Pearson's (r) Limitations

This activity explores two limits of the Pearson's product-moment correlation coefficient (PPMCC), r

  • Non-linear relationships/models
  • Outliers

The teaching slides are at the bottom of the page as it's important students explore the applets and Tasks 1 to 3 below, for themselves. The slides can be a helpful, visual summary of the findings of the below activities. 

Teaching slides - Pearsons correlation coefficient: standardising the covariance

It is recommended to start the class with the "Teaching Activities" that follow the below presentation. These slides help present and summarise some of the key ideas that students should 'discover'/work out for themselves from the teaching activities below. You can click on each image to make it fill the screen and then click through them. You might also consider presentation mode when using these slides.


Teaching Activity - non-linear data

Task1: Pearson's (r) and non-linear data1

(a) Enter the below data in your Graphics Display Calculator (GDC), plot it as a scatter graph, and discuss with your group what functions you think could fit the data e.g. Linear? Quadratic? Cubic? Exponential? etc. 

(b) Sketch the scatter graph,  you can print off this grid and axes, and draw over the data points the model you, as a group, think best fit the data.

(c) Use your calculator to determine Pearson's coefficient (r) for this data.

(ii) Discuss how well the model fits the data using this value of Pearson's (r). Approximate values of 'r' and their interpretation are given on the right: 

Reflection challenge: what do you think about the degree of accuracy given (number of significant figures/decimal places) for each of the coefficients in the model your group has chosen? Consider the scale of the x-values when justifying the degree of accuracy you think would be appropriate, and why.  

Task2: Pearson's (r) and non-linear data2

(a) In the applet below, 'drag and drop' the data points so that they are following, in turn, each one of the models below. Once you have positioned the points in a pattern that you think fits the model, click on its checkbox to see the function. Take a screenshot (or print it) of the whole graph (to ensure we can see the data, the function's equation, its line and the Pearson's correlation coefficient) - example screenshot below:

(i) cubic model

(ii) exponential 

(iii) trigonometric sine function

If you can't remember what any of these look like, you should graph, in your GDC, a cubic ax3 + bx² + cx + d, an exponential:

kabx+c, a trigonometric sine function: asin(bx)+d, to remind yourself/see what they look like.

(b) Discuss how well the model fits the data using this value of Pearson's (r). You can use the approximate values of 'r' and their interpretation given in Task1 above.

(c) Compare this Pearson's coefficient (r) interpretation of how well the model fits the data with your own impression, from simply looking at the graphs you have produced and how well they fit the data. Do you think Pearson's coefficient offers an accurate measure of this 'goodness of fit' i.e. how close the points lie to the function/model? 

Task3: Pearson's (r) as a measure of a model's 'goodness of fit' 

Using the applet below (or above, the applet below has some different models to try), or your calculators (you can use the same data as you've already entered for Task1 above, or you can enter the x and y coordinates of each point in the applet below), tick the boxes next to different model functions to see how well they each fit the data. What happens to Pearson's coefficient (r) for each model? What do you conclude from this about Pearson's coefficient as a measure of how well the different functions each fit the data? 

Teaching Activity - Outliers

Task 4: Pearson's (r) and Outliers

Below is the hockey player height and weight data we used in the Covariance and Pearson's correlation coefficient activities. One of the players we can see is atypical, in terms of their height and weight, when compared to the rest of the players. This players data is highlighted in pink in the spreadsheet. 

(a) Work out the mean height and weight of the hockey players, excluding this 'atypical player' and replace, in the spreadsheet, the 'atypical' data with the mean height and weight. What do you notice happens to Pearson's correlation coefficient? 

(b) Now replace the same 'atypical' players data with a height of 62 inches and a weight of 290 pounds. What is the value of Pearson's correlation coefficient now? 

(c) In your groups, discuss what your results above imply about the sort of data for which Pearson's is and isn't reliable. 

(d) Using the 'outlier' definition of "1.5 times the interquartile range from the nearest quartile" work out if the original value and the value used in part (b) are 'outliers' or just 'atypical' (uncommon) heights and weights for a professional ice hockey player.



Resources


IB exam style questions


Unit Planning

This section will offer teachers advice about the unit planning element of teaching this sub topic. This will cover references to ToK and the ATTL sections


Useful Links

This section will offer relevant links that teachers may be interested in. In each case these links will have some commentarty from the site authors to suggest how it might be of use.

All materials on this website are for the exclusive use of teachers and students at subscribing schools for the period of their subscription. Any unauthorised copying or posting of materials on other websites is an infringement of our copyright and could result in your account being blocked and legal action being taken against you.