Regression
Regression is the statistical approach to find the relationship between variables. It allows you to estimate the value of a dependent variable (Y) from a given independent variable (X).
- You want to understand how two numerical variables are related.
- Your data is continuous (see Types of Data for more information).
- Your data is linked (paired) – each value of your independent variable has a matching value for the dependent variable (like X-Y coordinates).
Regression allows you to estimate the value of a dependent variable (Y) from a given independent variable (X). Variable X is known as the predictor variable and variable Y is known as the response variable.
There are many types of regression:
Using linear regression, we find the straight line that best “fits” the data, known as the least squares regression line. An online calculator can be found HERE or you can use Google Sheets or Office Excel spreadsheet programs.
An example of a linear relationship is shown below.
If your data has a linear relationship you can progress to the Pearson Correlation Coefficient which tells us the type of linear relationship (positive, negative, none) between two variables, as well as the strength of that relationship (weak = 0.0 to 0.29, moderate = 0.30 to 0.49, strong = 0.50+).
Logistic regression is used to fit a regression model that describes the relationship between one or more predictor variables and a binary response variable (such as yes/no or does/does not). For example, researchers want to know how exercise and weight impact the probability of developing diabetes. To understand the relationship between the predictor variables and the probability of developing diabetes, researchers can perform logistic regression because there are only two potential outcomes: either someone develops diabetes, or they do not. An online calculator can be found HERE.
Polynomial regression is used to fit a regression model that describes the relationship between one or more predictor variables and a numeric response variable. This is sometimes done after you try linear regression and observe that a polynomial curve would fit the data better. In the figure below, polynomial regression results in a higher R2 value (0.9749 compared to 0.8928) which indicates that the polynomial curve fits the data better. An online calculator can be found HERE.
Multiple linear regression finds the line of best fit for data comprising two independent X values (X1 and X2) and one dependent Y value. For example, if you collected data on height, age and number of flowers of a certain plant species, multiple regression would allow you to predict the number of flowers based on a plant’s height and age. An online calculator can be found HERE.
Using regression to predict/explain results
Once you have done your regression model, you will have an equation which predicts the response variable for different values of the predictor variable. This can be used to extend the data beyond what you collected or as a basis for explaining the relationship between the variables.
R-squared is a goodness-of-fit measure for regression models. It is called the coefficient of determination. It uses the differences between each data point and your line/curve, as shown in the figure to the right. It measures the strength of the relationship between your line/curve and the dependent variable on a 0-100% scale. The closer R2 is to 100% (1), the better your line/curve fits the data. |
NOTE: This R-squared value (the coefficient of determination) is different to the Pearson Correlation Coefficient and the Spearman Rank Correlation. Don't confuse them!
The graphs below show the difference between a high R2 value and a low R2 value. The data points in the graph on the left are closer to the line than those on the right.
Are low R2 values always bad?
No. Some areas of study have an inherently higher amount of unexplainable variation. In these areas, your R2 values will always be lower. For example, studies that try to explain human behaviour generally have R2 values less than 50%.
However, if you have a low R2 value but the independent variables are statistically significant, you can still draw important conclusions about the relationships between the variables.
Are high R2 values always good?
No! A regression model with a high R2 value can have problems. For example, the regression equation may predict negative values for the response variable, which may be nonsensical, or in a polynomial model the equation may predict the response variable will begin reducing in value after a certain point which may also be nonsensical.
If your regression analysis reveals a linear relationship, you can continue onto calculating the Pearson Correlation Coefficient.
If your regression analysis reveals a monotonic relationship (this could be a polynomial, exponential or logistic relationship), you can continue onto calculating the Spearman Rank Correlation.
The pictures below show the difference between a positive montonic relationship (as x-values increase, y-values also increase), a negative montonic relationship (as x-values increase, y-values decrease) and a non-monotonic relationship (as x-values increase, y-values perhaps decrease and then increase, or increase and then decrease). We can only use the Spearman Rank Correlation in the first two instances.
Your Turn
You decide to investigate the usefulness of Plecoptera (stonefly) nymphs as indicators of environmental factors in streams. Samples from 15 streams are obtained by displacing nymphs from a streambed into a net by means of a standardised-kick technique. Values of water hardness – calcium carbonate concentration – are obtained from the local water authority. The observations are shown in the table below.
You decide to do linear regression on the two variables.
Using the online calculator HERE, enter your data.
The results show the following...
The R2 value is | 26 |
The equation of the line is | 15 and 38 |
When water has zero calcium carbonate concentration, the expected number of Plecoptera (stonefly) nymphs is about | -0.1727X + 26.31 |
When water has zero calcium carbonate concentration, we can say with 95% confidence that the expected number of Plecoptera (stonefly) nymphs will be between about | 152 |
We expect there to be zero nymphs when the calcium carbonate concentration is about | 0.4231 |
The y-intercept (where the regression line crosses the y-axis) indicates the expected number of nymphs when the calcium carbonate concentration is zero.
The x-intercept (where the regression line crosses the x-axis) indicates the concentration of calcium carbonate at which there will be zero nymphs.
The R2 value of 0.4231 is not high. This indicates that the regression line does not fit the data particularly well. We can see this from the graph on the output page.
The graph also highlights that the first three data points look like outliers (or at least require explanation).
Re-run your data exlcuding the first three points.
You should see the new R2 value of 0.7764. What can you conclude from this?
The new linear regression line fits the data better. | ||
The first three points should clearly be excluded. | ||
If the first three points are indeed valid results, you should try a different form of regression (perhaps polynomial regression). |
The new linear regression line does fit the data better.
The first three data points need to be investigated. You should not make the decision to exclude them based solely on the regression results.
It seems reasonable that polynomial regression might fit the complete data set better. However, using the polynomial regression calculator HERE, we can see that the R2 value reduces to 0.6785. It also raises questions - is it reasonable that the number of nymphs starts increasing once the CaCO3 concentration reaches about 120?