IA data processing and p-hacking
Saturday 1 May 2021
I have recently learned about the term p-hacking, the incorrect use of significance testing, and it leads to studies making striking claims which go against generally accepted ideas. We have all read headlines about how if we just eat a special type of food regularly it will have a miraculous effect on our health. Are these claims the results of p-hacking?
Scientists are accused of lowering scientific standards by p-hacking. Cherry picking, excluding sets of inconclusive data and only reporting those data sets showing significance, can make a partly inconclusive study look innovative and groundbreaking. Searching for correlations in the data can be another form of p-hacking. If the hypothesis is written first and the data tested afterwards the statistics for testing significance and rejecting the null hypothesis are reliable. Reversing this method by looking for a correlation before writing the hypothesis can lead to incorrect conclusions.
There is positive feedback too. The more surprising the findings of a study, the more attention the research gets in the media. News headlines attract clicks and views which is good for publishers. Researchers desire conclusive results after years of arduous data collection, it helps secure the next round of funding and the stakes are high.
DP Biology students sometimes inadvertently indulge in p-hacking. Desperate to find something significant in an IA study they might delete anomalous data, collect extra data to lower their p-value, or even test a completely new hypothesis if it looks to be more significant. We should be able to teach DP Biology students how to recognise p-hacking. It is an important skill in critical thinking. More importantly there are no marks for statistical significance in the IA, only for correct data processing and methodology. A conclusion correctly stating that the data are inconclusive scores more highly than one which has errors in the statistics.
In the last five minutes of this short video, Zedstatistics explains why p-hacking is a problem in scientific research. It's problematic in IA investigations too, especially those which take data from a database.
Imagine an IB Diploma student who chooses to study the effect of light intensity on the morphology of oak leaves in their IA. Do oak leaves have a different shape in brighter light?
The student is hard working and measures many different leaf characteristics in the hope of finding the one which is most affected by the light. These dependent variables include; mass, length, width, surface area, leaf thickness, colour of leaves, etc. In the data processing a statistical test is carried out separately on each variable to see if the null hypothesis should be rejected, (H0: light intensity has no effect on this feature).
After many tests the student finds that the data for leaf thickness is significantly different from the null hypothesis, and proposes that leaf thickness is affected by light intensity. Success! The student then writes the research question, "What is the effect of light intensity of the thickness of Q.robur leaves?". To keep the IA data relevant the student only includes in the report the data for leaf thickness. This is a classic case of p-hacking. Why is it wrong?
Using a standard test of significance, if p<0.05 we reject the null hypothesis. It is reasonable to say, the probability of collecting these results if the null hypothesis is true is small. Less than 5% (P<0.05). It is unlikely that the null hypothesis is true, so logically the student may conclude that the experimental hypothesis might be a better explanation.
However, in one out of twenty tests of significance there could be a value of p<0.05 even though the null hypothesis is true, light intensity has no effect on the feature. By definition this is the probability level we are using as a test. When p=0.05 the probability of getting these results when there is no actual significant difference is one in twenty, (5%). So if you test twenty different features of leaves it is more than likely that one will show a probability of p<0.05 in a stats test even when none of them are actually affected by light intensity. This is p-hacking.
To avoid p-hacking
- Write the hypothesis before analysing the data.
- Avoid doing multiple tests of significance in a study. e.g. repeated t-tests.
- Include all the results in the study, don't be tempted to exclude results which look wrong.
There is a simple correction, the Bonferroni correction, which avoids p-hacking if there is no alternative to multiple t-tests. Count how many t-tests are done in the analysis and divide the 'alpha', the p<0.05 significance level, by the number of t-tests. If there are ten tests, then the p-value to show a significant difference between the observed data and the null hypothesis would be p < (0.05/10), i.e. P<0.005.
If you want to know more I recommend this article by Geoff Cumming in The Conversation, "Why so many science studies are wrong." If you prefer something more mathematical read this clear explanation of significance testing and how to avoid The Familywise Error, from Charles Zaiontz. It includes the Bonferroni correction to a p-value that can avoid p-hacking (the familywise error). Zaiontz also mentions the specialised statistics to use in place of doing multiple t-tests. ANOVA or Mann-Whitney and Wilcoxon tests are better suited to this type of data.