Chi Squared, regression and causation

Monday 13 April 2015

Both tests on the same data?

This is a quick blog post about a couple of discussions I have had recently about students using both Linear regression and independence tests with the same data! I wrote this in a correspondence recently and thought it worth sharing. Comments welcomed!

'There is not an official line that you cannot use both tests on the same data. My hypothesis is that it is irrelevant to use the two tests on the same data. As I rule I think is good to encourage students to use scatter graphs with numerical data and Chi2 if one or both of the variables are categorical. Clearly correlation does not imply causation, but a chi² test does not imply causation either. The two tests both look for a relationship between 2 variables. If both variables are numerical then a scatter is appropriate, if either or both are categorical then chi² is appropriate. Clearly a chi² test can be done with 2 numerical variables by categorising and it would not be wrong but I can't see a need for it? The outcome of the chi² test would be entirely dependent on the chosen class intervals used to create the categories which could maybe be adjusted to suit an outcome.
I am happy to be corrected on this and usually raise this at workshops because I find it interesting. My challenge to teachers is to produce example data where one test genuinely offers something the other doesn't (ie no correlation, but dependence). I am not saying it doesn't exist, but given the somewhat arbitrary nature of choosing class intervals I suspect that it doesn't..... but I haven't ruled it out. Pragmatically, for mathematical studies I think it highly unlikely that students will successfully differentiate between the value of the different tests done on the same data.

HOWEVER - this does not mean students cant use both.... eg a project might be investigating literacy rates and involve a scatter graph of literacy against GDP, and then a chi² to see if literacy rate is dependent on continent for example. This is great. In this case the student has used both tests to investigate a theme involving literacy rates but not on exactly the same data.
IF students have done both tests on the same data then I think it is difficult to mark because the student would need to justify what one offers that the other doesn't in order for it to be considered relevant.

I summary, my advice is that students avoid doing both tests on the same data, but there is no official line that it cant be done. If you want to give students full marks then I would advise noting to the moderator where you see the difference. Otherwise I think the relevance can be questioned.'

......... to be continued


Sabbatical
12 Jul 2015