Of blobs and difficulty
Wednesday 15 May 2019
Writing a sample Listening exam (see the page Listening exam #1 ) has made me think about the ways ‘difficulty’ is structured and assessed in exams. For seven years, I had the job of writing Texthandling papers for English B HL, testing reading comprehension, and so have experience of the problems of trying to judge how difficult a stimulus text might be to second language learners, of how difficult questions might be, and of how to meet the specification of how difficulty was to be arranged in the various texts. The standard architecture was that the first text should be ‘easy’, that the following texts should get more ‘difficult’ reaching a peak at the fourth text, and then the final text should be a little more ‘easy’ again. This is a simple plan to describe, but not so simple to achieve in practice – because it was permanently challenging to define exactly how ‘easy’ or ‘difficult’ any of the elements might be.
At the time, I was Deputy Chief to a Chief Examiner called Bryan, who was a serious expert in the academic analysis of language assessment techniques. This meant, in particular, that he could apply a vast toolkit of statistical techniques – and I spent much time stumbling along trying to follow his patient explanations of what on earth he was talking about. I am still no statistician (never understood what ‘standard deviation’ really meant), but could at least grasp most of the general principles in practical layman’s terms. Partly because of my mathematical ignorance, but also because of the need to have some form of visual analysis which would provide a clear overall picture, we devised something called a ‘blob chart’.
This inelegant title referred to a large chart of squares, in which there was a column for each of the 60 questions in the reading paper, and a row for each of the students in the sample we examined. The sample contained 100 students, chosen representatively across the range of performance. Some unfortunate IB assistant had the job of entering the results in the chart – if an answer was correct, a mark, or ‘blob’, was put in the relevant box, and if not correct, there was a space.
In addition, the students were organized from top to bottom according to their final overall total – highest mark at the top, lowest at the bottom. And remember that the questions, from right to left, reflecting the architecture of increasing difficulty that I mentioned at the beginning.
So, we expected that the top row, showing the best student, would be full of blobs all the way across – and the bottom row, showing the weakest student, would probably have only a few blobs, at the left, where the first and easiest questions were located. In theory, then, there should be a fairly clear diagonal shading from bottom left to top right.
Unfortunately… it never worked out like that! All the blob charts we ever generated were extremely messy – deciphering them was a bit like reading tea leaves, or interpreting a Rhorsach blot. But they did demonstrate a few things, which I suggest will apply to all such comprehension tests.
- Some excellent students got ‘easy’ questions wrong, and occasionally very weak students got ‘difficult’ questions right. Possible reasons would be that the excellent students were simply careless, or even dismissive of the ‘easy’ questions and so jumped to conclusions – and the weak students might simply have guessed and struck lucky.
- Some questions generated vertical columns which were full of blobs, or had hardly any blobs at all. If a column was full, that would suggest that the question was so easy that nobody could get it wrong, and so the question didn’t discriminate in any way. If there were no blobs at all, that would suggest that the question was too difficult for any of the students, either because the question itself was hard to understand, or the concept being tested was beyond the student group’s possible understanding. (Or – once or twice, and to our embarrassment - the answer in the markscheme was actually wrong!)
- Patterns of blobs or spaces indicating groups of questions were sometimes most revealing. Such groups typically tested specific skills – to illustrate, five ‘choose the true statements’ questions would test skimming / generalization skills, whereas ‘short answer’ questions might be testing scanning for detail. Such patterns tended to tell us about individual students’ abilities – for instance, a student that got all five (supposedly ‘easy’) skimming questions wrong, but got all (supposedly ‘difficult’) short answer scanning questions right, had presumably been trained in scanning, but not in skimming.
The key term there is ‘training’ – teaching students to read (or listen) methodically in the process of the disciplined handling of comprehension questions.
Looking back now, the blob charts were terribly primitive technology – but we pored over them very eagerly, trying to find objective confirmation of how successfully the paper had been designed, or where we should place the grade boundaries. Nowadays, such papers are marked online, so I imagine that, with the appropriate programming, the IB’s computer could generate vast blob charts to cover 15,000+ students, and use big-data statistical operations to detect the kind of patterns that I have mentioned. I wonder if, even now, some graduate is doing a PhD thesis using just this process?