Comparative Judgement: Is it ‘Better’ or ‘Worse’ than conventional assessment methods?

 

cjThere is a new term hitting the blogosphere that runs by the name Comparative Judgement. Strongly advocated by David Didau and Daisy Christodoulou, there is a growing awareness of the concept, what it involves, and how to achieve it through commercial services such as No More Marking, and the FFT Education collaboration with No More Marking: Proof of Progress.

Daisy, as head of education research at the Ark chain, clearly and persuasively argues of the limitations that marking according to pre-defined criteria involves (variously described by different authors as rubric-marking, criteria-marking, analytical marking and use of mark schemes). The key criticisms are: the terms used in the criteria are often highly vague; subjectively interpreted by markers – resulting in inconsistency; channeling students into a constrictive frame of similarly-limited ‘acceptable’ responses; and not crediting highly imaginative and unusual – though valid – leaps of conceptual understanding by the candidate that fail to feature in the rubric statements.

David Didau has followed up with his work in schools over the last year, using Comparative Judgement with teachers and writing a series of informative blogs that describe his thinking and the process in detail. In response to one of the comments on a blog questioning the innovation, David responds:

What this does – and ALL it does – is mark essays with greater reliability and validity than is possible with mark schemes. It’s also an order of magnitude more efficient. What’s not to like?”

Isn’t this what we are all after? Concerns over the consistency of exam marking, the difficulties of internal moderation within departments, and the workload millstone of endless hours devoted to marking for individual teachers should mean that departments seriously consider Comparative Judgement if those assertions hold up. So, this post considers the three claims: greater reliability, greater validity and being more efficient (quicker), because if they’re right – this surely has to be something that more departments get on board with.

(This is a long blog, and if you don’t have the time to read in full, you may wish to scroll down to the ‘Usefulness’ section at the end for a summary of the potential value, and problematic issues of CJ that seem to emerge from the literature.)

Comparative Judgement (CJ)

CJ is no relative newcomer and with respect to academic use was implemented by  one of the key proponents of the method, Alastair Pollitt, in 1996 to moderate the marking standards between A level exam boards. Since 2009 CJ has been used to assess undergraduate work on ITE programmes at the University of Limerick.

Without going into a detailed description of the CJ process, (David does this very effectively in his post here), it is essentially sitting in front of a computer screen on which two student responses are presented. A quick skim read of the two results in you clicking to identify which is ‘better’. It is meant to be fast, superficial scanning rather reading in depth, and relies on internalised (tacit) knowledge of what ‘quality’ in your subject looks like. No list of mark criteria is to be referred to. Another pair of responses appears, and you judge ‘better’ again of these two. (It is based on the assertion that we are more consistent when we compare items than attempting to ascribe them a particular value linked to descriptive criteria). After several rounds of comparing pairs, in which each student’s response will be compared many times with all, or a proportion of the others, an algorithm ranks all the judged pieces into a hierarchical order according to the dominant judgements of the assessors doing the procedure. You know which is the ‘best’ judged piece, which the ‘worst’, and the rank order of all the others in between. (Declaration: I haven’t used the process; this is based on what is publicly available from secondary material).

Claim 1: Greater Reliability

All teachers want consistency in exam marking, both in their own batch of mock papers from first to last, and by their exam boards. The frustration of sending for GCSE and A level re-marks and finding a significant discrepancy between the two markers is all too widely experienced. There are persuasive claims of the accuracy (reliability) of CJ. Usually measured by SSR (Scale Separation Reliability), in effect it measures the degree of agreement between the assessors in judging pairs of responses. With figures of 0.8 to 0.97 in some CJ exercises (1.0 would be ‘perfect’ agreement), the figures are higher than what is usually achieved by rubric-marking comparisons (0.56-0.8).

But then we need to be careful with the term ‘accuracy/reliability’. It suggests precision, objectivity and dismissal of doubt. But what is referred to as ‘accuracy’ in CJ is a term for ‘agreement between the judgers’ (inter-rater reliability). When compared with rubric marking, CJ comes out with a higher measure for inter-rater reliability. But what they are agreeing about is ‘rank’ on a scan reading of compared responses. What rubric marking is doing is identifying a mark that represents an assimilation of breadth of competencies on a ratio scale; something that is not the same. It’s the equivalent of comparing a group of sports journalists who have just watched Chelsea beat Spurs and showing high agreement about which team was the best. But when it comes to their ratings of the individual players, the level of agreement drops sharply. Which is more ‘accurate’?

In addition, there is some academic challenge to the veracity of some of the favoured forms of CJ in the accuracy levels claimed. Adaptive Comparative Judgment (ACJ) is a technique whereby early (Swiss) rounds of CJ classify all the responses into largely similar batches and thenceforth present pairs of response that are closer to each other in ranking. The advantage of ACJ is that fewer judgements are required of each answer, so it is quicker to perform overall for raters. But, as this report by the Cambridge Assessment’s Tom Bramley argues, there are strong concerns that ACJ overstates the accuracy it claims due to potential flaws in identifying batches for comparison: “The conclusion is, therefore, that the SSR statistic is at best misleading and worst worthless as an indicator of scale reliability whenever a CJ study has involved a significant amount of adaptivity.” (p. 15)

A further issue arises over the make-up of the raters, with indications that sub-groups of subject raters judge consistently within their group, but differently to other sub-groups. In his comment beneath David’s blog, Dylan Wiliam cites the example of Hugh Morrison’s study of raters in Northern Ireland where those teaching in grammar schools judged student responses significantly differently to those in secondary modern schools. It raises the question of what one is drawing upon to judge between pairs of script if the intention is to skim-read and not analyse in depth. David refers to it as tacit knowledge: that experience we pick up as we teach groups and receive their exam grades back over successive years, along with our own understanding of progression within our subject. Dylan suggests the different sub-groups of teachers in Northern Ireland may have been drawing on different tacit knowledge bases.

In a large 2012 study by Whitehouse and Pollitt of CJ looking at responses to an AS level physical geography essay (15 marker) for which the marked results from the previous summer’s exam were also obtained, it was found that CJ (holistic marking) achieved greater consistency between markers than rubric (analytical) marking. But within the markers there was less consistency between GCSE geography teachers than amongst A level teachers, suggesting that tacit knowledge grows with familiarity with the course, rather than subject-specialism alone. The authors argue “geographers do not share a set of criteria on which to base their marking decisions” (P13. Para.2). This may have implications for the commonality of teaching experience by teachers involved in CJ exercises.

In an earlier study in 2011 (ibid. P.16 para. 1) looking at CJ with ESL teachers, it was found that holistic marking (CJ) gave greater consistency between markers compared with rubric markers, but that a single holistic mark was unable to explain the uneven nature of a candidate’s performance, which is important for reporting on various competencies (think subject-specific progress markers as well as AO’s) and providing feedback.

So, I’m left with the questions:

a) if (A)CJ is ‘more accurate’ than rubric marking, is it comparing like-for-like?

b) if ACJ (the faster form) is overstated in its reliability statistics, does it weaken the CJ case?

c) high accuracy data indicate low inter-rater variability (i.e. good agreement). But are the raters agreeing on the things we judge most important? Just how valid is it to get agreement on the ranking of student work based on a quick scanned reading rather than slightly less agreement on a detailed reading where the response lies on a ratio scale (a mark than can be assigned a discrete arithmetical value).

d) What is tacit knowledge? How much is it a shared product? How do we know if it’s there or lacking? What happens with new teachers who haven’t had time to build it up, teachers taking on a second unfamilar subject, or experienced teachers dealing with intentionally more robust specifications covering new concepts, knowledge and applications, as at present, and whose tacit knowledge may have to be parked, and new tacit-icy built up?

e) There is a suggestion that length of response and quality of handwriting can influence raters over content, in which case there may be significant agreement, but based on non-cognitive qualities. So how is it ensured that they are judging the cognitive substance, rather than the superficial style, of a batch of student responses?

 

Claim 2: Greater validity

Any assessment is, hopefully, going to yield useful information about the student: what they know, understand and/or what they can do. ‘Validity’ is the capacity of an assessment to shine a light on the defined area of subject-ability we want to have revealed. A test may simply define what students have remembered of a list of topic-vocabulary learnt for homework. It won’t unearth their understanding of key concepts. As long as we are clear what, and what not, the spelling test reveals, it has validity. Summative assessment, however, (and that is what CJ offers – with little propensity for formative feedback in its rapid form) – will invariably attempt to capture breadth of knowledge and understanding, depth of concept application and fluency with essential subject-specific skills. In other words – a broad palette of cognitive hues and intensities. There has to be a question of whether quick scanning for an impressionistic response to an extended piece of work can capture that range and in those subtleties.

Sure, if the piece of student response being judged is short and concise – it is going to be easier. Looking at the pieces Daisy put up on the screen at a recent ResearchEd presentation (video link below, image at the top), and those David uses in his explanatory blogs, they are not at all lengthy, arising from KS2 pupils’ work. But there’s a concern if we are ranking children on such paucity of evidence of their ability. If we have decided that a rank order of students is a worthwhile exercise we want that ranking to have validity, and it has to be based on more than a few lines of writing. Or else the spotlight is focused on such a narrow space in the arena, it is leaving vast swathes of cognitive potential unlit (an illustrative metaphor that clarifies the distinction and relationship between reliability and validity in the final section of this paper by Dylan Wiliam. Bottom of p.4-5).

It’s not just the length (i.e. the opportunity for discursive and analytical writing of open responses that Daisy argues constricts able and imaginative students if bounded by mark criteria) but also the format. In one of the largest studies of its kind, McMahon and Jones (Ian Jones is the senior scientific advisor for No More Marking) conducted a CJ exercise in 2015 with a large school in Ireland on an assessment write-up of a chemistry experiment. In a multi-comparison study five teachers and 37 students (14-15 years old) undertook judgements of responses from 154 fellow students. Two of the teachers also rubric-marked the same responses to compare the CJ ranking with that using traditional mark criteria. The subset of students peer assessed all their fellow student responses to compare with the ranking of the five teachers. There was high reliability in the teachers’ ranking (SSR 0.89) and students’ judgement had a good match with their teachers (correlation of 0.74). But in the conclusion, the authors state that CJ is more suited to tests with an open item requiring students to explain their understanding in a sustained manner and exclude closed items that require rote recall or the application of procedures. So, not so valid for Knowledge recall and Skills demonstration, but more useful for assessing Understanding. My question here is:  Does this require assessments to focus on question formats that probe understanding and exclude questions aimed at knowledge recall and skills use, so they are in a form appropriate for CJ? (In fact, Jones suggests as much when he writes: “More generally, tests designed for comparative judging might produce more valid and reliable outcomes when they consist of an open item that requires students to explain their understanding in a sustained manner, exclude closed items that require rote recall or the application of procedures.” p.19-20). Or do assessments have two sections, one of short knowledge-recall and skill questions that are rubric marked, and the results combined with CJ ranking of extended written pieces responding to ‘understanding’ challenges? Which complicates and lengthens the procedure.

 

Claim 3: Greater efficiency (speedier)

The 30-second judgement sounds attractive in principal. Get through a batch of comparisons in an hour. Job done. Great. But then the time it takes comes down to what is being presented, and what is being presented is, as students become older, going to be lengthy and consist of answers to multiple questions if it is to have the validity to justify a ranking of students. That chemistry experiment exercise mentioned above – it actually took longer using CJ than rubric marking. Considerably longer. The five teachers made 1550 paired judgements (310 each) each of which took, on average, 33 seconds to make. The total judging time came to over 14 hours (approximately 2.8 hours per teacher). Remember, two of the teachers rubric-marked the 154 answers between them to compare the processes, taking 3 hrs (1.5 hrs each). So for an actual exam paper; something broad enough in scope and comprehensive enough in validity to justify, perhaps, a ranking – CJ took far longer than rubric marking: “Nevertheless, comparative judgement took considerably longer to complete than the marking exercise, challenging the efficiency of the method compared to traditional procedures.” (p.15)

The conclusion of the study’s authors suggests that CJ is more appropriate to assessments that require open-ended answers requiring students to explain their understanding and to exclude closed items that require rote recall or the application of procedures. (The exam paper had 3 short-response questions, and one open question). A similar result was obtained in a study by Steedle and Ferrara which found that CJ required 17% more marking time than rubric-marking (Steedle, Ferrara 2015 Draft ‘Evaluating Comparative Judgement p. 20).

In the Whitehouse and Pollitt study of CJ-ing AS level Geography essays (15 marks, 2 page response) referred to above, the median judgement time of judges was 2mins 13secs. A 15-marker is the final element of a full 25-mark question involving shorter structured answers of 4, 5 and 6 marks. There are four full questions to be answered on the full paper, which is one of three exam papers for the award. Can CJ be scaled up to exam marking?

In response to one comment on his blog, David answers: “I confidently predict that comparative judgement will replace traditional exam marking for all essay-based subjects within the next five years.” Many would be inclined to welcome this given the shortage of examination markers and concerns over non-specialist examiners marking more challenging questions. However, the study above ascertained that the 2011 AS geography exam produced 23,000 entries marked by 60 examiners in the 2-3 weeks allocated for marking (384 scripts each). The authors calculated that moving to CJ for a similar turn-around would require 180 examiners working 8 hrs per day for 21 consecutive days to accomplish the task. (Just let the consideration of that sink in). That dedication, from three times as many examiners suggests it’s not going to happen anytime soon. As the report states: “ACJ would be feasible as an alternative to marking for high stakes assessment only if a substantial proportion of centres provided judges.” (p.15)

There is a suggestion that to speed up CJ for exams, exam question formats could be changed to service holistic marking by excluding closed responses that require rote recall or the application of skill procedures, and focus on featuring open responses requiring extended demonstration of understanding. My reaction to that is that is seems to be massively tail-wagging the assessment dog, on a – consequentially – narrow breadth of coverage. So the question I’m left with here is: in the one area where summative ranking of students is a justifiable outcome, why would CJ be considered if a) it takes longer b) requires more examiners and c) narrows the validity of the exam format by adjusting question type to its needs?

 

Usefulness? 

If it does something better, or faster, cheaper, or takes a bit longer but the benefits are of an even higher ratio – then an innovation is worth exploring. Or sometimes, just to say ‘what the heck, let’s give it a go and be adventurous’. Taking all my reservations into account, I can visualise a number of situations where a trial of CJ could prove useful:

 

Use 1: Inter-regional moderation. If the system works well enough for inter-board exam moderation, then the same need could be met elsewhere. This report by Schools Week a few weeks ago highlights the issues of regional differences between pupils in key stage 2 writing results. CJ of student responses may overcome the concerns connected with verifying reliability of samples. Daisy Christodoulou’s ResearchEd 2016 session examines the plausibility of this very clearly at school level (see video here – it’s long, but interesting).

Use 2: Multiple-school moderation. If I was responsible for a subject across various schools in a MAT, or support-cluster of LA schools, I could see benefits from the full subject-team engaging in CJ of common KS2 or KS3 assessment essays and GCSE/A level mock exam extended responses. It would enable each teacher to get a sense of response spread across a wider range than just their own school, identify schools (teachers) that are inconsistent (or heavy, or light) in their judgements and help inform a shared discussion about what they consider the key values of the subject students should be encouraged/taught to display at different ages. It can help steer a convergence of subject values and ambition of quality outcomes. I think it could prove worthwhile, carried out once each year or two, possibly looking at a different age-group on each occasion.

Use 3: Within a department many of the same benefits as in ‘2’ would accrue, particularly in seeing good practice from beyond one’s own student range and identifying key responses that could be used within one’s own classroom to exemplify particular qualities (relevancy, efficiency of expression, high quality analysis, -comment, -concluding….) to use as exemplars and match with individual students’ in one’s own class to guide them in overcoming particular hurdles they are currently stuck at.

Use 4: Peer assessment using CJ to inform learning. A way for students to acknowledge different possible ways to answer an open question would be via anonymised peer assessments. Jones and Wheadon (both of NMM) conducted a study with KS4 students and found high levels of reliability in their abilities to judge pairs of responses to a maths question. However, they acknowledge that further work has to be done to improve the learning gains and generate personalised feedback. My chief concern here is the exclusion from the task of students from less able sets, regarding it likely that “some students would struggle with the literacy demands of the assessment task” (p.10). In that case, if CJ peer assessment benefits more able students, achievement gaps will inevitably widen.

Use 5: You know that final, end of year assessment? The one that nobody has the energy or time to mark when you are preparing new schemes of work ready for next year? The one where the feedback will have very little impact anyway if it’s just before the summer holidays? That’s when I’d consider the department using a CJ procedure. An open-ended essay covering key ideas from over the entire year; ACJ QR-coded response-sheets; a ranking of responses that will not be shared with students – but can be given to next year’s class teachers at the start of the year to inform and guide their classroom organisation and planning. That could hold a number of virtues. (We used Cognitive Ability Test data in this way from tests carried out early in Year 7 and found the confidential scores highly valuable. They weren’t carried out again, though, so the information was becoming less useful by Year 9. I could see an annual CJ exercise producing subject-specific, not-to-be-shared, summative ranking information may help improve upon that).

In addition, I think Daisy’s and David’s concerns with the imprecise terms used in some mark criteria are worthy of challenging and discussing as a department to get a more closely aligned common interpretation. Particularly as departments up and down the land will be preparing internal mock papers for new courses. It could at least help develop higher quality internally-produced mark criteria. And also, to agree how we credit the student who makes a highly insightful comment, but slips between the bars of the mark criteria, perhaps by means of exemplars.

 

Where I have concerns that would raise key questions over the use of CJ are:

  • Do I have a valid need to rank students? Particularly if that ranking is shared with students. Or unintentionally leaked. The collateral impact of that is potentially so massive that I would be very wary of justifying a CJ ranking exercise other than where mentioned, above.
  • How often do we need summative information within a school phase, other than at the end? Would we not be better off carrying out formative assessment marking – which will guide a feedback response? CJ doesn’t provide opportunity for individual feedback without re-marking responses more closely, which ‘ups’ the time demand.
  • How much ‘accuracy’ do we need? Steedle and Ferrara, along with Dylan Wiliam (see blog ‘comment’) note that efforts to tighten the ‘reliability’ statistic can reduce the ‘validity’ measure. Usually greater validity in the assessment can (will) reduce marker agreement. Do we alter our assessment formats to get the right reliability number, or work on building internal agreement about the key values of our subject amongst markers to conserve high validity and continuously work to improve reliability?
  • How, with key assessments, does a rank order of students meet the specific need of advising on estimated performance? Using ‘anchor pieces’ (inserting responses from previous exam series’ of known grade scores) may obtain you a spread of ranked pieces between the anchors (say, 6 pieces between the ‘B’ grade anchor and the ‘C’. But don’t you need to know if they are ‘just into’ the C, or ‘within a mark or two of the B’? – which a ranking is unlikely to indicate.
  • How frequently do mark schemes actually limit students, constrain them to similar responses or fail to credit insights of comprehension not recognised in the rubric criteria? It strikes me that the majority of students benefit from mark criteria as a scaffold against which they get valid ‘assists’ to work towards the top grades. Do we deliberately hold back useful criteria and let pupils struggle while the criteria for success remain locked in a secret garden? If there is an issue in using rubrics – then it seems it mainly relates to those of highest ability. What is the cost to the majority, if mark schemes are ejected in favour of a ranking system that benefits the very brightest?
  • The suggestion that teachers can use exemplar responses identified during a CJ process to help struggling students see what they need to aim for, suggests to me that in verbalising the key qualities the teacher identifies in the demonstration piece, we are creating a verbal rubric as far as the listening student is concerned, having critically discarded the standard written form of it to justify CJ.
  • It’s not a ‘free lunch’. There are financial costs for subscribing to the services of companies that carry out the CJ process. There is the obtaining of QR coded response sheets and assuming the formal exercise that uses them is the one that is going to capture best response. There is the need to photocopy and upload all the responses.  And the time taken depends on the extent of the response, the nature of the question format, the need to identify exemplars, and the need to re-mark more closely when specific feedback is to be given.
  • And – it can take longer than rubric marking for past GCSE and A level exam papers. It’s probably a phase issue: doing a CJ session on KS2 responses is likely to be faster, but KS4 and 5 phases produce more to read, more variations in content to consider, and longer for judgements to be made between closely matched pairs. The time-savings appear to seesaw between different ages of students to become time-costs with older students and full exam responses.

 

Would I explore it? Yes – for one or more of the teacher-benefits it can offer if I had one of the identified needs outlined above. It comes with several riders, one of which is: it does seem to perform a function more for high-ability students in that they are the ones who are more likely to be stimulated by a high ranking if they get wind of their placement; they are more likely to be found operating beyond a criteria-based rubric; and – if involving peer assessment – are likely to have the de-coding ability required to independently recognise and analyse the qualities of their peers’ responses to help improve their own.

Even the major academic proponents of CJ recognise the need to tread carefully: this, in the final paragraph of the Whitehouse and Pollitt study, identifies the virtues and some of the limiting considerations:

“The ACJ system is simple for users to access and work with; it provides robust statistics that can be operationalised for monitoring the judging as it progresses. ACJ may offer an alternative to marking in subjects where the shared criteria for assessment are limited and there is scope for subjectivity. There is a need, however, for further work to identify the assessments that are most suitable for ACJ and what support is required in the selection of judges and the maintenance of a shared set of criteria to use when making judgments.” (p. 16)

Does it do everything is says on the tin? Well, it’s certainly worth visiting the shop and perhaps making a purchase, as long as you do a careful reading of all the very small print in the terms and conditions that have been stuffed in the bag alongside it.

 

A useful blog on his current experimenting with CJ in secondary English is by Phil Stock @JoeyBagstock :  ‘No more marking? Our experience using comparative judgement’

3 Comments

Filed under Comparative judgement, exams, Feedback, Uncategorized