Plagiarism arises again and again as an issue in assessment. Suggestions are made that students need to be educated to avoid it, assessment tasks are devised to minimize it and assessors are vigilant to detect it (using electronic means).
It seems that cheating was a substantial issue way back in China 1500 years ago where those caught cheating on the civil service exam were subject to punishments sometimes as extreme as execution. Candidates submitted to body searchers upon entry to the exam. For 1300 years up until around 1900 this continued for the civil service test which involved tens of thousands of candidates at a sitting. Interestingly papers were not marked, but rather they were ranked. Thousands of assessors read the papers and were involved in the process of eliminating papers until the top ten were identified. These papers were presented to the Emperor who decided on the final ranking.
Candidates were anonymous as seat numbers were used in place of names. Handwriting was even disguised, ‘papers were rewritten by clerks, and checked by proof-readers, to remove the influence of calligraphic skills’ (Pollitt, 2012, p. 158).
Both bias and cheating were identified as features which could unfairly influence exam results and both issues were dealt with. Note that the entire system was judgement based. Pollitt states that marking was ‘invented’ in Europe in 1792 when for the first time marks were allocated to individual questions and added up to produce a final number which was then ranked against other exam papers. Pollitt goes on to say ‘thus “judgement” was replaced by “marking” ‘ (p. 159). However, he doesn’t make it clear whether the 1792 ‘marked’ exams were dealing with selected response assessments where the correctness of an answer was not particularly contentious or constructed response assessments where professional judgement would have played a significant role.
Constructed response questions such as essays are more difficult to measure than closed response questions such as multiple choice. Louis Thurstone published several papers between 1925 and 1935 looking at ways to measure ‘greyness’ which we can interpret as the need to make judgements when responses are not either wholly correct or wholly incorrect. He conceived a paired system where a person was asked to choose the better of two responses rather than to assign a value to them. He called it the Law of “Comparative Judgement.”
In the 1950s Rasch developed mathematical models that paralleled Thurstone’s ideas, but in a simplified form that enabled the construction of computer models. Using the Rasch model involves assessors in agreeing on a notional scale (rubric / descriptors) and forming ‘holistic evaluative judgments’ based on it.
Pollitt’s Adaptive Comparative Judgement (ACJ) involves numerous comparisons of pairs of essays / portfolios to determine which is better. This sounds like norm referencing as essays / portfolios are compared to each other rather than compared to a standard. In fact Pollitt isn’t providing sufficient information to enable a clear understanding of what he is proposing. Are the assessors evaluating each essay against a rubric / standard and then choosing the better of the two? Are they dispensing with the standard and just choosing the better of the two rather like the Chinese did all those years ago?
Pollitt makes claims for internal reliability at 0.95 for a spread of assessments involving over 350 portfolios. Effectively assessors are checked and rated for heavy-handed or soft marking. Intra-rater inconsistency is picked up and some papers highlighted for rechecking. I’m not sure how his approach is different to a FACETS Rasch analysis. This needs to be much clearer.
He claims that ‘valid assessment requires the design of good test activities and good evaluation of students’ responses using good criteria’ (p. 166) The former ‘good test activities’ and the latter ‘good criteria’ are clearly issues of validity. However, his central idea ‘good evaluation of students’ responses’ seems to be more of a reliability issue.
He also says, ‘since CJ [comparative judgement] requires holistic assessment, a very general set of criteria is needed’ (p. 166). However, it is not easy to see why CJ requires holistic assessment or why a very general set of criteria is needed. He later gives an example of 160 words that outlines a general set of criteria, but not levels of achievement. He concludes the paragraph with ‘ACJ judges are expected to keep their minds on what it means to be good at Design and Technology: there is no mark scheme to get in the way’ (p. 167). This seems to suggest that the assessors are not awarding grades at all, but just ranking pairs of essays. Again his process is not entirely clear.
He states that ‘the method demands and checks that a sufficient consensus exists amongst the pool of judges involved.’ However, this doesn’t seem to be unique and would be demanded of most marking sessions involving multiple markers. In a later section he claims high levels of reliability for ACJ that is partly a result of assessors being ‘focused wholly on making valid judgements.’ This needs more clarification.
Pollitt’s reference to the situation in China and in Europe hundreds of years ago is most interesting. However, it seems to be the stuff of a different paper to the information about comparative judgement. His work on comparative judgement needs to be more clearly explained.
Pollitt, A. (2012). Comparative judgement for assessment. Int J Technol des Educ. 22. 157-170 Retrieved on 23/7/2012 from http://www.springerlink.com.libraryproxy.griffith.edu.au/content/237j0t5841h8v150/fulltext.html