Marking Essays Causes Mental Depletion

We always knew it, didn’t we? At the start of a session we’re more meticulous about our marking than we are later in the process when we are far more likely to make a ‘gut’ decision or a ‘near enough is good enough’ decision before moving quickly to the next paper in the pile.

A 2011 paper on judicial decision making has reached a similar conclusion, but in a more high stakes environment. It looks at decisions by a parole board regarding prisoner requests to have their sentences reduced. The study found that favourable rulings were more likely to be given at the start of the day and after a meal or snack break. Immediately before a break rulings were less favourable. This was consistent for each of the eight judges in the study.

The essence is that when making repeated decisions, such as marking mountains of student papers, our judgment is likely to be harsher or more perfunctory the longer the session.

Well, I’m off to do some marking. I’m only half-way through the 90 papers. The good news for the next few students is that I’ve just had a break! 🙂


Danziger S., Levav J., & Avnaim-Pesso L.,   (2011)  Extraneous Factors in Judicial Decisions. Proceedings of the National Academy of Sciences of the United States of America, ISSN 0027-8424, 04/2011, Volume 108, Issue 17, pp. 6889 – 6892

Posted in professional judgment, reliability, standards, writing assessment issues | Leave a comment

Off with the head – China had the answer to plagiarism hundreds of years ago.

Plagiarism arises again and again as an issue in assessment. Suggestions are made that students need to be educated to avoid it, assessment tasks are devised to minimize it and assessors are vigilant to detect it (using electronic means).

It seems that cheating was a substantial issue way back in China 1500 years ago where those caught cheating on the civil service exam were subject to punishments sometimes as extreme as execution. Candidates submitted to body searchers upon entry to the exam. For 1300 years up until around 1900 this continued for the civil service test which involved tens of thousands of candidates at a sitting. Interestingly papers were not marked, but rather they were ranked. Thousands of assessors read the papers and were involved in the process of eliminating papers until the top ten were identified. These papers were presented to the Emperor who decided on the final ranking.

Candidates were anonymous as seat numbers were used in place of names. Handwriting was even disguised, ‘papers were rewritten by clerks, and checked by proof-readers, to remove the influence of calligraphic skills’ (Pollitt, 2012, p. 158).

Both bias and cheating were identified as features which could unfairly influence exam results and both issues were dealt with. Note that the entire system was judgement based. Pollitt states that marking was ‘invented’ in Europe in 1792 when for the first time marks were allocated to individual questions and added up to produce a final number which was then ranked against other exam papers. Pollitt goes on to say ‘thus “judgement” was replaced by “marking” ‘ (p. 159). However, he doesn’t make it clear whether the 1792 ‘marked’ exams were dealing with selected response assessments where the correctness of an answer was not particularly contentious or constructed response assessments where professional judgement would have played a significant role.

Constructed response questions such as essays are more difficult to measure than closed response questions such as multiple choice. Louis Thurstone published several papers between 1925 and 1935 looking at ways to measure ‘greyness’ which we can interpret as the need to make judgements when responses are not either wholly correct or wholly incorrect. He conceived a paired system where a person was asked to choose the better of two responses rather than to assign a value to them. He called it the Law of “Comparative Judgement.”

In the 1950s Rasch developed mathematical models that paralleled Thurstone’s ideas, but in a simplified form that enabled the construction of computer models. Using the Rasch model involves assessors in agreeing on a notional scale (rubric / descriptors) and forming ‘holistic evaluative judgments’ based on it.

Pollitt’s Adaptive Comparative Judgement (ACJ) involves numerous comparisons of pairs of essays / portfolios to determine which is better. This sounds like norm referencing as essays / portfolios are compared to each other rather than compared to a standard. In fact Pollitt isn’t providing sufficient information to enable a clear understanding of what he is proposing. Are the assessors evaluating each essay against a rubric / standard and then choosing the better of the two? Are they dispensing with the standard and just choosing the better of the two rather like the Chinese did all those years ago?

Pollitt makes claims for internal reliability at 0.95 for a spread of assessments involving over 350 portfolios. Effectively assessors are checked and rated for heavy-handed or soft marking. Intra-rater inconsistency is picked up and some papers highlighted for rechecking. I’m not sure how his approach is different to a FACETS Rasch analysis. This needs to be much clearer.

He claims that ‘valid assessment requires the design of good test activities and good evaluation of students’ responses using good criteria’ (p. 166) The former ‘good test activities’ and the latter ‘good criteria’ are clearly issues of validity. However, his central idea ‘good evaluation of students’ responses’ seems to be more of a reliability issue.

He also says, ‘since CJ [comparative judgement] requires holistic assessment, a very general set of criteria is needed’ (p. 166). However, it is not easy to see why CJ requires holistic assessment or why a very general set of criteria is needed. He later gives an example of 160 words that outlines a general set of criteria, but not levels of achievement. He concludes the paragraph with ‘ACJ judges are expected to keep their minds on what it means to be good at Design and Technology: there is no mark scheme to get in the way’ (p. 167). This seems to suggest that the assessors are not awarding grades at all, but just ranking pairs of essays. Again his process is not entirely clear.

He states that ‘the method demands and checks that a sufficient consensus exists amongst the pool of judges involved.’ However, this doesn’t seem to be unique and would be demanded of most marking sessions involving multiple markers. In a later section he claims high levels of reliability for ACJ that is partly a result of assessors being ‘focused wholly on making valid judgements.’ This needs more clarification.

Pollitt’s reference to the situation in China and in Europe hundreds of years ago is most interesting. However, it seems to be the stuff of a different paper to the information about comparative judgement. His work on comparative judgement needs to be more clearly explained.


Pollitt, A. (2012). Comparative judgement for assessment. Int J Technol des Educ. 22. 157-170 Retrieved on 23/7/2012 from

Posted in plagiarism, professional judgment, reliability, validity | Leave a comment

What is Professional Judgement?

In his article on professional judgement and disposition Dottin quotes two earlier authors. Both quotes offer shallow, out of context ideas about professional judgement (PJ) and do nothing to get to the nub of what it is.  After reading Dottin’s paper, basic questions remain. What is PJ and how academically grounded or experienced does a practitioner need to be to have adequate PJ? Are there different levels PJ proficiency?

Other writers have pointed out that there is inevitably a degree of subjectivity when marking essays and that assessors necessarily call on their own PJ, but nowhere does this quality seem to be defined.

My modest investigations into the area suggest that there is a huge discrepancy in marks allocated to student papers when decisions are left to PJ. At the very least there needs to be moderation to bring the PJs closer together. Even then there are problems. Clearly when marking essays assessors are attempting to measure something that is very difficult to measure and precision cannot be expected. The aim must be to minimize inter and intra-assessor discrepancies while maintaining levels of validity.


Dottin, E.S. (2009) Professional judgment and dispositions in teacher education. Teaching and Teacher Education. 25. 83-88

Posted in professional judgment | Leave a comment

Maintaining standards in assessment

AUTC is the Australia Universities Teaching Committee. In 2002 they issued several papers aimed at assisting universities to maintain standards in assessment and to respond to new issues.

Paper One – A new era in assessing student learning
Two new issues they canvas are plagiarism, and technological possibilities in assessment. Plagiarism is seen as a developing threat (especially from online sources) and one which can damage confidence in assessment and in academic standards. The paper advocates effective education campaigns and use of detection methods. Ten years later the plagiarism issue which used to largely centre around copying material from the net, today includes the advent of paid online tutors (or paper mills). This presents a new challenge to assessors.

When talking about emergent technological possibilities AUTC point to the need to ensure that the range of higher-order-skills is considered. Today technological methods are used in some contexts and have even been used to machine mark essays. This seems to be a case of using the technology to a point way beyond one that is capable of claiming much validity at all.

A third issue is raised particularly in the context of overseas students in Australia, but it is also applicable in some overseas contexts. Some students have ‘little prior exposure to the unwritten rules and conventions of higher education.’ This is being addressed to some extent by the advent of writing courses in some universities to help students to understand the basic premises of referencing, academic writing and avoidance of plagiarism.

Paper six – Quality and Standards
While Australian universities generally have clear statements of learning outcomes (standards), levels of achievement are less well defined. Grading is frequently norm referenced.

The paper implies that there is a problem with norm-referencing, and I have to agree. Students should be subject to the same standards at different times. If a student’s work merits a C in a current marking session, then it should merit a C in other marking sessions too. However, with norm-referencing, if there is a weak cohort, the C-paper may be elevated to an A.

‘A degree of subjectivity is inevitable. But this subjectivity must be informed by experienced, professional judgment’ (2002).  Everything seems to keep coming back to this idea of professional judgment, and the big question is – what is professional judgment? Why is it that one assessor can mark an essay at an A-level while another can mark it at a B or a C? This happens even amongst experienced assessors. So we have large degrees of subjectivity and poor levels of reliability.


Australian Universities Teaching Committee. (2002) A new era in assessing student learning. Retrieved from

Australian Universities Teaching Committee. (2002) Quality and standards. Retrieved from

Posted in plagiarism, professional judgment, standards, technology | Leave a comment

Fundamental Assessment Principles …

As his title suggests McMillan sets out to elucidate the basic principles of assessment. He highlights the role of professional judgment ‘whether the judgment occurs in constructing test questions, scoring essays, creating rubrics, grading participation, combining scores, or interpreting standardized test scores, the essence of the process is making professional interpretations and decisions’ (p 2). It is often assumed that teachers are professionals and thus they can write and mark tests. Indeed they can, but the degree of validity and reliability may be severely compromised if the teachers aren’t aware of possible hazards.

The role of professional judgment is a fascinating one. Look at just three areas; test questions, rubrics and the weighting of the various components. There is often limited scrutiny of test questions. At times questions may not be a close approximation of what the students have learnt or they may target a trifling component of what students have learnt. Rubrics are a problem too, they may not optimally represent what should be measured and again there may be little scrutiny of the creation of a particular rubric. Another issue is the weighting allocated to individual question or sections of a test paper; some components of a test are often weighted more heavily than others and this may not be an optimal representation of the importance of various components of a course. In these cases teachers are doing their best, but may not have sufficient knowledge to be making the judgments they make.

McMillan talks about the tensions involved in assessment. In particular he highlights the opposition between constructed-response assessments and selected-response assessments. Classic examples of this opposition would be a written essay and multiple choice questions. Clearly the latter is easier to mark and can be marked more consistently while the former is more difficult to mark and consistency will vary. The former is more likely to be able to lay claims to validity while the latter is more able to lay claims to reliability. So which approach should a test-writer opt for?

Perhaps the most interesting point he raises is one which is very easy to understand, but isn’t talked about much; ‘assessment contains error’ (p. 3). Indeed it does. If we have anything less than 100 per cent validity and 100 per cent reliability, then we have error. No tests measure up to 100 per cent in both areas therefore we have error. McMillan notes ‘typically, error is underestimated’ (p. 3).

Another principle he covers is that ‘good assessments use multiple methods.’ So in the case of the tension between constructed-response assessments and selected-response assessments a fairer approach might be to include a bit of each.

Finally back in 2000 McMillan was warning that despite the upsides there is potential danger in the use of technology in assessment. In particular he refers to companies developing tests with insufficient evidence of reliability and validity and with little thought to weighting, error and averaging.


McMillan, J.H. (2000) Fundamental Assessment Principles for Teachers and School Administrators. Practical Assessment, Research & Evaluation Vol.7 (8) 2-8 Retrieved July 6 from

Posted in professional judgment, reliability, validity, writing assessment issues | Leave a comment

Spray of ideas from journal article interview with Brian Huot

Huot talks about machine scoring of assessment as the culmination of pursuing reliability. And he is right on that one. The machine will spit out the same results no matter when the test is taken or by whom it’s taken. One hundred percent reliability. Nice.

He notes that validity and reliability are in tension with one another. Strong reliability comes with low validity and strong validity comes with low reliability. That’s an intriguing notion. I can’t believe that it is uniformly correct. I can see in the case of machine scoring that the reliability is high and the validity is clearly suspect because a test written for a machine to score can only measure what the machine is capable of measuring (range of vocab, number of words, grammatical accuracy [to some extent], number of paragraphs), but it can’t measure the strength of ideas and logic or the relevance to the task.

It raises the question of how you can optimize both reliability and validity and if it is in fact possible. An intriguing notion and one worth pursuing.

Speaking in an American context Huot talks about funding issues for writing programs. He notes that student numbers are growing while funding is falling.  At the same time funding in other areas of universities is growing. Writing programs are the poor siblings of the more valued mainstream programs. This is a problem for Language Centres in many universities. They are underfunded. Their staff aren’t considered academic staff and pay scales and working conditions reflect this.

Another issue to come up in Huot’s interview is portfolios. He says ‘portfolios are the only way for [me] to teach writing’ (p. 104). In my view portfolios are a breeding ground for plagiarism, but I take Huot’s point that time-limited writing tests are not tests of writing ability so much as tests of writing ability in a limited time – which may be an unrealistic representation of what a student is capable of. So this is another area that is problematic.


Bowman, M., Mahon, W., Pogell, S. (2004) Assessment as Opportunity: A Conversation with Brian Huot. Issues in Writing. Vol.14 (2) 94-114

Posted in portfolios, reliability, validity | Tagged | Leave a comment

Machine Scoring – Seriously?

Condon (2011) reviews the various phases of writing assessment over the last century. The developments of the last fifteen or twenty years are particularly interesting. Several important issues are raised including portfolios, machine assessment and the behemoth tests. The relationship between writing theory and assessment theory is discussed as well as the notion of engaging students in the writing process.

Portfolios first popularized in the 1970s saw a boom in the 1990s and are still popular today. They are favoured as reflecting authentic writing. However, many advocates of portfolios seem to turn a blind eye to a major problem which is the rampant plagiarism that comes with them. Despite all attempts to curb it, plagiarism is a growing issue in tertiary education, and portfolios are a perfect breeding ground.

Condon expresses dismay at the advent of machine scoring when he briefly mentions COMPASS, E-rater and Criterion. A quick check of their websites reveals that all three state that the essay writing score is received immediately after the test is taken thus confirming the machine marking aspect;, via ScoreitNow!, Machine scoring can likely claim high levels of reliability (ability to replicate scores at different marking sessions), but questionable levels of validity (whether what is being tested is what should be tested). One can only agree with Condon’s point of view and ask if we can we seriously leave assessment of a communicative skill to a machine.

The large scale testing business is another controversial issue in assessment and is seen by some as having shortcomings, particularly in terms of compromised validity which it is argued is a result of lack of authenticity partly due to the use of timed essays. In his context Condon mentions a number of American tests which he refers to as ‘assessment juggernauts.’  In an international context, perhaps we can extrapolate the argument to IELTS.

Condon says, “if writing assessment engages with writing theory, then the assessment practices that emerge will be consistent with the best that has been thought, researched, and written about writing as a construct, as a set of competencies, and as a social practice” (p.173).  This would seem to be a starting point or an enabling factor rather than a clear if/then relationship.

Finally he canvases the argument that the “writing we assess needs to be meaningful to writers” (p. 177). Ideally this would be the case, but it raises questions about what exactly ‘meaningful’ might mean and what the purpose of teaching Academic Writing is. In the world outside the classroom, employees and entrepreneurs prepare documents and written arguments in a range of contexts. They are not necessarily contexts that capture their personal interests or have particular meaning. What they have is context and purpose. The tertiary system should not be treating students with kid-gloves. It should be equipping them with skills for life in the 21st century. This involves research skills, flexibility, initiative, accountability, technological skills, cultural skills and critical thinking, to name a few. Tertiary education needs to engage students, but not necessarily by framing every task so that it caters to them exactly. Students need to take responsibility for their learning too.

It would seem that many of the systems we have in place now represent different philosophical underpinnings, are flawed and may not be capable of being perfected, but they have merits that make them amenable to constructive refinements or even just tinkering with. Others however, such as ‘machine scoring’ seem to be highly ambitious and of questionable value.


Condon, W. (2011) Reinventing Writing Assessment: How the Conversation is Shifting. WPA Vol.34(2) p.162 – 182

Posted in writing assessment issues | Tagged | Leave a comment