Friday, May 9, 2025

AI Is Bad At Grading Essays (Chapter #412,277)

A new study shows results that will be absolutely unsurprising to anyone who has been paying attention. ChatGPT is not good at grading essays.

A good robograder has been the white whale of the ed tech industry for a long time now, and failing with impressive consistency. Scholar Les Perelman has poked holes in countless robo-grading products, and I've been writing about the industry since I began this blog. And this comment from the Musings of a Passing Stranger blog in 2011 is still applicable:
What Pearson and its competitors do in the area of essay scoring is not a science. It's not even an art. It's a brutal reduction of thought to numbers. The principles of industrial production that gave us hot dogs now give us essay scores.

The main hurdles to computerized grading have not changed. Reducing essay characteristics to a score is difficult for a human, but a computer does not read or comprehend the essay in any usual understanding of the words. Everything the software does involves proxies for actual qualities of actual writing. This paper from 2013 still applies-- robograders still stink.. 

Perelman and his team were particularly adept at demonstrating this with BABEL (the Basic Automatic B. S. Essay Language Generator), a program that could generator convincing piles of nonsense which robograders consistently gave high scores. Sadly, it appears that BABEL is no longer on line, but I've taken it out for a spin myself a few times-- the results always make robograders look incompetent (see here, here, here, and here).

The study of bad essay grading is deep. We have some classic studies of the bad formula essay. Paul Roberts' "How To Say Nothing in 500 Words" should be required reading in all ed programs. Way back in 2007, Inside Higher Ed ran this article about how an essay that included, among other beauties, reference to President Franklin Denelor Roosevelt was an SAT writing test winner. And I didn't find a link to the article, but in 2007 writing instructor Andy Jones took a recommendation letter, replaced every "the" with "chimpanzee," and scored a 6 out of 6 from the Criterion essay-scoring software at ETS. You can read the actual essay here. And as the classic piece from Jesse Lussenhop, part of robograding's problem is that it has adopted the failed procedures of grading-by-human-temps. 

Like self-driving cars, robograding has been just around the corner for years. If you want to dive into my coverage here at the Institute, see here, here, here, here, here and here for starters. Bill Gates was predicting it two years ago, and just last year, an attempt was made to get ChatGPT involved which was not quite successful and very not cheap. Which is bad news because the "problem" that robograding is supposed to solve is the problem of having to hire humans to do the job. Test manufacturers have been trying to solve that problem for years (hence the practice of undertrained minimum wage temps as essay graders). 

That brings us up to the recent attempt by The Learning Agency. TLA is an outfit pushing "innovation." It (along with the Learning Agency Lab) was founded by Ulrich Boser in 2017, and they partner with the Gates Foundation, Schmidt Futures, Georgia State University, and the Center for American Progress, where Boser is a senior fellow. He has also been an advisor to the Gates Foundation, Hillary Clinton's Presidential Campaign, and the Charles Butt Foundation--so a fine list of reform-minded left-leaning outfits. Their team involves former government wonks, non-profit managers, comms people and a couple of Teach for America types. The Lab is more of the same; there are more "data scientists" in this outfit than actual teachers.

TLA is not new to the search for better robograding. The Lab was involved in a competition, jointly sponsored by Georgia State University, called The Feedback Prize. It was a coding competition being run through Kaggle, in which competitors are asked to root through a database of just under 26K student argumentative essays that have been previously scored by "experts" as part of state standardized assessments between 2010 and 2020 (which raises a whole other set of issues, but let's skip that for now). The goal was to have your algorithm come close to the human scoring results; and the whole thing is highly technical.

Now TLA has dug through data again, to produce "Identifying Limitations and Bias in ChatGPT Essay Scores: Insights from Benchmark Data." They grabbed their 24,000 argumentative essay dataset and let ChatGPT do its thing so they could check for some issues.

Does ChatGPT show bias? A study just last year said yes, it does, which is always a (marketing) problem because tech is always sold with the idea that a machine is perfectly objective and not just, you know, filled with the biases of its programmers. 

This particular study found bias that it deemed lacking in "practical significance," except when it didn't. Specifically, the difference between Asian/Pacific Islanders and Black students, which underlines how Black students come in last in the robograding.

So yes, there's bias. But the other result is that ChatGPT just isn't very good at the job. At all. There's more statistical argle bargle here, but the bottom line is that ChatGPT gives pretty much everyone a gentleman's C. To ChatGPT, nobody is excellent and nobody is terrible, which makes perfect sense because ChatGPT is not qualified to determine anything except whether the strong of words that the writer has created is, when compared to a million other strings of words, probable. ChatGPT cannot tell whether the writer has expressed a piercing insight, a common cliche, or a boneheaded error. ChatGPT does not read, does not understand. 

Using ChatGPT to grade student essays is educational malpractice. It is using a yardstick to measure the weight of an elephant. It cannot do the job.

TLA ignores one other question, a question studiously ignored by everyone in the robograding world-- how is student performance affected when they know that their essay will not be read by an actual human being? How does one write like a real human being when your audience is mindless software? What will a student do when schools break the fundamental deal of writing--that it is an attempt to communicate an idea from the mind of one human to the mind of another?

This is one of the lasting toxic remnants of the modern reform movement--an emphasis on "output" and "product" that ignores input, process, and the fact that there are many ways to get a product-- particularly if that's all the people in charge care about. 

"The computer has read your essay" is a lie. ChatGPT can scan your output as data (not as writing) and compare it to the larger data set (also not writing any more) and see if it lines up. Your best bet as a student is to aim for the same kind of slop that ChatGPT churns out thoughtlessly.

Add ChatGPT to the list of algorithmic software that can only do poorly a job it should not be asked to do at all. 

2 comments:

  1. Link to " Paul Roberts' "How To Say Nothing in 500 Words" goes to hoax.com. (The original link was to https://mrgunnar.net/ which appears to be for sale).

    Great article, as always. I regularly appreciate your posts, but rarely say "Thanks".

    ReplyDelete
    Replies
    1. Thanks, and thanks for the alert. I found another copy anbd have updated the link

      Delete