Thursday, January 28, 2021

Selling Roboscoring: How's That Going, Anyway?

The quest continues--how to best market the notion of having student essays scored by software instead of actual humans. It's a big, bold dream, a dream of world in which test manufacturers don't have to hire pesky, expensive meat widgets and the fuzzy unscientific world of writing can be reduced to hard numbers--numbers that we know are objective and true because, hey, they came form a computer. The problem, as I have noted many times elsewhere, is that after all these years, the software doesn't actually work. 

But the dream doesn't die. Here's a paper from Mark D. Shermis (University of Houston--Clear Lake) and Susan Lottridge (AIR) presented at a National Council of Measurement in Education in Toronto, courtesy of AIR Assessment (one more company that deals in robo-scoring). The paper is two years old, but it's worth a look because it shows the reasoning (or lack thereof) used by the folks who just can't let go of the robograding dream. 

"Communicating to the Public About Machine Scoring: What Works, What Doesn't" is all about managing the PR when you implement roboscoring. Let's take a look.

Warming Up

First, let's lay out the objections that people raise, categorized by a 2003 paper as humanistic, defensive and construct.

The humanistic objection stipulates that writing is a unique human skill and cannot be evaluated by machine scoring algorithms. Defensive objections deal with concerns about “bad faith” or off-topic essays and scoring algorithm vulnerabilities to them. The construct argument suggests that what the human rater is evaluating is substantially different than what machine scoring algorithms used to predict scores for the text.

Well, sort of. The defensive objection is mis-stated; it's not just that robograding is "vulnerable" to bad-faith or off-topic essays, but that those vulnerabilities show that the software is bad at its job. 

The paper looks at six case studies, three in which the implementation went well and three in which the implementation "was blocked or substantially hindered." Which is a neat rhetorical trick-- note that the three failed cases are not "cases where implementers screwed it up" but cases where those nasty obstructionists got in our way. Long-time ed reform watchers will recognize this--it's always the implementation or the opponents or "entrenched bureaucracies" but never, ever, "we did a lousy job."

Let's look at our six (or so) cases.

West Virginia

West Virginia has been at this a while, having first implemented roboscoring in 2005. They took a brief break from 2015-2017, then signed up with AIR. Their success is attributed to continuing professional development and connecting "formative writing assignments to summative writing assignments" via a program called Writing Roadmap that combined an electronic portfolio and the same robograder used for the Big Test.

In other words, they developed a test prep system linked directly to the Big Test, and they trained teachers to be part of the teach-to-the-test system. This arrangement allows for a nice, circular piece of pedagogical tautology. Research shows that students who go through the test prep system get better scores on the test. It looks like something is happening, but imagine that the final test score was based on how dependably students put the adjective "angry" in front of the noun "badger." You could spend years teaching them to always write "angry badger" and then watch their "angry badger" test scores go up, then crow about how effective your prep program is and how well students are now doing-- but what have you actually accomplished, and what have they actually learned. You can make your test more complicated than the angry badger test, but it will always be something on that order because--and I cannot stress this enough--computer software cannot "read" in any meaningful sense of the word, so it must always look for patterns and superficial elements of the writing.

When AIR came to town in 2018, they did some teacher PD (1.5 whole days!) which was apparently aimed at increasing teacher confidence in the roboscorers by doing the old "match the computer score to human scores" show, a popular roboscore sales trick that rests on a couple of bits. One is training humans to score with the same algorithm the computer uses (instead of vice versa, which is impossible). The other is just the statistical oddity of holistic scoring. If an essay is going to get a score form 1-5, the chances that a second score will be equal or adjacent are huge, particularly since essays that are sucky (1) or awesome (5) will be rarer outliers. 

Even so, in the 1.5 day training, most teachers didn't meet the 70% exact agreement rate for scores, "suggesting that the training did not result in industry-standard performance." Industry, indeed.


The LEAP implemented robo-scoring as a second reader for the on-line version of the test. The advantages of this are listed include "flagging and recalibrating readers" in case their training didn't quite stick. Also, it can help with rater drift and rater bias, and I have my doubts about its usefulness here, but I will agree that these are real things when you are ploughing through a large pile of essays.

Since 2016 that has been handled by DRC, with their own proprietary robo-scorer now the primary scorer, with some humans "monitoring reads" for a portion of the essays. For a while yet another "engine" was scoring open-ended items on the tests because "increasing student use of the program translated into significant hand-scoring costs." The paper doesn't really get into why Louisiana's program was "successful" or how communication, if any, with any part of the public was accomplished.


Utah has been robo-scoring since 2008, both "summative and formative," a phrase to watch for, since it usually indicates an embedded test prep program--robo-scoiring works better if you train students to comply with the software's algorithm. Utah went robotic for the usual three reasons-- save money, consistent scores, faster return of scores. 

Utah's transition had its bumps. In particular, the paper notes that students began gaming the system and writing bad-faith essays (one student submitted an entire page of "b"s and got a good score). Students also learned that they could write one good paragraph, then write it four or five times. The solution was to keep "teaching" the program to spot issues, and to implement a confidence rating which allowed the software to say, "You might want to have a human look at this one." There were also huge flaps over robo-scorers finding (or not) large chunks of copied text, which led to changes in filters and more PD for teachers.

Utah has had a history of difficulty with test manufacturers for the Big Standardized Test--since this paper was issued, yet another company crashed and burned and ended up getting fired over their inability to get the job done, resulting in a lawsuit.


The authors call Ohio an "interesting" case, and I suppose it is if you are the kind of person who goes to the race track to watch car crashes. Ohio had a "modestly successful" pilot, but didn't brief the State School Board on much of anything before the first year of robo-scoring flunked a way-huger-than-previous number of students--including third graders, which is a problem since Ohio has one of those stupid third grade reading retention rules. Turns out the robo-scorer rejected the time-honored tradition of starting the essay response with a restatement of the prompt. Oopsies. It's almost as if the robo-scoring folks didn't even try to consult with or talk to actual teachers. Ohio has been trying to PR its way out of this


Yes, the Canadian province. They started out in 2014 with LightSide, the only "non-commercial product" in the crowded field, though its "main drawback is that it employs a variety of empirical predictors that do not necessarily parallel traditional writing traits." That tasty little phrase leads to this observation:

This makes it difficult to explain to lay individuals how writing models work and what exactly differentiates one model from another. Most commercial vendors employ NLP routines to tease out characteristics that lay audiences can relate to (e.g., grammar errors), though this information does not necessarily correspond to significantly better prediction models (Shermis, 2018).

So, it doesn't use qualities related to actual writing traits? And these robo-scorers are tweaked to kick out things like grammar errors, not because they're using them to score, but because it's something the civilians can relate to? That... does not make me feel better about robo-scoring.

However, the Alberta Teacher's Federation called on Dr. Les Perelman, my personal hero and a friend of the institute, who came on up and made some pointed observations (They also insist on spelling his name incorrectly). The authors do not like those observations. They say his comments "reflect a lack of understanding of how large-scale empirical research is conducted," which musty have made his years as a professor at MIT pretty tough. They also claim that he fell into the classic correlation-causation fallacy when he pointed out the correlation between essay length and score. The authors reply "The correlation is not with word count...but rather the underlying trait of fluency" and right here I'm going to call bullshit, because no it isn't. They launch into an explanation of how important fluency is, which is a nice piece of misdirection because, yes, fluency is important in writing but no, there's no way for a hunk of software to measure it effectively. 

The authors also take issue with Perelman's point that the system can be gamed. Their response is basically, yeah, but. "Gaming is something that could theoretically happen in the same way that your car could theoretically blow up." As always, robo-score fans miss the point. If the lock on my front door is easy to pick, yeah, I might go months at a time, maybe years, and never have my house broken into. But if the lock is easy to pick, it's a bad lock. Furthermore, the game-ability is a symptom of the fact that robo-scoring fundamentally changes the task, from writing as a means of communicating an idea to other human beings into, instead, performing for an automated audience that will neither comprehend nor understand what you are doing. (Also, the car analogy is dumb, because it's theoretically almost impossible that my car will blow up.)

Robo-scoring is bad faith reading; why should students feel any moral or ethical need to make a good faith effort at writing? Why? Because a bunch of adults want them to perform so they can be judged? 

At any rate, Alberta has decided maybe not on the robo-grading front.


Oh, Australia. "Early work on machine scoring was successful in Australia, but the testing agency was outmaneuvered by a Teacher’s Federation with an agenda to oppose machine scoring." Specifically, the teachers called in the dastardly Dr. Perelman, whose conclusions were pretty rough including "It would be extremely foolish and possibly damaging to student learning to institute machine grading of the NAPLAN essay, including dual grading by a machine and a human marker." Perelman also noted that even without the robo-scoring out of the NAPLAN writing assessment, "It's by far the most absurd and the least valid of any test that I've seen." 

The teachers named Perelman a Champion of Public Education. The writers of this paper have, I suspect, entirely different names for him. But in railing against him, they reveal how much they simply don't get.

For example, he suggested that because the essays were scored by machine algorithms students would not have a legitimate audience for which to write, as if students couldn’t imagine a particular audience in a persuasive essay task. There was no empirical evidence that this was a problem. 

[Sound of my hand slapping my forehead.] Is an imaginary audience supposed to be a legitimate one? Is the point here that we just have to get the little buggers to write so we can go ahead and score it? Perhaps, because...

 He rightly suggested that computers could not assess creativity, poetry, or irony, or the artistic use of writing. But again, if he had actually looked at the writing tasks given students on the ACARA prompts (or any standardized writing prompt), they do not ask for these aspects of writing—most are simply communication tasks.

Great jumping horny toads! Mother of God! Has any testocrat ever explained so plainly that what the Test wants is just bad-to-mediocre writing? Yeah, all that artsy fartsy figurative language and voice and, you know, "artistic" stuff--who cares? Just slap down some basic facts to "communicate," because lord knows communication is just simple artless fact-spewing. These are people who should not be allowed within a hundred feet of any writing assessment, robo-scored or otherwise.

And in attempting to once again defend against the "bad faith" critique, the paper cites actual research which, again, misses the point.

Shermis, Burstein, Elliot, Miel, & Foltz (2015) examined the literature on so-called “bad faith” essays and concluded that it is possible for a good writer to create a bad essay that gets a good score, but a bad writer cannot produce such an artifact. That is, an MIT technical writing professor can write a bad essay that gets a good score, but a typical 9th grader does not. The extensiveness of bad faith essays is like voter fraud—there are some people that are convinced it exists in great numbers, but there is little evidence to show for it.

First, I'm not sure how that research finding, even if accurate, absolves robo-grading of its badness (I find the finding hard to believe, actually, but I'm not looking at the actual paper, so I can't judge the research done). The idea that the good scores only go to bad essays by good writers is more than a little weird, as if bad writers can't possibly learn how to game the system (pretty sure it wouldn't take a good writer to write the page of "b"s in the earlier example). Is the argument here that false positives only go to students who deserved positives anyway? 

And again, the notion that it doesn't happen often so it doesn't matter is just dumb. First of all, how do you know it happens rarely? Second, it makes bad faith writing the backbone of the system, and beating it the backbone of bad writing instruction. The point of the writing becomes satisfying the algorithm, rather than expressing a thought. We figured out how to beat PA's system years ago-- write neatly, write a lot (even if it's redundant drivel), throw in some big words (I always like "plethora"). Being able to satisfy an algorithm is not not not NOT the same thing as writing well. 

In the end, the writers dismiss Perelman's critique as a "hack job" that slowed down the advance of bad assessment in the land down under.


Florida was using humans and software, but the humans had greater weight. If they disagree, another human would adjudicate. One might ask what the point of having the robo-scorer at all, but this is Florida and one assumes that the goal is to phase the human out. 


The paper ends with some recommendations about how to implement a robo-scoring plan (it does not, nor does the rest of the paper, offer any arguments about why it's a good idea to do so). In general they suggest starting with humans and phasing in computers. Teacher perception also matters. They offer some phases.

Phase 1. Start with a research study endorsed by a technical advisory committee. So, some practice work, showing how the robo-scorer does with validity (training) papers, as well as seeing how it does with responses that repeat text, copy the pledge, copy themselves, gibberish, off-topic essays, etc. The state "could involve teachers as appropriate" and that phrase should be the end of it all, because if you are developing an assessment program to assess writing and teacher aren't involved from Day One and every day thereafter as a major voice in the program development, then your program should not see the light of day. Robo-scoring underlines how badly much ed tech completely silences teachers and replaces them with software designers and tech folks. If an IT guy from your local tech shop stopped in and said, "I would like to take over the teaching of writing in your class," you wouldn't seriously consider the offer for even a second. And yet, that is exactly what these robo-score companies propose.

Phase 2. Design initial scoring plan. Use the research to plan stuff. Because it wouldn't be ed tech if we weren't designing the program based on what the tech can do, and not on what actually needs to be done. Nowhere in the robo-scorer PR world will you find an actual discussion of what constitutes good writing--a really tough topic that professional English teacher humans have a tough time with, but which robo-scorers seem to assume is settled, and settled in ways that can be measured by a computer, even though software doesn't--and I still can't stress this enough-- actually read in any meaningful sense of the word. 

Phase 3. Design a communication plan. And only now, in phase three, will we develop a plan for convincing administrators and teachers that this is all a great idea. Use "rationale" and "evidence." There are six steps to this, some of which involve non-existent items like "a description of how essay scoring maps to achievement," but at number six, in the last item on the last list in the third phase, we arrive at "An opportunity and method for teachers to ask questions." Nothing here about what to do if the questions are on the order of, "Would you like to bite me?"

Phase 4. Propose a pilot. Get some school or district to be a guinea pig.

Phase 5. Implementation. If it works in that pilot, deploy it on the state level. With communication, and lots of PD, centered on how to use the scoring algorithm and "improve learning" aka "raise test scores." 

Phase 6. Review and revise. Debrief and get feedback. 


Robo-scoring has all the bad qualities of ed tech combined with all the qualities of bad writing instruction, courtesy of the usual problem of turning edu-amateurs loose to--well, I was going to say "solve a problem" but robo-scoring is really a solution in search of a problem. Or maybe the problem is that some folks can't get comfortable with how squishy writing instruction is by nature. Creating robo-scoring software is as hopeless an endeavor as creating a program to rank all the great writers, or using a computer to set the canon for English literature. The best you can hope for is software that quickly and efficiently applies the biases of the programmer who wrote it, filtered through the tiny little lens of what a computer can actually do. 

And while two years ago they were working on the general PR problem of robo-graders, now ed techies are salivating at the golden opportunities they see in pandemic-fueled distance learning, meaning that these things are running around loose more than ever. And it is everywhere, unfortunately, useful for basic proofreading and word counting and basically any kind of writing check that doesn't involve actual reading. Here's hoping this weed isn't in your own garden of education.

1 comment:

  1. They do not care if their robot scoring programs work. All they care about is the money. If they are the CEOs and stockholders, they want more money, lots of it. If they are the programmers, they want to justify their jobs and their incomes.

    Confirmation bias may play a part in this fraud, too. The techies at the end of the money stream want believe the roboscoring programs work even when they don't.

    How is this different from the deplorable voters that call themselves evangelical (fake) Christians, but they still worshiping at Donald Trump's alter of greed and lies even after it has been revealed he is a traitor, a crook, and a Russian asset?