Monday, March 19, 2018

OH: Computers Are Grading Essays

No sooner had I vigorously mocked the idea of using computers to grade essays, then this came across my desk:

CLEVELAND, Ohio - Computers are grading your child's state tests.

No, not just all those fill-in-the bubble multiple choice questions. The longer answers and essays too.

According to State Superintendent Paolo DeMaria and state testing official Brian Roget (because "state testing official" is now  job-- that's where we are now), about 75% of Ohio's BS Tests are being fully graded by computers.

This is a dumb idea.

"The motivation is to be as effective and efficient and accurate in grading all these things," DeMaria told the board. He said that advances in AI are making this new process more consistent and fair - along with saving time and, in the long run, money.

If you think writing can be graded effectively and efficiently and accurately by a computer, then you don't know much about assessing writing. The saving money part is the only honest part of this.

But all the kids are doing it, Mom. American Institutes for Research (AIR-- which is not a research institute at all, but a test manufacturer) is doing it in Ohio, but Pearson and McGraw-Hill and ETS are all doing it, too, so you know it's cool.

DeMaria said that the research is really "compelling," which is another word for "not actually proving anything," and he also claims that even college professors are using Artificial Intelligence to grade papers. He does not share which colleges, exactly, are harboring these titans of educational malpractice. Would be interesting to know. Meanwhile, Les Perelman at MIT has made a whole second career out of repeatedly demonstrating that these essay grading computers are incompetent boobs.

The shift from human scorers is usually a little controversial, which may be why Ohio just didn't tell anyone it was happening. It came to light only after, the article notes wryly, "irregularities" were noticed in grades. Oddly enough, that constitutes a decent blind test of the software-- folks could tell it was doing something wrong even when they didn't know that software was doing the grading.

Some Ohio board members think the shift is just fine, though one picked an unfortunate choice of example:

"As a society, we're on the cusp of self-driving vehicles and we're arguing about whether or not AI can grade a third grade test?" asked recently-appointed board member James Shephard. "I think there just needs to be some perspective here."

I feel certain that as Shephard spoke, he was unaware that a self-driving vehicle just killed a pedestrian in Arizona.

The actual hiccup that called attention to the shift from meat widget grading was a large number of third grade reading tests that came back with a score of zero. That was apparently because they quoted too much of the passage they were responding to, though they are supposed to cite specific evidence from the text. It's the kind of thing that a live human could probably figure out, but since computer software does not actually understand what it is "reading," -- well, zeros. On a test that will determine whether or not the student can advance to fourth grade (because Ohio has that stupid rule, too).
I don't understand a word you just said, but you fail!

The state has offered some direction (30% is the tipping point for how much must be "original") so that now we have the opening shot in what is sure to be a long volley of rules entitled "How to write essays that don't displease the computer." Surely an admirable pedagogical goal for any writing program.

The state reported that of the thousand tests submitted for checking, only one was rescored. This fits with a standard defense of computer grading-- "When we have humans score the essays, the scores are pretty much the same as the computer's." This defense does not move me, because the humans have their hands and brains tied, strapped to the same algorithm that the computer uses. Of course a human gets the same score, if you force that human to approach the essay just as stupidly as the computer does. And computers are stupid-- they will do exactly as they're told, never understanding a single word of their instructions.

The humans-do-it-too defense of computer grading ignores another problem of this system-- audience. Perhaps on the first go round you'll get authentic writing that's an actual measure of something real. But what we already know from stupid human scoring of BS Tests is that teachers and students will adapt their writing to fit the algorithm. Blathering on and on redundantly and repetitiously may be bad writing any other time, but when it comes to tests, filling up the page pleases the algorithm. The algorithm also likes big words, so use those (it does not matter if you use them correctly or not). These may seem like dumb examples, but my own school has had success gaming the system with these rules and rules like them.

And this is worse. I've heard legitimate arguments from teachers who say the computer's ability to sift through superficial details can be on part of a larger, meat-widget based evaluation system, and I can almost buy that, but that's not what Ohio is doing-- they are handing the whole evaluation over to the software.

What do you suppose will happen when students realize that the computer will not care if they illustrate a point by referring to John F. Kennedy's noble actions to save the Kaiser during the Civil War? What do you suppose will happen when students realize that they are literally writing for no human audience at all? How will they write for an algorithm that can only analyze the most superficial aspects of their writing, with no concern or even ability to understand what they are actually saying?

This is like preparing a school band to perform and then having them play for an empty auditorium. It's like having an artist do her best painting and then hanging it in a closet. Even worse, actually-- this is like having those endeavors judged on how shiny they are, still unseen and unheard by human eyes and ears.

Ohio was offered a choice between doing something cheap and doing something right, and they went with cheap. This is not okay. Shame on you, Ohio.


  1. This one of those rare pieces that elicit my, "Yes!" because it's been distressing me for decades. Thank you for writing it!

  2. Don't want computers grading writing? Don't want irrelevant tests graded by computers that spit out irrelevant scores that supposedly reflect educator skill?
    The 1st ELA standard requires students to 'cite specific textual evidence when writing or speaking to support conclusions drawn from the text.' The 1st standard is the kickstand of the ELA assessments. Don't agree with it? Stop tolerating, working with, begrudgingly accepting or out right supporting the Common Core Standards, or the Next Generation Learning Standards, What ever you call them. STOP.

    1. The Common Core standards are pretty good. They do not tolerate testing, much less computerized grading. Testing can be applied to them only by ignoring the standards themselves, which educators at the top seem to be pretty good at.

  3. NH's state assessments, too, beginning this year.

    "But it'll save us so much money!"

    1. Getting rid of the vast majority of this testing would save even more money. Wouldn't that be nice.

    2. Yep. The other big selling point, at least in my district, was that we'll have results back much sooner. They won't mean anything but we'll have them in two weeks!

  4. I've been in the testing room, grading kids' essays for Pearson. This was 2006. Mind-bendingly numb experience. You don't want the humans grading these tests, either; at least, not for $12/hr.