Thursday, April 7, 2016

Grading Good-faith Gibberish

Les Perelman is one of my heroes for his unflinching exposure, time and time again, of the completely inadequacy of using computers to assess writing.

Perelman and his grad students create BABEL, (the Basic Automatic B.S. Essay Language Generator) a program that can generate brilliant gibberish. Diane Ravitch, education historian and activist, took a stab at using BABEL and got, in part, this:

Didactics to subjugation will always be an experience of humankind. Human life will always civilize education; many for diagnoses but a few of the amanuensis. Myrmidon at absurd lies in the search for reality and the realm of reality. From the fact that denationalization excommunicates the denouncements involved of civilizations, humanity should propagate absurd immediately.

This scored a mere 4 out of 6. Apparently Ravitch, as an older Americam, suffers from being the product of our earlier status quo education system. If only she'd been exposed to the Common Core.

The software that scored her essay is PEG writing, and the site has some lovely FAQ items, one of which Ravitch highlighted.

It is important to note that although PEG software is extremely reliable in terms of producing scores that are comparable to those awarded by human judges, it can be fooled. Computers, like humans, are not perfect.
PEG presumes “good faith” essays authored by “motivated” writers. A “good faith” essay is one that reflects the writer’s best efforts to respond to the assignment and the prompt without trickery or deceit. A “motivated” writer is one who genuinely wants to do well and for whom the assignment has some consequence (a grade, a factor in admissions or hiring, etc.).
Efforts to “spoof” the system by typing in gibberish, repetitive phrases, or off-topic, illogical prose will produce illogical and essentially meaningless results.
In other words, PEG knows it doesn't work. It also assumes a great deal in assuming that students writing pointless essays on boring subjects for baloney-filled standardized tests are "motivated" writers. Can the software accurately score motivated gibberish? Can the program distinguish between frivolous garbage and well-meant garbage?

Probably not. As noted in PEG's response to the question of how the software can evaluate content:

However, analyzing the content for “correctness” is a much more complex challenge illustrated by the “Columbus Problem.” Consider the sentence, “Columbus navigated his tiny ships to the shores of Santa Maria.” The sentence, of course, is well framed, grammatically sound, and entirely on topic. It is also incorrect. Without a substantial knowledge base specifically aligned to the question, artificial intelligence (AI) technology will fail to grasp the “meaning” behind the prose. Likewise, evaluating “how well” a student has analyzed a problem or synthesized information from an article or other stimulus is currently beyond the capabilities of today’s state of the art automated scoring technologies.
PEG bills itself as a "trusted" teaching assistant that can help relieve some of the time pressures that come from having many, many essays to grade. But I can't trust it, and it's unlikely that I ever will.
This is the flip side of Common Core reading, an approach that assumes that reading is a batch of discrete behaviors and tricks that are unrelated to any content. Here we assume that writing is just a series of tricks, and it doesn't really matter what you're writing about, which is a concept so bizarre that I can barely wrap my head around it. Use big words-- even if they have nothing to do with the topic of the essay. Use varied sentence lengths-- but don't worry about what the sentences say. 

PEG, like other similar services, offers as proof of its reliability its closeness to human-rendered scores. But that happens only because the human-rendered scores come from a rubric designed to resemble the algorithm of a computer, not the evaluative processes of a human writing teacher. In other words, you make the computer look good by dumbing down the humans used for comparison.

Pearson's continued fascination with AI-directed education, as well as the news that PARCC will use computer essay grading in four of its six states-- these are Bad News, because computer software is simply not up to the job of evaluating writing in any meaningful way. BABEL is just one more demonstration of how completely inadequate the software tools are.

P.S. My favorite line from my own BABEL efforts:

Charter, frequently to an accusation, might innumerably be grout for the allocution.

4 comments:

  1. "It also assumes a great deal in assuming that students writing pointless essays on boring subjects for baloney-filled standardized tests are "motivated" writers"

    This idea needs to be repeated over and over again until it begins to sink into the thick skulls of ed. policy makers.

    The only people involved in the assessment that can produce results that show accurately the best of their knowledge (or at least their performance on the assessment item(s)) are the students. They are also the only people involved who have absolutely no stake in producing such results. This is such an obvious source of nonsensical time-wasting, it's surprising to me that everyone can't immediately see it.

    I wish a bunch of pro-testing policy wonks could have spend the last hour and a half in my homeroom this morning watching my 10th graders do their two 30 minute online test sessions in about 7 minutes and sleep the rest of the time.

    ReplyDelete
    Replies
    1. "Nonsensical time-wasting." That's my big problem with all the testing. One or two tests that took a day would be fine. But no. My children must take many tests over the year, and the end of grade tests involve weeks of preparation, a week of testing, and several weeks when the teachers tutor the students who failed the first time. My children weren't even allowed to read after finishing the three hour exams in one hour. Absurd.

      Delete
  2. Dave is right. The only reason for any student to do excellent work on such tests is ... well, I can't really think of one. My understanding is enormous numbers of high school students don't take them seriously at all. Or, as you say,

    "Charter, frequently to an accusation, might innumerably be grout for the allocution."

    ReplyDelete
  3. Speaking as a foreign-language teacher... Babel, 15 yrs or so ago, was our only translation program. Its results were often gibberish & often rightly pilloried as such. Today we have many translation aides, but seldom does one do the job. You can enter a phrase into Google & ask for it 'in Spanish.' You'll get something, but you can't stop there for it might be a literal [gibberish] translation. So you try it at linguee.sp, which wil offer many alternatives. Then you try the alternative that seems right in word reference.com, which will offer many suggestions from various Sp-speaking countries. You pick what best fits, & try it out on a general Google search, which will tel you whether it's common usage.

    Hopefully from this example of researching a for-lang phrase, you can see that a simple Babel program for scoring English essays could fall far short of the mark.

    ReplyDelete