Tuesday, March 11, 2014

Essay-Grading Software & Peripatetic Penguins

Education Week has just run an article by Caralee J. Adams announcing (again) the rise of essay-grading software. There are so many things wrong with this that I literally do not know where to begin, so I will use the device of subheadings to create the illusion of order and organization even though I promise none. But before I begin, I just want to mention the image of a plethora of peripatetic penguins using flamethrowers to attack an army of iron-clad gerbils. It's a striking image using big words that I may want later. Also, look at what nice long sentences I worked into this paragraph.

Look! Here's My First Subheading!

Speaking for the software will be Mr. Jeff Pence, who apparently teaches middle school English to 140 students. God bless you, Mr. Pence. He says that grading a set of essays may take him two weeks, and while that seems only a hair slow to me, I would certainly agree that nobody is taking 140 7th grade essays home to read overnight.

But Mr. Pence is fortunate to have the use of Pearson WriteToLearn, a product with the catchy slogan "Grade less. Teach more. Improve scores." Which is certainly a finely tuned set of catchy non-sequitors. Pearson's ad copy further says, "WriteToLearn—our web-based literacy tool—aligns with the Common Core State Standards by placing strong emphasis on the comprehension and analysis of information texts while building reading and writing skills across genres." So you know this is good stuff.

Pearson White Papers Are Cool!

Pearson actually released a white paper "Pearson's Automated Scoring of Writing, Speaking, and Mathematics" back in May of 2011 (authors were Lynn Streeter, Jared Bernstein, Peter Foltz, and Donald DeLand-- all PhD's except DeLand).

The paper wears its CCSS love on its sleeve, leading with an assertion that the CCSS "advocate that students be taught 21st century skills, using authentic tasks and assessments." Because what is more authentic than writing for an automated audience? The paper deals with everything from writing samples of constructed response answers (I skipped the math parts) and in all cases finds the computer better, faster, and cheaper than the humans.

Also, Webinar!

The Pearson website also includes a link to a webinar about formative assessment which heavily emphasizes the role of timely, specific feedback, followed by targeted instruction, in improving student writing. Then we move on to why automated assessment is good for all these things (in this portion we get to hear about the work of Peter Foltz and Jeff Pence, who is apparently Pearson's go-to guy for pitching this stuff). This leads to a demo week in Pence's class to show how this works, and much of this looks usable. Look-- the 6+1 traits are assessed. Specific feedback. Helps.

And we know it works because the students who have used the Pearson software get better scores on the Pearson assessment of writing!! Magical!! Awesome!! We have successfully taught the lab rats how to push down the lever and serve themselves pellets.

Wait! What? Not Miraculous??

"Critics," Adams notes drily, "contend the software doesn't do much more than count words and therefor can't replace human readers." They contend a great deal more, and you can read about their contending at the website humanreaders.org, and God bless the internet that is a real thing.

"Let's face the realities of automated essay scoring," says the site. "Computers cannot 'read'." They have plenty of research findings and literature to back them up, but they also have a snappy list of one-word reasons that automated assessors are inadequate. Computerized essay grading is:

Unlike Pearson, the folks at this website do not have snappy ad copy and slick production values to back them up. They are forced to resort to research and facts and stuff, but their conclusion is pretty clear. Computer grading is indefensible.

There's History

Adams gets into the history. I'm going to summarize.

Computer grading has been around for about forty years, and yet somehow it never quite catches on.

Why do you suppose that is?

That Was A Rhetorical Question

Computer grading of essays is the very enshrinement of Bad Writing Instruction. Like most standardized writing assessment in which humans score the essays based on rubrics so basic and mindless that a computer really could do the same job, this form of assessment teaches students to do an activity that looks like writing, but is not.

Just as reading without comprehension or purpose becomes simply word calling, writing without purpose becomes simply making word marks on a piece of paper or a screen.

Authentic writing is about the writer communicating something that he has to say with an audience. It's about sharing something she wants to say with people she wants to say it to. Authentic writing is not writing created for the purpose of being assessed.

If I've told my students once, I've told them a hundred times--good writing starts with the right question. The right question is not "What can I write to satisfy this assignment?" The right question is "What do I want to say about this?"

Computer-assessed writing has no more place in the world of humans than computer-assessed kissing or computer-assessed singing or computer-assessed joke delivery. These are all performance tasks, and they all have one other thing in common-- if you need a computer to help you assess them, you have no business assessing them at all.

And There's The Sucking Thing

Adams wraps up from some quotes from Les Perelman, former director of the MIT Writing Across the Curriculum program. He wrote an awesome must-read take-down of standardized writing for Slate, in which, among other things, he characterized standardized test writing as a test of "the ability to bullshit on demand." He was also an outspoken critic of the SAT essay portion when it first appeared, noting that length, big wordiness, and a disregard for factual accuracy were the only requirements. And if you have any illusions about the world of human test essay scoring, reread this classic peek inside the industry.

His point about computer-assessed writing is simple. "My main concern is that it doesn't work." Perelman is the guy who coached two students to submit an absolutely execrable essay to the SAT. The essay included gem sentences such as:

American president Franklin Delenor Roosevelt advocated for civil unity despite the communist threat of success by quoting, "the only thing we need to fear is itself," which disdained competition as an alternative to cooperation for success.

That essay scored a five. So when Pearson et al tell you they've come up with a computer program that assesses essays just as well as a human, what they mean is "just as well as a human who is using a crappy set of standardized test essay assessment tools." In that regard, I believe they are probably correct.

To Conclude

Computer-assessed grading remains a faster, cheaper way to enshrine the same hallmarks of bad writing that standardized tests were already promoting. Just, you know, faster and cheaper, ergo better. The good news is that the system is easy to game.  Recycle the prompt. Write lots and lots of words. Make some of them big. And use a variety of sentence lengths and patterns, although you should err on the side of really long sentences because those will convince the program that you have expressed a really complicated thought and not just I pledge allegiance to the flag of the United States of Estonia; therefor, a bicycle, because a vest has no plethora of sleeves. And now I will conclude by bring up the peripatetic penguins with flamethrowers again, to tie everything up. Am I a great writer, or what?


  1. I cannot imagine the crap that future writers will have to put out there to prove they can write. I took a week-long summer class in writing at Teacher's College of Columbia University. We learned that here are all kinds of wonderful writing and the key is the strong verb; Sentence length was not important, although the students are encouraged to use complex and compound sentences when they need them in their writing. The key to improving their writing was teacher and peer conferences. We published every 3-6 weeks depending on what we were writing. Some of the best writing in our grade 5 series of novels came from authors such as Gary Paulsen. Paulsen's "Storm" story, one of a collection about sled dogs, had paragraphs that were one word long. Sentences that were one word long. The kids loved that because it indicated how important that thought was. The best authors write sentence fragments on purpose. I shudder too think what Paulsen would get on the Pearson rubric. You made many good points and I LOVE your topic headings. LOL!

  2. This comment has been removed by the author.

  3. I teach writing. I grade writing. I engage my students in the writing process. My students consistently achieve higher scores on the Georgia State Writing Assessment, which is graded by real people.
    * Software does not grade essays but does provide useful feedback in the assessment process.
    * Software does not teach writing but does provide support to the teacher who is engaged with students in the writing process.
    "For Perelman, he believes that these kinds of systems can work in tandem with real human professors — but they aren't a substitute. "" - http://www.theverge.com/2014/4/29/5664404/babel-essay-writing-machine
    I agree.

  4. Just wrote about Perelman's new toy. I agree that software has its uses, but the use of computer grading goes far beyond the software's limitations.


  5. I believe we agree that computer grading, as an isolated score generator, is unacceptable. Use of computer assessments to inform the engaged teacher as a part of a grading process is more viable. The teacher is the deciding factor and makes the final call.

  6. I found and posted this yesterday, dealing just with this subject: http://www.thenewatlantis.com/publications/machine-grading-and-moral-learning

    "... However, functionalism’s critics believe there is a question-begging assumption at its heart. The functionalist argues that if two essays are functionally equivalent, then what produced each essay must be a mind, even if one of the essays was in fact produced by a machine. But as philosopher John Searle famously argued in his Chinese Room thought experiment, the functionalist argument ignores the distinction between derived and original meaning. Words have derived intentionality because we use words as artificial vehicles to express concepts. If the mind is like a wellspring of meaning, words are like cups, shells for transmitting to others what they cannot themselves create. The same is true of all conventional signs. Just as a map is not a navigator and an emoticon is not an emotion, a computer is not a mind: it cannot create meaning, but can only copy it. The difference between a Shakespearean sonnet and the same sequence of letters as the sonnet produced randomly by a thousand typing monkeys — or machines — is that a mind inscribed one with semantic meaning but not the other. When Shakespeare writes a sonnet, the words convey the thoughts that are in his mind, whereas when a mindless machine generates the same sequence of words, there are simply no thoughts behind those words. In short, functionalism’s focus on the behavioral concept of “functional equivalence” forgets that a sign depends on the meaning it signifies. We cannot treat syntactically equivalent texts as evidence of semantically equivalent origins.

    Put another way: I don’t give plagiarized papers the same grade I give original papers, even if the text of the two papers is exactly the same. The reason is that the plagiarized paper is no sign: it does not represent the student’s thinking. Or we might say that it is a false sign, meaning something other than what it most obviously appears to. If anything, what I can infer from a plagiarized paper is that its author is the functional equivalent of a mirror. As a mirror is sightless — its images are not its own — so too is a plagiarized paper mindless, all of its meaning stolen from a genuine mind. In a Dantean contrapasso, I grade plagiarizers with a mark harsher than the F that recognizes an original but failed attempt at thought: I drop them from my course and shake the dust from my feet. We should do the same with functionalism.

    John Henry’s Retort

    Of course, essay-grading software is not functionally equivalent to a professor in the first place, even for the narrow purpose of providing feedback on academic essays. It cannot be, because grading is a morally significant act that computers are incapable of performing. Functionalism, in falsely reducing human acts to mechanical tasks, also reduces the polyvalent language of moral value to a single, inappropriate metric.

    If minds are computers, then they should be evaluated by norms appropriate to computers: namely, by their efficiency in mapping inputs to outputs. So if professors and grading software are functional equivalents — outside of the Ivy League, at least — then they should be evaluated using the same criteria: the number of comments they write per paper, their average response time, the degree to which their marks vary from a statistical mean, and so on. This is the latent normative view of functionalism, particularly when it’s turned from a philosophical theory into a technical program: if machines can perform some task more efficiently than human beings, then machines are better at it. However, efficiency is not the moral metric we should be concerned with in education, or in other essentially interpersonal, relational areas of human life...."