Tuesday, May 21, 2024

AI Proves Adept At Bad Writing Assessment

AI is not responsible for the rise if bad writing assessment, but it is promising to provide the next step in that little journey to hell.

Let me offer a quick recap of bad writing assessment, much of which I experienced first hand here in Pennsylvania. The Keystone State a few decades back launched the Pennsylvania System of School Assessment (PSSA) writing assessment. Assessing those essays from across the state was, at first, a pretty interesting undertaking-- the state selected a whole boatload of teachers, brought them to a hotel, and had them spend a weekend scoring those assessments.

I did it twice. It was pretty cool (and somewhere, I have a button the state gave us that says "I scored 800 times in Harrisburg). Much about it was not entirely impressive. Each essay was scored by two teachers, and for their scores to "count" they had to be identical or adjacent-- and on a five point scale, the odds are good that you'll meet that standard pretty easily. We were given a rubric and trained for a few hours in "holistic grading", and the rubric was pretty narrow and focused, but still left room for our professional judgment.

But then the state, like many others, stopped using teachers. It was easier to put an ad on craigslist, hire some minimum wage workers, train them for half a day, and turn them loose. Cheaper, and they didn't bring up silly things like whether or not a student's argument made sense or was based on actual true facts. (There is a great old article out there by someone who did this, but I can't find it on line). 

Pennsylvania used this system for years, and my colleagues and I absolutely gamed it. We taught our students, when writing their test essays, to do a couple of simple things.

* Fill up the whole page. Write lots, even if what you're writing is repetitive and rambling.

* Use a couple of big words (I was fond of "plethora"). It does not matter whether you use them correctly or not.

*Write neatly (in those days the essays were handwritten).

* Repeat the prompt in your first sentence. Do it again at the end. Use five paragraphs.

Our proficiency rates were excellent, and they had absolutely nothing to do with our students' writing skills and everything to do gaming the system.

The advent of computer scoring of essays has simply extended the process, streamlining all of its worst qualities. And here comes the latest update on that front, from Tamara Tate, a researcher at University California, Irvine, and an associate director of her university’s Digital Learning Lab, her latest research-- "Can AI Prove Useful In Holistic Essay Scoring"-- written up by Jill Barshay in Hechinger. 

The takeaway is simple-- in a fairly big batch of essays, ChatGPT was identical or within a point (on a six point scale) of human scorers (actual matching 40% of the time, compared to 50% for humans). This is not the first research to present this conclusion (though much previous "research came from companies trying to sell their robo-scorer), with some claims reaching the level of absurdity

The criticism of this finding is the same one some of us have been expressing for years-- it says essentially that if we teach humans to score essays like a machine, it's not hard to get a machine to also score essays like a machine. This seems perfectly okay to people who think writing is just a mechanical business of delivering probable word strings. Take this defense of robo-grading from folks in Australia who got upset when Dr. Les Perelman (the giant in the field of robograding debunkery) pointed out their robograder was junk:

He rightly suggested that computers could not assess creativity, poetry, or irony, or the artistic use of writing. But again, if he had actually looked at the writing tasks given students on the ACARA prompts (or any standardized writing prompt), they do not ask for these aspects of writing—most are simply communication tasks.

Yes, their "defense" is that the test only wants bad-to-mediocre writing anyway, so what's the big deal?

The search for a good robogradcer has been ongoing and unsuccessful, and Barshay reports this piece of bad news. 
Earlier versions of automated essay graders have had higher rates of accuracy. But they were expensive and time-consuming to create because scientists had to train the computer with hundreds of human-graded essays for each essay question. That’s economically feasible only in limited situations, such as for a standardized test, where thousands of students answer the same essay question.
So, the industry will be trying to cut corners because it's too expensive to do the job even sort of well-ish. 

Tate suggests that teachers could "train" ChatGPT on some sample essays, but would that not create the effect of requiring students to try to come close to those samples? One of Perelman's regular tests has been to feed a robograder big word nonsense, which frequently gets top scores. Tate says she hasn't seen ChatGPT do that; she does not say that she's given it a try.

And Tate says that ChatGPT can't be gamed. But then later, Barshay writes:
The next step in Tate’s research is to study whether student writing improves after having an essay graded by ChatGPT. She’d like teachers to try using ChatGPT to score a first draft and then see if it encourages revisions, which are critical for improving writing. Tate thinks teachers could make it “almost like a game: how do I get my score up?”

Yeah, that sounds like gaming the system to me.

Tate has some other odd observations, like the idea that "some students are too scared to show their writing to a teacher until it's in decent shape," a problem more easily solved by requiring them to turn in a rough draft than by running it by ChatGPT.

There are bigger questions here, really big ones, like what happens to a student's writing process when they know that their "audience" is computer software? What does it mean when we undo the fundamental function of writing, which is to communicate our thoughts and feelings to other human beings? If your piece of writing is not going to have a human audience, what's the point? Practice? No, because if you practice stringing words together for a computer, you aren't practicing writing, you're practicing some other kind of performative nonsense.

As I said at the outset, the emphasis on performative nonsense is not new. There have always been teachers who don't like teaching writing because it's squishy and subjective and personal-- there is not, and never will be, a Science of Writing--plus it takes time to grade essays. I was in the classroom for 39 years--you don't have to tell me how time-consuming and grueling it is. There will always be a market for performative nonsense with bells and whistles and seeming-objective measurements, and the rise of standardized testing has only expanded that market. 

But it's wrong. It's wrong to task young humans with the goal of satisfying a computer program with their probable word strings. And the rise of robograders via large language models just brings us closer to a future that Barshay hints at in her final line:

That does give me hope, but I’m also worried that kids will just ask ChatGPT to write the whole essay for them in the first place.

Well, of course they will. If a real human isn't going to bother to read it, why should a real human bother to write it, and so we slide into the kafkaesque future in which students and teachers sit silently while ChatGPT passes essays back and forth between output and input in an endless, meaningless loop. 

If you'd like to read more about this issue, just type "Perelman" into the blog's search bar above.

1 comment:

  1. Reminds me of the scene in "Real Genius" where the class is a pre-recorded lecture playing to a classroom full of tape recorders