Pages

Tuesday, February 20, 2024

TX: Using Computers To Make STAAR Test Worse

There are several fundamental problems with trying to use standardized testing on a large scale (say, assessing every student in the state). One is the tension between turnaround time and quality. The quickest tests to score are those based on multipole choice questions; however, multiple choice questions are not a particularly good measure of student learning. Essay questions are an excellent tool for letting students show what they know, but they are super time consuming to score.

So we have the undying dream of test manufacturers, the dream of a computer program that can assess student writing accurately. It's a fantasy, a technology that, like self-driving cars, is always right around the corner. And like self-driving cars, the imperfected not-really-functional tech keeps getting purchased by folks who succumb to the sales pitch. Add Texas to the list of suckers and Texas students to the list of victims.

The trouble with software

I've been writing about the shortcomings of these programs for a decade (here, here, here, here, and here, for starters). 

There are a variety of technical problems, including the software's ability to recognize whether the content of the answer is bunk or not. Did Hitler fight in the Civil War? Your computer does not "know."

The "solution" is "training" the software on the particular question for which it's assessing answers, but that is essentially teaching the software that a good answer looks like these sample good answers it has viewed, which in turn sets some narrow parameters for what students can write. 

Computers are good at recognizing patterns, but that recognition is based on what their trainers show them, like the facial recognition programs that can't see Black faces because they were trained on white ones. When Ohio did quick pivot to computer-scored essays, it trained its software on essays that did not use the classic "recycle the prompt as your topic sentence" technique used by many teachers (in response to the old algorithm), and a whole lot of students failed. Who is doing the software training and how are they doing it--these are critical questions.

The shift is subtle but important--the software can't tell you if the written answer is good, but it can tell you if it closely resembles the examples that the software has been told are good ones.

Which hints at the philosophical issue here. Using computer scoring fundamentally changes the task. Instead of making a good faith effort to communicate information to another human being, the student is now tasked with trying to meet the requirements of the software.

I took a look at how things were going in various states in 2021. Not well, is the short answer. A favorite dodge is to say that roboscoring works as well as human scoring, but the trick here is to train human scorers to follow a narrow algorithm cemented with examples of how to apply it--in other words, to teach humans to score the essays as a computer would. 

The trouble with STAAR

Texas's Big Standardized Test is the STAAR (which does not stand for Some Tests Are Always Ridiculous or maybe Should Throw Away Any Results or even Stupid Tests' Asses Are Raggedy). And the STAAR has a troubled history including technical glitches and questions without correct answers and just losing crates of answer sheets and just not working. Or is not aligned with state standards. And after many years, still glitch like crazy.

A big STAAR highlight is covered in this piece by poet Sara Holbrook, a poet who discovered that A) her own work was being used on the STAAR test and B) she couldn't answer some of the questions about her own work. 

After several years of struggling, STAAR went fully on line last year, which could only make the idea of roboscoring written portions more attractive.

So now what

"Constructed responses" will now be scored mostly by computer, an "automated scoring engine." 25% will then be routed past human beings. Spanish language tests will be human scored.

Human scorers will be trained to use the rubrics with practice sets, then required to display their machine-like precision "by successfully completing a qualification set." Short answer responses (SCR) are scored on 0-1 or 0-2 rubric. The long answer questions (ECR) are scored "using an item-specific 5-point rubric that identifies scores based on two traits—development and organization of ideas (up to 3 points) and language conventions (up to 2 points)." 

Which raises two questions--who decided that conventions should count for 40%, and how will an algorithm assess development and organization of ideas?

The ASE is trained on student responses and human scores from the field-test data. It is trained to emulate how humans would score student responses for each constructed-response question...
As part of the training process, the ASE calculates confidence values that indicate the degree to which the ASE is confident the score it has assigned matches the score a human would assign. The ASE also identifies student responses that should receive condition codes. Condition codes indicate that a response is blank, uses too few words, uses mostly duplicated text, is written in another language, consists primarily of stimulus material, uses vocabulary that does not overlap with the vocabulary in the subset of responses used to train the ASE, or uses language patterns that are reflective of off-topic or off-task responses.

Emphasis mine. So, "doesn't sufficiently mimic the essay the program was trained on" is a problem on the same order as "left the page blank."  

Education professor Duncan Klussman commented, “What we don’t wanna do is have a system that moves to completely formulaic writing. Like ‘If I write exactly this way, I get the highest score,'” but that's exactly what you get. 

Well, that's what you get once you adapt teaching to fit the algorithm. Last year when the STAAR test went on line, 54% of Houston fourth graders scored a zero on the written portion. Previously, pre-online STAAR, the number was 5%. So did fourth graders turn stupid, or did the test requirements change in ways that teaching hasn't adapted to yet (believe it or not, the director of the state's assessment development division says it's not that second one, but "it really is the population of testers much more than anything else.") 

All of this matters a great deal in a state where schools are still graded largely on student results from the BS Test.

The Dallas News asked a few experts, including my hero and friend of the institute Les Perelman, an absolute authority in the many failings of roboscoring. Perelman notes that having humans backstop only 25% of the writing responses was "inherently unequal," which is an understatement. Imagine telling a class, "Okay, I'm going to actually look at the essays from 25% of you; the rest will just get whatever the computer says." 

Perelman also notes that machine scoring

“teaches students to be bad writers,” with teachers incentivized to instruct children on how to write to a computer rather than to a human. The problem, he said, is machines are “really stupid” when it comes to ideas.

Exactly. Computer-assessed grading remains a faster, cheaper way to enshrine the same hallmarks of bad writing that standardized tests were already promoting.

But TEA officials are sure they've got everything under control. They've "worked with their assessment vendors" who are Pearson, the well-known 800 pound gorilla of ed tech moneymaking, and Cambium, a sprawling octopus of education-flavored businesses (you can get a taste of their sprawl here). It might have been nice to have worked with actual educators, even to the tony extent of letting them know what was coming rather than just rolling this out quietly. 

Peter Foltz, professor at University of Colorado at Boulder, reassured the Dallas News that it's not easy to coach students how to game a scoring engine. I doubt it. We learned how to game the algorithm in PA when it was applied by humans, and that transferred just fine to roboscorers. All we had to do was replace some actual writing instruction with writing for the test instruction.

Foltz also said that automated scorers must be built with strong guardrails, and that just takes me back to when self-driving car manufacturers remind drivers of self-driving cars, "When using Autopilot, drivers are continuously reminded of their responsibility to keep their hands on the wheel and maintain control of the vehicle at all times."

You know what's better than guardrails and safeguards to protect us from the many ways in which software fails to do the job that it is supposed to do but actually can't? Not using software to do a job that it actually can't. 

I'm sure that Cambium and Pearson smell big bucks. Folks at TEA may even smell a way to erase some of STAAR's sad history by being all shiny and new (a thing they presumably new because the sales force from Pearson and Cambium have told them so). But this is a bad idea. Bad for schools, bad for education, bad for writing, bad for students. Bad. 

No comments:

Post a Comment