Sunday, July 12, 2015

Test Scoring Monkeys

It's been less than a month since Motoko Rich traveled to San Antonio to hear a Pearson test scoring supervisor explain that scoring the tests is like making a Big Mac. Now Claudio Sanchez has made the same journey for NPR, and the results are no more flattering for Pearson than those from Rich's jaunt.

The center uses scorers from many walks of life, though a four-year degree is required. What is not required is any sort of opinion about the quality of the questions.

David Connerty-Marin, a spokesman for PARCC, says it's not up to a scorer or Pearson or PARCC to say, "Gee, we think this is too hard for a fourth-grader."

What is or is not developmentally appropriate, he says, is not an issue because the states have already made that decision based on the Common Core Standards.

One of these rainy summer days, I'll spend some time running up and down the internet and see if I can find, somewhere in the great chain of standards and testing, the person who says, "Me, I'm the one. I'm the guy who decides that this test item is appropriate for an eight year old." But until the day comes, we're stuck with test manufacturers who say, "Well, we just follow what the state tells us" and states that say, "Well, we lean on the professionals to design these things" and a whole bunch of people who point and shrug and say, "Well, you know, the standards" as if the standards were dropped down from heaven on the back of a golden cloud that deposited them on top of a burning bush.

The article's description of the scoring process reveals for the gazillionth time that the constructed open-ended responses are not any kind of open-ended response at all, but a bizarre exercise in blind matching.

Sanchez talked to one retired teacher who has worked eight years for Pearson.

She looks for evidence that students understood what they read, that their writing is coherent and that they used proper grammar. But it's actually not up to Vickers to decide what score a student deserves.
Instead, she relies on a three-ring binder filled with "anchor papers." These are samples of students' writing that show what a low-score or a high-score response looks like.

"I compare the composition to the anchors and see which score does the composition match more closely," Vickers says.

That's not an open-ended response. It's a newer, more gigantic form of multiple choice, where students choose from all the possible combinations of words in the English language in hopes of selecting the one combination that is acceptable to test manufacturers. Those folks in Texas have the same basic task as the guy checking the work of the million monkeys to see which one has typed a Shakespeare play. This is a test where students are given a box full of LEGOs and told to build something, but will only get credit if they build the right thing.

And, of course, reporters can't know any specifics about any of the actual test questions or responses.

Pearson does not allow reporters to describe or provide examples of what students wrote because otherwise, company officials say, everybody would know what's on the test.

I don't even know how to explain how insane that is. In my own classroom, my students know exactly what is going to be on a test. Any test that depends on super-duper secrecy is a terrible test. It is also possibly a test manufactured by cheap money-grubbing slackers who don't want to do the work of updating it annually.

Pearson delivers a backhanded acknowledgement that secrecy has not been their friend. One supervisor notes that since the public doesn't know what Pearson's doing, "misconceptions" abound. But Sanchez gets the last word on that subject:

Most Americans have been in the dark, says Thompson. So the risk for Pearson, PARCC and the states is that by trying to be more transparent this late in the game, people may very well end up with more questions than answers.


  1. I would like to help Ms. Vickers out in providing more clarity about the scoring process. She skims a response for 20 to 30 seconds, thinks about which anchor paper she remembers from a training class she attended some months ago the response is closest to, assigns a score, and moves on. Remember the old Lucille Ball scene in I Love Lucy where she and Ethel have to pack candies in boxes as the conveyor belt runs by faster and faster? That is what the actual scoring process for writing and constructed responses looks like.

  2. Third paragraph from bottom: "I don't even know how to explain how insane that is. In my won classroom, my students know exactly what is going to be on a test."

    Shouldn't "won" by "own"?

  3. Phila.ken, is "by" a typo or a joke? A joke, I hop.

  4. We all get fat finger syndrome at the keyboard.

  5. If our blogs' audiences are only or mostly educators, are we wasting our time? Anti-reformist blogging demonizing school choice, testing, privatization, charters, low teacher pay, etc. is frustrated educator feel good. I would very much like to read ideas on positive actions we can take that will change the parental and political misperceptions of public schooling.

    1. My thought would be that "demonizing" anything is not really effective in changing minds. Presenting clear arguments that seek to find common ground between people who disagree is much more effective.

    2. I wish that were true, TE. The trouble is, when people have an ulterior agenda, they lie and ignore facts. For example, the American Statistical Association has stated clearly, giving clear explanations why, VAMs cannot be used as valid measures of an individual teacher's effectiveness, and this has been pretty widely disseminated. Yet politicians from Arne Duncan to Andrew Cuomo continue to insist they be used.

    3. Rebecca,

      I have read both the ASA statement and the two Chetty et al articles, and Chetty's response to the ASA statement. I would love to have a discussion about them here if you would like to have one.

    4. Even Chetty et al say that VAMs are "not perfectly reliable" and that "educators and policymakers are likely to make better decisions if they are based on multiple measures of job performance rather than any stand-alone metric." They even state that "other measures of teacher performance, such as principal evaluations, student ratings, or classroom observation, may prove ultimately to be better predictors of teachers' long-term impacts on students than VAMs."

      But Arne, Cuomo, and others haven't gotten the message. They either don't understand, or choose not to, the assumptions and limitations of the models. Arne doesn't care if any method other than one based on test scores - with or without value-added - is used. Cuomo insists on VAMs being 50% of teacher evaluations - even though experts say 20% is too much weight - and if a teacher's VAM score isn't good enough, the other parts of the evaluation don't count.

    5. Rebecca,

      I doubt that many policy makers will get the message from posts on blogs, so I think the work that can be done on the blogs is building on the common ground that exists. Your quotes from Chetty et al are a good example of this, pointing out that the authors of the paper have a reasonable position on the effectiveness of this measure of teaching.

    6. If Chetty et al don't have an ulterior agenda, they should be upset that Arne and Cuomo are misapplying their findings and should be vociferously correcting their misapprehensions.

    7. Chetty is a professor a full professor at Harvard on the short list for a future Nobel prize. There is no ulterior agenda.

      Have you read his recent work on inter-generational income mobility and geography? It was widely covered by the press.

    8. I don't care who he is, I wouldn't want my findings misapplied.

      As far as his recent work, it isn't surprising that more segregated areas have less upward mobility; that areas with smaller class sizes and higher local education (property) taxes have higher upward mobility; or that areas with more single parents have less upward mobility. The study says areas with more religious individuals and greater community participation have higher upward mobility; however, many African-American communities have strongly religious populations, but in their case it doesn't seem to lead to higher upward mobility. It seems weird that local market conditions and access to higher education do not correlate to upper mobility. The thing I notice from the map is that rural areas seem to have greater upper mobility, while all the southeast seems to have the least upward mobility. I don't know how any of this leads to practical solutions to poverty.

  6. Another insightful analysis with wonderful Greene analogies.

  7. Thanks, Peter, for the usual excellent commentary. Great comment section at the end of the NPR article.