Benchmark is originally a surveying term. Benchmarks are slots cut into the side of stone (read "permanent") structures into which a bench (basically a little shelf) can be inserted for surveying purposes. We know they're at a certain level because they've been measured in relation to another marker which has been measured in relation to another marker and so on retrogressively until we arrive at a Mean Sea Level marker (everything in surveying is ultimately measured in relation to one of those).
Surveying markers, including benchmarks, are literally set in stone. Anybody with the necessary training can find them always in the same place and measure any other point in relation to them.
This metaphorical sense of unwavering objective measure is what many folks carry with them to their consideration of testing and cut scores. Passing, failing, and excellence, they figure, are all measured against some scholarly Mean Sea Level marker by way of benchmarks that have been carefully measured against MSL and set in stone.
Sorry, no. Instead, cut scores represent an ideal somewhere between a blindfolded dart player with his fingers duct-taped together, and the guy playing against the blindfolded dart player who sets the darts exactly where he wants them.
Writing in the Stamford Advocate, Wendy Lecker notes that the Smarter Balanced Assessment Consortium members (including Connecticut's own committed foe of public education Commissioner Stefan Pryor) set cut scores for the SBA tests based on stale fairy dust and the wishes of dying puppies.
People tend to assume that cut scores-- the borderline between Good Enough and Abject Failure-- mean something. If a student fails The Test, she must be unready for college or unemployable or illiterate or at the very least several grades behind where she's Supposed To Be (although even that opens up the question "Supposed by whom?")
In fact, SBAC declares that the achievement levels "do not equate directly to expectations for `on-grade' performance" and test scores should only be used with multiple other sources of information about schools and students.
Furthermore, "SBAC admits it cannot validate whether its tests measure college readiness until it has data on how current test takers do in college."
If you are imagining that cut scores for the high-stakes accountability tests are derived through some rigorous study of exactly what students need to know and what level of proficiency they should have achieved by a certain age-- well, first, take a look at what you're assuming. Did you really think we have some sort of master list, some scholastic Mean Sea Level that tells us exactly what a human being of a certain age should know and be able to do as agreed upon by some wise council of experty experts? Because if you do, you might as well imagine that those experts fly to their meetings on pink pegasi, a flock of winger horsies that dance on rainbows and take minutes of the Wise Expert meetings by dictating to secretarial armadillos clothed in shimmering mink stoles.
Anyway, it doesn't matter because there are no signs that any of these people associated with The Test are trying to work with a hypothetical set of academic standards anyway. Instead, what we see over and over (even back in the days of NCLB), is educational amateurs setting cut scores for political purposes. So SBAC sets a cut score so that almost two thirds of the students will fail. John King in New York famously predicted the percentage of test failure before the test was even out the door-- but the actual cut scores were set after the test was taken.
That is not how you measure a test result against a standard. That's how you set a test standard based on the results you want to see. It's how you make your failure predictions come true. According to Carol Burris, King also attempted to find some connection between SAT results and college success prediction, and then somehow graft that onto a cut score for the NY tests, while Kentucky and other CCSS states played similar games with the ACT.
Setting cut scores is not an easy process. Education Sector, a division of the thinky tank American Institutes for Research (they specialize in behavioral sciency thinking, and have a large pedigree in the NCLB era and beyond), issued an "explainer" in July of 2006 about how states set passing scores on standardized tests. It leads off its section on cut scores with this:
On a technical level, states set cut scores along one of two dimensions: The characteristics of the test items or the characteristics of the test takers.It is essential to understand that either way is an inescapably subjective process. Just as academic standards are ultimately the result of professional judgment rather than absolute truth, there is no “right” way to set cut scores, and different methods have various strengths and weaknesses.
The paper goes on to talk about setting cut scores, and some of it is pretty technical, but it returns repeatedly to the notion that at various critical junctures, some human being is going to make a judgment call.
Educational Testing Service (ETS) also has a nifty "Primer on Setting Cut Scores on Tests of Educational Achievement." Again, from all the way back in 2006, this gives a quick compendium of various techniques for setting cut scores-- it lists eight different methods. And it also opens with some insights that would still be useful to consider today.
The first step is for policymakers to specify exactly why cut scores are being set in the first place. The policymakers should describe the benefits that are expected from the use of cut scores. What decisions will be made on the basis of the cut scores? How are those decisions being made now in the absence of cut scores? What reasons are there to believe that cut scores will result in better decisions? What are the expected benefits of the improved decisions?
Yeah, those conversations have not been happening within anyone's earshot. Then there is this:
It is important to list the reasons why cut scores are being set and to obtain consensus among stakeholders that the reasons are appropriate. An extremely useful exercise is to attempt to describe exactly how the cut scores will bring about each of the desired outcomes. It may be the case that some of the expected benefits of cut scores are unlikely to be achieved unless major educational reforms are accomplished. It will become apparent that cut scores, by themselves, have very little power to improve education. Simply measuring a child and classifying the child’s growth as adequate or inadequate will not help the child grow.
Oh, those crazy folks of 2006. Little did they know that in a few years education reform and testing would be fully committed and devoted to the notion that you can make a pig gain weight by weighing it. All this excellent advice about setting cut scores, and none of it appears to be getting use these days.
I'm not going to go too much more into this document from a company that specializes in educational testing, except to note that once again, the paper frequently notes that personal and professional judgment is a factor at several critical junctures. I will note that they include this step--
The next step is for groups of educators familiar with students in the affected grades and familiar with the subject matter to describe what students should know and be able to do to reach the selected performance levels.
They also are clear that selecting the judges who will set cut scores means making sure they are qualified, have experience, and reflect a demographic cross section. They suggest that policymakers consider fundamental questions such as is it better to pass a student who should fail, or fail a student who should pass? And they are also clear that the full process of setting the cut scores should be documented in painstaking detail, including the rationale for methodology and qualifications of the judges.
And they do refer uniformly to the score-setters as judges, because the whole process involves-- say it with me-- judgment.
People dealing with test scores and test results must remember that setting cut scores is not remotely like the process of surveying with benchmarks. Nothing is set in stone, nothing is judged based on its relationship to something set in stone, and everything is set by people using subjective judgment, not objective standards. We always need to be asking what a cut score is based on, and whether it is any better than a Wild Assed Guess. And when cut cores are set to serve a political purpose, we are right to question whether they have any validity at all.