Saturday, April 14, 2018

The Testing Charade: Buy This Book

This is probably going to be a long post, so let me get the most important parts out of the way first. The books is The Testing Charade: Pretending To Make Schools Better.

Daniel Koretz has published a book that gathers between two covers all the things wrong with the test-centered accountability under which we all currently suffer. His explanations are clear, and his illustrations are vivid. If you want to clarify your thinking about testing-- if you know something is wrong, but it's hard to wrap your head around-- you need this book. If a loved one or a colleague or an administrator asks you, "So what's the big deal, anyway? Why get upset over some simple standardized testing and accountability?" you need to hand them a copy of this book.

Buy this book.

Other things you should know up front. Koretz is the real deal. He is a widely recognized testing expert and scholar based at the Harvard Graduate School of Education. He is not some guy I found who presents my point of view; long time readers know that I am a standardized test radical (kill them all with fire) and Koretz is not-- he sees valuable uses for the big tests done right, and he pushes some ideas I don't agree with. Nevertheless, I'm telling you to buy this book.


There are fourteen chapters in this book. I'm going to talk about all of them, with quotes, because Koretz says a bunch of important stuff. If any of this lights a fire in your brain, my advice is simple. Buy this book.

1: Beyond All Reason

Pressure to raise scores on achievement tests dominates American education today.

That's the first sentence of the first chapter, which looks at where we have arrived in the bizarre and extreme pursuit of school improvement via test scores. Koretz provides multiple examples of how this accountability approach has gone off the tracks, including tales of excellent schools given "failing" grades. The evidence of the failure of this system has, Koretz says, "been accumulating for more than a quarter century. Yet it is routinely ignored-- in the design of education programs, in public reporting of educational 'progress,' and in decisions about the fates of schools, students, and educators."

Don't make the mistake of thinking that these problems will disappear now that NCLB has finally been replaced. Test-based accountability was well established in this country before NCLB, and it will continue now that ESSA has replaced it. It is true that NCLB was a very poorly crafted set of policies-- a train wreck waiting to happen, some of us said when it was enacted-- and it did substantial harm....ESSA continues the basic model of test-based accountability, while returning to states just a fraction of the discretion they had in implementing this model before NCLB was enacted.

 Koretz notes that he believes that standardized tests-- properly used-- can be valuable. And while he is going to damn the current accountability system, he is not arguing against all accountability.

2: What Is a Test ?

Everyone knows a test when they see it. However, understanding tests is very different from recognizing them, and unfortunately, many of the people with their hands on the levers in education don't understand what tests are and what they can and can't do.

Koretz offers the useful analogy that a test is like a poll-- a small sampling of a much larger domains, and only useful if the small sample is properly related to the larger area. "The items on a test matter only to the extent that they allow us to predict mastery of the larger subject area from which they were sampled." Standardized tests are not good tools for measuring full mastery, and Koretz lays out the reasons.

First, standardized testing has inherent limitations. Things like complex analytical thinking and problem solving aren't best assessed with a standardized test. Second, test authors must make large numbers of decisions about what is and is not included in test items. These decisions, some deliberate and some not, narrow what the test actually tests. This point, for Koretz, is huge. The sampling decisions introduce error (as in "margin of" and not "oopsies"), The samples will be incomplete measures. And perhaps most importantly, the sampling creates perverse incentives.

High stakes testing creates strong incentives to focus on the tested sample rather than the domain it is intended to represent. If you teach a domain better-- say, geometry-- scores on a good test of that area will go up. However, if you directly teach the small sample measured by a particular test-- for example, memorization of the fact that vertical angles are equal-- scores will increase, often dramatically, but mastery of geometry as a whole will not improve much, if at all.

This is as good an explanation as I've seen of why teaching to the test is a bad idea.

3: The Evolution of Test-Based "Reform"

You know we're going back to 1983 and A Nation at Risk,  as Koretz tries to lay out how we arrived at the place where tests are the most central part of any school's everyday life. He notes that nations used as examples of test-based education are actually trying to back away from it, and that even the highest high-stakes testing countries don't do what we do, don't test so often, don't use test scores to evaluate teachers and schools.

He offers the term "measurement-driven instruction" which is shorthand for a world in which "improving performance on the specific test was to be the explicit goal, and higher-quality instruction would be the consequence. This was the tail wagging the dog."

This approach did not start with NCLB, but NCLB gave it national scale. And folks who thought Obama would provide relief "were quickly and sorely disappointed." Arne Duncan's big contribution was to get states to tie individual teacher evaluation to test scores, leading to "some of the most ludicrous uses of test scores." ESSA isn't going to help, other that in allowing some states to add broader measures to the mix.

Koretz gives test-based reformsters credit for good motives, but "Whatever their motives. the proponents were wrong. The reforms caused more harm than good."

Although he'll spend the rest of the book exploring the details, he gives the broad strokes here to explain why test-based reform has failed so badly. First, it "rewards far too narrow a slice of educational practice and outcomes." Second, it is too high pressure. And third is that it left almost no room for human judgment.

...teaching is far too complex a job to evaluate without any judgment, and many of the things we value most in schools aren't captured by tests.

4: Campbell's Law

The more any quantitative social indicator is used for social decision making, the more subject it will be to corruption pressures and the more apt it will be to distort and corrupt the social processe it is intended to monitor.

Well, you knew Campbell's Law would come up. But what you might not know (well, anyway, I didn't) is that Campbell actually predicted how the law would play out in educational testing:

But when test scores become the goal of the teaching process, they both lose their value as indicators of educational status and distort the educational process in undesirable ways.

For folks who don't quite get how Campbell's Law works, or why it applies, Koretz has some great examples of the law in action (my favorite: the old Soviet shoe factory story). And he tempers the observations about the Law in action in education by arguing that we have an obligation to consider the balance; some approaches will be more heavily hurt by Campbell's Law than others, and Koretz wants us to make sure that there isn't more damage than benefit.

5: Score Inflation

This was a helpful concept for me. For Koretz, "grade inflation" doesn't mean blowing up scores or moving the cut score goal posts-- it means raising test scores by techniques that do not actually raise the level that the tests purport to measure.

Koretz talks at length about the problems of trying to study test score inflation-- turns out that people who are politically invested in "good" results on the Big Standardized Tests are not interested in letting scholars study how inaccurate those good scores are.

Koretz addresses inflation throughout the book, and his view is nuanced-- on one end of the scale is flat-out cheating, where on the other end we have coaching. At one point, my bosses expected me to put up a poster of the "anchor standards" that would be tested. Would that count as an inflation factor? I'd say yes, because it was aimed at preparing for the test and not aimed at improving reading, writing, speaking and listening skills.

Koretz says that all you need to get inflated scores is a test that is predictable. If, for instance, you are poring through banks of old test questions in order to narrow down what kinds of questions your students need to be prepared for, that's causing score inflation. And if your administrators are requiring you to do that...well, they are part of the reason that this test-driven system is a failure. High stakes matter as a motivation to do test prep, but even in lower-stakes settings, Koretz found a culture of "applied anxiety" about test scores.

Beyond corrupting the scores of individual students through things like test prep, Koretz points out you can corrupt the group results as well. One way is to focus on bubble kids-- the ones who can be dragged across the cut score line. Another is to game the system by rigging which students are tested. And of course there's plain old cheating.

Koretz explains a less-obvious problem of inflation-- it varies from school to school and classroom to classroom, which means "we are identifying the wrong teachers, schools and systems as successes and failures." And it's also worth noting that low-income schools have a higher incentive to inflate, which in turn means that the students who arguably most need an "improved" education are instead getting the education most corrupted by the testing process. IOW, they most need a "real" education, and they are most likely to get a battery of test prep instead.

6: Cheating

By the end of NCLB, as many of us noted, there were only two types of schools-- schools that were failing and schools that were cheating. Measurement-driven instruction has created enormous incentive to cheat, and created some thorny ethical dilemmas. After all, which is worse-- cheating on the student score for a single BS Test, or allowing bad educational policy to take that child's school away from her?

Koretz has a whole batch of cheating tales here, and they are depressing, but necessary if we're going to understand what high-stakes testing pushes people toward.

7: Test Prep

There is no doubt that test-based accountability has resulted in a huge commitment of time and effort to test preparation, I can't be more precise, in part because people don't agree on the dividing line between test prep and regular instruction.

In fact, the blurring of that line is one of the sobering parts of this chapter,

Koretz lists three types of bad test prep-- reallocating time between subjects, reallocating time within subjects, and coaching.  Schools shift time, both between classes and within classes, to things that count on the test. No band for you, Pat-- you need to spend a period in ELA test remediation. Today, class, we're going to skip over chapter five, because that stuff is not on the test. Coaching is teaching tricks that have no use except for taking the test. For example, process of elimination, which seems harmless enough, but allows a student to get a correct answer when the student would never have been able to come up with the correct answer on her own. Life is not multiple choice; process of elimination is a skill that is only useful for taking the test.

The most alarming part of the chapter addresses the idea that test prep is corrupting the idea of good teaching. Koretz has a disturbing bank of anecdotes about teachers who are told their is not to teach their subject, but to raise test scores in it. And like anyone else who has encountered new teachers in the last decade, he's met young teachers who were taught in college that test scores are their main purpose.

They were telling me that I was missing the boat by seeing test prep as something that competes for time with good instruction. In their experience, raising scores had become the end goal, the mark of a "good" teacher. To an alarming degree, they had been taught that test prep and good instruction are the same thing.

There are so many reasons this is bad, bad news, but here's one that Koretz spots

For the present, it indicates that one of the few checks against inappropriate test prep-- teachers' own understanding of the differences between test prep and good instruction-- has been eroded.

8: Making up Unrealistic Targets

Yikes. This is a depressing chapter. Koretz notes that the "trust most people have in performance standards is essential, because the entire system now revolves around them." And yet that trust is wildly misplaced.

Koretz outlines several methods of setting performance standards, but the important thing to remember is this:

There is another, perhaps even more important, reason why performance standards can't be trusted: there are many different methods one can use, and there is rarely a really persuasive reason to select one over the other.

And yet those different methods can produce vastly different results.

There are other problems.

A primary motivation for setting a Proficient standard is to prod schools to improve, but information about quickly teachers actually can improve student learning doesn't play much, if any, of a role in setting performance standards. When panels set standards, they are not given information about practical rates of improvement, and for the most part they are not asked to consider them.

In other words, they are set up to be the educational equivalent of an agricultural board that declares, "What we need is wheat that we can harvest six hours after it's planted. Do that!"

Also, standards are set mainly on the assumption that all kids are the same, with a goal of radically reducing the variability of achievement. I'm just going to sum up on this part: that's a dumb assumption.

9: Evaluating Teachers

Koretz is pretty brutal about the VAM-sauced test-based teacher evaluation systems. He allows that the old system had a problem in that everyone ended up looking great, but the new system surprised him because it solved problems with solutions so dumb it had never occurred to him that anyone would actually use them (e.g. evaluating teachers based on the scores of students they have never met).

This chapter includes Koretz's account of his visit to the Department of Education in Duncan's early days. You'll want to read that yourself.

How have we screwed up test-based accountability for teachers. Let Koretz count the ways:

* Taking test scores out of context (a "deliberate goal" of reformers, but "one of the main reasons the reforms have failed)
* Trying to use tests to explain, not just describe
* Using "Value-Added Modeling to evaluate teachers (here's as good an explanation as you'll find for why VAM can't possibly work)
* Rating teachers with the wrong test
* Teachers ratings are inconsistent across tests
* Teacher test scores are unstable over time (there are charts and specifics here that drive home how bad this effect is).

10: Will the Common Core fix this?

In a word, no.

Want a longer explanation?

It's not just the Common Core that has been dropped into schools wholesale before we gathered any evidence about impact; this has been true of almost the entire edifice of test-based reform, time and time again. I'll argue later that putting a stop to this disdain for evidence-- this arrogant assumption that we know so much that we don't have to bother evaluating our ideas before imposing them on teachers and students-- is one of the most important changes we have to make.

Koretz also argues that CCSS tests have actually been created in a way that makes them more predictable, and therefor more susceptible to all the Campbell's Lawian Test Preppery he just spent several chapters eviscerating. Koretz dismisses the one size fits allness of the Core, noting that one official once told him that the Core eliminated the distinction between career readiness and college readiness. "Rhetorically, perhaps, but not in actuality" responds Koretz.

It is a new flavor of the same old failed approach, gives one size fits all a "grandiose rhetorical wrapper." Underneath all the noise

the basic failed model of educational improvement remains unchanged: set arbitrary performance targets on standardized tests; apply them uniformly, without regard to circumstances; and reward and punish. Whatever its other virtues and vices, the Common Core hasn't changed this. This approach hasn't worked before, and it won't work with the Common Core.

11: Did Kids Learn More?

We know less about this most fundamental of questions than we probably should.

Why not? Partly because the test data is so subject to inflation that it can't be trusted. But also-- and I know I just ran a similar quote, but this point is important

There is a second reason for the dearth of information, the blame for which lies squarely on the shoulders of many of the reformers. Time after time they declared that they had figured out what would work, and they imposed it on students and teachers on a mass scale without taking the time to evaluate their programs first. It's analogous to a drug company saying that they have figured out, based just on their own beliefs and logic, which drugs will be effective and safe, so they can skip the time-consuming and expensive burden of actually gathering some evidence before selling it to you.

Koretz spends some time crunching some numbers, asking if students learned more, if they learned it because of test-centered accountability, and if what they learned justifies the "huge" costs of these policies (including costs like the corruption of instruction on a broad scale).

The answer is pretty clearly, "No."

12: Nine Principles for Doing Better

Here they are--

Pay attention to other important stuff.

Monitor more than student achievement.

Set reasonable targets.

Stop just kicking the dog harder.

Don't expect school to do it all.

Pay attention to context.

Accept the need for human judgment.

Create counterbalancing incentives.

Monitor, evaluate and revise.

13: Doing Better

Here Koretz works out his nine principles into a more specific action plan. He considers at length some of the usual models-in-other-countries. And he lays out some ideas.

Measure what matters. Okay, no bonus points for originality here, but then one of the subtexts of this book is that apparently you have to point out the obvious to people in education  reform because some of them rush straight into doing things that are obviously stupid. Here also he and I disagree on a role for well-used well-made standardized tests. That's okay; I know I'm out in left field on this. But I do like this:

This is the first and one of the most difficult tradeoffs we face: to measure learning well and to give teachers better incentives, we will have to use measures that have serious drawbacks-- in particular, potential inconsistency from classroom to classroom and school to school.

Yes. The more perfectly something is universally standardized, the useful it is to an individual teacher.

We need to measure "soft skills" well. This will be hard, and there will be disagreement. Yup-- that's already happening. And he has a useful insight-- part of the reason we got standardized testing "was the notion that educators can't be trusted to evaluate schooling or other educators." But the soft skills measures will have to "give a substantial role to the judgment of professionals." Not standardized SEL tests.

We need a sensible accountability system. Most interesting detail here-- we need various measures that are NOT too closely aligned with each other. If everything's aligned together, everything canm be wrong together.

Use tests sensibly. One test cannot do everything.

Provide support to teachers. Monitor and make midcourse corrections (you know-- the way Common Core was specifically designed not to do).

Still here? Good for you.

Chapter 14 is a wrap-up and I'll skip that, just as I've skipped over many specifics and explanations. Really, if I had time to be a better reviewer, I would have given you a much more compact look at what Koretz has created here, which is nothing less than a scholarly, thoughtful, accessible explanation of how test-based reform has taken education into the weeds. You should read this book, and you should pass it on to other folks who care about education and want to both understand the problems and envision some solutions. This is a valuable work, and I'll be coming back to it again and again, and you should, too.


  1. One of the most damaging downsides to high stakes, standards based testing has been the misdirection of educator energy.
    NCLB was bad, but once Gates and Coleman and Duncan upped the ante, millions of teachers and administrators and researchers and bloggers devoted billions of hours focused almost exclusively on Common Core testing. Just think about how billions of hours of effort, if directed productively, might have produced ideas, innovation, policies, programs, and even products that actually helped kids learn. Instead we spent all that time on a big destructive fire. Some of us mistakenly fanned it, many of us tried to put it out. And now five years later, a smoldering heap of charred crap is all we have to show for those efforts. Imagine what should have been.

    1. I'm concerned with anything coming out of Harvard as they seem to be on top of all this personalized learning and new wave of tests.
      I get very concerned around testing soft skills.
      Are you sure he is not just getting us ready for reform 2.0?

  2. And on top of all this testing offends the teacher-student relationship. It just /feels/ wrong, for both parties (if not for the party mandating it), and sometimes that alone should be enough.