Saturday, October 29, 2016

How (Not) To Grade Schools

Bellwether Education Partners is a right-tilted thinky tank from the same basic neighborhood as the Fordham Institute. Chad Aldeman is one of their big guns, and this month he's out with Grading Schools: How States Should Define “School Quality” Under the Every Student Succeeds Act. It's a pretty thing with thirty-two pages of thoughts about how to implement school accountability under ESSA, and I've read the whole thing so that you don't have to. Let's take a look under the hood.



Aldeman offers a few thoughts to start that give a hint about where he might be headed. School evaluation has been too rigid and rule-bound. We've focused too much on student test scores instead of student growth. But the window is now open for a "new conversation," which kind of presumes that there was an old conversation, and I suppose for people in the thinky tank world it might seem as if there were a conversation, but from out here the actual education field, school accountability has been imposed from the top down with deliberate efforts to silence any attempts at conversation.

In other words, the news that school accountability has been too rigid and rules-bound is only news to people who have steadfastly ignored the voices of actual teachers, who called that one from the very first moment that No Child Left Behind raised its rigid, inflexible, and not-very-smart head.

So to have this "new conversation," policy folks should brace themselves for a certain amount of "Told you so" or "No kidding" or even "No shit, Sherlock." Or alternately, as this new conversation is probably going to resemble the old one insofar as actual teacher voices will be once again excluded, something along the lines of, "Remember what happened the last time you ignored us?"

What Is Accountability and Why Does It Matter?

Alderman acknowledges that accountability covers a wide range of functions, from transparency for the general public on one end to rewards and punishments by government on the other end. He posits that somewhere in the middle that "accountability can act as a tool for improvement through goal-setting, performance benchmarking, and re-evaluation." And he also notes that accountability measures are state government's way of signalling what it values.

So accountability can be very many things. Who is it for?

Well, teachers and school leaders, who are supposed to be able to use the data to do a better job. And parents, too. And also the political leaders who are responsible for the oversight of public tax dollars. And on top of that, ESSA requires states to grade schools in order to stack rank and  target some for some manner of fixing, including targeting the bottom five percent.

Aldfeman barrels on, pretending that meeting that last set of ESSA mandated stack-ranking, school-grading requirements will meet all the various versions of accountability that he has listed. He suggests in passing that we're really talking about different degrees of transparency for different groups of accountability viewers, but that's not really true either.

Neither Aldeman or, for that matter, the feds have seriously or realistically addressed the problems that come when you try to create an instrument that measures all things for all audiences. This is bananas, and it's why the entire accountability system continues to be built on a foundation of sand and silly putty. The instrument that tells a parent how their child is doing is not the same as the instrument that tells a teacher how to tweak instruction, and neither is the same as the instrument that tells the state and federal government if the school is doing a good job, and none of those are the same as an instrument used to stack ran all the schools in the state (and, it should also be noted, none of those functions are best done by a Big Standardized Test, and yet policymakers seem unable to let go of the assumption that the BS Tests are good for anything).

It's like weighing the entrees at a restaurant as a way of determining customer satisfaction, chef quality and efficiency, how well the restaurant is managed, compliance with health code regulations, reviews for the AAA guide, and the stability of the building in which the restaurant is housed. It's simply nuts.

Aldeman cites assorted research that is all based on the assumption that narrow poorly-written standardized math and reading tests are actually measuring something useful. They are not. Virtually all of the data generated by these tests is junk, and as their use becomes more widespread and students become more weary of them, the data becomes junkier and junkier.

Bottom line-- real accountability requires a wide range of instruments for a wide range of audiences, and we have not remotely solved that challenge. Not, let me note, that it isn't a challenge worth solving. But as long as we base the whole system on the BS Tests, we will not be remotely in the right neighborhood.

How Should States Select Accountability Measures

Again, Aldeman is working from some bad assumptions about what the system is for. Can you spot the key word in this sentence?

The trick, then, is to design accountability systems in which schools are competing on measures that truly matter

A competition system is not a measuring system. If I tell you that Chris is the tallest kid in class and Pat is the shortest, you still have no idea of Chris's or Pat's actual height.

Aldeman gets his next point right-- an accountability system should be simple, clear and fair. Well, partly right. His idea of "fair" is that the system only measures things that schools actually have control over. So he's skipped one other key attribute-- the accountability system needs to be accurate and measure what it actually says it measures. So, for instance, we should stop saying "student achievement" when we actually mean "student score on a single narrow standardized math and reading test that has never really passed tests for validity and reliability."

Aldeman notes the four required elements per ESSA:

1) "Achievement rates" aka "test scores."
2) Some other "valid and reliable" academic indicator. The word "other" assumes facts not in evidence.
3) Progress in achieving English language proficiency
4) Some other indicator of school quality or success

Aldeman offers a chart in which some possible elements are judged against qualities like simplicity, fairness, disagregatability, and giving guidance to the school. So measuring grit or other personal qualities is iffy because measuring and teaching it are iffy. Teacher and student surveys get a thumbs up for measuring stuff, but thumbs down for being actionable, though I think a good student or staff survey would provide a school with very specific issues to address.

Aldeman says to avoid redundant measures and reminds us that ESSA doesn't put a maximum limit on measures to be used.

How Can States Design School Ratings Systems That Are Simple, Clear, and Fair?

A fake subheading that simply covers an introduction that says, "And now I will tell you how." It does include a fun sidebar about how K-2 should be included in the accountability system. Aldeman notes that leaving them out previously was because of things like the unsolved challenge of how to assess the littles; he does not offer any new insights about that issue that have turned up since NCLB days, and in fact subjecting the littles to any kind of formal or standardized assessment is a truly, deeply indefensible policy notion, and serves as nothing more than a clear-cut example of putting the desires of policy-makers and data-grubbers over the needs of small children.

Incorporating Student Achievement

Of course, by "student achievement," we just mean "test scores." Aldeman recommends we start out with a simple performance scale index for points. He suggests five performance levels, with emphasis on proficiency because "proficiency is, after all, a benchmark for future success in college and careers." Which-- no, no it's not. There isn't an iota of data anywhere to connect a proficiency level on the BS Tests with college and career success, particularly because the proficiency rating is a normed ranking, so it moves every year depending on the mass of scores and the cut scores set annually by state testocrats.

So we're talking about using the test scores, which are junk, after they have been run through a normed scale, which adds more junk.

Using Growth as the "Other" Academic Indicator

Aldeman pays tribute to the "growth mindset" as a worthy stance for schools, though we are once again talking only about growth as it applies to standardized test scores. If the student grew in some other way, nobody cares.

The problem with coming up with a measure of student growth is, of course, that nobody has successfully done it yet. Aldeman mentions several models.

* Without using the words "value-added," Aldeman nods to the model that uses obtuse, opaque, and unproven mumbo-jumbo to make the claim that student performance can be statistically stripped from other characteristics. Aldeman suggests this is disqualified because it is neither simple nor understandable; he might also mention that it is baloney that has been debunked by all manner of authorities.

* Aldeman mentions the student percentiles model, a stack-ranking competitive model that compares a student's test score to the score of other students who had a similar score last year. Like all such normed models, this one involves goal posts that move every year, and like all percentile-based models, it guarantees the exact same distribution year after year. No amount of school quality will raise all students to the top 25%.

* Aldeman favors a transitional matrix, judging schools on how many students move from one group to another (say, below basic to basic). This is also a bad idea. Aldeman has elsewhere shown sensitivity to the unintended consequences of some of these policy choices, so I'm not sure how he misses the obvious implications here. A school's best strategy will be to invest its energy on students who are near a threshhold and not those for whom there's no real hope of enough improvement.

Creating an Overall Index and Incorporating Subgroup Results

Aldeman wants to use the two indicators we've got so far and average them for an overall index, and this is the score by which we'll "flag" the bottom 5%. These indexes would also be computed for subgroups so that schools can also be flagged for failing to close their achievement gaps.

To be clear, this approach assumes that identifying schools for improvement is an important lever at the state’s disposal. That’s intentional, because there are positive effects associated with the mere act of notifying schools that they need to improve. That’s especially true for accountability systems bearing consequences for schools, but it’s even true in systems relying purely on information and transparency. 

In other words, threats work. At least, they work on raising test scores (and he's got some research from reformster research rock star Eric Hanushek to back it up). This is a deeply irresponsible policy idea, ignoring completely the question of what schools give up and get rid of in order to raise their test scores. Cutting recess, phys ed, art, music, etc. In my own district I have seen schools strip student schedules so that middle school students with low test scores spent their entire day in English and math class, with no history, art, science or other non-tested subjects.

This is the test-centered school at its worst. This is a lousy idea.

Incorporating Other Measures of School Success Into Final School Ratings

Here Aldeman brings out the English model of school inspections, in which trained and experienced educators visit the school for an extended inspection, both detailed and holistic, of how the school works, how well it ticks, how well it serves students, and how well it matches the notion of what a good school should be.

This is a good idea.

Though I can imagine that for schools that have been "flagged" because of test scores, the inspection visit might be a bit harrowing.

I would offer one editing suggestion to Aldeman for his system. Keep the school inspection system and get rid of everything else.

Yes, yes, ESSA has kept us beholden to the BS Testing system. But any sensible, realistic, useful accountability system is going to shrink the use of the BS Test down to the absolute minimum the feds will let the state get away with. Making the test scores the foundation of the rest of the accountability is the absolute wrong way to go.


Aldeman notes that ESSA somehow focuses less attention on punishing "failing" schools than on actually helping them, which, maybe, depending on how you read it. It would be worth it for the feds and states to back away from that, since they have shown absolutely no aptitude for turning around failing schools.

There is one other huge hole in Aldeman's plan, and that is the space where we should find the voice of the community in which the school is located. He has dodged one of the big accountability questions, which is this-- if the community in which a school is located is happy with their school, exactly what reason is there for the state and federal bureaucrats to get involved? I remain puzzled that the right-leaning policy folks continue to remain uninterested in local control of schools.


  1. I much appreciate you reading it so that I did not have to. I was at TWO schools this week where over 80% of students are not proficient in math (92% at one of the schools). The same week I had to listen to Silicon Valley types tell me all about the new innovations they are funding to track school data. I wanted to scream. Do they SERIOUSLY think the problem is that teachers at those schools don't know that their fifth-grade students can't do basic multiplication problems? I'm pretty sure that none of the teachers are sitting around waiting for a printout to tell them that Emilia doesn't know 4 x 8

  2. There's a good analysis of the New Mexico A-F Grading System on the Vamboozled site:

  3. Another razor-sharp analysis. An "accountability" system is not going to improve anything unless the appropriate problem-solving techniques are first used to determine what the problems actually are, what the causes of the problems are, and then what goals would work to address each one. You can't just start with a baseless assumption.