The Brown Center on Education Policy at Brookings (the folks who remind us that teachers are everything wrong with education) has released a new report, "Evaluating Teachers with Classroom Observations," a study intended to Tell Us Some Things about teacher evaluation and how to do it best. Is this going to be a trip to the unicorn farm? The very first sentences tell us where this report's heart is:
The evidence is clear: better teachers improve student outcomes, ranging from test scores to college attendance rates to career earnings. Federal policy has begun to catch up with these findings in its recent shift from an effort to ensure that all teachers have traditional credentials to policies intended to incentivize states to evaluate and retain teachers based on their classroom performance.
We are off once more to search for ways to perfect the teacher evaluation system. Grover J. Whitehurst, Matthew M. Chingo, and Katherine M. Lindquist have laid down twenty-seven serious pages of unicorn farming. Let me do my best to take you on a condensed tour.
Focus on the Human Observation
Their big take-away is this: "Nearly all the opportunities for improvement to teacher evaluation systems are in the area of classroom observations rather than test score gains." In other words, the VAM side of evaluations is as good as it can be, but that pesky human-observing-human piece needs to be tightened up. Yikes.
You see, only some teachers are evaluated on test score gains, but all teachers are observed. And here's one thing they get right-- the human observation can provide feedback that's actually good for something, while test results are too late and too vague to be of any use to teachers at all.
But that leads us to this curious thought: Improvements are needed in how classroom observations are measured if they are to carry the weight they are assigned in teacher evaluation. Human observation needs to be measured in a more sciency way. Their big support for this is the finding that teachers with top students tend to get top observation scores. Their reasoning makes sense-- Danielson, for instance, wants you to show off your teaching of higher-order questioning skills. Would you rather do that with your Honors class, or the class where you're hoping the students just remember what you covered yesterday?
The solution? Make human observations more like VAM. The authors suggest that the same sort of demographic factoring adjustments that are used for VAMs should be used for human observation. And if that strikes you as a lousy idea-- well, it only gets better.
History of Bad Evaluation
The authors run down the history of teacher quality pursuits. NCLB defined "highly qualified" as "possessing certain qualifications," but then researchers figured out how to attache numbers to teacher quality and that made things better because, science. Recap of some of the iffy research claiming that a good second grade teacher will help you grow up to be rich. This has laid groundwork for new, federally-approved-and-pushed-but-not-actually-mandated-because-hey-that-would-be-illegal eval systems. Which can still allow for great variety between school districts, and as we all know, variety is bad juju.
So they decided to go study four districts to see if they could find unicorns there.
FINDINGS OF THE STUDY
1) Evaluation systems are sufficiently reliable and valid to be swell. There is strong year-to-year correlation between scores. They are just as reliable as (I am not making this up) systems used to predict season-to-season performance in professional sports.
I am not a statistics guy, but I have to note that the study drew on "one to three years of data from each district drawn from one or more of the years from 2009 to 2012." Am I crazy, or does that not seem like very much data with which to determine year-to-year consistency?
2) Only some teachers are evaluated by VAM. So none of these four districts were in a location where the art teacher gets credit for third grade math scores.
3) Observation scores are more stable from year to year than VAM. Don't get excited-- that's a bad thing, apparently. The fact that your administrator knows you and your work gives him a preconceived notion of how effective you are. So a long-standing relationship with a boss who knows you and your work is not helpful-- it's just a bias.
They have no absolutely answer for a VAM-to-observation ratio in evals, but they recommend properly handled observations be at least 50%.
4) School VAM scores throw things out of whack. Good school VAMs hide bad teachers; bad school VAMs hurt good teachers. These should be scrapped or minimized.
5) Better students = better observation ratings. I can think of a zillion reasons for this, but I don't think many teachers disagree. "Please come observe me when I'm teaching my lowest class of the day," said no teacher ever. Then follows several pages of charts and numerical wonkery to reach the conclusion I mentioned above-- observations should be subjected to the same kind of demographic adjustical jim-crackery that goes into VAMs.
6) That kind of adjustment calls for large sample sizes. Which means getting that data-laden legerdemain on a state level. There are charts and graphs here as well.
7) Outside observers are more predictive of next years VAM scores than inside ones. Principals are influenced by what they know. What's called for is an outside observer who doesn't know anything. Well, not anything except how to observe characteristics that are predictive of VAM scores. This produces the most hilarious recommendation of all-- two-to-three annual classroom observations of each teacher. Before principals decide to go hide in an ashram, note that at least one of these should be conducted by a no-nothing outsider.
There are certainly Bad Principal situations where some relief from bias would be a Good Thing. But if we are accepting the premise that a principal's knowledge and understanding of her staff is somehow an obstacle to be avoided, we are approaching again the reformy place where human interactions are bad for education and the people who work in public education are all dopes. This isn't a trip to the unicorn farm; it's a trip to the robot unicorn factory. Where money trees grow.
A new generation of teacher evaluation systems seeks to make performance measurement and feedback more rigorous and useful.
Could be worse. They could have brought up grit. But we're going to wind up by reminding everyone that even though variations in a system may be useful in that they offer the chance to study lots of variables in action, mostly they are bad because, chaos.
Their final paragraph starts with this sentence:
A prime motive behind the move towards meaningful teacher evaluation is to assure greater equity in students’ access to good teachers.
Also, a bicycle, because a vest has no sleeves. Equal access to great teachers may be the stated motivation for the move toward "meaningful" (a meaningless word in this context) teacher evaluation, but what is still missing is the slimmest shred, the slightest sliver, the most shrunken soupcon of proof that a teacher evaluation system would take us one step closer to that goal. Hell, we haven't even proven that "equal access to great teachers" doesn't exist right now! For all we know, we may be following thinky tanks on these ridiculous field trips to the unicorn farm while actual unicorns are back home, grazing in our front yard.