Pages

Friday, August 29, 2014

Bellwether Flubs Teacher Evaluation Argument

I am fascinated by the concept of think tank papers, because they are so fancy in presentation, but so fanceless in content. I mean, heck-- all I need to do is give myself a slick name and put any one of these blog posts into a fancy pdf format with some professional looking graphic swoops, and I would be releasing a paper every day.

Bellwether Education, a thinky tank with connections to the standards-loving side of the conservative reformster world, has just released a paper on the state of teacher evaluation in the US. "Teacher Evaluation in an Era of Rapid Change: From 'Unsatisfactory' to 'Needs Improvement.'" (Ha! I see what you did there.) Will you be surprised to discover that the research was funded by the Bill and Melinda Gates Foundation?

In this paper, Chad Aldeman (Associate Partner on the Policy and Thought Leadership Team-- his mom must be proud) and Carolyn Chuong (Analyst-- keep plugging, kiddo) lay out what they see as current trends which they evaluate in terms of what they think the trends should be. So, see? A smattering of factish information all filtered through a set of personal ideas about how education should be going-- just like me! Let's see what they came up with.

The Widget Effect

Oh, this damn thing. You can go back and read the original TNTP paper, which was earthshattering and galvanized governments to leap up and start using a new piece of jargon. Just in case you missed it, the whole point was that school systems should not treat teachers as if they are interchangeable widgets, but instead should treat them as interchangeable widgets, some of which do widgetty things better than others. In other words, under this approach, all teachers are still widgets in a big machine; it's just that some widgets are better than others. But this theoretical thought-leadery framework is still influential today in the sense that it influenced this paper that I'm reading and you are reading about.

So what did Aldeman and Chuong find? Five things, it turns out. Here they are.

1) Districts are starting to evaluate teachers as individuals.

The "most dramatic finding" in The Widget Effect was that school districts were using binary pass/fail. Now states are moving toward a four- or five-tiered system. Woot!

Some people, apparently, quibble because the new system still finds a small percentage of teachers are in the suck zone, and for many reformsters, a teacher eval system is only good if it finds the gazillions of bad teachers that reformsters just know are out there. But Aldeman and Chuong say that criticism misses two points.

First, they say, don't look at the percent-- look at the number. See how high that number is? That's lots of bad teachers, isn't it. Also, they cite the New York report about tenure rule changes. They think the research says that if you're a bad teacher and your administration says so, you might leave. I think the research also says that if you're a good teacher and your boss gives you a bad evaluation, you might think twice about wanting to work for that boss. But here, as throughout, we will see that the question "Is the evaluation accurate" never appears on the radar.

Second, did we mention there are more than two categories. And the categories are named with words, and the words are very descriptive. That allows us to give targeted support, which we totally could never do under the old system, because-- I don't know.  Principals are dopes and the evaluation rating is the one and only source of data they have about a teacher's job performance?

2) Schools are providing teachers with better, timelier feedback on their practice.

There's no question that this is a need. Traditional evaluations in many states involved getting a quick score sheet as part of a teacher's end-of-the-year check-out process. Not exactly useful in terms of improving practices.

But in this section the writers come close to acknowledging the central problem-- the ineffectiveness of the actual evaluation. They note that research shows that teachers with higher-functioning students tend to get better evaluations.

However, they correctly note that new evaluation techniques encourage a more thorough and useful dialogue between the teacher and the administrator. But, of course, the new evaluation system are based on the same old true (and only) requirement-- certain paperwork must be filled out. The new models put huge time requirements on principals who still have a school to run, and the pressure to the letter of the paperwork law met while trampling the spirit are intense. We'll see how that actually works out.

3) Districts still don't factor student growth into teacher evals

Here we find the technocrat blind faith in data rearing its eyeless head again

While raw student achievement metrics are biased—in favor of students from privileged backgrounds with more educational resources—student growth measures adjust for these incoming characteristics by focusing only on knowledge acquired over the course of a school year.

This is a nice, and inaccurate, way to describe VAM, a statistical tool that has now been discredited more times than Donald Trump's political acumen. But some folks still insist that if we take very narrow standardized test results and run them through an incoherent number-crunching, the numbers we end up with represent useful objective data. They don't. We start with standardized tests, which are not objective, and run them through various inaccurate variable-adjusting programs (which are not objective), and come up with a number that is crap. The authors note that there are three types of pushback to using said crap.

Refuse. California has been requiring some version of this for decades. and many districts, including some of the biggest, simply refuse to do it.

Delay. A time-honored technique in education, known as Wait This New Foolishness Out Until It Is Replaced By The Next Silly Thing. It persists because it works so often. 

Obscure. Many districts are using loopholes and slack to find ways to substitute administrative judgment for the Rule of Data. They present Delaware as an example of how futzing around has polluted the process and buttress that with a chart that shows statewide math score growth dropping while teacher eval scores remain the same.

Uniformly high ratings on classroom observations, regardless of how much students learn, suggest a continued disconnect between how much students grow and the effectiveness of their teachers.

Maybe. Or maybe it shows that the data about student growth is not valid.

They also present Florida as an example of similar futzing. This time they note that neighboring districts have different distributions of ratings. This somehow leads them to conclude that administrators aren't properly incorporating student data into evaluations.

In neither state's case do they address the correct way to use math scores to evaluate history and music teachers.

4) Districts have wide discretion

Their point here is simply that people who worry about the state (and federal) government using One Size Fits All to intrude local autonomy into oblivion are "premature" in their concern. "Premature" is a great word here, indicating that the total control hasn't happened yet-- it's just going to happens later.

5) Districts continue to ignore performance when making decisions about teachers

Let me be clear. I used the heading of this section exactly as Adelman and Chuong wrote it, because it so completely captures a blind spot in this brand of reformster thought.

Look at that again, guys. Is that really what you meant to say? Districts completely ignore performance when making decisions about teachers? Administrators say to each other, "Let's make our decisions about staff based on hair color or height or shoe size, but whatever we do, let's not consider any teacher's job performance ever, at all."

No, that would be stupid. What Adelman and Chuong really mean is that districts continue to ignore the kind of performance measures that Adelman and Chuong believe they should not ignore. Administrators insist on using their own professional judgment instead of relying on state-issued, VAM-infested, numbly numbery, one-size-measures-all widget wizardy evaluation instruments. Of course districts make decisions about teachers based on job performance; just not the way Adelman and Chuong want them to.

Also, districts aren't rushing to use these great evaluation tools to install merit pay or to crush FILO. They are going to beat the same old StudentsFirst anti-tenure drum. I have addressed this business at great length here and here and here and here (or you can click on the tenure tag above), but let me do the short version-- you do not retain and recruit great teachers by making their continued pay and employment dependent on an evaluation system that is no more reliable than a blind dart player throwing backhand from a wave-tossed dinghy.

Recommendations

It's not a fancy-pants thinky tank paper until you tell people what you think they should do. So Adelman and Chuong have some ideas for policymakers.

Track data on various parts of new systems. Because the only thing better than bad data is really large collections of bad data. And nothing says Big Brother like a large centralized data bank.

Investigate with local districts the source of evaluation disparities. Find out if there are real functional differences, or the data just reflect philosophical differences. Then wipe those differences out. "Introducing smart timelines for action, multiple evaluation measures including student growth, requirements for data quality, and a policy to use confidence intervals in the case of student growth measures could all protect districts and educators that set ambitious goals."

Don't quit before the medicine has a chance to work. Adelman and Chuong are, for instance, cheesed that the USED postponed the use of evaluation data on teachers until 2018, because those evaluations were going to totally work, eventually, somehow.

Don't be afraid to do lots of reformy things at once. It'll be swell.

Their conclusion

Stay the course. Hang tough. Use data to make teacher decisions. Reform fatigue is setting in, but don't be wimps.

My conclusion

I have never doubted for a moment that the teacher evaluation system can be improved. But this nifty paper sidesteps two huge issues.

First, no evaluation system will ever be administrator-proof. Attempting to provide more oversight will actually reduce effectiveness, because more oversight = more paperwork, and more paperwork means that the task shifts from "do the job well" to "fill out the paperwork the right way" which is easy to fake.

Second, the evaluation system only works if the evaluation system actually measures what it purports to measure. The current "new" systems in place across the country do not do that. Linkage to student data is spectacularly weak. We start with tests that claim to measure the full breadth and quality of students' education; they do not. Then we attempt to create a link between those test results and teacher effectiveness, and that simply hasn't happened yet. VAM attempted to hide that problem behind a heavy fog bank, but the smoke is clearing and it is clear that VAM is hugely invalid.

So, having an argument about how to best make use of teacher evaluation data based on student achievement is like trying to decide which Chicago restaurant to eat supper at when you are still stranded in Tallahassee in a car with no wheels. This is not the cart before the horse. This is the cart before the horse has even been born.





1 comment: