CURMUDGUCATION: Norms vs. Standards

Saturday, June 4, 2016

Norms vs. Standards

I've found myself trying to explain the difference between norm and standards reference multiple times in the last few weeks, which means it's time to write about it. A lot of people get this distinction-- but a lot of people don't. I'm going to try to do this in plain(ish) English, so those of you who are testing experts, please forgive the lack of correct technical terminology.

A standards-referenced (or criterion-referenced) test is the easiest one to understand and, I am learning, what many, many people think we're talking about when we talk about tests in general and standardized tests in particular.

With standards reference, we can set a solid immovable line between different levels of achievement, and we can do it before the test is even given. This week I'm giving a spelling test consisting of twenty words. Before I even give the test, I can tell my class that if they get eighteen or more correct, they get an A, if they get sixteen correct, they did okay, and if the get thirteen or less correct, they fail.

A drivers license test is also standards-referenced. If I complete the minimum number of driving tasks correctly, I get a license. If I don't, I don't.

One feature of a standards-referenced test is that while we might tend to expect a bell-shaped curve of results (a few failures, a few top scores, and most in the middle), such a curve is not required or enforced. Every student in my class can get an A on the spelling test. Everyone can get a drivers license. With standards referenced testing, clustering is itself a piece of useful data; if all of my students score less than ten on my twenty word test, then something is wrong.

With a standards-referenced test, it should be possible for every test taker to get top marks.

A norm-referenced measure is harder to understand but, unfortunately, far more prevalent these days.

A standards-referenced test compares every student to the standard set by the test giver. A norm-referenced test compares every student to every other student. The lines between different levels of achievement will be set after the test has been taken and corrected. Then the results are laid out, and the lines between levels (cut scores) are set.

When I give my twenty word spelling test, I can't set the grade levels until I correct it. Depending on the results, I may "discover" that an A is anything over a fifteen, twelve is Doing Okay, and anything under nine is failing. Or I may find that twenty is an A, nineteen is okay, and eighteen or less is failing. If you have ever been in a class where grades are curved, you were in a class that used norm referencing.

Other well-known norm referenced tests are the SAT and the IQ test. Norm referencing is why, even in this day and age, you can't just take the SAT on a computer and have your score the instant you click on the final answer-- the SAT folks can't figure out your score until they have collected and crunched all the results. And in the case of the IQ test, 100 is always set to be "normal."

There are several important implications and limitations for norm-referencing. One is that they are lousy for showing growth, or lack thereof. 100 will be and has always been "normal" aka "smack dab in the middle of the results" for IQ tests. Have people gotten smarter or dumber over time? Hard to say-- big time testers like the IQ folks have all sorts of techniques for tying years of results together, but at the end of the day "normal" just means "about in the middle compared to everyone else whose results are in the bucket with mine." With norm referencing, we have no way of knowing whether this particular bucket is overall smarter or dumber than the other buckets. All of our data is about comparing the different fish in the same bucket, and none of it is useful for comparing one bucket to another (and that includes buckets from other years-- as all this implies, norm referencing is not so great at showing growth over time).

Normed referencing also gets us into the Lake Wobegon Effect. Can the human race ever develop and grow to the point that every person has an IQ over 100? No-- because 100 will always be the average normal right-in-the-middle score, and the entire human race cannot be above average (unless that is also accompanied by above-average innumeracy). No matter how smart the human race gets, there will always be people with IQs less that 100.

On a standards-referenced test, it is possible for everyone to get an A. On a normed-referenced test, it is not possible for everyone to get an A. Nobody has to flunk a standards-referenced test. Somebody has to flunk a norm-referenced test.

What are some of the examples we live with in education?

How about "reading on grade level"? At the end of the day, there are only two ways to determine what third grade "grade level" is-- you can either look at all all the third graders you can get data for and say, "Well, it looks like most of them get up to about here" or you can say "I personally believe that all third graders should be able to get to right about here" and just set it yourself based on your own personal opinion.

While lots of people have taken a shot at setting "grade level" in a variety of ways, it boils down to those two techniques, and mostly we rely on the first, which is norm-referencing. Which means that there will always be people who read below grade level-- always. The only way to show the some, more or all students are reading above grade level is to screw around with where we draw the "grade level" line on the big bell curve. But other than doing that kind of cheating with the data analysis, there is no way to get all students reading above grade level. If all third graders can read and comprehend Crime and Punishment, then Crime and Punishment is a third grade reading level book, and the kid in your class who has trouble grasping the full significance of Raskolnikov's dream of the whipped mare as a symbol of gratification through punishment and abasement is the third grader who gets a D on her paper.

And of course, there are the federally mandated Big Standardized Tests, the PARCC, SBA, PSSA, WTF, MOUSE, ETC or whatever else you're taking in your state.

First, understand why the feds and test fanatics wanted so badly for pretty much the same test to be given everywhere and for every last student to take it. Think back to our buckets of fish. Remember, with norming we can only make comparisons between the fish in the same bucket, so the idea was that we would have a nation-sized bucket with every single fish in it. Now, sadly, we have about forty buckets, and only some of them have a full sampling of fish. The more buckets and the fewer fish, the less meaningful our comparisons.

The samples are still big enough to generate a pretty reliably bell-shaped curve, but then we get our next problem, which is figuring out where on that bell curve to draw the cut score, the line that says, "Oh yeah, everyone above this score is super-duper, and everyone below it is not." This process turns out (shocker) to be political as all get out (here's an example of how it works in PA) because it's a norm-referenced test and that means somebody has to flunk and some bunch of bureaucrats and testocrats have to figure out how many flunkers there are going to be.

There are other norm-referencing questions floating out there. The SAT bucket has always included all the fish intending to go to college-- what will happen to the comparisons if the bucket contains all the fish, including the non-college-bound ones? Does that mean that students who used to be the bottom of the pack will now be lifted to the middle?

This is also why using the SAT or the PARCC as a graduation exam is nuts-- because that absolutely guarantees that a certain number of high school seniors will not get diplomas, because these are norm-referenced tests and somebody has to land on the bottom. And that means that some bureaucrats and testocrats are going to sit in a room and decide how many students don't get to graduate this year.

It's also worth remembering that norm referencing is lousy at providing an actual measure of how good or bad students are at something. As followers of high school sports well know, "champion of our division" doesn't mean much if you don't know anything about the division. Saying that Pat was the tallest kid in class doesn't tell us much about how tall Pat actually is. And with these normed measures, you have no way of knowing if the team is better than championship teams from other years, or if Pat is taller or shorter than last year's tallest kid in class.

Norm referencing, in short, is great if you want to sort students, teachers and schools into winners and losers. It is lousy if you want to know how well students, teachers and schools are actually doing. Ed reform has placed most of its bets on norm referencing, and that in itself tells us a lot about what reformsters are really interested in. That is not a very useful bucket of fish.

44 comments:

JayneJune 4, 2016 at 9:07 AM
Thanks for your explanation.
ReplyDelete
Replies
NY TeacherJune 4, 2016 at 10:01 AM
If the Common Core standards and companion PARRC and SBAC assessments really lived up to the "college and career readiness" claim, then why would cut scores be set post-facto? If the founders knew exactly which specific skills were required for college success, shouldn't their tests be criterion referenced? The fact that they are not completely debunks their bogus and unsubstantiated claim!
ReplyDelete
Replies
Jennifer Borgioli BinisJune 4, 2016 at 12:20 PM
A few points of clarification - any given test can be criterion-referenced or norm-referenced. It depends on what you want to know and how you handle the scores.

Give your spelling test to 100 8th graders across PA. Find the average score. Give the test to your class of 8th graders. Give each student a score in terms of how they scored compared to the normed group (the large group of 8th graders). That's NR.

Look at your spelling test. Determine at what point your expectations (or standards) will be met (i.e. 80% correct.) Give your test. Give each a student a score letting them know how they did in relation to your expectations. That's CR.

Terms like pass or fail muddy the waters a bit. Some standards-referenced tests are pass/fail. You can pass or fail the NYS Regents exams but students do not pass or fail the NYS 3-8 tests. Their scores are reported in terms of proximity to the state's expectations. Both tests are standards-referenced. Both have a different scale with different implications based on where a child's individual score falls. With (as far as I know) the exception of only Illinois, all state tests developed for NCLB/ESSA purposes are CR, including PARCC.

I totally agree using the SAT for graduation criteria is deeply flawed for a whole variety of reasons, especially as it relates to standards-alignment. However, using PARCC for a graduation criteria has a different set of flaws but it's comparable to the NYS Regents' exams.

Finally, I would offer that NRT isn't inherently a problem. The vast majority of tests used to identify learning disabilities(and therefore ensure they get special education services) use an NRT design. This allows the Committee on Special Education to have a sense as to how the child performs as compared to their peers.

That said, all tests are flawed. Which is why it's incredible bad practice to make one test a determining factor for a child's future. Doing so isn't about NR or CR. That's bad policy.
ReplyDelete
Replies
AnonymousJune 4, 2016 at 12:45 PM
While posts about the common core have often made claims about it requiring age inappropriate content, it has never been clear how "age inappropriateness" is established.

Presumably it is a norm referenced standard, but that will likely mean that there are a significant percentage of students from both tails of the distribution for whom age appropriate content is actually inappropriate. Does anyone have any idea about the criteria for determining when something is and is not age appropriate?
ReplyDelete
Replies
FalstaffJune 4, 2016 at 3:59 PM
An engineering professor I know gives a 5 question quiz in Discrete Math every day. 5 attempted and right is an A+. 5 attempted and 1 wrong is 80%, B-. 4 attempted and right is still 80%. 5 attempted and 2 wrong is 60%. 3 attempted and right is still 60%. And so on. That's CR, right?
ReplyDelete
Replies
NY TeacherJune 4, 2016 at 4:21 PM
JBB, Consultant at LCI
She facilitates and supports assessment design programs that wrestle with the messiness of learning and capturing it in ways that are meaningful for students and teachers.Amused, entertained, and intrigued by the question, “What are the implications of reducing learning to a number?” Jennifer can be found arguing the merits of rubrics and large-scale, high-quality assessment on Twitter as JennLCI.

Interesting bio Jennifer.

How can you possibly support the use of MC items for testing SUBJECTIVE skills? This of course is the essence of PARRC and SBAC testing in Common Core ELA. The MC items from these tests break every rule in the book!

ReplyDelete
Replies
DukeJune 4, 2016 at 6:27 PM
A big, important topic, Peter. Thanks for addressing it.

Some of my thoughts:

http://jerseyjazzman.blogspot.com/2015/05/standardized-tests-symptoms-not-causes.html
ReplyDelete
Replies
gadfly1974June 5, 2016 at 7:32 AM
I want to thank everyone for this important conversation.

Full disclosure: my previous school district contracted with Jenn when NY started rolling out NCLB testing, and she helped me understand what you can and can't learn from Big Standardized testing reports. Thank you, Jenn!

I have a question for Jenn and Peter.

When did our education system begin implementing cut scores separate from teacher judgment?

I grew up in the Boston area, so I remember going on field trips to area living museums and learning about apprenticeship relationships.

A kid would ask the guy working on making a horseshoe, "Did you go to school to learn to become a blacksmith?" and he would laugh and say, "What's a school?" and then describe the process of a student learning his craft from a master craftsman.

The kids would all groan and murmur about how lucky he was.

I wonder:

1) What do report cards look like around the world? How do various countries define success?

2) Is there an academic history not just of Big Standardized Testing, but of educational feedback? I grew up in Franklin, Mass., where Horace Mann began his work to build a public education system. I can't imagine that the report card I got from Oak Street Elementary School, a mile or two down the road from Mann's birthplace, looks at all like the feedback a kid would have received when the system was brand new.

3) More philosophically, I wonder if both normed and criterion-referenced tests assume a Platonic Ideal beyond the direct experience of any individual student. In this sense, arguing for the virtues of NR over CR tests misses a deeper point, which is that students in the exclusive private and residential academy systems receive individualized attention and narrative feedback instead of grades based on cut scores.

I guess my thesis is that when my classmates and I groaned about how lucky the blacksmith was, our motives many not have been pure, but our reactions were reasonable.

Everyone has a deep desire to be seen as an individual rather than as a participant in a system based on comparison to an abstract standard or our peers.

If that formal history of educational feedback doesn't exist, I may have an excuse to go back to school and pursue a PhD. Seriously!
ReplyDelete
Replies
Jennifer Borgioli BinisJune 5, 2016 at 10:14 AM
At the risk of making a crass self-promotion... my podcast Ed History 101 seeks to answer those very questions. The two episodes that get to your questions are the ones on bells in schools (the rise of the scientific movement in education in the 1910's) and the history of the NYS Regents (which includes teacher feedback.)

Our first episode gets at what you and your friends picked up with that blacksmith - our education system has never really been about the individual. At various points its been about educating boys to be better men, then educating children to become better Americans, then workers, then thinkers and doers. And ever forward.

One of my favorite quotes on the issue: In our society, that we provide common public schooling is inherently a compromise – We must therefore strive continually to find a creative balance between local and central direction, between diversity and standards, between liberty and equality.
ReplyDelete
Replies
gadfly1974June 5, 2016 at 11:32 AM
I'll check out your podcast, Jenn. Thanks!
ReplyDelete
Replies
Dr. Ed FullerJune 5, 2016 at 2:04 PM
So, one interesting variation is to use a baseline norm-referenced test and then select a cut score based on the distribution of scores. Then hold that cut score constant over tie such that, theoretically, all kids could pass at some point x years after the baseline. New Mexico uses this (i think--at least from what I can gather from their hard to understand acct materials). Texas essentially ran the same approach--having teachers pick a cut score and then holding the cut score constant over time as the distribution of scores changed from normal to skewed to the left. In fact, when looking at the distribution of all students passing all tests at the school level from 2003 to 2011, this is exactly what happened. New Mexico claims to do this with their value-added scores as well, but I dont really understand how they actually accomplish that.
ReplyDelete
Replies
Peter GreeneJune 5, 2016 at 3:21 PM
My apologies to some of you-- as you can see, commenting has been... um... brisk and spirited on this post, and since I've been working graduation all day, I violated one of my cardinal rules and tried to modulate comments via my phone. The reason I have a cardinal rule not to do that is because once my big fat fingers meet the teeny tiny phone buttons comments end up deleted when I meant to approve them.

I think I'm caught up-- so if you posted something trenchant that didn't make it here, I probably accidentally deleted it. Feel free to give it another shot if you're so inclined.
ReplyDelete
Replies
UnknownJune 9, 2016 at 11:11 AM
Thanks for the good work. Reposted on notjustaparent.com
ReplyDelete
Replies

Add comment

Pages

Saturday, June 4, 2016

Norms vs. Standards

44 comments: