Saturday, June 4, 2016

Norms vs. Standards

I've found myself trying to explain the difference between norm and standards reference multiple times in the last few weeks, which means it's time to write about it. A lot of people get this distinction-- but a lot of people don't. I'm going to try to do this in plain(ish) English, so those of you who are testing experts, please forgive the lack of correct technical terminology.

A standards-referenced (or criterion-referenced) test is the easiest one to understand and, I am learning, what many, many people think we're talking about when we talk about tests in general and standardized tests in particular.

With standards reference, we can set a solid immovable line between different levels of achievement, and we can do it before the test is even given. This week I'm giving a spelling test consisting of twenty words. Before I even give the test, I can tell my class that if they get eighteen or more correct, they get an A, if they get sixteen correct, they did okay, and if the get thirteen or less correct, they fail.

A drivers license test is also standards-referenced. If I complete the minimum number of driving tasks correctly, I get a license. If I don't, I don't.


One feature of a standards-referenced test is that while we might tend to expect a bell-shaped curve of results (a few failures, a few top scores, and most in the middle), such a curve is not required or enforced. Every student in my class can get an A on the spelling test. Everyone can get a drivers license. With standards referenced testing, clustering is itself a piece of useful data; if all of my students score less than ten on my twenty word test, then something is wrong.

With a standards-referenced test, it should be possible for every test taker to get top marks.

A norm-referenced measure is harder to understand but, unfortunately, far more prevalent these days.

A standards-referenced test compares every student to the standard set by the test giver. A norm-referenced test compares every student to every other student. The lines between different levels of achievement will be set after the test has been taken and corrected. Then the results are laid out, and the lines between levels (cut scores) are set.

When I give my twenty word spelling test, I can't set the grade levels until I correct it. Depending on the results, I may "discover" that an A is anything over a fifteen, twelve is Doing Okay, and anything under nine is failing. Or I may find that twenty is an A, nineteen is okay, and eighteen or less is failing. If you have ever been in a class where grades are curved, you were in a class that used norm referencing.

Other well-known norm referenced tests are the SAT and the IQ test. Norm referencing is why, even in this day and age, you can't just take the SAT on a computer and have your score the instant you click on the final answer-- the SAT folks can't figure out your score until they have collected and crunched all the results. And in the case of the IQ test, 100 is always set to be "normal."

There are several important implications and limitations for norm-referencing. One is that they are lousy for showing growth, or lack thereof. 100 will be and has always been "normal" aka "smack dab in the middle of  the results" for IQ tests. Have people gotten smarter or dumber over time? Hard to say-- big time testers like the IQ folks have all sorts of techniques for tying years of results together, but at the end of the day "normal" just means "about in the middle compared to everyone else whose results are in the bucket with mine." With norm referencing, we have no way of knowing whether this particular bucket is overall smarter or dumber than the other buckets. All of our data is about comparing the different fish in the same bucket, and none of it is useful for comparing one bucket to another (and that includes buckets from other years-- as all this implies, norm referencing is not so great at showing growth over time).

Normed referencing also gets us into the Lake Wobegon Effect. Can the human race ever develop and grow to the point that every person has an IQ over 100? No-- because 100 will always be the average normal right-in-the-middle score, and the entire human race cannot be above average (unless that is also accompanied by above-average innumeracy). No matter how smart the human race gets, there will always be people with IQs less that 100.

On a standards-referenced test, it is possible for everyone to get an A. On a normed-referenced test, it is not possible for everyone to get an A. Nobody has to flunk a standards-referenced test. Somebody has to flunk a norm-referenced test.

What are some of the examples we live with in education?

How about "reading on grade level"? At the end of the day, there are only two ways to determine what third grade "grade level" is-- you can either look at all all the third graders you can get data for and say, "Well, it looks like most of them get up to about here" or you can say "I personally believe that all third graders should be able to get to right about here" and just set it yourself based on your own personal opinion.

While lots of people have taken a shot at setting "grade level" in a variety of ways, it boils down to those two techniques, and mostly we rely on the first, which is norm-referencing. Which means that there will always be people who read below grade level-- always. The only way to show the some, more or all students are reading above grade level is to screw around with where we draw the "grade level" line on the big bell curve. But other than doing that kind of cheating with the data analysis, there is no way to get all students reading above grade level. If all third graders can read and comprehend Crime and Punishment, then Crime and Punishment is a third grade reading level book, and the kid in your class who has trouble grasping the full significance of Raskolnikov's dream of the whipped mare as a symbol of gratification through punishment and abasement is the third grader who gets a D on her paper.

And of course, there are the federally mandated Big Standardized Tests, the PARCC, SBA, PSSA, WTF, MOUSE, ETC or whatever else you're taking in your state.

First, understand why the feds and test fanatics wanted so badly for pretty much the same test to be given everywhere and for every last student to take it. Think back to our buckets of fish. Remember, with norming we can only make comparisons between the fish in the same bucket, so the idea was that we would have a nation-sized bucket with every single fish in it. Now, sadly, we have about forty buckets, and only some of them have a full sampling of fish. The more buckets and the fewer fish, the less meaningful our comparisons.

The samples are still big enough to generate a pretty reliably bell-shaped curve, but then we get our next problem, which is figuring out where on that bell curve to draw the cut score, the line that says, "Oh yeah, everyone above this score is super-duper, and everyone below it is not." This process turns out (shocker) to be political as all get out (here's an example of how it works in PA) because it's a norm-referenced test and that means somebody has to flunk and some bunch of bureaucrats and testocrats have to figure out how many flunkers there are going to be.

There are other norm-referencing questions floating out there. The SAT bucket has always included all the fish intending to go to college-- what will happen to the comparisons if the bucket contains all the fish, including the non-college-bound ones? Does that mean that students who used to be the bottom of the pack will now be lifted to the middle?

This is also why using the SAT or the PARCC as a graduation exam is nuts-- because that absolutely guarantees that a certain number of high school seniors will not get diplomas, because these are norm-referenced tests and somebody has to land on the bottom. And that means that some bureaucrats and testocrats are going to sit in a room and decide how many students don't get to graduate this year. 

It's also worth remembering that norm referencing is lousy at providing an actual measure of how good or bad students are at something. As followers of high school sports well know, "champion of our division" doesn't mean much if you don't know anything about the division. Saying that Pat was the tallest kid in class doesn't tell us much about how tall Pat actually is. And with these normed measures, you have no way of knowing if the team is better than championship teams from other years, or if Pat is taller or shorter than last year's tallest kid in class.

Norm referencing, in short, is great if you want to sort students, teachers and schools into winners and losers. It is lousy if you want to know how well students, teachers and schools are actually doing. Ed reform has placed most of its bets on norm referencing, and that in itself tells us a lot about what reformsters are really interested in. That is not a very useful bucket of fish.


44 comments:

  1. Thanks for your explanation.

    ReplyDelete
  2. If the Common Core standards and companion PARRC and SBAC assessments really lived up to the "college and career readiness" claim, then why would cut scores be set post-facto? If the founders knew exactly which specific skills were required for college success, shouldn't their tests be criterion referenced? The fact that they are not completely debunks their bogus and unsubstantiated claim!

    ReplyDelete
    Replies
    1. PARCC and SBAC are criterion-referenced. What happens after the test is given is a process of figuring out how raw scores relate to the cut scores, which are set on a different scale. The process is called "bookmarking" and involves classroom teachers and content-area experts.

      Delete
    2. I really don't care what you want to call the PARRC and SBAC assessments, or what made-up term you use to identify the process, nor do I care that you pretend to involve real teachers. What you are describing is nothing short of using voodoo dolls. What I do know is that not one of your magicians, I mean psychometricians, can tell me what % of the total possible raw score points on the grade "X" PARRC or SBAC ELA test, in advance of administration, accurately indicates that the test taker is college or career ready. End of story.
      Thanks for trying to convince us that your snake oil really does work.

      Delete
    3. I can offer two things.
      1. I'm happy to point you towards texts that explain why the raw -> scale is done after if you'd like.
      2. It's not my snake oil. I'm all about performance-based assessment and portfolios. That said, in order to understand designing that stuff (which is REALLY hard), it helps to understand multiple choice test design (which is hard). My $.02: https://jennbbinis.com/2016/05/18/vegetarian-butchers-unite/

      Delete
  3. A few points of clarification - any given test can be criterion-referenced or norm-referenced. It depends on what you want to know and how you handle the scores.

    Give your spelling test to 100 8th graders across PA. Find the average score. Give the test to your class of 8th graders. Give each student a score in terms of how they scored compared to the normed group (the large group of 8th graders). That's NR.

    Look at your spelling test. Determine at what point your expectations (or standards) will be met (i.e. 80% correct.) Give your test. Give each a student a score letting them know how they did in relation to your expectations. That's CR.

    Terms like pass or fail muddy the waters a bit. Some standards-referenced tests are pass/fail. You can pass or fail the NYS Regents exams but students do not pass or fail the NYS 3-8 tests. Their scores are reported in terms of proximity to the state's expectations. Both tests are standards-referenced. Both have a different scale with different implications based on where a child's individual score falls. With (as far as I know) the exception of only Illinois, all state tests developed for NCLB/ESSA purposes are CR, including PARCC.

    I totally agree using the SAT for graduation criteria is deeply flawed for a whole variety of reasons, especially as it relates to standards-alignment. However, using PARCC for a graduation criteria has a different set of flaws but it's comparable to the NYS Regents' exams.

    Finally, I would offer that NRT isn't inherently a problem. The vast majority of tests used to identify learning disabilities(and therefore ensure they get special education services) use an NRT design. This allows the Committee on Special Education to have a sense as to how the child performs as compared to their peers.

    That said, all tests are flawed. Which is why it's incredible bad practice to make one test a determining factor for a child's future. Doing so isn't about NR or CR. That's bad policy.

    ReplyDelete
    Replies
    1. Yes, you can pass or fail any test, CR or NR. But only in a CR test is possible that everyone can pass.

      We've had the conversation about "pass" and "fail" before, and I know you don't like to use the terms. I get why, but I really wish the testing industry would use the terms, if for no reason than to force themselves to face the reality of how their products are used. Sometimes I think test manufacturers and experts are like nuclear scientists saying, "Well, what we've engineered is a device for harnessing a fission-based intense energy release. Please don't call it a bomb, and we certainly don't want to discuss what would happen if you dropped it on an enemy city."

      You can say all day that the Big Standardized Tests are not tests that students would "pass" or "fail," but when you tell third graders in Florida or North Carolina that they will not be going on fourth grade because of the results of their test-- everybody who is not a test expert is going to call that "failure." When you attach stakes to a test, you put the words "pass' and "fail" in play, and that's how the regular civilians who deal with this stuff will describe it.

      Delete
    2. I *completely* get the use of pass/fail when talking about policy and in everyday conversation. I think we covered that before and said horse is resting peacefully. :)

      I brought it up here for two reasons.
      1. Ask 100 students how they did on their SATs and I'm willing to bet not one will say "I failed them." The scoring structure of them is such that the pass/fail construct doesn't fit. That's not a semantics thing, it's a psychometrics thing.

      2. Your first reference to the drivers license tests makes it seem like there's a relationship between an SR design and passing/failing a test when that's not really the case.

      Delete
    3. If said student got below the minimum score for getting into their safety school, they'd say they failed them.

      Delete
    4. Isn't the goal of Common Core for all students to be college and career ready? Why wouldn't students be assessed through criteria referenced testing. The goal was to meet standards-most students should be able to pass these tests, plain and simple.

      Delete
    5. Students are assessed through criteria referenced tests. All NCLB-mandated tests (which includes PARCC and SB) have a CR design.

      Delete
  4. While posts about the common core have often made claims about it requiring age inappropriate content, it has never been clear how "age inappropriateness" is established.

    Presumably it is a norm referenced standard, but that will likely mean that there are a significant percentage of students from both tails of the distribution for whom age appropriate content is actually inappropriate. Does anyone have any idea about the criteria for determining when something is and is not age appropriate?

    ReplyDelete
    Replies
    1. This comment has been removed by the author.

      Delete
    2. You might find this text helpful.

      http://www.sciencedirect.com/science/article/pii/S1878929314000516

      Delete
    3. TE, I'm going to make a lot of suppositions here, because I'm not a reading teacher or an elementary teacher, and I haven't studied Piaget in depth. I'm sure there are many people out here though, who could answer your question.

      To begin with, are we talking about reading comprehension or learning in general. I think everything has to start with Piaget and learning in general. As far as reading goes, I think there may be a difference between "grade level" and "age appropriate".

      I think grade level is usually based a lot on vocabulary, but you also have to think about whether or not the topic is appropriate for a certain grade level or age, that it would be something they have knowledge of or interest in. Then you have Peter's SAT example of whether or not the vocabulary is appropriate socio-economically: the oarsmen in the regatta. I remember that my son was expected to know what a "hedge" (as in shrub border) meant in first grade, and he had no idea; nobody in our neighborhood had one. So there's complexity to begin with.

      It seems to me (not an expert) that to assign a grade level to a reading, you would have to have many children at various grade levels (of normal age for the levels) read that particular reading and somehow show their understanding of it. Perhaps an oral or written summary of it in their own words. But writing would also be testing writing skills, which, though related, are not the same as reading skills. If you use any kind of reading comprehension test questions, like true/false or multiple choice, they would have to be really good ones because I've seen a lot (at the high school level) that I thought were terrible and could be misleading, not giving accurate results. Then you would have to make a judgement of what was the highest grade level where a majority of the children had a good understanding of it. I don't think just being able to read it aloud or counting how many of what vocabulary is being used would be good indicators. And I would also want to disaggregate the results by socio-economic status to see if results are different for, for example, third grade low, middle, and high family income to see if there might be a bias there. A difference could mean an "achievement gap", or it could mean the vocabulary or background knowledge is something impossible for lower socio-economic level children to be familiar with. Could be hard to tell. Although I remember reading that cloze exercises are used to determine reading grade level: you leave out every fifth word, and if a majority of the students could fill in the blank from context (I don't know if oral or written), it was appropriate for that level. It made sense to me.

      Delete
    4. (Cont.)
      I think your question about wouldn't something that was deemed "age-appropriate" by norm reference actually then be inappropriate for children on the upper and lower ends of the bell curve is a good one. I think we have to talk about parameters here, and that goes to Piaget. That's why early childhood educators say expecting all kindergarteners to learn how to read is not appropriate, and why play is important. We know from Piaget and others that young children learn best by playing, and that there are many things they need to learn besides academics (or sitting still), like how to get along with others. But although there's usually a certain basic sequencing of skills able to be learned, we know each child goes through this at a different pace, so you can't limit this window to one specific year. The younger the child, the more true this is. We know that sometimes children can't manage all kinds of skills at the same time. Often a child who walks earlier than "normal" will start speaking later than "normal", and vice-versa. Some children talk understandably at 18 months, and some don't talk at all until they're three, but when they do start speaking, they use complete sentences. So this is true of many skills, including reading.

      So to find out if something is reading grade level appropriate, you first have to find a valid way to determine at what grade level the majority of students can understand it, which I assume would be considered norm-referenced. Then you could use this (hopefully) valid reading to make a criterion-referenced test, because you've determined it's valid to use as such. But it should never be used by itself to determine if the student can go on to the next grade, because of the parameters involved in age-appropriate skills, parameters caused by differences in child development and individual learning differences. As I mentioned to you before, my son was failing his third grade reading tests because of his dyslexia, but my friend who was a special education teacher knew that even though he had no phonic sense, he was good at understanding context, which is more important from fourth grade on. If he had been kept back, especially without some kind of intervention, he would only have been frustrated because he was never going to have a good understanding of phonics.

      So something considered "age-appropriate" would have to be considered within parameters; for example, it might be appropriate for children ages 4-7, depending on the child. But something involving more abstract thought, skills, and knowledge, and might be appropriate for ages 12-15, depending on the child, should never be considered appropriate for children younger than that. For example, expecting , as PAARC does, that a 9 year old should be able to write an essay about the differences in structure between a prose reading and a poem is obviously ridiculous. I'm not saying that it's impossible to find one child in maybe hundreds of thousands who is precocious in abstract thought development and could be specifically taught to do this, but it's not something that should be expected.

      And anyone who says that teachers decide what's age-appropriate using "gut" feelings is someone with no knowledge of early childhood cognitive development and has no idea what they're talking about

      Delete
    5. Unknown,

      Thanks for the link, but it seems to me the paper you linked to is not really helpful in determining what is "age appropriate" in the context of K-12 education.

      Rebecca,

      I suspect that most people think that age appropriateness is something like the idea that about 50% of students of this age should be able to do. The problem with this in my view is that it also means that 50% of students would find this work inappropriate.

      Delete
    6. That's why it should just be a guideline and there has to be some leeway about exact ages. But sure, in an ideal world each child's work would be in Vygotsky's "zone of proximal development". But it wouldn't be a computer program; Vygotsky said children need social interaction and a teacher to guide them along.

      Delete
    7. te,
      I'm not a brain expert, but what I got from that link is that brain research would be very helpful in determining what is age appropriate especially at certain turning points. Developmental appropriateness is probably many things in addition, but in terms of just plain physical development, the prefrontal cortex and reorganization of the brain is essential for abstract reasoning. I'm sure an expert could say more about it, but we know that the move from concrete to formal operations happens between the ages of 11 and 13. We also know that ability to understand complex inferential tasks will vary widely at this juncture. It only makes sense that tests of skill for this period be mindful of the fact that brain development is individual. We should, of course expose children to abstract reasoning. That's just the good practice of enrichment and differentiation. But, it is important that we recognize that there is a difference between higher standards and sooner ones. It borders on abuse to test and stack rank children based on the earliest case.

      Delete
    8. And when I say "most", I guess I mean more like at least 70%, not 50%.

      Delete
    9. Unknown,

      There may come a time when we understand enough about the brain to be able to do what you suggest, but I don't think we are there yet. Recall the article defines adolescence as being between the ages of 10 and 19 and compares adolescence performance to children and adults. That is really much too coarse an age category to be useful.

      Rebecca,

      I think perhaps 70% is to wide as it would mean age appropriate assignments function equally well for students in the sixteenth percentile as students in the eighty fourth percentile. I would think that in reading, for example, an assignment well suited to a student in the sixteenth percentile would be inappropriately simple for a student in the eighty fourth percentile

      Delete
    10. With a bell curve, 68% of the distribution lies within one standard deviation of the mean.

      Delete
    11. Rebecca,

      That is true, but do you think that a student reading in the sixteenth percentile should be doing the same "age appropriate" work as a student reading in the eighty forth percentile? That is what your post suggested.

      I think that "age appropriate"is much narrower, though I do suspect that there is little other than gut feeling behind the notion of "age appropriate".

      Delete
  5. An engineering professor I know gives a 5 question quiz in Discrete Math every day. 5 attempted and right is an A+. 5 attempted and 1 wrong is 80%, B-. 4 attempted and right is still 80%. 5 attempted and 2 wrong is 60%. 3 attempted and right is still 60%. And so on. That's CR, right?

    ReplyDelete
    Replies
    1. Peter may have a different take than I do but I'd offer that what you're describing is a scoring schema, not necessarily CR or NR.

      Delete
    2. Yes, you are describing the classic CR test, written and administered by classroom teachers or college professors. However, your engineering professor still has to place a "cut score" for determining the passing (and failing) marks. Schools have traditionally used 65% or 70% as the arbitrary pass-fail cut for teacher's arbitrary CR tests, quizzes, papers, HW, etc.

      Delete
    3. NY Teacher,

      Without knowing the questions and how the questions were chosen, I do not think you can claim Falstaff's quiz a CR test. If I am teaching an introductory economics course at MIT I would ask different questions and/or possibly expect different answers than I would expect from students at my relatively open admissions public university. In this case, the questions are normed, while the cut scores are not.

      Delete
    4. If you were teaching at MIT, the students in your class were accepted into on of the most selective colleges in the world. They were selected, in part, using normed entrance exams (SAT; ACT). As an engineering professor you are not norming your 5 item quizes. And the last thing you are doing is sorting and flunking your students using bell curves, cut scores.

      TE read Jersey Jazzman's blog post on this topic. It is an excellent read and will help you understand that the difference is in how raw scores are used.

      Delete
    5. Peter,

      Thanks for the opportunity to re-post my accidentally deleted post.

      NY Teacher,

      If you think it appropriate to ask MIT students different questions than the students at my virtually open admission public university, you are advocating for norming questions based on the student population. As my very wise department chair once said, an exam question, and exam, a class, and a major need to be designed based on the students you actually have, not the students you wish you had. This norming happens long before the first exam is handed out, so any discussion of raw scores is irrelevant to my point.

      It is the norming of questions, course design, and grading criteria that makes comparing grades in the same class across schools impossible and has lead to interest in using exams where the questions are not normed, that is ones with standard questions.

      Delete
  6. JBB, Consultant at LCI
    She facilitates and supports assessment design programs that wrestle with the messiness of learning and capturing it in ways that are meaningful for students and teachers.Amused, entertained, and intrigued by the question, “What are the implications of reducing learning to a number?” Jennifer can be found arguing the merits of rubrics and large-scale, high-quality assessment on Twitter as JennLCI.

    Interesting bio Jennifer.

    How can you possibly support the use of MC items for testing SUBJECTIVE skills? This of course is the essence of PARRC and SBAC testing in Common Core ELA. The MC items from these tests break every rule in the book!

    ReplyDelete
    Replies
    1. Yup. That's me! I help schools design portfolio and performance based assessments, usually to replace their multiple choice final exams. So... I don't so much support MC as I know how they work cause it's my job to know.

      Delete
    2. Then you agree that PARRC and SBAC cannot possibly quantify college and career readiness?

      Then you agree that the MC format should never be used to test subjective skills a-la PARCC and SBAC?

      Delete
    3. Just looking for a yes or no Jen.

      Delete
    4. Welp... not sure what to tell you as you asked me two questions. So I'll offer this.

      Until ESSA is re-written, multiple choice will be a part of state tests. Hopefully, more states will shift to performance tasks like New Hampshire and the NYS Performance Consortium. Until then... here we are.

      Delete
    5. Question #1 - Agree? or Disagree?

      Question #2 - Agree? or Disagree?

      Sorry if I confused you.

      Delete
    6. If I may ask a clarifying question - why are you focusing on PARCC and SB? That is, is your question also referring to every state test developed to meet NCLB mandates?

      Delete
  7. A big, important topic, Peter. Thanks for addressing it.

    Some of my thoughts:

    http://jerseyjazzman.blogspot.com/2015/05/standardized-tests-symptoms-not-causes.html

    ReplyDelete
  8. I want to thank everyone for this important conversation.

    Full disclosure: my previous school district contracted with Jenn when NY started rolling out NCLB testing, and she helped me understand what you can and can't learn from Big Standardized testing reports. Thank you, Jenn!

    I have a question for Jenn and Peter.

    When did our education system begin implementing cut scores separate from teacher judgment?

    I grew up in the Boston area, so I remember going on field trips to area living museums and learning about apprenticeship relationships.

    A kid would ask the guy working on making a horseshoe, "Did you go to school to learn to become a blacksmith?" and he would laugh and say, "What's a school?" and then describe the process of a student learning his craft from a master craftsman.

    The kids would all groan and murmur about how lucky he was.

    I wonder:

    1) What do report cards look like around the world? How do various countries define success?

    2) Is there an academic history not just of Big Standardized Testing, but of educational feedback? I grew up in Franklin, Mass., where Horace Mann began his work to build a public education system. I can't imagine that the report card I got from Oak Street Elementary School, a mile or two down the road from Mann's birthplace, looks at all like the feedback a kid would have received when the system was brand new.

    3) More philosophically, I wonder if both normed and criterion-referenced tests assume a Platonic Ideal beyond the direct experience of any individual student. In this sense, arguing for the virtues of NR over CR tests misses a deeper point, which is that students in the exclusive private and residential academy systems receive individualized attention and narrative feedback instead of grades based on cut scores.

    I guess my thesis is that when my classmates and I groaned about how lucky the blacksmith was, our motives many not have been pure, but our reactions were reasonable.

    Everyone has a deep desire to be seen as an individual rather than as a participant in a system based on comparison to an abstract standard or our peers.

    If that formal history of educational feedback doesn't exist, I may have an excuse to go back to school and pursue a PhD. Seriously!

    ReplyDelete
  9. At the risk of making a crass self-promotion... my podcast Ed History 101 seeks to answer those very questions. The two episodes that get to your questions are the ones on bells in schools (the rise of the scientific movement in education in the 1910's) and the history of the NYS Regents (which includes teacher feedback.)

    Our first episode gets at what you and your friends picked up with that blacksmith - our education system has never really been about the individual. At various points its been about educating boys to be better men, then educating children to become better Americans, then workers, then thinkers and doers. And ever forward.

    One of my favorite quotes on the issue: In our society, that we provide common public schooling is inherently a compromise – We must therefore strive continually to find a creative balance between local and central direction, between diversity and standards, between liberty and equality.

    ReplyDelete
  10. I'll check out your podcast, Jenn. Thanks!

    ReplyDelete
  11. So, one interesting variation is to use a baseline norm-referenced test and then select a cut score based on the distribution of scores. Then hold that cut score constant over tie such that, theoretically, all kids could pass at some point x years after the baseline. New Mexico uses this (i think--at least from what I can gather from their hard to understand acct materials). Texas essentially ran the same approach--having teachers pick a cut score and then holding the cut score constant over time as the distribution of scores changed from normal to skewed to the left. In fact, when looking at the distribution of all students passing all tests at the school level from 2003 to 2011, this is exactly what happened. New Mexico claims to do this with their value-added scores as well, but I dont really understand how they actually accomplish that.

    ReplyDelete
    Replies
    1. This makes sense to me. I thought this was how these tests were supposed to work, and that was why education secretaries would say scores would get better with time.

      Delete
  12. My apologies to some of you-- as you can see, commenting has been... um... brisk and spirited on this post, and since I've been working graduation all day, I violated one of my cardinal rules and tried to modulate comments via my phone. The reason I have a cardinal rule not to do that is because once my big fat fingers meet the teeny tiny phone buttons comments end up deleted when I meant to approve them.

    I think I'm caught up-- so if you posted something trenchant that didn't make it here, I probably accidentally deleted it. Feel free to give it another shot if you're so inclined.

    ReplyDelete
  13. Thanks for the good work. Reposted on notjustaparent.com

    ReplyDelete