Subject Areas
Resources
Special Sections
HGSE News  
 


High-Stakes Testing—Where We've Been and Where We Are
An Interview with Professor Daniel Koretz

Harvard Graduate School of Education
March 21, 2002
A story from Ed., the magazine of the Harvard Graduate School of Education

Send this page to a friend
About Ed. magazine

Interviews in this Series
 What Teachers Need in a High-Stakes World (Pforzheimer Professor Susan Moore Johnson)

 Making Standards Work (Lecturer Paul Reville)

 High-Stakes Testing—Where We've Been and Where We Are (Professor Daniel Koretz)

Daniel Koretz is a faculty member in the Administration, Planning, and Social Policy and Human Development and Psychology areas at the Harvard Graduate School of Education. Jane Buchbinder, editor of Ed. magazine, conducted this interview with Koretz on the current state of standards-based reform in the United States, and the events that have brought us to this point.

Professor Daniel Koretz

Daniel Koretz: My interest for a long time has been the diverse effect of high-stakes testing. I've carried out a variety of studies of that going back more than a dozen years.

The current emphasis on high-stakes testing, which is nearly universal in the policy world, is really not new. It's new in its extremity and its current forms, but it dates back at least twenty-five years or more. There are some tests that go back further; the New York States Regents tests go back 140 years.

New York State began to use Regents exams for the accountability of college-bound students starting around 1865. For the most part after World War II, however, standardized testing or external testing—testing imposed by people outside of the school—was typically very low stakes. It had consequences for a few people. There were kids whose classification for needing special education or not depended on a test. College admissions tests were certainly high stakes. But, for most people, it really didn't matter very much. Teachers sometimes had reasons to change what they did if they found that kids were strong in one part of the tests and weak on another, but it was basically a very low-stakes system until the early 1970s.

The current emphasis on high-stakes testing, which is nearly universal in the policy world, is really not new.

What changed it in the 1970s was the minimum competency testing movement, which started around 1972: a wave of state-level policies imposing generally low-level tests as exit exams—tests you have to pass to get out of high school and get a diploma. There was a very rapid growth in the use of these minimum competency tests over a six- or seven-year period spanning the 1970s. But the tests were easy: typically, they were multiple-choice tests aimed at a very low level. I don't have precise figures on the failure rates, but people who were observing it at the time often argued that failure rates above twelve percent were never tolerated. So it was basically a way to weed out kids at the bottom. And that persisted until the 1980s.

What happened [in the 1980s] was another periodic wave of concern about the American educational system that was sparked by two things. One was some international comparisons that did not look wonderful, and the other, larger one was the perception that performance in the American school system was deteriorating. If that was in fact true, that the performance of American kids was deteriorating in the 1960s and 1970s, it was not—if you look carefully at the data—entirely because of schools. There were a lot of social changes going on that contributed to this.

Jane Buchbinder: What sort of social changes had an impact on this deterioration?

Koretz: It's not entirely clear what the social changes were, but the data was fairly clear that social changes must have had some role: the trends were simply too pervasive to be attributed to educational policy, which is not uniform in the United States. In fact, it even had echoes in Canada. There are a number of suggestions of social changes that are at least consistent with it. What is reasonably clear is that it would be hard to come up with an educational policy that could account for all of this. It happened too broadly, in low-achievement schools and high-achievement schools; it happened all over the place. The irony is that it was all over before people started getting concerned about it. It ended with the cohorts of kids who entered school in the early 1970s, so that by the time they graduated in 1980, the bottom had been reached and things had started going back up.

[Those of us in the field of measurement] kept saying, ‘Look out, this isn't going to work very well.’

That produced what was called at the time the educational reform movement, which had a number of elements. The two most pervasive were more statewide testing and stiffer course-taking requirements for graduation from high school. That's where the big increases in the percentage of kids taking algebra started. If you look back at the early 1980s and look at people coming out of high school now, what you find is that the average number of mathematics courses completed has more than doubled. That is where it came from. But the big thing was more testing. States that didn't have statewide testing started implementing it. Tests were made more difficult. The consequences for test use gradually started to rise, but conventional multiple choice tests were still in use.

During that period of time, there were perhaps a half a dozen of us in the field of measurement who kept saying, "Look out, this isn't going to work very well." Nobody really paid a lot of attention to those of us who issued this warning. But in the late 1980s, a few of us did the first really rigorous quantitative test of whether or not there was inflation of test scores under what are—by today's standards—relatively low-stakes conditions. At the time, we thought it to be high stakes but we didn't know what high stakes were.

Buchbinder: What were the stakes then?

Koretz: There was no pay differential for teachers based on scores. There were no cash rewards to schools. There was no closing or reconstitution of schools that didn't show improvement. But superintendents and principals used scores as an accountability tool. One superintendent in Maryland referred to it as the strategy of applied anxiety. And it was enough, we found, to create really striking inflation of scores. We found that in one large district, scores were inflated by about one half an academic year by the spring of the third grade and by nearly a full year by the end of fifth grade.

How can we include tests in accountability systems in ways that minimize some of these undesirable effects and maximize what we gain?

About the time that study was done, people began to realize that they really did have a problem on their hands. Scores were not believable, and students and parents were complaining of inappropriate behavior in schools: endless amounts of drill, for example, class assignments that looked just like standardized tests, and so on. This was in about 1990.

The reaction in the policy world was quite uniform. It was not, "Oh, test accountability doesn't work as well as we had hoped." It was, "Test-based accountability was the way to go; we were just using the wrong tests." The big push then became two things: first of all, further ratcheting up of stakes. It was in the early 1990s that we first started seeing systems that actually gave cash awards based on test scores. That's now commonplace but it was precedent-setting when Kentucky did it in the early 1990s, for instance.

The other thing that went along with that was the movement towards tests that people said "weren't worth teaching to." What that meant in most people's minds was tests that were designed so that if you were doing test preparation, you were actually doing something valuable in class. There is some evidence that some states actually managed to do a little of that.

The argument should be something different. We know that there is a need for more accountability in some public schools. We know that the public is not prepared to abandon a quest for more accountability, and it's hard to imagine that a good accountability system wouldn't somehow take into account what some kids actually learn. Accountability systems are going to stick around, and tests are clearly going to be a big part of them. The argument ought to be: how can we include tests in accountability systems in ways that minimize some of these undesirable effects and maximize what we gain? At this point that question, in my view, is unanswered.

It's a very complex question, and it's one that cannot be answered solely by improvement of the testing system. We do know as psychometricians a number of things that can be done to lessen the problem. They tend to be expensive and they tend to have other problems as well, which makes them politically unacceptable. But even if there were the political will to spend money to improve testing systems as much as we can, that's not going to be enough.

Nobody has designed a good multiple measure system in part because policymakers were not really interested in it.

It's also important to refrain from basing serious decisions on a single measure. This axiom of educational measurement is currently much more widely ignored than not. There aren't any great examples out there of the use of multiple measures in accountability systems. Nobody has designed a good multiple measure system in part because the policymakers were not really interested in it. So there wasn't much incentive or funding for researchers. There are some really difficult problems that lead me to think while that we need to improve testing programs, we are going to have to tinker around with accountability systems beyond that to make them work.

Buchbinder: What do you think policymakers were interested in, if not finding the best ways?

Koretz: It's hard to generalize about policymakers, but many of them are looking for fast results because many of them are in office for a fairly short time. They don't want to spend a lot of money. And they don't want things that are messy and subjective that can cause them a lot of argument. Tests are cheap and very powerful and have an aura of objectivity, and many deserve that aura. If you were to say, "The only way to balance the incentives created by reliance on achievement tests is to also place emphasis on direct evaluations of teacher practice," people would draw back in horror because that is enormously expensive, it's somewhat subjective, and it's not clear who's going to do it. If you take the best teachers out of classrooms to judge other teachers, you have just degraded the quality of instruction. So very few people want to deal with this. So when you talk about multiple measures, what policy people tend to do instead is give schools a little bit of credit for improving dropout rates or attendance rates. That doesn't do much, especially in many places where attendance rates are so high there isn't any way for them to improve. And dropout rates are not entirely under the control of schools anyway.

I may be proved wrong. This is somewhat speculative but I'm not convinced that we can get an accountability system to function the way we want without getting involved in some of the mushier and more controversial and expensive things.

Buchbinder: Is there anyone testing the test themselves? Or testing the test makers?

Koretz: Yes and no. There is a very well-developed set of professional standards in measurement. Those standards are not always adhered to. One of those standards is that you shouldn't make serious decisions based on a single measurement and people do anyway. But there is a tradition of evaluating various aspects of the quality of tests. One aspect typically evaluated is their reliability—the consistency with which they give you the same answer. And a variety of methods are used to evaluate validity: the extent to which the scores support the inference that you are basing on them.

What some people in my field are trying to say is, ‘Look, it doesn't work for researchers to walk behind the elephant with a broom.’

However, those methods are not really up-to-date in that they were designed for a lower-stakes world than we have now. A test can look very good in terms of traditional validation evidence when it is first released and still produce nothing but bogus gains and inflated scores when it's applied because the traditional methods of evaluating tests don't look at inflation of scores over time. One of the reasons there has not been more research on this is that, frankly, many politicians are quite hostile to it. Put yourself in the position of the state chief who has introduced a new test. There is generally a year or two when things look dreadful and then scores start going up, so you can say, "Ah, we've got improvement. Our policies are working." And some researcher comes in and says he or she wants to see if your score increases are meaningful or not. It's not an appealing option, and I've been thrown out of quite a number of states at various times in the course of doing studies for this reason.

What some people in my field are trying to say, "Look, it doesn't work for researchers to walk behind the elephant with a broom." It just doesn't work. Nobody wants us to do it, and it's not always useful to come into a state two years after a new program starts and then two years later have results, which may be bad news. That's really not very helpful. But that's what researchers have had to do. I'm not sure how successful they will be, but there is a real effort now to make this more of an R&D effort, with research and evaluation helping to shape policy from the beginning. This will work only if policymakers accept it. If they are willing to be sold a bill of goods by somebody who says, "I know exactly how to do it," then they have no incentive to do serious evaluation. But if they understand that to some degree they are creating something out of whole cloth, then it makes sense to admit that to the public and say that they are going to have ongoing evaluation and we are going to have mid-course corrections. It's a rare thing for this kind of collaboration to happen, but it occasionally has. For instance, I spent four years as the principal investigator of a series of external evaluations at the Vermont portfolio assessment program. The commissioner of education in Vermont stated publicly that he would support independent evaluation, and he maintained that position consistently, regardless of what the studies showed. The commissioner simply said point-blank, "We don't believe that anyone has a package that is adequate for our needs and so we are going to have external people evaluating the program as we go along, and you should expect that we are going to change course from time to time." That is rare.

Buchbinder: And along with that, there were high stakes attached at the time?

Koretz: By today's standards, they were not high stakes, but by Vermont's standards at the time they were. Vermont is a state that has had no history of state testing or state control for that matter. But also because of the political culture of Vermont, people were very receptive to the commissioner announcing that there would be uncensored external evaluation. The first time I had any bad news for them, the commissioner asked me to get it written up in time for their annual data release. He then had his department disseminate a summary of my study rather than my doing it myself. He had a press conference about it in which he spoke for two minutes, introduced me, and said, "Here is the person responsible; ask him anything you want."

That is rare, but it has happened occasionally. What people in the field are trying to do is to find a way to do more formative evaluation, which is more helpful in developing something and refining it than walking in at the end and giving an up-or-down judgment about the quality of the effort.

Buchbinder: Are there people looking at each state's successes and failures?

Koretz: Some of us are trying to. Some people are not; they are convinced that they've got the answer. But the only rational thing to do is to look around at a variety of states and cities and try to get comparative information on how all of the systems work.

The standards movement is based on the notion that the biggest impediment is the lack of standards and accountability. I don't think that is true. I think the lack of standards and accountability is significant but, in many of the schools that I have seen, I would not say that that is the biggest impediment to improved performance. If, for example, you have kids who are highly transient, who don't speak English, who come from dysfunctional homes, it's hard imagining that a better test is really going to solve the problem.

For More Information
More information about Daniel Koretz is available in the Faculty Profiles.

What do YOU think?



HGSE News, Harvard Graduate School of Education
© 2013 President and Fellows of Harvard College.

Classroom Practice | Cognitive Development | Technology & Learning | Urban Education & Equity | Educational Reform | Educational Administration | Subscribe | Advanced Search | Feedback | About the Site | Faculty Research | Faculty Profiles | News Office | Books & Special Features | In the News | Press Releases | On Campus | HGSE News Home | HGSE Home