Testing the Global Education Tests

Country rankings win headlines, but they often mislead, creating false comparisons and risking policy confusion

Posted April 5, 2018
By Bari Walsh

When results of big international education assessments are released, the rankings are greeted with fanfare, generating headlines about how one country or region is dominating, or why another is not. But like the “best colleges” lists that gain similar attention, these global rankings are more noise than signal — creating misperceptions, risking ill-conceived policy decisions, and diverting attention from more nuanced (and effective) uses of the data.

That’s according to an article published today in Science by Judith Singer of Harvard University and Henry Braun of Boston College. The paper grew out of their work on a National Academy of Education steering committee, chaired by Singer, that studied the purposes, methods, and policy uses of so-called international large-scale assessments, or ILSAs — tests like the Programme for International Student Assessment (PISA) or the Progress in International Reading Literacy Study (PIRLS).

View the steering committee’s report, issued today, here. And access summaries, related papers, and videos here.

The Problem with Rankings

The report’s most significant takeaway, Singer and Braun argue, is the need to de-emphasize rankings when assessments are released. Although ILSAs provide a valuable framework for understanding how a jurisdiction’s education system is performing, and for motivating further investment in education, the rankings so dominate the releases that “they become the statement of truth, and the data that are underlying the rankings get lost,” says Singer.

As long as ILSA results are expressed in rankings, nuance will be lost. Rankings are catnip to journalists and policymakers, Singer says, and understandably so. “They give you the lead for the story, and they’re too tempting to ignore.”

The actual data is much more nuanced than the rankings suggest. As the Science paper details, exclusion criteria are a major wrinkle that most headlines miss. In Shanghai, for example — which had the highest mean PISA score of any jurisdiction in mathematics in 2012 — 27 percent of students in the target population of 15-year-olds did not take the test, due in part to internal migration policies that prevented enrollment; in the United States, that exclusion rate was around 11 percent. In both cases, excluded students are likely to be coming from the lower-achieving end of the spectrum, meaning that tests aren’t truly representative.

And test scores are affected by so many out-of-school factors. In South Korea — another of the Asian countries dominating the top of the 2012 PISA mathematics rankings — about half of the tested students receive private tutoring, which has a price tag for families that adds up to more than more than half of what the country spends on education overall. “So to say that the Korean education system is among the best in the world seems a somewhat misleading conclusion,” says Singer.

And, of course, comparing results from single cities or a city-state like Singapore with results from a country like France, with its national education system, and then to a highly decentralized system like the United States is clearly an apples-to-oranges (to bananas) proposition. The committee found many instances in which a country’s ranking goes up in a given release year, but its mean score goes down — meaning that countries are ranked higher but doing worse — or where the differences between ranked jurisdictions’ scores are so statistically insignificant that the act of ranking them adds a meaning that isn’t there.

As long as ILSA results are expressed in rankings, nuance will be lost. Rankings are catnip to journalists and policymakers, Singer says, and understandably so. “They give you the lead for the story, and they’re too tempting to ignore.”

In Shanghai, for example — which had the highest mean PISA score of any jurisdiction in mathematics in 2012 — 27 percent of students in the target population of 15-year-olds did not take the test, due in part to internal migration policies that prevented enrollment; in the United States, that exclusion rate was around 11 percent.

A More Nuanced Approach

Rather than drawing simplistic conclusions from top-level rankings, Singer and Braun find more promise in studies that take ILSA data down to a regional or local level and combine it with another data source, like questionnaires from parents, teachers, or students.

Equally promising: comparing within-country analyses across different — but culturally similar — countries, and looking at relationships between different indicators of achievement.

For instance, the committee reviewed research that found that Hong Kong and Taiwan have very similar mean scores on the 2012 PISA test, but that the relationship between scores and socioeconomic status (SES) is three times greater in Taiwan than in Hong Kong. Canada’s mean score was 37 points higher than that of the United States, but the relationship between scores and SES is much weaker in Canada. These types of analyses are more likely than rankings to lead to productive exploration of these equity issues, Singer and Braun say, but they often don’t make the same public impact.

Five Strategies for Improving ILSAs and Enhancing Impact

All partners — the assessment organizations, testing contractors, policymakers, and media — should move to deemphasize rankings.
When ILSA data is released, they should be linked to other sources of data, so that more nuanced and meaningful analysis can happen more quickly.
Researchers and policymakers should take advantage of the move to digitally based assessments and the ability to glean more useful and accurate data.
Build pilot programs to add longitudinal tracking components that measure learning over time.
When an ILSA release spurs consideration of policy changes, use the ILSA data to trigger randomized field trials among like countries to test the effects of specific interventions.