Standardized Tests: NAEP, PIRLS, TIMSS, PARCC, PISA, ITBS, and CLT

Part 1: In Defense of Standardized Testing
Part 2: Alternatives to Standardized Testing

There are two types of standardized tests, criterion referenced and norm-referenced.
Criterion referenced tests are based on some standard (criteria). The current standards based movement would be a proponent of this approach, and, the tests you make in class likely qualify as criterion based too. It allows you to measure learning based on an external standard that is stable from year to year.
Norm-referenced tests are based on the norm for that particular year. In English, this means that students are compared with each other. So, a score in the 51% percentile, means that the student scored higher than 51% of the test takers for that particular year.

The National Assessment of Educational Progress (NAEP) 

According to Daniel Koontz, the NAEP is “widely considered to be a gold standard for evaluating educational trends” (The Testing Charade, Ch 5). One reason it is a gold standard is that its results are not particularly vulnerable to corruption because it is a low-stakes test. This means that students and teachers are not held accountable for the results, so there is little incentive to cheat or overly rely on test prep.

This is important because many state achievement tests are higher stakes for schools and  students and are vulnerable to the aforementioned corruption. So, the NAEP can be used almost as a way to audit the state tests. For example, if students show remarkable growth on the state test, you would also expect to see a level of growth on the NAEP. If there is little or no growth on the corresponding sections of the NAEP, then it is fair to question whether the state is gaming their own test in order to look good and score political points. 

One example of this is New York City. In 2007, Joel Klein was chancellor of the New York school system and, based on the results of the state achievement test, students made excellent progress. However, “when scores on the NAEP were released in 2007, they showed that New York City’s eight-graders had made no progress whatsoever in mathematics on that test over the previous two years, despite their huge gains on the state test” (Koretz, The Testing Charade, Ch. 5). 

A sample of nationally representative groups of grade 4, 8, and 12 students take the NAEP every four years. Each participating state selects 2,500 students per subject to take the test. The NAEP is a criterion-referenced test, so students are not directly compared with each other. Instead, students are compared with an external standard. The NAEP essentially has three achievement levels: Basic, Proficient, and Advanced. 

The Proficient level is a bit misleading because it does not correspond with grade level performance, in order to reach Proficient on the NAEP, a student will need to perform higher than grade level. With this high standard, only about ⅓ of American students are considered to be proficient or better.

Trends in International Mathematics and Science Study (TIMSS)

TIMSS is an international test put on by the International Association for the Evaluation of Educational Achievement (IEA) at Boston College and that is taken in over 60 countries. In America, it is taken by a nationally representative sampling of about 10,000 students in fourth grade and 10,000 in eighth grade (FAQ). As it is based on voluntary participation and sampling, this is a low stakes test for both schools and students.

TIMSS is a low-stakes criterion referenced exam, it uses the International Baccalaureate (IB) standard and divides achievement into four levels: Advanced, High, Intermediate, and Low (Mullis, Martin, Foy, & Hooper, 2016). If you are interested in what each benchmark means, check out page 19 of this report. Based on my understanding, I’d say that intermediate should be the minimum acceptable level, meaning that we should essentially be aiming for 100% of students to be at this level or better.

The results of TIMSS paint a much more favorable picture of U.S. education than the NAEP. Though, we should expect better results, since the standard (criterion) of TIMSS is more aligned with grade-level expectations. In the past 20 years, Math, for both 4th and 8th grade, the U.S. has increased the percentage of students achieving at Intermediate or better by nearly 10%. This is good progress. However, we have seen less gains in science. We have essentially remained stagnant in 4th grade and seen moderate improvements in 8th grade.

When we compare the U.S. with other countries that took the TIMSS, we see that we are above average, but below the top tier.

Progress in International Reading Literacy Study (PIRLS)

Like TIMSS, PIRLS is put on by IEA at Boston College and is considered to be a low-stakes test that is criterion-referenced, with the same benchmarks of Advanced, High, Intermediate, and Low. If you are interested in what each benchmark means, click here. Unlike TIMSS, PIRLS is given every 5 years instead of every 4. A sample of nationally representative fourth graders take the test. Students are assessed on an informational text and literary text. In 2016, there were 61 participating countries.

When compared with the NAEP, PIRLS was found to have readings that were easier by about one grade level (FAQ). So, we should expect better results and that is exactly what we find. 

While 35% of fourth grade students are deemed proficient by the NAEP, PIRLS found that 83% of students achieved at the Intermediate benchmark or better (Mullis, Martin, Foy, & Hooper, 2017).

Program for International Student Assessment (PISA)

PISA is a low-stakes, norm-referenced international test started by the OECD in 2000 and assesses a sampling of 15 year olds’ reading, math, and science literacy every three years. 600,000 students took the test in 2018, representing 79 countries or education systems. 

It has also been divided into various norm-referenced proficiency levels in an attempt to classify students. Being norm-referenced, these proficiency levels will differ slightly from year to year because the cohorts of students will be different, meaning that the average scores will be different.

The test makers note that, “There are no natural breaking points to mark borderlines between stages along this continuum. Dividing the continuum into levels, though useful for communication about students’ development, is essentially arbitrary. Like the definition of units on, for example, a scale of length, there is no fundamental difference between 1 metre and 1.5 metres – it is a matter of degree. It is useful, however, to define stages, or levels along the continua, because they enable us to communicate about the proficiency of students in terms other than continuous numbers. This is a rather common concept, an approach we all know from categorising shoes or shirts by size (S, M, L, XL, etc.).” 

When you look at America’s results, you see that they are more or less in line with the OECD average, while lagging a bit in math. One thing that makes the PISA useful, beyond comparing different education systems, is that it breaks the data down by the student’s socioeconomic status. This is important because it helps us see how well we are teaching different groups of students. The OECD’s report found that the gap between advantaged and disadvantaged students in America is 11 points larger than the OECD average. It also breaks down performance based on gender. In 2018, girls performed 24 points better than boys in reading. This “gender gap” is better than the OECD average of 30 points. The performance gender gap in math favored boys by 9 points, larger than the OECD average of 5 points. In science, American boys and girls performed roughly the same.

In reading, 81% of American students were able to at least reach level 2 proficiency, compared with the OECD average of 77%. Essentially, this means that 81% of our students can “At a minimum, these students can identify the main idea in a text of moderate length, find information based on explicit, though sometimes complex criteria, and can reflect on the purpose and form of texts when explicitly directed to do so.”

In math, 73% of our students reached level 2 proficiency or higher, slightly lower than the OECD average of 76%. Essentially, this means that 73% of our students can “interpret and recognise, without direct instructions, how a (simple) situation can be represented mathematically (e.g. comparing the total distance across two alternative routes, or converting prices into a different currency).”

In science, 81% of our students reached level 2 proficiency or higher, slightly better than the OECD average of 78%. Essentially, this means that 81% of our students can, “recognise the correct explanation for familiar scientific phenomena and can use such knowledge to identify, in simple cases, whether a conclusion is valid based on the data provided.”

Summary of International Standardized Tests

When we look over results from the international standardized tests, we can take a level of comfort. Even though America has substantial room for improvement, no matter which test you are looking at, we are roughly in line with other higher performing countries. We should recognize this. It is not only doom and gloom. 

But, we should also take a good hard look at the criterion referenced ones (NAEP, TIMSS, PIRLS). The NAEP is a very high standard, so there is not necessarily a need to fret about the low percentage of students who are measured proficient in that test. But both TIMSS and PIRLS are aligned to grade level standards and both show that we fail to get 20-30% of students to achieve at an acceptable level.

Iowa Assessments, formerly Iowa Test of Basic Skills (ITBS) 

While the Iowa Assessments started in Iowa, hence the “Iowa” in its name, it has a national reach. The Iowa Assessments are taken every year from kindergarten through eighth grade and they assess Language Arts, Reading, Math, Science, and Social Studies. The test underwent a transformation between the 2011-2012 school year in order to be better aligned with the Common Core Standards and the Smarter Balanced Exam (other state standardized tests). To go along with the change in focus, the ITBS was renamed Iowa Assessments.

This is a norm-referenced test, meaning that students are compared with each other, not to an outside standard, which allows for comparisons between students by using a percentile score. Essentially, if your child receives a score in the 50th percentile, then he/she scored higher than 50% of the test takers in that year, if your child scored in the 86% percentile, then he/she scored higher than 86% of the test takers in that year. 

Given the norm-referenced format, the Iowa Assessments are not so easy to compare with each other over time, because each year involves a different set of students, and therefore, a different norm. They are best used to compare with students in the same year who took the same test. If you are looking at the test results over time, I would suggest taking them with a grain of salt.

The Iowa test is not high stakes, but it does have more of an impact on the students than the NAEP, TIMSS, PIRLS, or PISA. Schools will commonly use results from the Iowa Assessments as one factor to place students in talented and gifted programs. As this test has real-life impacts on students, it is particularly important that the test makers check for content bias.

The Partnership for Assessment of Readiness for College and Careers (PARCC)

The PARCC is given to a representative sample of students in grades 3-11 annually and assesses mathematics along with English/Language Arts and is in alignment with the Common Core Standards.

PARCC is a criterion referenced test (the Common Core is the criterion) and students are assigned performance levels between 1 and 5 with Level 3 and above considered to be passing.
Level 1: Did not yet meet expectations
Level 2: Partially met expectations
Level 3: Approached expectations
Level 4: Met expectations
Level 5: Exceeded expectations

If you want more information about what these performance levels actually mean, click here. If you want to really nerd out, check out this nearly 500 page technical report. Section 9.5, section 10 and section 11 are most relevant.

The PARCC results do not paint a particularly pretty picture of American education. For the 2015-2016 school year, the percent of students who met or exceeded expectations hovered around 40% at all grade levels for ELA and Math starts at 42.5% who at least meet expectations, but that lowly result plummets over time, finishing at 25.9% in 8th grade. Go ahead and look at the graphs. If you are interested in a breakdown by state or ethnicity, check out this pdf.

This is all the more concerning because the PARCC is aligned with the Common Core Standards, meaning that the tests are at grade level.

PARCC is a high-stakes test. Students may be held back if they do poorly. This makes concerns about bias extremely important.

The Classic Learning Test (CLT)

Meet CLT, the new kid on the block. It was started in 2015 with the intention of providing an alternative to the bigger, more famous standardized tests. It features, “passages selected from great works across a variety of disciplines, the CLT suite of assessments provide a highly accurate and rigorous measure of reasoning, aptitude, and academic formation for students from diverse educational backgrounds.”

The CLT is offered as an alternative to the SAT and ACT, so the CLT is high stakes. However, our focus will be on their other tests. The CLT8 and CLT10 are standardized tests for 8th and 10th graders. These tests are norm-referenced, with the norm being based on a nationally representative sample of the CLT10 population. 

Content wise, the CLT10 and CLT8 cover verbal reasoning (reading comprehension), grammar, writing, and quantitative reasoning (math). These exams are designed to be comparable to the PSAT, and the scores between the tests can be compared. If you are interested in comparing the scores, look at pages 29-33 of this link. If you are interested in how students performed based on income or race, look at Chapter 10 of the technical report. Unfortunately the scores for race are only broken down into two categories, white and non-white. I would guess that this is due to sample size issues and that future reports will offer more detailed breakdowns, sample allowing.

There Are No Better Options

The data we get from the NAEP, TIMSS, PIRLS, PISA, and PARCC leaves plenty of room for concern. Internationally, we are essentially average in education, nothing to brag about. But, when we look at how our students perform at grade level assessments, there is real cause for concern according to the PARCC exam, only around 40% of our students meet or exceed the standard from grades 3-8 in English Language Arts. In math, the story is much worse. Without these standardized tests we would only have a vague idea about these problems, so, until there is a better option, I am for standardized tests. It is important for us to know where educational inequalities and inefficiencies exist. Currently, if we were to replace standardized tests with any alternative, at best we would get fuzzier data.


Koretz, D. (2017). The Testing Charade: Pretending to Make Schools Better. University of Chicago Press.

Mullis, I. V. S., Martin, M. O., Foy, P., & Hooper, M. (2016). TIMSS 2015 International Results in Mathematics. Retrieved from Boston College, TIMSS & PIRLS International Study Center website:

Mullis, I. V. S., Martin, M. O., Foy, P., & Hooper, M. (2017). PIRLS 2016 International Results in Reading. Retrieved from Boston College, TIMSS & PIRLS International Study Center website: