Standardized Tests: NAEP, PIRLS, TIMSS, PARCC, PISA, ITBS, and CLT

Part 1: In Defense of Standardized Testing
Part 2: Alternatives to Standardized Testing

There are two types of standardized tests, criterion referenced and norm-referenced.
Criterion referenced tests are based on some standard (criteria). The current standards based movement would be a proponent of this approach, and, the tests you make in class likely qualify as criterion based too. It allows you to measure learning based on an external standard that is stable from year to year.
Norm-referenced tests are based on the norm for that particular year. In English, this means that students are compared with each other. So, a score in the 51% percentile, means that the student scored higher than 51% of the test takers for that particular year.

The National Assessment of Educational Progress (NAEP) 

According to Daniel Koontz, the NAEP is “widely considered to be a gold standard for evaluating educational trends” (The Testing Charade, Ch 5). One reason it is a gold standard is that its results are not particularly vulnerable to corruption because it is a low-stakes test. This means that students and teachers are not held accountable for the results, so there is little incentive to cheat or overly rely on test prep.

This is important because many state achievement tests are higher stakes for schools and  students and are vulnerable to the aforementioned corruption. So, the NAEP can be used almost as a way to audit the state tests. For example, if students show remarkable growth on the state test, you would also expect to see a level of growth on the NAEP. If there is little or no growth on the corresponding sections of the NAEP, then it is fair to question whether the state is gaming their own test in order to look good and score political points. 

One example of this is New York City. In 2007, Joel Klein was chancellor of the New York school system and, based on the results of the state achievement test, students made excellent progress. However, “when scores on the NAEP were released in 2007, they showed that New York City’s eight-graders had made no progress whatsoever in mathematics on that test over the previous two years, despite their huge gains on the state test” (Koretz, The Testing Charade, Ch. 5). 

A sample of nationally representative groups of grade 4, 8, and 12 students take the NAEP every four years. Each participating state selects 2,500 students per subject to take the test. The NAEP is a criterion-referenced test, so students are not directly compared with each other. Instead, students are compared with an external standard. The NAEP essentially has three achievement levels: Basic, Proficient, and Advanced. 

The Proficient level is a bit misleading because it does not correspond with grade level performance, in order to reach Proficient on the NAEP, a student will need to perform higher than grade level. With this high standard, only about ⅓ of American students are considered to be proficient or better.

Trends in International Mathematics and Science Study (TIMSS)

TIMSS is an international test put on by the International Association for the Evaluation of Educational Achievement (IEA) at Boston College and that is taken in over 60 countries. In America, it is taken by a nationally representative sampling of about 10,000 students in fourth grade and 10,000 in eighth grade (FAQ). As it is based on voluntary participation and sampling, this is a low stakes test for both schools and students.

TIMSS is a low-stakes criterion referenced exam, it uses the International Baccalaureate (IB) standard and divides achievement into four levels: Advanced, High, Intermediate, and Low (Mullis, Martin, Foy, & Hooper, 2016). If you are interested in what each benchmark means, check out page 19 of this report. Based on my understanding, I’d say that intermediate should be the minimum acceptable level, meaning that we should essentially be aiming for 100% of students to be at this level or better.

The results of TIMSS paint a much more favorable picture of U.S. education than the NAEP. Though, we should expect better results, since the standard (criterion) of TIMSS is more aligned with grade-level expectations. In the past 20 years, Math, for both 4th and 8th grade, the U.S. has increased the percentage of students achieving at Intermediate or better by nearly 10%. This is good progress. However, we have seen less gains in science. We have essentially remained stagnant in 4th grade and seen moderate improvements in 8th grade.

When we compare the U.S. with other countries that took the TIMSS, we see that we are above average, but below the top tier.

Progress in International Reading Literacy Study (PIRLS)

Like TIMSS, PIRLS is put on by IEA at Boston College and is considered to be a low-stakes test that is criterion-referenced, with the same benchmarks of Advanced, High, Intermediate, and Low. If you are interested in what each benchmark means, click here. Unlike TIMSS, PIRLS is given every 5 years instead of every 4. A sample of nationally representative fourth graders take the test. Students are assessed on an informational text and literary text. In 2016, there were 61 participating countries.

When compared with the NAEP, PIRLS was found to have readings that were easier by about one grade level (FAQ). So, we should expect better results and that is exactly what we find. 

While 35% of fourth grade students are deemed proficient by the NAEP, PIRLS found that 83% of students achieved at the Intermediate benchmark or better (Mullis, Martin, Foy, & Hooper, 2017).

Program for International Student Assessment (PISA)

PISA is a low-stakes, norm-referenced international test started by the OECD in 2000 and assesses a sampling of 15 year olds’ reading, math, and science literacy every three years. 600,000 students took the test in 2018, representing 79 countries or education systems. 

It has also been divided into various norm-referenced proficiency levels in an attempt to classify students. Being norm-referenced, these proficiency levels will differ slightly from year to year because the cohorts of students will be different, meaning that the average scores will be different.

The test makers note that, “There are no natural breaking points to mark borderlines between stages along this continuum. Dividing the continuum into levels, though useful for communication about students’ development, is essentially arbitrary. Like the definition of units on, for example, a scale of length, there is no fundamental difference between 1 metre and 1.5 metres – it is a matter of degree. It is useful, however, to define stages, or levels along the continua, because they enable us to communicate about the proficiency of students in terms other than continuous numbers. This is a rather common concept, an approach we all know from categorising shoes or shirts by size (S, M, L, XL, etc.).” 

When you look at America’s results, you see that they are more or less in line with the OECD average, while lagging a bit in math. One thing that makes the PISA useful, beyond comparing different education systems, is that it breaks the data down by the student’s socioeconomic status. This is important because it helps us see how well we are teaching different groups of students. The OECD’s report found that the gap between advantaged and disadvantaged students in America is 11 points larger than the OECD average. It also breaks down performance based on gender. In 2018, girls performed 24 points better than boys in reading. This “gender gap” is better than the OECD average of 30 points. The performance gender gap in math favored boys by 9 points, larger than the OECD average of 5 points. In science, American boys and girls performed roughly the same.

In reading, 81% of American students were able to at least reach level 2 proficiency, compared with the OECD average of 77%. Essentially, this means that 81% of our students can “At a minimum, these students can identify the main idea in a text of moderate length, find information based on explicit, though sometimes complex criteria, and can reflect on the purpose and form of texts when explicitly directed to do so.”

In math, 73% of our students reached level 2 proficiency or higher, slightly lower than the OECD average of 76%. Essentially, this means that 73% of our students can “interpret and recognise, without direct instructions, how a (simple) situation can be represented mathematically (e.g. comparing the total distance across two alternative routes, or converting prices into a different currency).”

In science, 81% of our students reached level 2 proficiency or higher, slightly better than the OECD average of 78%. Essentially, this means that 81% of our students can, “recognise the correct explanation for familiar scientific phenomena and can use such knowledge to identify, in simple cases, whether a conclusion is valid based on the data provided.”

Summary of International Standardized Tests

When we look over results from the international standardized tests, we can take a level of comfort. Even though America has substantial room for improvement, no matter which test you are looking at, we are roughly in line with other higher performing countries. We should recognize this. It is not only doom and gloom. 

But, we should also take a good hard look at the criterion referenced ones (NAEP, TIMSS, PIRLS). The NAEP is a very high standard, so there is not necessarily a need to fret about the low percentage of students who are measured proficient in that test. But both TIMSS and PIRLS are aligned to grade level standards and both show that we fail to get 20-30% of students to achieve at an acceptable level.

Iowa Assessments, formerly Iowa Test of Basic Skills (ITBS) 

While the Iowa Assessments started in Iowa, hence the “Iowa” in its name, it has a national reach. The Iowa Assessments are taken every year from kindergarten through eighth grade and they assess Language Arts, Reading, Math, Science, and Social Studies. The test underwent a transformation between the 2011-2012 school year in order to be better aligned with the Common Core Standards and the Smarter Balanced Exam (other state standardized tests). To go along with the change in focus, the ITBS was renamed Iowa Assessments.

This is a norm-referenced test, meaning that students are compared with each other, not to an outside standard, which allows for comparisons between students by using a percentile score. Essentially, if your child receives a score in the 50th percentile, then he/she scored higher than 50% of the test takers in that year, if your child scored in the 86% percentile, then he/she scored higher than 86% of the test takers in that year. 

Given the norm-referenced format, the Iowa Assessments are not so easy to compare with each other over time, because each year involves a different set of students, and therefore, a different norm. They are best used to compare with students in the same year who took the same test. If you are looking at the test results over time, I would suggest taking them with a grain of salt.

The Iowa test is not high stakes, but it does have more of an impact on the students than the NAEP, TIMSS, PIRLS, or PISA. Schools will commonly use results from the Iowa Assessments as one factor to place students in talented and gifted programs. As this test has real-life impacts on students, it is particularly important that the test makers check for content bias.

The Partnership for Assessment of Readiness for College and Careers (PARCC)

The PARCC is given to a representative sample of students in grades 3-11 annually and assesses mathematics along with English/Language Arts and is in alignment with the Common Core Standards.

PARCC is a criterion referenced test (the Common Core is the criterion) and students are assigned performance levels between 1 and 5 with Level 3 and above considered to be passing.
Level 1: Did not yet meet expectations
Level 2: Partially met expectations
Level 3: Approached expectations
Level 4: Met expectations
Level 5: Exceeded expectations

If you want more information about what these performance levels actually mean, click here. If you want to really nerd out, check out this nearly 500 page technical report. Section 9.5, section 10 and section 11 are most relevant.

The PARCC results do not paint a particularly pretty picture of American education. For the 2015-2016 school year, the percent of students who met or exceeded expectations hovered around 40% at all grade levels for ELA and Math starts at 42.5% who at least meet expectations, but that lowly result plummets over time, finishing at 25.9% in 8th grade. Go ahead and look at the graphs. If you are interested in a breakdown by state or ethnicity, check out this pdf.

This is all the more concerning because the PARCC is aligned with the Common Core Standards, meaning that the tests are at grade level.

PARCC is a high-stakes test. Students may be held back if they do poorly. This makes concerns about bias extremely important.

The Classic Learning Test (CLT)

Meet CLT, the new kid on the block. It was started in 2015 with the intention of providing an alternative to the bigger, more famous standardized tests. It features, “passages selected from great works across a variety of disciplines, the CLT suite of assessments provide a highly accurate and rigorous measure of reasoning, aptitude, and academic formation for students from diverse educational backgrounds.”

The CLT is offered as an alternative to the SAT and ACT, so the CLT is high stakes. However, our focus will be on their other tests. The CLT8 and CLT10 are standardized tests for 8th and 10th graders. These tests are norm-referenced, with the norm being based on a nationally representative sample of the CLT10 population. 

Content wise, the CLT10 and CLT8 cover verbal reasoning (reading comprehension), grammar, writing, and quantitative reasoning (math). These exams are designed to be comparable to the PSAT, and the scores between the tests can be compared. If you are interested in comparing the scores, look at pages 29-33 of this link. If you are interested in how students performed based on income or race, look at Chapter 10 of the technical report. Unfortunately the scores for race are only broken down into two categories, white and non-white. I would guess that this is due to sample size issues and that future reports will offer more detailed breakdowns, sample allowing.

There Are No Better Options

The data we get from the NAEP, TIMSS, PIRLS, PISA, and PARCC leaves plenty of room for concern. Internationally, we are essentially average in education, nothing to brag about. But, when we look at how our students perform at grade level assessments, there is real cause for concern according to the PARCC exam, only around 40% of our students meet or exceed the standard from grades 3-8 in English Language Arts. In math, the story is much worse. Without these standardized tests we would only have a vague idea about these problems, so, until there is a better option, I am for standardized tests. It is important for us to know where educational inequalities and inefficiencies exist. Currently, if we were to replace standardized tests with any alternative, at best we would get fuzzier data.

Sources: 

Koretz, D. (2017). The Testing Charade: Pretending to Make Schools Better. University of Chicago Press.

Mullis, I. V. S., Martin, M. O., Foy, P., & Hooper, M. (2016). TIMSS 2015 International Results in Mathematics. Retrieved from Boston College, TIMSS & PIRLS International Study Center website: http://timssandpirls.bc.edu/timss2015/international-results/

Mullis, I. V. S., Martin, M. O., Foy, P., & Hooper, M. (2017). PIRLS 2016 International Results in Reading. Retrieved from Boston College, TIMSS & PIRLS International Study Center website: http://timssandpirls.bc.edu/pirls2016/international-results/

Alternatives to Standardized Testing


Part 1: In Defense of Standardized Testing
Part 2: Alternatives to Standardized Testing
Part 3: Standardized Tests: NAEP, PIRLS, TIMSS, PARCC, PISA, ITBS, and CLT

Standardized testing comes with a sordid history of intentional discrimination, perverse incentives, suspicious discrepancies in scores, and outright cheating. What are the alternatives?

In my research for this blog series, a 2015 article by NPR about alternatives to standardized testing was referenced repeatedly. There were four main alternatives.
1. Sampling
Summary: This is essentially the same as standardized testing, but instead of testing all students, it would test a statistically representative group of students. This is what the NAEP and PISA do.
My Thoughts: I am not completely against this approach. It could be a decent compromise. But I would want my child to be assessed each year. I think it is valuable to see where my child stands in relation to children in the school, district, state, and nationally. This isn’t an attempt to boast about the score, it gives valuable information to parents because the tests give a reference point that is beyond the classroom grades and that is comparable with other locations. Does the test score roughly match my child’s grades? This ERIC Digest provides an excellent summary of how to use/interpret the results of a standardized test.
2. Stealth Assessment
Summary: This is basically gamification. Assessing students with their performance on a computer program.
My Thoughts: Technology can be amazing. But I don’t think this would be a wise direction to move towards. I have not seen any data on the validity of stealth assessment (I don’t think there is much research here yet). It would also bring up even more equity issues than the current set of standardized tests.
3. Multiple Measures
Summary: Instead of measuring based on one assessment (the test) it could use social and emotional skills surveys, game-based assessments (stealth assessment) and performance or portfolio-based assessments.
My Thoughts: There is important data here that would help parents, teachers, administrators, and policy makers, and it would seem obvious to me that we should assess schools and teachers on multiple measures. But wouldn’t the same accusations of bias involved in standardized testing be there for the surveys as well? And, since they are about social and emotional skills/norms, wouldn’t that be even more controversial than standardized academic tests?
Portfolio assessments should not be considered as a replacement for standardized tests because, based on what they are, it is impossible to standardize them. They can be great tools at the teacher/school level though.
I’ll spend some space talking about performance assessments later. They are the most promising alternative.
4. Inspections
Summary: An inspector will come and assess a variety of factors in the school.
My Thoughts: Even with observations, we cannot reliably assess individual teachers because there are so many variables (Wiliam, Leadership for Teacher Learning, Ch 2). Evaluating an entire school or school system in this manner would be exponentially more difficult.
Using inspections would give us good data (we should have some sort of inspection data as part of a multiple measures approach), but it would be much more expensive than standardized testing due to the required man hours and would be a very different type of data. It would not tell us much about what students are or are not learning.

The Most Promising Alternative

The specific alternative to standardized tests I find most promising is a type of performance based assessment. Though there are very significant challenges that performance assessments will have to hurdle before I would be willing to consider replacing standardized tests with performance assessments. 

The performance assessment would have to be externally imposed on schools in a similar way standardized tests currently are. The assessment would also have to be standardized. The purpose here is two-fold. Standardization allows for comparisons between different groups of students and it helps control the bias.

If the assessment is not standardized and given in a standardized manner, then the data generated will not be very useful for anything broader than the context the assessment was given in. There would be too many variables. The performance assessment should also be externally imposed because these assessments should function as a type of audit on the system. Is it working? Are all students being educated?

The last hurdle may be the largest. There is a paucity of research on performance assessments, and alternatives to standardized tests in general (Garcia & Pearson, 1994). I was not able to find anything more recent. It could be that I just don’t know the right search terms. If you are aware of more recent research on possible replacements for standardized tests, please send it my way either in the comments below or on Twitter (@Teacher_Fulton). We should not replace standardized tests with performance assessments until they have developed a track record at least as reliable as standardized tests.

The next post in this series will give an overview of several common standardized tests. (coming soon)

Sources:
William, D. (2016). Leadership for Teacher Learning: Creating a Culture Where All Teachers Improve So That All Students Succeed. Learning Sciences International.

In Defense of Standardized Testing

This series of articles is primarily concerned with standardized tests in compulsory education (Iowa Test of Basic Skills, PISA, TIMSS, PIRLS, NAEP). These tests differ from college entrance exams (ACT, SAT) in that, except for some state achievement tests, the tests tend to be low or no stakes for both the students and schools. 

Many educators have an aversion to standardized testing, and this is not without reason. Teachers spend an inordinate amount of time preparing their students for many of these tests and beyond that, these tests have led to a narrowing of the curriculum. This happens in the misguided attempt to focus on reading and math by reducing the time spent on science, social studies, art, etc (sometimes drastically!). This is misguided because, while it makes sense that you could increase these scores by spending more time on said subjects, doing so actually reduces background knowledge, which, after decoding, is the key to comprehension. 

But It Gets Worse

Standardized tests have been intentionally used by educators to exclude minorities. For one example, you can look into the case of Larry P, a black student in California who was wrongly sent into special education. You can also read this article from Time Magazine for an overview of the negatives.

Other times, the blind spots of the test writers caused them to discriminate against girls as Garcia and Pearson (1994) note,

“When girls outscored boys on the 1916 version of the test designers, apparently operating under the assumption that girls could not be more intelligent than boys, concluded that the test had serious faults. When they revised the 1937 version, they eliminated those items on which girls outperformed boys. By contrast, they did not revise or eliminate items that favored urban over rural children or children of professional fathers over children day laborers (Mercer, 1989); these cultural differences apparently matched developers’ expectations of how intelligence and achievement ought to be distributed across groups (Kamin, 1974; Karier, 1973a, 1973b; Mercer, 1989).”

Whether these blind spots are willful or simply ignorant is irrelevant for our purposes. What is important is that we acknowledge that this type of discriminatory bias is still a possibility in standardized tests today. 

Content Bias

This is the type of bias that is most often pointed out in standardized tests. Content bias is simply when the content of the test favors one particular culture over another, typically favoring the majority culture. This, by default, disadvantages minorities and so it is important to be able to counter content bias if we want standardized tests to be meaningful.

Thankfully, modern standardized test creators take bias seriously.

They “have used a variety of techniques to create unbiased tests (Cole & Moss, 1989; Linn, 1983; Oakland & Matuszek, 1977). Among others, they have examined item selection procedures, examiner characteristics, and language used on the tests as possible sources of bias. One of the most common methods used to control for test bias is that of examining the concurrent or predictive validity of individual tests for different groups through correlational or regression analysis.” (Garcia and Pearson, 1994).

For more detail on what this looks like in practice, read this EdSurge article. Managing content bias will always be a challenge, even with knowledge of history, advanced statistical tools, and a good heart.

Perverse Incentives

Many standardized tests also suffer from the Cambell effect. This simply means that when tests are important (high-stakes) for students or teachers, then it is more likely for the results to be corrupted by any number of means. 

Think about it, when teachers and schools are assessed based on their students’ performance, they will do what they can to look good. And when your job is on the line, you may be driven to take certain….“shortcuts”.

This often leads to the aforementioned narrowing of the curriculum, which disproportionately affects students in impoverished areas. 

On top of this, there are numerous cases of outright illegal behavior. Schools engaged in the practice of scrubbing, unenrolling students or encouraging a temporary truancy. There have also been cases of students being held back in grade 9 and then, after repeating said year, they jump up to grade 11, conveniently skipping the standardized tests (Koretz, The Testing Charade, Ch 5).

And then there are the cases of traditional cheating. The most famous of which is the disaster in Atlanta where 11 educators were given felony convictions and 22 other teachers reached plea agreements. 

We know that cheating is unfortunately not an isolated problem, it has been estimated that, on the low end, at least 5% of these high-stakes standardized tests involve cheating in some fashion (Jacob & Levitt, 2003).

Discrepancies in Test Scores

Poor students tend to score lower than wealthy students. Minority students tend to score lower than white students. This certainly should raise some red flags because it shows that there are real problems somewhere, though not necessarily with the test itself. Once we work to reduce the variables and compare students of different ethnicities who share a similar socioeconomic status and language level, the achievement gap is greatly decreased, but still significant (Garcia & Pearson, 1994), showing that there is at least one other, but likely multiple significant problems, somewhere.

The challenge here is two-fold. Is the primary problem with the standardized tests themselves or with unequal schools, differing home situations, etc? Both?

The Importance of Standardization

In America, 80% of teachers are white (NCES, 2019). Even if you choose to assume the best, it is foolish to assume that the average teacher is knowledgeable about every culture and can adequately adjust for content bias.

Standardization allows for a level of control over the bias because you only need to provide oversight to one group, not millions of teachers. In addition the makers of standardized tests are specifically trained to create them and to analyze them for bias. This doesn’t mean they are perfect, but they are certainly better at making tests and adjusting for bias than the average teacher.

The main value provided by standardized tests is that they give data. Without this data, we would not be aware of the discrepancies in performance based on race or income mentioned above.

Now, we tend to use the data in order to make excuses. “These disparities exist because of economic inequality, we really need to fix that.” And, true enough. But economic inequality is not relevant for teachers to do their job. Our job is to teach students as they are. We need to get results with the students we have in the schools we’re at. If you use a student’s social situation to excuse their lack of learning, get out of education. Social situations provide context, not excuses. 

The data shows where teachers and schools are failing to educate their students. The data shows where problems are. We should use this to help schools help children. We should use this data as a tool to help us identify successful teaching methods. If we get rid of standardized assessments, we also get rid of this data. To do so is to choose to make ourselves blind, not a wise choice.

The scope of the problem is huge. Are there valid alternatives to standardized testing? (coming soon)

America fails too many of her students, but it isn’t all doom and gloom, though there is a fair share of it. Just take a look at how her students perform (coming soon).

Part 1: In Defense of Standardized Testing
Part 2: Alternatives to Standardized Testing
Part 3: Standardized Tests: NAEP, PIRLS, TIMSS, PARCC, PISA, ITBS, and CLT

Sources:
García, G. E., & Pearson, P. D. (1994). Chapter 8: Assessment and Diversity. Review of Research in Education, 20(1), 337–391. https://doi.org/10.3102/0091732X020001337
Jacob, Brian A. and Steven D. Levitt. “Rotten Apples: An Investigation Of The Prevalence And Predictors Of Teacher Cheating,” Quarterly Journal of Economics, 2003, v118(3,Aug), 843-878.
Koretz, D. (2017). The Testing Charade: Pretending to Make Schools Better. University of Chicago Press.