Standardized tests are administered under standard conditions, scored in a standard way, and result in quantifiable results; they can include multiple choice or performance-based questions (like essays); and, while usually administered to groups, they may also be individually administered. Standardized tests are used in schools for a variety of reasons (to determine the achievement or aptitude of individual students; to evaluate programs and curricula for improvement; to hold schools accountable), and they may be either low stakes (that is, having no or low-level rewards and consequences) or high stakes (that is, having serious consequences attached to the scores).
II. Standardized Testing Timeline
III. Mental Measurement Goes to School
IV. The Technical and Sociopolitical Nature of Measurement
V. Eugenics and Testing
VI. Biases in Standardized Tests
VII. Effects of Standardized Tests on Test Takers
VIII. Standardized Tests and Accountability
Standardized testing in schools has always been controversial. The controversy centers on the uses and misuses of the tests; cultural, class, and gender biases in the tests; whether they are benign or have a negative effect on those who take them; and whether they are a good indicator of the quality of learning or schools. Standardized tests have both the promise to reward merit and the reality of advantaging the already advantaged; the means to overcome racism and a source or racism; and the means of establishing accountability in schools and a constraint on what a good education is.
Beginning in kindergarten, standardized test results are used to sort, track, and monitor the abilities, achievements, and potentials of students. A concern is that standardized test results may be weighed more heavily than they ought to be, that decisions once made cannot or will not be reversed, and that other compelling information may be ignored. The uses of standardized testing are far-ranging. While there is considerable variation from one school district to the next, children will be administered at least one, but typically many more, standardized tests each year. Except for Iowa and Nebraska, every other state administers English and mathematics state-mandated tests from grades 3 to 8, and, of those 48 states, 31 administer state-mandated tests in at least two of grades 9 through 12.
Standardized Testing Timeline
- 1845: Administration of the Boston Survey, the first known written examination of student achievement.
- 1895: Joseph Rice developed common written tests to assess spelling achievement in Boston schools.
- 1900–1920: Major test publishers are created: The College Board, Houghton- Miffl in, Psychological Corporation, California Testing Bureau (now CTBS), and World Book open for business.
- 1901: The first common college entrance examinations are administered.
- 1905: Alfred Binet and Theodore Simon develop the Binet-Simon Scale, an individually administered test of intelligence.
- 1914: Frederick Kelly invents the multiple choice question.
- 1916: Louis Terman develops a U.S. version of the Binet-Simon Scale, which becomes a widely used individual intelligence test, and he develops the common concept of the intelligence quotient (or IQ).
- 1917: Army Alpha and Beta tests are developed by Robert Yerkes, then president of the American Psychological Association, to efficiently separate potential officers from soldiers as the United States entered World War I.
- 1926: The Scholastic Aptitude Test (SAT), adapted directly from the Army Alpha Test, is first administered.
- 1955: The high-speed optical scanner is invented.
- 1970: The National Assessment of Educational Progress is created.
- 1980s: Computer-based testing is developed.
- 2001: No Child Left Behind is signed into law, reauthorizing the Elementary and Secondary Education Act, and using student achievement test scores to hold schools accountable.
- 2009: Race to the Top, a federal program designed, in part, to emend No Child Left Behind, is launched; by mid-2010, 27 states have implemented new math and English standards based on the program’s recommendations.
Mental Measurement Goes to School
The use of standardized student achievement testing in U.S. schools dates back to 1845 with the administration of the Boston Survey. Horace Mann, then secretary of the Massachusetts State Board of Education, oversaw the development of a written examination covering topics such as arithmetic, geography, history, grammar, and science. The test battery, 154 questions in all, was given to 530 students sampled from the more than 7,000 children attending Boston schools. Mann was moved to create these tests because of what he perceived to be a lack of consistency and quality in the Boston schools.
This was followed not long after by Joseph Rice’s work, also in Boston. In the decade beginning in 1895, Rice organized assessment programs in spelling and mathematics in a number of large school systems. Much as Horace Mann wanted to see more consistency in what was taught in schools, Rice was motivated by a perceived need to standardize curriculum.
About this same time, in 1904, E. L. Thorndike, known as the father of educational testing, published the first book on educational measurement, An Introduction to the Theory of Mental and Social Measurement. He and his students developed many of the first achievement tests emphasizing controlled and uniform test administration and scoring.
But the real impetus for the growth of testing in schools grew out of the then developing emphasis on intelligence tests, particularly those that could be administered to groups rather than individuals, work that built on the basics of Thorndike’s achievement tests. In 1917, the Army Alpha (for literate test takers) and the Army Beta (for illiterates) intelligence tests were developed. Robert Yerkes, Louis Terman, and others took up the challenge during World War I of helping the military to distinguish between those recruits who were officer material and those who were better suited to the trenches. Within a year and a half, Terman and his student Arthur S. Otis had tested more than 1.5 million recruits.
Terman also created the Stanford Achievement Test, which he used in his longitudinal study of gifted children and from whence came the term intelligence quotient, or IQ . So taken with Terman’s work, the Rockefeller Foundation supported his recommendation that every child be administered a “mental test” and in 1919 gave Terman a grant to develop a national intelligence test. Within the year, tests were made available to public elementary schools.
Test publishers quickly recognized the potential of testing in schools and began developing and selling intelligence tests. Houghton-Mifflin published the Stanford-Binet Intelligence test in 1916. The commercial publication of tests is critical since many of the efficiencies of the testing industry, such as machine scanning, resulted from efforts to gain market share. In turn, the ability to process large quantities of data permitted ever more sophisticated statistical analyses of test scores, certainly with the intention of making the data more useful to schools, teachers, and counselors.
Until the onset of the current high-stakes testing movement, achievement and ability tests served a number of purposes, but in his 1966 book The Search for Ability: Standardized Testing in Social Perspective, David Goslin summarized what were at the time the typical uses of standardized tests in schools:
- to promote better adjustment, motivation, and progress of the individual student through a better understanding of his abilities and weaknesses, both on his own part and on the part of his teachers and parents
- to aid in decisions about the readiness of the pupil for exposure to new subject matter
- to measure the progress of pupils
- to aid in the grade placement of individuals and the special grouping of children for instructional purposes within classes or grades
- to aid in the identification of children with special problem or abilities
- to provide objective measures of the relative effectiveness of alternative teaching techniques, curriculum content, and the like
- to aid in the identification of special needs from the standpoint of the efficiency of the school relative to other schools
While Goslin’s list represents the emphasis on local uses of testing, at this same time, during the administration of John F. Kennedy, there was a growing interest in national assessment. During this period, Ralph Tyler was called upon to oversee the development of a national testing system, which would become the National Assessment of Educational Progress (NAEP), first administered in 1969 by the Education Commission of the States. The creation of NAEP allowed for state-by-state comparisons and a common metric for all U.S. students, and all states are now required to participate in NAEP testing.
The Technical and Sociopolitical Nature of Measurement
The development of standardized means for measuring intelligence, ability, and achievement coincided with a remarkable explosion of scientific knowledge and technological advance across a wide range of domains. The industrial growth during most of the 20th century and the information technology growth of the late 20th century are the context for the use and development of assessments that differentiate individuals for the allocation of scarce resources such as jobs, postsecondary education, and scholarships.
Without the power of more and more advanced and complex technology, both in terms of data management and statistical analysis, it is doubtful that student assessment would be the driving force of the accountability demanded in the current standards-based reform movement. The development of testing technology is a series of changes, each responding to a contemporary constraint on testing, and each of which enhanced the efficiency of testing—that is, the ability to test more people at less cost and in less time. Charles Pearson’s invention of factor analysis in 1904, Lindquist’s invention of the optical scanner in 1955, the development of item response theory in the early 1950s by Fred Lord and Darrell Bock, as well as the variant developed by Georg Rasch in 1960, and the development of matrix sampling by Darrell Bock and Robert Mislevy in the 1960s and 1970s are examples of these technological enhancements. A number of areas in student assessment remain astonishingly unsophisticated, such as, for example, strategies for standard setting. In many ways, the educational measurement community has operated on the assumption that appropriate uses of assessment in schools is a matter of making good tests and being able to manipulate the scores in sophisticated ways. However, testing is also a sociopolitical activity, and even technically sound measures and procedures are transformed when they are thrown into the educational policy and practice arena.
There is and has been great optimism about what testing and measurement in schools can accomplish. Robert Linn, contemporary father of educational measurement, suggests in his 2000 Educational Researcher article that we are overly optimistic about the promises of what can be delivered:
I am led to conclude that in most cases the instruments and technology have not been up to the demands that have been placed on them by high-stakes accountability. Assessment systems that are useful monitors lose much of their dependability and credibility for that purpose when high stakes are attached to them. The unintended negative effects of high-stakes accountability uses often outweigh the intended positive effects. (19)
Eugenics and Testing
Early U.S. work on mental measurement was deeply informed by a presumed genetic basis for intelligence and differences. In 1905, the French psychologist Alfred Binet developed a scale for measuring intelligence that was translated into English by the American psychologist Henry H. Goddard, who was keenly interested in the inheritability of intelligence. Although Binet did not hold the view that intelligence was inherited and thought tests were a means for identifying ways to help children having difficulty, Goddard and other U.S. hereditarians disregarded his principles. Goddard believed that “feeble-mindedness” was the result of a single recessive gene. He would become a pioneer in the American eugenicist movement.
“Morons” were Goddard’s primary interest, and he defined morons as “high grade defectives” who possess low intelligence but appear normal to casual observers. In addition to their learning difficulties, Goddard characterized morons as lacking self-control, susceptible to sexual immorality, and vulnerable to other individuals who might exploit them for use in criminal activity.
Lewis Terman was also a eugenicist and popularized Binet’s work (that is, Goddard’s translation of it) with the creation of the Stanford-Binet Test. A critical development, and one that still sets the parameters for standardized testing, was Terman’s standardizing the scale of test scores—100 was the average score and the standard deviation was set at 15. Terman (along with others, including E. L. Thorndike and R. M. Yerkes) promoted group testing for the purpose of classifying children in grades three through eight, tests that were published by the World Book Company (the current-day Harcourt Brace). The intent of these tests was clear: to identify the feeble-minded and curtail their opportunity to reproduce, thus saving America from “crime, pauperism, and industrial inefficiency.” Although lively discussion (most especially with Walter Lippman) about the value of and justifiability of Terman’s claims was waged in the popular press, this eugenicist perspective persisted.
In addition, Terman’s classifying of student ability coincided with ideas emerging among progressive educators in the 1910s. Progressives believed curriculum and instructional methods should be scientifically determined, and Terman’s tests and interpretations fit the bill. Few seriously questioned his assumptions about the hereditary nature of intelligence or that IQ was indeed a valid measurement of intelligence. By “scientifically” proving that recent immigrants and blacks scored lower than whites due to an inferior mental endowment, he catered strongly to the nativism and prejudice of many Americans.
Although most contemporary experts in mental measurement eschew these eugenicist beginnings, the debate lives on, manifest more recently in the work of Herrnstein and Murray in the much-debated book The Bell Curve. The authors of this treatise have used intelligence testing to claim African Americans are genetically intellectually inferior. But their arguments are connected to class as well, and herein may lie the most obvious connections to the advocacy of testing by powerful politicians and corporate CEOs. Questions and answers they pose are:
How much good would it do to encourage education for the people earning low wages? If somehow government can cajole or entice youths to stay in school for a few extra years, will their economic disadvantage in the new labor market go away? We doubt it. Their disadvantage might be diminished, but only modestly. There is reason to think that the job market has been rewarding not just education but intelligence. (96)
Race and class, which are inextricably linked in contemporary society, remain important considerations in measurement and assessment. There is ample evidence that suggests achievement tests are better predictors of parental income than anything else.
Biases in Standardized Tests
Standardized tests used in schools have always been criticized for their potential biases, especially since the test developers may be different from many students taking the tests, and these authors may take for granted their cultural, class, racial upbringing, and education. Although test developers strive to develop fair tests, it is difficult to make a single test that accurately and fairly captures the achievement of a student, rather than their life experiences. For example, a test item that asks about the motion of two trains moving on parallel tracks may be obvious to adults but quite confusing to a child who has never had occasion to ride on a train. Or the obvious class bias in the oft-cited analogy question on the SAT—runner is to marathon as oarsman is to regatta. The bias in standardized test questions may be based on cultural, class, ethnic, gender, or linguistic differences.
The increased use of standardized tests called high-stakes tests—those where the results are used to make important decisions resulting in rewards or punishments—has, however, reinforced the disadvantages standardized tests present for students of color and those living in poverty. High-stakes testing is disproportionately found in states with higher percentages of people of color and living in poverty. A recent analysis of the National Educational Longitudinal Survey shows that 35 percent of African American and 27 percent of Hispanic eighth-graders will take a high-stakes test, compared to 16 percent of whites. Looked along class lines, 25 percent of low socioeconomic–status (SES) eighth-graders will take a high stakes test compared to 14 percent of high-SES eighth graders. Students of color are more likely to take high-stakes tests, and they also score lower than white students. With the advent of high-stakes testing, dropout rates for students of color have increased, either because they do not do well on the tests required for graduation or because they are pushed out by school districts that are judged by their overall test scores.
Effects of Standardized Tests on Test Takers
Although standardized tests are meant to facilitate educational decision making at many levels, it is important to consider the experience of the test taker. Generally, policymakers, test developers, and educational bureaucrats assume the taking of standardized tests will be benign, with relatively modest positive or negative effects on children. Research, however, suggests standardized testing contributes to unhealthy levels of student stress, resulting sometimes in serious mental health problems and even suicide.
Although less dramatic, there is also concern that emphasizing performance on standardized tests (and even grades) diminishes students’ motivation to learn. Rather than focusing on the value of learning, educational contexts that emphasize outcomes focus students on getting the grade or test score—emphasizing what is required to do well on the test rather than focusing on genuine learning. Lifelong learning and critical thinking are not key educational outcomes when the focus is on tests scores.
Standardized Tests and Accountability
Despite cautions about the value of standardized testing, test scores are now the common language used by education bureaucrats, politicians, and the media to summarize the quality, or lack of quality, of schools. This is a recent use of standardized tests though, and the last decade has seen a turn to high-stakes testing. While test scores have been used within the education bureaucracy for many decades, it was not until the late 1970s, when the College Board reported declines in SAT scores, that these scores become the means for describing the quality of schools. And the debate about the meaning and value of those scores for such purposes began immediately.
Culturally, Americans are drawn to statistical and numerical indicators, which are perceived to be factual and objective. Perhaps this is a result of public policy debate in a complex democracy where there is a tendency to gravitate to simple means for resolving differences in deeply held value positions. Perhaps it is a romance with technology and science. In a few decades, standardized test scores have infiltrated the popular culture as the obvious, inevitable indicator of the quality of education and schooling. In such a short time, we have forgotten there are many other sorts of evidence that have been and can be used to describe the quality of schools and schooling.
The reporting of test scores in the media, primarily in newspapers, is now expected and often uncritically examined. Typically, the test scores on state-mandated standardized tests are published annually in local newspapers. Little, if any, information is provided about the meaning of the scores. Topics like measurement error (which, for example, would tell one that on the SAT, a difference of at least 125 points between two scores is necessary to be confident there is any real difference between the two test takers) and test validity (whether the standardized test being used is one that was developed for that particular use) are usually not included, either because they are not understood by education reporters or are perceived as too complex for the average reader. Similarly, schools and districts are often ranked—an absolute folly given the error in the scores and the conceptual problem of truly being able to justifiably rank order hundreds of things.
While test developers have always cautioned that tests should be used for the purposes for which they were developed, this desire for a simple common metric to judge a student’s learning, the quality of a school, a teacher’s performance, or a state or country’s educational attainment has led to more misuses of tests. More and more standardized tests are required, and too often tests are being used in ways they were not intended to be used. The most common example of this is when general achievement tests developed specifically to create a spread of scores from very low to very high with most scores in the middle (like the Iowa Test of Basic Skills) are used to determine what students have actually learned. James Popham, notable measurement expert, likens this misuse to taking “the temperature with a tablespoon.”
Standardized tests have been a part of education and schools for many decades and will continue to be a part of the educational landscape. Controversy will surely continue to surround their uses and the psychological, social, and political aspects of their use.
- Baker, Joan M., Achievement Testing in U.S. Elementary and Secondary Schools. New York: Peter Lang, 2005.
- Hursh, David W., High-Stakes Testing and the Decline of Teaching and Learning. Lanham, MD: Rowman & Littlefield, 2008.
- Mathison, Sandra, and E. Wayne Ross, Defending Public Schools: The Meaning and Limits of Standards- Based Reform and High-Stakes Testing. Westport, CT: Praeger, 2004.
- Nichols, Sharon Lynn, and David C. Berliner, Collateral Damage: How High-Stakes Testing Corrupts American Schools. Cambridge MA: Harvard Education Press, 2007.
- Phelps, Richard P., ed., Defending Standardized Testing. Mahwah, NJ: Lawrence Erlbaum, 2005.