When the National Center for Education Statistics (NCES) reports differences in results, these results reflect statistical significance. Understanding statistical significance in large-scale assessments, how results are estimated, and the influence of sample size are important when interpreting NAEP data in The Nation's Report Card. Explore this important guide to NAEP results and the use of statistical significance in NAEP data.
Like any survey based on a sample, NAEP results are subject to uncertainty. This uncertainty is reflected by the standard error of NAEP estimates; the more precise the estimate, the smaller the standard error.
The first source of uncertainty arises from the fact that NAEP only assesses a sample of students, rather than every eligible student (a census). The sample consists of a number randomly selected students. Carefully constructed surveys can yield very precise estimates of population quantities. But, a different, equally good sample of students could be have been selected, and the results based on the second sample would be slightly different. Thus, the first component of the standard error is due to sampling of students, termed "sampling variance."
In a good sampling design, the sampling variance decreases as the number of students selected increases. Large groups will tend to have smaller standard errors than smaller groups. A NAEP national assessment typically contains about 10,000 students. Some NAEP assessments include separate, state level samples of over 2,000 per state, which are combined to produce national results. These state-national assessments result in total samples of approximately 140,000 students. Thus, results for the nation based on NAEP state –national assessments will have much smaller standard errors than results from NAEP national only assessments.
The second source of uncertainty in NAEP results is due to "measurement." Measurement variance arises from the fact that a student’s proficiency in a subject (e.g., how good the student is at mathematics), is not directly observed, but has to be estimated based on the answers that the student provides to the items on the assessment. It is possible that, were the assessment given on a different day, the student might provide slightly different answers. Similarly, a different version of the assessment, comprised of different but equally valid items, would give slightly different estimates of students’ proficiency. These two factors give rise to what is typically termed "measurement variance."
NAEP assessments contain a third, related source of measurement uncertainty, due to sampling of items. The contents of all NAEP assessments are created according the specifications of a framework, which is created by the National Assessment Governing Board. NAEP frameworks are quite broad and multifaceted, and the resulting assessments are long. Taking the full assessment would require approximately 5-6 hours for each student, which is unreasonable to ask of students. To limit the burden on individual students, NAEP items are grouped into blocks requiring 25-30 minutes to complete. Each student receives a book of two blocks. The fact that students did not take the entire assessment is an additional source of measurement uncertainty.
A third source of uncertainty affects some NAEP comparisons. In 2017, NAEP began its transition from paper and pencil format to digital format. In the digital assessment, the items are presented, and students respond, using a tablet. The transition from paper mode to digital mode required a special study in which two parallel (randomly equivalent) groups of students took the assessment, one on paper and the other on tablet. Based upon the responses of these two groups of students, the scale of the digital assessment was "linked" to the existing paper and pencil NAEP scale. Because the linking is based on samples of students, the link could have been slightly different if different students were sampled for the study. This source of uncertainty is termed "linking variance."
Linking variance is relevant when results from a digital assessment (e.g., 2017 mathematics) are compared to paper-based results (e.g., 2013 mathematics). Standard errors for comparisons within the same mode of testing, such as of paper-to-paper (2015 v. 2013 mathematics), or comparisons of digital to digital (2017 boys’ proficiency compared to 2017 girls’ proficiency) do not include linking variance.
The final variance of a NAEP result is the sum of sources of variance. The standard error is the square root of that variance.