Chapter 11
Concluding Thoughts: Creating Next-Generation Assessments That Last

Linda Darling-Hammond, Frank Adamson, and Thomas Toch

Ask smart people—workforce experts, labor economists, university faculty—about the skills that schools should be teaching students, and they all talk about the same things: problem solving, analyzing and synthesizing information, thinking creatively, communicating clearly. These are the sorts of skills that higher-paying jobs increasingly demand—skills that reformers of various stripes have been pushing public education to teach since a blistering US Department of Education–funded critique of American education in the early 1980s ignited the school reform movement that continues today.

Whether the context is the changing nature of work, international competitiveness, or, most recently, calls for common standards, the premium today is not merely on students’ acquiring information but on recognizing what kind of information matters, why it matters, and how to combine it with other information—what many people now call twenty-first-century skills. Remembering information is no longer the highest priority in classrooms; instead, the emphasis is on figuring out what students can do with that knowledge in new situations. And with new research revealing that young children are far more able to engage in complex thinking skills than once thought—and that problem-solving skills are the basics on which other skills are built—teaching students how to become analytical and strategic in applying what they learn is now important in elementary classrooms as well as in high schools.

The federal No Child Left Behind Act (NCLB), passed by Congress in 2001 to promote school improvement by holding local educators accountable for their students’ achievement, called for “high standards of academic achievement [for] all public elementary school and secondary school students.” The law put the entire public education system on a new outcomes-based footing, and it cast in sharp relief the second-class educational status of students of color and those from disadvantaged backgrounds. However, the standardized tests that states introduced to comply with NCLB have generally not sought to gauge students’ grasp of the thinking skills that experts say students should master. Instead, tight NCLB testing time lines, the scale of testing required under the federal law, and pressure from state elected officials to lower costs led to tests that rely heavily on multiple-choice questions measuring mostly low-level tasks like the recall of information in reading passages. These questions can be administered and scored rapidly and inexpensively, but by their very nature, they are not well suited to judging students’ ability to express points of view, marshal evidence, and display other advanced skills.

A number of states that implemented performance assessments in the early 1990s scaled them back as a result of technical concerns, implementation burdens, or costs, especially when NCLB increased testing requirements to reach every child every year. In addition the federal Department of Education was often unwilling to approve innovative testing systems.

Because teachers tend to teach what is tested, especially when high stakes are attached to the scores, the expansion of multiple-choice measures of simple skills has narrowed the opportunities for lower-achieving students to attain the higher standards that NCLB has sought for them, and it has placed a glass ceiling over many more advanced students who are unable to demonstrate the depth and breadth of their abilities on such exams. The tests have discouraged teachers from teaching more challenging skills and from having students conduct experiments, make oral presentations, write research papers, use new technologies, and do other activities that teach such skills and pique students’ interest in learning at the same time.

The NCLB school accountability model and the standardized testing that undergirds it may have established an academic floor for the nation’s students, but it has not catalyzed the pursuit of genuinely higher standards of thinking and performance. It is therefore not surprising perhaps that US students outperform many of their international counterparts on measures of knowledge such as the Trends in International Mathematics and Science Study (TIMSS), which measures knowledge as given, while they do much less well on international tests that gauge students’ ability to apply knowledge, such as the Program for International Student Assessment (PISA). With the exceptions of a few states like Massachusetts, we are today, under NCLB, still pursuing the basic skills testing that was introduced in the 1980s, when policymakers took their first tentative steps toward holding schools accountable for their students’ performance, and long before it became clear that success in a rapidly changing and increasingly complex world required students to master much more than the ability to recognize one answer out of five.

NEW OPPORTUNITIES

The advent of the Common Core State Standards, the Next Generation Science Standards, and the emergence of new accountability systems under federally approved waivers from NCLB provide a potential opportunity to address this fundamental misalignment between our aspirations for students and the assessments we use to measure whether they are achieving those goals. The United States has an opportunity to create a new generation of assessments that build on NCLB’s strengths, including its commitment to accountability for the education of traditionally underserved groups of students, while measuring a wider range of skills and expanding the definition of accountability to include the teaching of such skills.

These new assessments would rely more heavily on the kinds of performance measures described in this book—tasks requiring students to craft their own responses to complex problems rather than merely select from among multiple-choice answers that in many instances require little thinking and reward guessing. They range from short-answer tasks such as constructing and explaining a problem solution to extended work like writing essays, engaging in research, and conducting laboratory investigations. Like the road test that virtually all adults have taken to gain a driver’s license, these performance assessments ask students to demonstrate that they can actually do with their knowledge when it is applied in practice.

As we have described, there are many examples of large-scale performance assessments in the United States and other countries that feature tasks in virtually all subject areas: Kentucky’s long-standing writing portfolio and the New York State Regents Examinations to the hands-on science experiments and computer simulated tasks of the National Assessment of Educational Progress (NAEP), Connecticut’s and Vermont’s high school science assessments, the Collegiate Learning Assessment, and England’s General Certificate of Secondary Education exams and similar assessments in Hong Kong, Singapore, and Australia, for example.

Research shows that well-designed performance assessments yield a more complete picture of students’ abilities and weaknesses and can overcome some of the validity challenges of assessing English learners and students with disabilities. The use of performance measures has been found to increase the intellectual challenge in classrooms and support higher-quality teaching. Students who routinely engage in instruction where they are expected to demonstrate applications of their knowledge and explain and defend their answers have often been found to outscore other students on both traditional tests and more complex measures.

By involving teachers in scoring essays and other performance measures, the way assessment systems in high-achieving nations and some states do today, teachers can become more knowledgeable about how to evaluate and teach to challenging standards. Teacher involvement in scoring has been found to offer a powerful professional development opportunity that translates into a stronger ability to design and implement standards-based curriculum. Such tests are thus tied more closely to the improvement of classroom instruction and can support more expansive and productive student learning.

All of these factors are driving the increased use of performance assessments around the world. As the Hong Kong Examinations and Assessment Authority (2009) explained while introducing new school-based performance assessments into its examination system:

The primary rationale for school-based assessments (SBA) is to enhance the validity of the assessment, by including the assessment of outcomes that cannot be readily assessed within the context of a one-off public examination, which may not always provide the most reliable indication of the actual abilities of candidates. . . . SBA typically involves students in activities such as making oral presentations, developing a portfolio of work, undertaking fieldwork, carrying out an investigation, doing practical laboratory work or completing a design project, [that] help students to acquire important skills, knowledge and work habits that cannot readily be assessed or promoted through paper-and-pencil testing. Not only are they outcomes that are essential to learning within the disciplines, they are also outcomes that are valued by tertiary institutions and by employers.

There are numerous challenges to using performance measures on a much wider scale, such as ensuring the measures’ rigor and reliability. But valuable lessons can be learned in addressing such challenges from a growing number of high-achieving industrialized nations that have successfully implemented performance assessments for many years, from a series of state experiments with performance assessments in the 1990s, the expansion of the Advanced Placement (AP) and International Baccalaureate (IB) programs, and the growth of performance measures in the military and other sectors, developments that have been aided by substantial advances in testing technology. This large body of work suggests that performance assessments pay significant dividends to students, teachers, and policymakers alike and that the assessments can be built to produce confident comparisons of individual student performance over time and comparisons across schools, school systems, and states.

Our goal in this book has been to provide a thorough analysis of the prospects for and challenges of introducing and sustaining standardized performance assessments on a large scale. We have studied extensively the history and current uses of performance assessments in the United States and abroad, the technical advances that have been made, and the impacts that have been documented by researchers.

CHALLENGES AND LESSONS

The challenges associated with using performance measures on a large scale include the need to ensure the tests’ rigor and technical reliability and to manage their cost and time requirements. The experiences of a growing number of high-achieving nations that use large-scale performance assessments effectively, the record of the IB and AP testing programs, successful state experiences with performance assessments, and the growth of performance measures in the military and other sectors illustrate how such assessments can be reliably and cost-effectively incorporated into testing systems. And studies have demonstrated that performance tasks can be designed in ways that allow them to measure student achievement accurately and permit the comparison of results across students and schools and from year to year—necessary features of tests used to hold schools accountable for their students’ results.

Lessons

The research reviewed in this book shows that creating reliable, valid, feasible, and cost-effective performance assessments can be developed with attention to these topics:

  • Careful task design based on a clear understanding of the specific knowledge and skills to be assessed—representing important disciplinary content that represents core concepts and abilities validly—and an understanding of how students develop cognitively. Tasks should be clear about what criteria define a competent performance and should be vetted through rigorous field testing to ensure that the items or tasks are understandable and are measuring the intended concepts and abilities fairly and without bias. When these principles are followed, studies have found that assessments can be made comparable and valid across time, tasks, and raters.
  • Reliable scoring systems based on standardization of tasks and well-designed scoring rubrics, training of scorers, moderation of the scoring process to ensure consistency in applying the standards, and auditing of the system to double-check and upgrade comparability. Well-developed systems with these features have produced interrater reliability with levels of agreement of 90 percent or higher, comparable to the AP exams and other well-respected tests.
  • Methods for ensuring fairness based on the use of universal design principles, careful linguistic choices to avoid sources of confusion unrelated to the content being measured, cultural review of items, and pilot testing of tasks to see how they perform with different test takers. Carefully designed performance assessments have been found to produce more successful evaluations of knowledge for English learners, special education students, and students who struggle in other ways than traditional standardized tests.
  • Effective use of technology to deliver and administer assessments; enable simulations, research tasks, and other sophisticated assessment opportunities; adapt assessments to better measure student abilities and growth; and support both human scoring and machine scoring of open-ended items, which is becoming more reliable and effective. (As a measure of the potential for technology to streamline performance testing, the NAEP has found that human and computer scoring of a set of physics simulations matches 96 percent of the time.)
  • Professional development that enables educators to learn to build, use, and score assessments that will inform and guide their teaching. Many systems have demonstrated that teachers can develop this knowledge rapidly when given the support. In successful systems, teachers are engaged in curriculum alignment, performance task development, scoring processes, and data analysis so that they understand the system and can teach productively to the standards. The processes include a peer review or moderation system that provides a feedback loop, checks on quality, and offers directions for staff development.
  • Administrative support from education agency officials and legislators at the state and federal levels, offering targeted assistance to teachers, administrators, and school systems that allows their effective participation in new assessment systems and leverages improvements in teaching. In addition to professional development, this includes widespread information, extensive training in both use and scoring, redesign of curriculum materials to ensure alignment with and support of new assessments, and redesign of school schedules to provide in-class time for more in-depth work on the part of students and out-of-class time for teachers’ planning, analysis, and scoring of student work, as is common in other countries.
  • Proper use that reveals areas needing improvement and leads to curriculum and professional learning supports. Tools such as task blueprints, rubric specifications, and training and scoring protocols should be developed to support the proper use of performance assessments. Use of assessments for information rather than sanctions allows the development of more ambitious tasks aimed at higher standards and less corruption of the assessment system. This framework for assessment has driven stronger learning and higher achievement in many nations abroad.

Costs

Costs, especially for scoring, are another concern. Studies have found that performance-based tests tend to be about twice as expensive as tests that rely exclusively on multiple-choice questions. But a detailed cost modeling study grounded in real-world prices shows that it is possible to construct large-scale assessments that combine multiple-choice questions and performance measures for no more than today’s much-less-informative tests—about twenty-five dollars per pupil for English language arts and math combined. This can be accomplished by taking advantage of the economies of scale that will accompany states banding together in consortia, tapping the efficiencies of technology in administering tests and supporting scoring, and using teachers strategically in the scoring of performance items.

Appropriate, affordable, and educationally supportive scoring models must be developed. In most European and Asian systems, and in those used in several US states, scoring of assessments is conducted by teachers and time is set aside for this aspect of teachers’ work and learning. While teacher time to create and score the assessments can be substantial, these activities lead to more skilled and engaged teachers. Teachers often report that some of the best professional development of their careers occurs when they have opportunities to examine, score, and discuss student work. Importantly, international assessments have strategically captured teacher professional development time to evaluate and validate student work. Capitalizing on this time can both lower costs and establish a common language around curriculum standards and assessment.

While the use of performance tasks does require time and expertise, educators and policymakers in high-achieving nations believe that the value of rich performance assessments far outweighs their cost. Nations around the world have expanded their use of performance tasks because these deeply engage teachers and students in learning, make rigorous and cognitively demanding instruction commonplace, and, leaders argue, increase students’ achievement levels and readiness for college and careers. While looking to economize, it is also important to put the costs of high-quality assessment into perspective. Even if states spent fifty dollars per pupil on assessments each year (more than twice the estimated costs of a balanced system), this would still be far less than 1 percent of the costs of United States education overall.

POLICY RECOMMENDATIONS

Performance assessment is a key component in a balanced assessment system that responds to fast-paced changes placing greater demands on education and knowledge development in the United States and around the rest of the world. Images of what students will need to do with their knowledge should help shape formulations of curriculum, instruction, and assessment policy at the national, state, and local levels. As a starting point for the development of the next generation of assessments, policymakers must begin with a vision of young people as lifelong learners who deeply understand core concepts and modes of inquiry within the disciplines and who can also work across disciplines to evaluate evidence, frame and solve problems, express and defend their ideas, and create new ideas, technologies, and solutions.

We have noted that consortia of states have undertaken efforts to refine standards for learning, so that they are internationally benchmarked and are fewer, higher, and deeper. To ensure that new assessments are developed to fully represent the new standards, federal and state policy should:

  1. Fund an intensive development effort that enables states and consortia of states, in collaboration with development experts in federal centers, nonprofit organizations, and universities, to
    • Develop, validate, and test high-quality performance assessments that are part of balanced assessment systems and are guided by thoughtful, coherent standards and curriculum frameworks.
    • Train the field of practitioners—ranging from psychometricians to a new generation of state and local curriculum and assessment specialists to teachers—who can be skillfully involved in the development, administration, and scoring of these assessments in valid and reliable ways.
    • Conduct high-quality research on the validity, reliability, instructional consequences, and equity consequences of these assessments.
  2. Encourage improvements in federal, state, and local assessment practice:
    • Provide incentives and funding for states to introduce high-quality state assessments that include performance components, as well as locally administered performance assessments, that evaluate critical thinking and applied skills. Support states in making such assessments reliable, valid, and practically feasible through teacher professional development and scorer training, moderated and audited scoring systems, and calibration systems, as well as research.
    • As part of these efforts, develop more appropriate assessments and accommodations for special education students and English language learners by underwriting efforts to strengthen the validity and reliability of existing performance assessments for these populations, properly adapt new assessments under construction, and create, as needed, new assessments of performance in the content areas for these students, based on professional testing standards that consider principles of universal design as well as specific needs for valid assessment of students in these groups.
    • Model high-quality items and better measure the standards, support the further development and implementation of the new blueprints, already under way, for the NAEP, which include more performance-oriented items that evaluate students’ abilities to evaluate evidence, solve problems, and explain and defend their ideas. These kinds of tasks were part of NAEP when it was launched in the 1960s and are common in other nations’ large-scale assessments, as well as in PISA. Their introduction would need to be incorporated carefully over time in a purposeful fashion that maintains existing trend data and continues to enable comparisons among states.
  3. Enable the incorporation of new assessments into the Elementary and Secondary Education Act accountability system:
    • Replace the current “status model” for measuring school progress with a continuous progress model that sets expectations for schools—and groups of students within them—to show progress on a set of measures that include multiple assessments of student learning, including performance measures, as well as school progression and graduation rates. In such a model, which reports information on multiple indicators, states could choose to include subject areas beyond reading and mathematics, such as writing, science, and history, which are important in their own right and essential to encourage and evaluate students’ literacy skills as they are applied in the content areas. Within a given subject, an index could accommodate assessments of student learning that capture a wider array of skills, including the more complex inquiry and problem-solving skills demanded by twenty-first-century colleges and jobs.

CONCLUSION

Current accountability reforms are based on the idea that standards can serve as a catalyst for states to be explicit about learning goals, and the act of measuring progress toward meeting these standards is an important force toward developing high levels of achievement for all students. However, an on-demand test taken in a limited period of time on a single day cannot measure all that is important for students to know and be able to do. As described by Achieve (2004), a national organization of governors, business leaders, and education leaders, the limitation of traditional on-demand tests is that they cannot measure many of the skills that matter most for success in the worlds of work and higher education:

States . . . will need to move beyond large-scale assessments because, as critical as they are, they cannot measure everything that matters in a young person’s education. The ability to make effective oral arguments and conduct significant research projects are considered essential skills by both employers and postsecondary educators, but these skills are very difficult to assess on a paper-and pencil test. (p. 3)

Balanced systems of assessment that include performance assessments have the potential to strengthen curriculum and instruction by evaluating the full range of standards in valid and appropriate ways, providing rich information about student learning that is useful to classroom teachers, and providing diverse means for students to demonstrate their learning. Developed carefully and used properly, such assessments can stimulate more thoughtful teaching, become an engine for ongoing improvement and professional development, and create a commitment to standards that shape more powerful learning.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
13.58.197.26