Chapter 3
Where We Are Now: Lessons Learned and Emerging Directions

Raymond Pecheone and Stuart Kahl
with the Assistance of Jillian Chingos and Ann Jaquith

Chapter 2 described past efforts by states to use performance assessments; this chapter describes current state efforts to use performance assessment in large-scale state accountability systems. The states highlighted here present a window into promising assessment practices currently in place that can help shape the development of the next generation of assessment in this country. These pioneering efforts offer insight into the challenges and opportunities of using performance measures within the context of state assessment policy. The lessons described here are intended to guide the development and implementation of statewide performance assessment components of high quality and utility.

As the first two chapters have noted, the use of performance measures is important because the nature of large-scale assessments significantly affects the attitudes, behaviors, and practices of students and teachers (Shepard, 2002; Wood, Darling-Hammond, Neill, & Roschewski, 2007; Coe, Leopold, Simon, Stowers, & Williams, 1994). Research in the early 1990s showed that reliance on all-multiple-choice tests in a high-stakes environment negatively affected instruction by reducing the complexity of task demands and the opportunities for students to develop and demonstrate certain thinking and performance skills (Cizek, 2001; Wilson, 2004; Conley, 2010; Flexer, 1991; Hiebert, 1991; Koretz, Linn, Dunbar, & Shepard, 1991; Madaus, West, Harmon, Lomax, & Viator, 1992; Shepard et al., 1995).

In fact, the multiple-choice and constructed-response forms of a question often tap different skills. For example, there is an important difference between actually solving a quadratic equation and using the lower-level prealgebra skill of substituting answer options in the equation to identify the correct answer. Likewise, there is a difference between drawing and justifying one’s own conclusions after reading a passage and picking the best conclusion from a set of four multiple-choice options. One of many lessons we have learned during the age of high-stakes statewide testing is that what gets tested is what gets taught.

These tests fall short when it comes to preparing students for postsecondary programs and the twenty-first-century workplace (National Association of State Boards of Education, 2009; Schleicher, 2009; Alliance for Excellent Education, 2009; National Center on Education and the Economy, 2007). Many high school students are bored in school and not motivated to learn (Quaglia Institute, 2008), resulting in disturbing dropout rates—a leading indicator of why educational approaches and testing practices should be reformed. Furthermore, many high-stakes statewide accountability programs currently use assessment instruments reminiscent of the minimal competency and basic skills tests employed twenty-five years ago (Tucker, 2009). Given that what gets tested is what gets taught, it is problematic that most current tests and the teaching they encourage neither require nor offer students the opportunity to demonstrate twenty-first-century skills, such as critical thinking, problem solving, and communication, on tests or in class (Wood et al., 2007; Shepard, 2002).

Concern about American students’ low levels of engagement, as well as the perception that many high school graduates lack necessary skills, has led to heightened interest in curriculum-embedded performance assessment, an approach to assessment that many believe is better suited to measuring these higher-order skills (Wood et al., 2007; Tucker, 2009). More policymakers believe that it is important to ensure that the educational reforms of the future advance the cause of improved educational practice and raise standards of performance that can lead to assurance that all students are college and career ready. It is anticipated that the next generation of assessment performance assessments will play a vital role in measuring the higher-order skills that are cited as critical to college and career success (Conley, 2010).

A CONTEXT FOR CONSIDERING PERFORMANCE ASSESSMENT

When people speak of performance assessment today in the context of twenty-first-century skills, they are often referring to more substantial activities—either short-term, on-demand tasks or curriculum-embedded, project-based tasks that yield reliable and valid scores. The most common example of such performance assessment in education is a directed writing assessment—administration of writing prompts—that requires students to produce essays or other forms of extended student writing. Other scorable products or performances could include responses to constructed-response questions following some activity, research reports, oral presentations, and debates. For the purposes of this chapter, performance measures are defined as an opportunity for students to show how they can apply their knowledge and skills in disciplinary and interdisciplinary tasks focused on key aspects of academic learning.

Large-scale performance assessment is not new. In the late 1980s, dissatisfaction with off-the-shelf tests not designed for evaluation of school programs led states to undertake customized, statewide testing programs (Hamilton, Stecher, & Yuan, 2008; Kahl, 2008). Heavily influenced by curriculum experts, many of these programs included constructed-response items, performance tasks, and portfolios. These performance components were considered “authentic assessments” in that they were intended to engage students in “real-world” activities that they might encounter outside school (Wiggins, 1998).

In states with the greatest emphasis on authentic assessment, teachers made extensive use of released constructed-response questions and performance tasks (Khattri, Kane, & Reeve, 1995; Koretz, Barron, Mitchell, & Stecher, 1996; Coe et al., 1994; see also chapter 2, this volume). Through professional development, teachers learned the value of evaluating actual student work for informing instructional practice. They also gained the competency to use (and even develop) scoring rubrics. Companies providing supplementary curriculum materials supported these efforts by producing materials that addressed higher-order thinking skills.

Although authentic assessment was curtailed by the every-child, every-year testing requirements of No Child Left Behind (NCLB), some states continued to use open-ended assessments, and others have since begun to return to these practices. The next section profiles promising assessment practices that are part of current state assessment systems. We focus in particular on those that include student performance components—that is, measures affording the opportunity for students to show how they can apply their knowledge and skills in disciplinary and interdisciplinary tasks focused on standards of academic learning.

BUILDING ON CURRENT STATE PERFORMANCE ASSESSMENT MODELS

In this section, we review state performance assessments used for a range of purposes, from offering alternative approaches to high-stakes state assessments to formative assessments designed to improve instructional practice and student learning. Performance assessment practices in states can include on-demand, constructed-response items; real-world, classroom-embedded performance tasks; and collection of student work through portfolios. In addition, performance tasks can be complex projects spanning days or weeks to complete, such as science experiments, student-designed disciplinary research inquiries, and assembly and interpretation of evidence about an historical question.

We offer a description of current state assessment practices in four individual states—Connecticut, New Jersey, New York, and Washington—and four more that are part of the New England Common Assessments Program (NECAP)—Maine, New Hampshire, Rhode Island, and Vermont—to illustrate the role performance assessments play in operational state accountability systems. (Table 3.1 provides an overview.) We especially highlight secondary education, because it is the pathway to college and career success and represents the primary focus of the national debate on rethinking accountability practices.

Table 3.1 State Use of Performance Assessments at the Secondary Level

States High School Assessments (Percentages Based on Number of Items) Assessment Graduation Requirements
Connecticut
Connecticut Academic Performance Test (CAPT)
  • Mathematics (grade 10): 25% constructed response (CR), 75% multiple choice (MC)
  • Science (grade 10): 20% CR, 92% MC
  • Reading for information (grade 10): 33% CR, 67% MC
  • Response to literature (grade 10): 100% open ended
  • Writing (grade 10): 70% essays, 30% MC
CAPT results must be included in district-generated graduation requirements, but may not be the sole basis of a graduation decision. Generally districts mandate a score of at least proficient (level 3 of 5) on the writing and mathematics sections.
New Jersey
High School Proficiency Assessment (HSPA)
  • Language arts literacy (grade 11): variety of MC, CR, and performance-based tasks, including speaking
  • Mathematics (grade 11): 17% CR, 83% MC
Students must pass either HSPA or SRA in both language arts literacy and mathematics.
Special Review Assessment (SRA)
  • Performance assessment tasks (PATs) completed in the area(s) in which student did not pass the HSPA
End-of-course examinations
  • Biology/life science: approximately 6% CR, 94% MC
  • Performance assessment prompt field-tested in 2008 and 2009
  • Algebra I: 4% CR, 11% short answer, 85% MC
New York
Regents Examinations (end-of-course assessments)
  • Comprehensive English: essay and MC; number varies
  • Global history and geography: essay, CR, MC
  • US history and government: essay, CR, MC
  • Mathematics B: 41% CR, 49% MC
  • Mathematics A, integrated algebra: 23% CR, 77% MC
  • Geometry: 26% CR, 74% MC
  • Biology: 33% CR, 67% MC
  • Chemistry: 38% CR, 62% MC
  • Earth science: performance-based assessment and written test 41% CR, 59% MC
  • Languages other than English (French, German, Hebrew, Italian, Latin, Spanish): speaking, CR, MC
Students must pass commencement-level Regents Examinations with a score of at least 55 to 64 to qualify for a local diploma or 65 for a Regents diploma in:
  1. Comprehensive English
  2. Mathematics
  3. Global history and geography
  4. US history and government
  5. Science
Schools in the New York Performance Standards Consortium are authorized to substitute a portfolio of tasks that includes written products and oral defenses in all areas but English.
Washington
Washington Assessment of Student Learning (WASL)
  • Reading (grade 10): 5% CR, 24% short answer, 70% MC
  • Mathematics (grade 10): 10% CR, 26% short answer, 64% MC
  • Writing (primarily grade 10): 100% CR
  • Science (grade 10): 7% CR, 26% short answer, 67% MC
All students must earn a Certificate of Academic Achievement (CAA) by meeting learning standards in reading, writing, mathematics, and science on the WASL or through an approved CAA option, which may include:
  1. SAT, ACT, or AP scores
  2. Sets of classroom work samples
  3. Grades in specified math or ELA classes compared with grades of students who passed the test
District-determined assessments. State provides models for performance tasks in:
  • Social studies
  • Health and fitness
  • Arts
  • Civic assessment, grade 11 or 12
Students must also complete a culminating project.
New England Common Assessments Program (NECAP): Maine, New Hampshire, Rhode Island, Vermont
  • Writing (grade 11): essay questions
  • Reading (grade 11): 14%–20% CR, 80%–85% MC (’08)
  • Mathematics (grade 11): 11% CR, 44% short answer, 44% MC
  • Science (grade 11): inquiry task and exams, 11% CR, 89% MC
In Rhode Island: Each district determines proficiency-based graduation requirements in the six core academic areas. NECAP exams must count as one-third of their total assessment in English and mathematics. The other measures must include at least two additional performance-based diploma assessments.
In New Hampshire and Maine: NECAP scores may be used to contribute to the graduation decision but must be accompanied by evidence from classroom work and performance assessments.

Connecticut

Connecticut has two assessment programs: the Connecticut Mastery Test (CMT), which assesses reading, writing, and mathematics of students in grades 3 through 8, and the Connecticut Academic Performance Test (CAPT), which assesses reading, writing, mathematics, and science in grade 10 (Connecticut State Department of Education, 2009). The CAPT test includes performance-based components in each subject area.

By scoring student results relative to established state goals in each content area, CAPT is designed to measure progress toward the educational goals reflected in the Connecticut curriculum frameworks. Moreover, by statute, CAPT must be included as one indicator of performance to support a graduation decision but cannot be used as the sole criterion for graduation. Specifically, by statute, CAPT scores must be combined with other “measures of successful course completion” (Connecticut State Department of Education, 2009, p. 26).

The CAPT assessment was designed as a balanced assessment with multiple-choice, constructed-response, and curriculum-embedded performance tasks to assess student content knowledge. Performance-based components are scored by teachers who are trained for the task and include a variety of item types and tasks:

  • Reading. The Reading across the Disciplines section of CAPT has two tests that assess students’ reading abilities: Response to Literature and Reading for Information. The Response to Literature test asks students to read a short story and respond to a series of essay questions requiring them to describe, interpret, connect to, and evaluate the story. The Reading for Information test “assesses a student’s ability to independently read, thoroughly comprehend, and thoughtfully respond to three authentic nonfiction texts” (Connecticut State Department of Education, 2006, p. 11), which are drawn from magazines, newspapers, and journals. Students answer a combination of multiple-choice and short-answer questions about the meaning of the article and the way the author wrote the article. All constructed responses are scored by trained scorers who have met calibration standards using a 0-to-2-point rubric (Connecticut State Department of Education, 2006).
  • Response to Literature. The Response to Literature subtest assesses students on their ability to “independently read, thoroughly comprehend, and thoughtfully respond to one authentic fictional text through four constructed-response questions” (Connecticut State Department of Education, 2006, p. 8). Two independent readers score each of the four written responses on a six-point rubric.
  • Writing across the Disciplines. The writing assessment contains two tests that assess students’ writing abilities: Interdisciplinary Writing and Editing and Revising. In the Interdisciplinary Writing test, students are given a set of source materials (e.g., newspaper and magazine articles, editorials, charts, and graphs) representing different perspectives on an issue. They are asked to read the materials and use the information to write a persuasive piece, such as a letter to a congressperson or a letter to the editor of their local newspaper, and cite given sources as support for their argument. Students are required to take two interdisciplinary writing tests about separate issues. Each response is scored holistically. Two independent readers score responses using a six-point scale (Connecticut State Department of Education, 2006). In the Editing and Revising test, students read passages with embedded errors and answer questions to indicate corrections.
  • Mathematics. The mathematics section uses questions requiring written responses to assess students’ abilities to solve problems, communicate, compute, and estimate in four major content areas (number quantity; measurement and geometry; statistics, probability, and discrete mathematics; and algebra and functions). Half the total points on the test draw from these performance-based questions; the remainder are based on multiple-choice questions (Connecticut State Board of Education, 2009).
  • Curriculum-Embedded Science Tasks. The CAPT science section currently uses a combination of multiple-choice questions and open-ended written responses. Prior to 2007, this section also included a performance task that assessed experimentation using a hands-on laboratory activity; starting in 2007, the laboratory activity was shifted to a curriculum-embedded classroom task. Currently two real-world tasks—a laboratory activity and a science, technology, and society investigation aligned to science standards and curriculum frameworks—are given to schools in each of the science content strands for grades 9 and 10. Students are required to complete these tasks in class; they are asked to formulate a hypothesis, conduct experiments, analyze data, and write a lab report to demonstrate their ability to engage in scientific reasoning. In the on-demand component of the CAPT science assessment, the specific scientific skills and processes needed to complete the embedded assessment are independently tested through the use of constructed-response items aligned to the locally embedded performance tasks (Connecticut State Department of Education, 2007a).

This testing methodology presents an innovative approach to high-stakes assessment in science using both curriculum-embedded performance tasks scored by classroom teachers and an on-demand assessment of student knowledge using a constructed-response methodology to independently measure student learning (Connecticut State Department of Education, 2007a). The approach takes advantage of the power of performance assessment to transform classroom practice and the need to ensure that measures of student learning are comparable and objective.

Given the recent development of Common Core Standards in English language arts and mathematics and Next Generation Standards in Science, Connecticut’s CAPT assessment serves as one proof point of an assessment that is designed to predict college and career success. To examine the impact of student performance on CAPT and its relationship to college and workplace success, Connecticut funded a major study tracking five cohorts of tenth-grade students between 1996 and 2000 over eight years beyond high school. The study found that students scoring higher on CAPT were more likely to attend and graduate from college and were more successful in the workplace as well (Coelen, Rende, & Fulton, 2008). (An example of a Connecticut science assessment task is shown in appendix A.)

New Jersey

In New Jersey, all students must pass the High School Proficiency Assessment (HSPA) or an approved alternative in order to graduate (New Jersey Department of Education, 2005). The state has by statute developed an alternative pathway to graduation through the use of a performance-based assessment system. If students do not pass the HSPA in March of their junior year during the first testing, they can retake the assessment in their senior year. In addition, students failing to meet state standards must begin remediation instruction in preparing for the performance-based Special Review Assessment (SRA).

The SRA is an “individually, locally administered, untimed, state-developed, locally scored assessment” (New Jersey Department of Education, 2008b, p. 7). Students must participate in a school-designed SRA instructional program for the content areas in which they did not meet proficiency on the HSPA prior to being administered SRA performance assessment tasks (PATs). The SRA Instructional Program is continued until the SRA teacher decides that students are successful on the required PATs. The PATs are open-ended tasks in each of the key competency areas to which students respond in writing, showing and explaining their ideas and solutions. (For an example, see appendix A.)

For content areas in which they have not scored at least 200 on the HSPA, students must successfully complete two PATs in each content area cluster or standard. Language arts literacy has two clusters, and mathematics has four standards. Selection of PATs “is based solely on the results of the student’s first HSPA administration” (New Jersey Department of Education, 2008b, p. 6). “If a student is not successful on a specific PAT, additional PATs may be administered until the student successfully completes the required number of PATs for that content area” (p. 7). In addition, to earn a diploma, “Students with disabilities who are in grade 11 . . . must participate in the HSPA or the APA [Alternate Proficiency Assessment]” (p. 12). “The Individualized Education Program (IEP) team for each student determines which assessment (HSPA or APA) the student will take for each content area addressed” and “must also determine if the student who is taking the HSPA in one or both content areas will be required to pass the HSPA in those content areas in order to graduate” (pp. 12–13).

New Jersey has been a leading state in developing and validating alternative assessment pathways to graduation. The state, through developing the SRA and APA, allows all students (including special needs students) alternative pathways to obtain a high school diploma. More important, the SRA alternatives offer diverse learners greater access to college through embedded performance measures that assess academic progress by using testing formats more sensitive to various learning modalities. As a result of implementing these policies, New Jersey reports one of the highest graduation rates (83 percent) in this country, comprising all racial and ethnic groups.

Although concerns were recently raised about the reliability and comparability of the locally managed and scored assessment results, the performance assessments were maintained after a formal state review. The review reinforced the value of this approach to supporting and evaluating learning. Based on the review recommendations, a more carefully controlled process for administering and scoring the PATs was put in place (New Jersey Department of Education, 2013).

New York State

New York has a 135-year history of state-level assessment that includes both on-demand and classroom-based performance tasks (New York State Education Department, 1987, p. 18). To earn a New York State diploma, students must pass commencement-level Regents Examinations in comprehensive English, global history and geography, US history and government, mathematics, and science, in addition to completing course requirements. Different cut scores on these syllabus-based, end-of-course tests are used for a local diploma and a state Regents diploma. Alternative assessments, approved by the state, can also be used for these diplomas.

Performance-based components of the Regents Examinations include a variety of tasks. In English, students write responses to both spoken and written texts. In addition, they are asked to write an essay discussing a controlling idea within two literary texts and the authors’ use of literary elements and techniques, and, in a separate essay, “interpret a statement provided to them about some aspect of literature and write an essay using two works they have read to support their interpretation of the statement” (Shyer, 2009, p. 3). The Regents Science Examination includes a laboratory performance test completed near the end of the course and a written test with a large number of open-ended questions (Shyer, 2009). In history and social studies, students complete essays about document-based questions that require analysis of a set of documents and artifacts. (See appendix A for an example.)

Generally at least two teachers must independently rate all Regents Examinations that lead to a Regents diploma, except mathematics, which requires at least three scorers. All teachers rate exams according to the scoring key and rubrics provided by the Department of Education, which have directions for scoring multiple-choice and constructed-response questions and, if applicable, guidelines for rating the essay or performance components (New York State Education Department, Office of State Assessment, 2008). Teachers are trained to score all extended writing tasks using benchmark performances and rubrics (University of the State of New York State Education Department, 2009a, 2009b). (See the New York State history assessment task in appendix A.)

The Regents have also approved a local diploma option that allows development of equivalent academic tasks, often part of a portfolio-based system, that can be substituted for the Regents Examination. All local options must be reviewed and approved by the state department of education. For example, the New York Performance Standards Consortium, a group of twenty-seven secondary schools, has received a state-approved waiver allowing their students to complete a graduation portfolio in lieu of the Regents Examinations in all subjects but English. This portfolio includes a set of ambitious performance tasks—a scientific investigation, a mathematical model, a literary essay, a history/social science research paper, an arts demonstration, and a reflection on a community service or internship experience—that meet a set of common standards and are scored through social moderation processes using common scoring rubrics (Performance Standards Consortium, 2013).

Washington

Washington’s standards, developed in the 1990s, feature ambitious goals in a wide range of areas, including reading; writing; communication; mathematics; social, physical, and life sciences; civics and history; geography; arts; and health and fitness. The Washington Assessment of Student Learning (WASL) was developed to assess student mastery of these standards. WASL included a combination of multiple-choice, short-answer, essay, and problem-solving tasks. In addition, the Washington assessment system included classroom-based assessments in subjects not included in WASL. Although the WASL was replaced in 2009–2010 with more traditional tests (the Measurements of Student Progress in grades 3 to 8 and a High School Proficiency Exam in grades 10 to 12), it expanded the use of classroom-based assessments, including performance assessments, to gauge student understanding of the learning standards in social studies, the arts, health/fitness, and educational technology.

Districts may use the state-designed assessments or may substitute assessments they have designed locally. They must report to the state that they are implementing the assessments or strategies in those content areas, but individual student scores are not reported. Surveys show that districts agree that the assessments support continuing focus on these areas of study; identify intentional models of instruction for students; help teachers identify areas of strength and weakness and modify instruction accordingly; and allow a more consistent standard of assessment across the state as they show what student performance should look like (Office of the Superintendent of Public Instruction, 2012). (For examples of some of these assessments in civics, economics, health and fitness, and the arts, see appendix A.)

New England Common Assessment Program

The New England Common Assessment Program (NECAP), an unprecedented state collaboration around development and design of common reference examinations, was launched by New Hampshire, Rhode Island, and Vermont, and recently expanded to include Maine. NECAP is offered in grades 3 to 8 plus grade 11 (Measured Progress, 2009). It is a hybrid assessment comprising both multiple-choice and constructed-response items, including performance tasks. In addition, each of these states supports more extended performance assessments that are designed and administered at the local level. We describe both below.

Performance-based components of NECAP include a range of standards-based constructed-response items and performance tasks. In writing, students respond to two writing prompts that are scored using a common set of rubrics. (See appendix A for an example.) In science, both constructed-response and performance items are included. The science constructed-response items “require students to respond to a question by using words, pictures, diagrams, charts, or tables to fully explain their response,” and an inquiry task asks students “to hypothesize, plan, and critique [scientific] investigations, analyze data, and develop explanations” (New England Common Assessment Program, 2009, pp. 1, 15).

All NECAP scorers are trained and go through a calibration process prior to scoring. The writing prompt is “scored by two independent readers both on quality of the stylistic and rhetorical aspects of the writing and on the use of standard English conventions” (Measured Progress, 2009, p. 8). The other constructed-response answers are scored using an item-specific rubric with score point descriptions. Common cut scores were established through representative expert committees in the NECAP states to ensure comparability across states. Professional development materials to support the NECAP assessment were developed by content specialists at the New Hampshire, Rhode Island, and Vermont departments of education in partnership with the Education Development Center and the National Center for the Improvement of Educational Assessment.

Through this collaboration, the NECAP states were able to maintain common content standards for the region; this fosters sharing of instructional and curriculum resources within and across state borders. In addition, NECAP states significantly lowered the cost of assessment while maintaining high standards of quality; these cost savings enabled the consortium to develop a more balanced assessment, including performance-based, constructed-response items, which would have strained the testing budgets of the individual states.

Collaboration among states in test design and administration could become a model for the country in developing innovative assessments that meet high standards of test quality, measure a broad range of skills and abilities, and are administratively feasible and cost-effective. By pooling resources, states are better able to afford the development of richer performance measures designed to address the skills and abilities needed to be college and workplace ready in the twenty-first century. Finally, state-based consortia can promote and support development of regional learning networks, which enable teachers and administrators to share promising practices and resources across states. These increasingly include networks that are designed to strengthen the local performance assessment systems that each state has previously nurtured individually. The Quality Performance Assessment (QPA) network, for example, is enabling schools, districts, and states across New England to strengthen local assessment systems by introducing common tasks that evaluate critical thinking, problem-solving, and communication skills and engaging school faculties in moderated-scoring processes (Center for Collaborative Education, 2012). (See chapter 7 in this volume.)

Maine and New Hampshire

Maine and New Hampshire established policies that include and encourage using performance assessment in conjunction with their large-scale state accountability systems. In Maine, local assessments are organized around the state’s learning results in eight areas: English, mathematics, science, social studies, health and physical education, career preparation, visual and performing arts, and world language. The state offers extensive professional development to local districts in developing common performance tasks, rubrics, portfolios, and exhibitions of student work.

New Hampshire passed legislation to develop a competency-based system for graduation that no longer relies on Carnegie units (New Hampshire Department of Education, 2005). The system uses a “mastery of learning” approach that will rely on course-based performance assessments to earn high school credits both in and out of school rather than Carnegie units. To ensure its students’ preparation for college and careers, New Hampshire has launched a new system of assessments that is tightly connected to curriculum, instruction, and professional learning. In addition to the Smarter Balanced Assessments in English language arts and mathematics, to be adopted in 2014–2015, this system will include a set of common performance assessments that have high technical quality in the core academic subjects, locally designed assessments with guidelines for ensuring quality, regional scoring sessions and local district peer review audits to ensure sound accountability systems and high interrater reliability, a web-based bank of local and common performance assessments, and a network of practitioner “assessment experts” to support schools. (See chapter 10, this volume.)

The state’s view is that a well-developed system of performance assessments that augment the traditional tests will drive improvements in teaching and learning, as they “promote the use of authentic, inquiry-based instruction, complex thinking, and application of learning . . . [and] incentivize the type of instruction and assessment that support student learning of rich knowledge and skills” (New Hampshire Department of Education, 2013, p. 9). The system will also offer a strategic approach for building the expertise of educators across the state by organizing professional development around the design, implementation, and scoring of these assessments, which model good instruction and provide insights about teaching and learning.

Like Connecticut, both Maine and New Hampshire have put in place assessment policies requiring that high school graduation decisions cannot be based solely on state high school examinations. Rather, assessment policies must be used in conjunction with other performance measures, among them curriculum-embedded performance tasks, portfolios, and other locally determined graduation indicators.

Rhode Island

To earn a high school diploma in Rhode Island, all students are required to demonstrate proficiency on both the NECAP and a locally developed school-based portfolio. Student portfolios for graduation must include a “composite measure of each student’s overall proficiency for graduation in the six core academic areas” locally developed and validated in each district (Rhode Island Board of Regents for Elementary and Secondary Education, 2008). Student results on the NECAP examinations count as one-third of the components of their total assessment in English, mathematics, and science “as designated by the Board of Regents” and include “at least two additional performance-based diploma assessments” in other subject areas (2008, L-6–3.3).

Districts must include in their local assessment system “a combination of at least two of the following performance-based assessments: graduation portfolios, exhibitions, comprehensive course assessments, or a combination thereof, such as a Certificate of Initial Mastery” (Rhode Island Board of Regents for Elementary and Secondary Education, 2008, L-6–3.2). Schools must develop a review process to score performance-based diploma assessments at the local level. For exhibitions, portfolios, or Certificates of Initial Mastery to be considered part of the schoolwide diploma assessment, schools must meet state requirements such as supplying “sufficient evidence” and “using valid and reliable rubrics and/or an independent review process for each entry in a portfolio (Rhode Island Department of Education & Education Alliance at Brown University, 2005b, p. 2). Teachers involved in portfolio scoring must be trained and meet calibration standards on the rubric in order to reliably score student work.

To ensure “opportunity to learn,” the state department guidelines require that “existing course offerings must now give students frequent opportunities to practice applying their skills and knowledge” (Rhode Island Department of Education & Education Alliance at Brown University, 2005a, p. 4). These guidelines ensure that courses prepare students “for the more formal demonstrations of proficiencies necessary to earn a diploma. Naturally, high school courses will also continue to administer routine assessments such as tests, quizzes, papers, labs and so forth” (Rhode Island Department of Education & Education Alliance at Brown Unversity, 2005, p. 4).

The Rhode Island approach puts teachers at the heart of the assessment process and teacher scoring as the basis for judgments of student learning. Rhode Island is breaking new ground in developing a balanced assessment system that takes into account school-based portfolio assessments, in combination with an on-demand standardized assessment (NECAP). In addition, use of a judgmental weighting system to combine information from standardized assessment (NECAP) and portfolio scores illustrates one approach to developing a composite score employing both performance and standardized test data to support a graduation decision.

Vermont Local Comprehensive Assessment System

Vermont was an early pioneer in using embedded classroom assessments for accountability and to guide curriculum development. As a result of NCLB requirements, these assessments were replaced by NECAP for federal accountability purposes and became instead part of Vermont’s School Quality Standards mandate. The standards mandate requires each school to develop a local comprehensive assessment system “aligned with the Vermont Framework of Standards and Learning Opportunities [that] is consistent with the Vermont Comprehensive Assessment System, adopted by the State Board of Education in November 1996” (Vermont State Board of Education, 2006).

Currently, each school’s local comprehensive assessment system must assess students in the required standards not covered by the state assessment (Vermont State Board of Education, 2006). The state furnishes a variety of assessment tools that schools may use in developing their local comprehensive assessment system. For example, in the content areas of mathematics and writing, the state offers benchmarks, rubrics, calibration materials, and data analysis tools to use mathematics and writing portfolios as effective local classroom assessments. According to the deputy commissioner of education and the director of standards and assessment, the local assessment provisions of the school quality standards are intended to place “classroom assessment at the core of the assessment system—common grade, team, school, and state assessments would round out the Local Comprehensive Assessment System” (Pinckney & Taylor, 2006, p. 1).

The materials, items, and tasks supplied by the state for optional use are predominantly performance based. (See an example of a Vermont information technology task in appendix A.) In addition, the Department of Education reviews district-based assessment systems and gives specific guidance to teachers and other educators responsible for scoring common assessments (M. Hock, personal communication, September 17, 2009). For example, districts “need to use common, agreed upon criteria for student expectations; scoring scales or rubrics; and benchmark performances in order to make consistent judgments about the quality of student work” (Vermont Department of Education, n.d.-a).

Although the NECAP assessment is used as the primary pathway for Vermont students to earn a diploma, they also can earn a diploma through meeting the requirements of a performance-based local option. A student “meets the requirements for graduation if, at the discretion of each secondary school board: ‘The student demonstrates that he or she has attained or exceeded the standards contained in the Framework or comparable standards as measured by results on performance-based assessments’” (Vermont Department of Education, 2006).

Over almost two decades, Vermont’s leadership in performance assessment has created a collaborative professional culture around curriculum and instruction that engages teachers in principled discussions about the quality of student work. Here is how Richard Murnane, a Harvard professor, described the conversation of Vermont teachers who came together to score student portfolios: “Often heated, the discussion focused on what constitutes good communication and problem solving skills, how first-rate work differs from less adequate work, and what types of problems elicit the best student work” (Murnane & Levy, 1996). Formal school-based structures designed to bring teachers together to discuss student work not only serve to deepen teacher knowledge of student skills and abilities but can change how professional development is practiced in schools and districts. Teacher-led discussions of student work are often cited as the best and most consequential professional development that can lead to higher student achievement (Wei, Darling-Hammond, Andree, Richardson, & Orphanos, 2009).

PROMISING AND EMERGING ASSESSMENT PRACTICES

There are many innovative assessment projects currently under way that are worth noting because they illustrate the lessons that have been learned from previous and ongoing efforts. This section highlights three such initiatives that focus on higher-order thinking skills and are designed to predict college and workplace readiness: the College and Work Readiness Assessment (CWRA), the College Readiness and Performance Assessment System (C-PAS), and the Ohio Performance Assessment Pilot Project (OPAPP). These projects illustrate three related but different approaches to measuring higher-order thinking by (1) using on-demand, computer-adapted, constructed-response items (CWRA); (2) assessing cognitive strategies that enable college-bound students to learn, understand, retain, use, and apply content from a range of disciplines (C-PAS); and (3) developing course and curriculum-embedded rich projects in the core academic disciplines that can be combined with high-stakes state accountability test measures to support a high school graduation decision (OPAPP).

College and Work Readiness Assessment

The CWRA is a high school senior version of the Collegiate Learning Assessment (CLA), described in chapter 2 (Klein, Benjamin, Shavelson, & Bolus, 2007; Klein, Freedman, Shavelson, & Bolus, 2008; Shavelson, 2008). Both assessments were developed at the Council for Aid to Education.

The CLA was developed to measure undergraduates’ learning—in particular, their ability to think critically, reason analytically, solve problems, and communicate clearly. The assessment comprises performance tasks and critical writing components. The performance task component presents students with a real-world problem and an in-basket of information and asks them to either solve the problem or recommend a course of action based on the furnished evidence. The analytic writing prompts ask students either to take a position on a topic or to critique an argument. A ninety-minute, entirely constructed-response exam, the CLA is delivered over the Internet. The assessment focuses on performance at the institution level or on performance at the program level within an institution. Institution or program-level scores are reported as observed performance and as value added beyond what would be expected from entering student SAT scores.

Both the CLA and its high school counterpart, the CWRA, differ substantially from most other standardized tests, which are based on an empiricist philosophy and a psychometric/behavioral tradition. From this stance, everyday complex tasks are divided into components, and each component is analyzed to identify the abilities required for successful performance. For example, suppose that components such as critical thinking, problem solving, analytical reasoning, and written communication are identified. A separate measure of each component would then be constructed, and students would take each test. At the end of testing, student scores would be added up to construct a total score to describe their performance—not only on the assessment at hand but also generalizing to a universe of complex tasks similar to those the tests were intended to measure.

The conceptual underpinnings of the CLA and CWRA are embodied in what has been called a criterion sampling approach to measurement. This approach assumes that the whole is greater than the sum of its parts and that complex tasks require an integration of abilities that cannot be captured if divided into and measured as individual components. The criterion-sampling notion is straightforward: if you want to know what a person knows and can do, sample tasks from the domain in which she is to act, observe her performance, and infer competence and learning. For example, if you want to know whether a person not only knows the laws that govern driving a car but can also actually drive a car, do not just give her a multiple-choice test. Also administer a driving test with a sample of tasks from the general driving domain (starting the car, pulling into traffic, turning right and left in traffic, backing up, parking). On the basis of this sample of performance, it is possible to draw more general, valid inferences about driving performance.

The CLA/CWRA follows the criterion-sampling approach by defining a domain of real-world tasks that are holistic and drawn from life situations. It samples tasks and collects student-generated responses that are modified with feedback as the task is carried out. (For an example of a CLA task, see appendix B.) The skills tested by these assessments do indeed appear to be related to college success. Research shows that high school grade point average (GPA) combined with scores on CWRA performance tasks offer as good a measure as high school GPA combined with SAT scores for predicting a student’s first year GPA and a better measure for predicting the cumulative college GPA (Klein et al., 2009).

The CWRA follows a similar approach. Its twofold mission is to (1) improve teaching and learning by using performance tasks to connect classroom practice with authentic institutional assessment and (2) evaluate student readiness to do college work. It is based on the same structure as the CLA but adapted to test high school seniors’ critical thinking, analytical reasoning, problem solving, and communication abilities. The CWRA is designed to provide greater access to the test content by adjusting the reading level to make it appropriate for a range of high school students. The adapted, open-ended performance tasks are administered in conjunction with other standardized tests of critical thinking to produce reliable individual student scores.

If these assessments are to encourage productive teaching, it is crucial to work with teachers to offer a model for how curricular and pedagogical interventions can help students develop these higher-order skills. CWRA offers workshops called “Performance Task Academies” to provide hands-on training grounded in the literature on learning theory, critical thinking, and authentic assessment. The academies show teachers how to create classroom projects that are a hybrid of case studies and performance-based learning, with a special focus on higher-order skills. The Performance Tasks Library is a teacher-created teaching resource.

The College Readiness Performance Assessment System

C-PAS is another tool designed to evaluate the college readiness of high school students. Developed by David T. Conley, director of the Education Policy and Improvement Center at the University of Oregon, CPAS is a response to the problem that current tests do not effectively gauge students’ ability to apply learning strategies they will be expected to demonstrate in entry-level college courses and beyond. As American secondary school classrooms attempt to become more data driven and as more high school students set college as their goal, it is critical that teachers have the right set of data to enable them to make instructional decisions that prepare their students for postsecondary education and that students have a clearer picture of their readiness for college courses. C-PAS is designed to supply this type of information to teachers and students to ensure high school instruction leads to college readiness for all students.

C-PAS is a series of curriculum-embedded performance tasks that teachers administer within the context of their curriculum and score with a common scoring guide. The result is a performance profile for each task composed of scores from up to five key cognitive skills measuring students’ ability to reason, solve problems, interpret information, conduct research, and generate work with precision and accuracy. The teacher separately grades the task for inclusion as a component in the course grade, thereby increasing student engagement in the task. The tasks are carefully designed to encourage student development of key cognitive strategies that research identifies as being important elements of entry-level college courses. (For an example of a C-PAS task see appendix B.)

C-PAS development is rooted in psychometric principles and practices in order to achieve a high degree of technical adequacy, which helps ensure that the scores generated are valid and accurate indicators of student development of key cognitive strategies associated with college success. This is achieved in a number of ways. The five key cognitive strategies are carefully analyzed using item response theory to determine the degree of interaction among them and establish task difficulty. Scoring guides are refined so that they focus on the key attributes of each cognitive strategy. All task writers are carefully selected and then trained to use task shells to ensure the structural similarity of all tasks and minimize task variance on extraneous dimensions. Teachers must follow common conditions of administration when introducing the tasks in class. Finally, after teachers score their students’ tasks using the common scoring guides, master scorers rescore a subset of the tasks to ensure consistency of teacher scoring.

Ohio Performance Assessment Pilot Project

The goal of the Ohio Performance Assessment Pilot Project (OPAPP) is to contribute to developing a world-class assessment that raises learning expectations for all students, is balanced, and uses a multiple measures approach to assessment and accountability. One important purpose of this approach is to support improvement in instructional practice and align student achievement to college readiness standards and international benchmarks of student performance. The initial phase of the project focused on developing curriculum-embedded, teacher-managed, rich performance tasks that are content focused, skills driven, and aligned to college and workplace readiness standards.

Course-embedded performance assessment tasks measure the content knowledge and skills learned in eleventh- and twelfth-grade courses. Students complete tasks in and out of class over a period of one to four weeks as an embedded part of course curriculum. Teachers administer the tasks under the supervision of their districts and state coaches. Content-area and state department curriculum experts develop each task in consultation with teachers, higher education faculty, and national content experts.

Performance outcomes are the content-specific knowledge and skills described by content experts that are needed for college and career success in the twenty-first century. Ohio teachers, higher education faculty, and state curricular experts arrive at a consensus as to the relative importance and validity of the performance outcomes. The performance outcomes are aligned with state content standards, national content standards (like those of the National Councils of Teachers of Mathematics and English), college readiness standards (Conley, 2007), and international benchmarks. Performance outcomes are a set of high-leverage content and skills aligned to the Ohio standards and the national Common Core standards. Explication of performance outcomes serves as the blueprint for designing the course-embedded performance assessment tasks.

Common scoring rubrics are a set of evaluative criteria aligned with the performance outcomes, designed to assess all performance tasks within a specific disciplinary focus (e.g., scientific inquiry and investigation, mathematics problem solving, English language arts inquiry, and communication).

The scoring system is based on a set of training protocols and benchmarks designed to ensure high scorer reliability. In addition, the project has developed moderation procedures to audit local teacher scores and a data feedback loop to give districts and schools formative information to improve teaching and learning.

OPAPP is designed to support consequential decisions; the design of the scoring is therefore grounded in psychometric principles and practices to support the validity of the measure. A high degree of technical adequacy is needed to ensure that the scores generated are credible and defensible and that they measure important dimensions of learning associated with college and career success. This is accomplished in part through using item response theory to examine the degree of interaction among them and establish estimates of student ability and task difficulties.

The performance assessments offer a rich, authentic measure of higher-order thinking and focus on discipline-specific thinking skills deemed necessary for college and career success. Assessment results supply formative information to teachers about student achievement of key performance outcomes, as well as credible, defensible evidence that can be used as part of a summative high-stakes accountability decision. In addition, the approach is designed specifically to create a two-way flow of information and engagement from the classroom level to the school and district, and from the state and systems level back into the classroom. These performance tasks are aligned with state curriculum frameworks and course syllabi and modeled after proven assessment practices that are already in place in high-performing countries worldwide.

Development of the course-embedded content knowledge and skills assessments is designed to fit the redesign of the Ohio accountability system as a component of an end-of-course examination system. Evidence and scores from course-embedded, performance-based assessments are specifically designed for use in combination with state-developed end-of-course exams. The reference exams include both constructed-response and on-demand multiple-choice items—essays and problem solutions, as well as curriculum-embedded, extended-performance tasks that may require more extended writing, research, and inquiry tasks. The tasks will be constructed by high school faculty and college faculty under the leadership of the state department of education and are intended to inform the grade given to the student in the course as well as combine with end-of-course exams to support high school graduation decisions.

The assessments may also be used to satisfy Ohio’s senior project requirement. The senior project is defined to include a variety of formats. One format might be a single project in an area of deep interest to students; a second could be a graduation portfolio that includes performance assessments in a selected number of content areas—for example, subject areas chosen by the student, as is common in other countries. (See chapter 4, this volume.) Students demonstrating mastery in additional content areas could receive diploma endorsements (“merit badges”) recognizing their outstanding achievement. These endorsements of accomplishment could be taken into account as part of a student’s application for college or in conjunction with a placement exam used by colleges to determine course eligibility. (See the Ohio Performance Assessment Task in appendix B.)

LESSONS LEARNED FROM CURRENT AND EMERGING PERFORMANCE ASSESSMENTS

Both past and current performance assessment efforts have taught us many lessons, which we treat in the next sections. With an eye toward including performance assessments as a component of accountability measures for both individual and group results, the lessons present ways to strike a balance between local and centralized responsibilities to ensure the high quality of products (the tasks or projects, associated materials, and measurement attributes), while also attending to the feasibility, in terms of costs in time and money, of taking performance assessment to scale. We examine lessons related to both of these dimensions next.

Task Quality

Traditional state assessments typically emerge from a development process that involves multiple reviews by curriculum and assessment experts, field testing, psychometric analyses, and further review and revision. But several programs with performance components rely on local development and selection of tasks. As expected, there is varying quality in locally developed tasks. The next sections examine task quality in terms of several critical attributes: alignment to standards; rich, scorable products; and technical quality, including scoring reliability and informative score scales.

Alignment to Standards with Supports

One early problem with locally developed tasks was that they were not always closely tied to standards. In all fairness, many states did not have the kind of content standards in the early 1990s that Title I mandated a few years later; many had only general curriculum guidelines. Local curricula and teaching were also often unaligned to the content standards. This is especially important in a high-stakes environment. It is generally agreed (and affirmed by the courts) that it is inappropriate to assess students on concepts and skills—and attach negative consequences to poor performance relative to those concepts and skills—if the students do not have an opportunity to learn and perform them.

Clearly a state’s responsibility in this regard is to make it clear to schools through content standards, curriculum resources, and professional development opportunities just what students should know and be able to do. In the early 1990s, for example, New Jersey conducted three years of “due notice testing” in conjunction with its HSPA program. This gave schools time to adjust curricula to new standards and get used to new tests before high stakes were associated with results from the testing. Because the state planned carefully, supported implementation, and helped develop high-quality performance assessment tasks aligned to the state standards, productive assessments with supports for students were well implemented, and schools responded with stronger teaching. New Jersey was one of the few states to increase its performance on the National Assessment of Educational Progress as well as on state tests, while it increased graduation rates (Darling-Hammond, 2010). Current efforts to implement the Common Core State Standards should observe this lesson, ensuring not only that tasks are aligned to standards but that teachers and students have opportunities to learn the expected content using strong curriculum materials and associated professional development.

Lesson: For performance assessments to be instructionally supportive and be used for accountability purposes, there must be means to ensure the tasks the students are asked to tackle are reasonable reflections of the standards they are intended to measure and the curriculum students have experienced. For this attribute and others, thoughtful assessment development followed by teacher training is essential, along with careful development of curriculum resources linked to assessment tasks.

Rich, Scorable Products

When they were introduced, performance assessments were as unfamiliar to local Kentucky educators as the state’s new standards were. Nonetheless, the Kentucky writing portfolios were of sufficient quality to be counted toward the accountability index because teachers could easily find ways to get students to generate texts that provided rich opportunities for demonstrating the standards. Although writing topics for Kentucky portfolios were never standardized or common across students, the specifications for entries were clear enough to ensure a degree of data comparability that justified continuing to use the approach for almost two decades.

The teachers were not as successful at identifying activities in mathematics that led to robust representations of the standards that were also readily scorable. This is likely because of long-standing traditions of teaching mathematics as discrete items with single answers rather than as applications of complex problem solving to real-world contexts. As a result, high-quality tasks for mathematics portfolios did not emerge readily from the curriculum and, in the aggregate, did not become as well developed as those in writing. The mathematics portfolios were ultimately replaced by performance tasks developed with the assistance of professional assessment developers in collaboration with teachers.

Unlike the assignments leading to portfolio entries in Kentucky, which were left to teachers to devise, the hands-on performance tasks administered by trained administrators sent into the Kentucky schools were developed through the joint efforts of department and contractor staff and advisory committees. Because difficulty creating useful portfolio assessments in mathematics was a challenge in other states, such as Massachusetts and Vermont, similar strategies for developing performance tasks have been adopted elsewhere as well.

Lesson: A critical requirement for performance assessments for high-stakes, statewide programs is the need for tasks to yield rich, scorable products, closely tied to standards, that yield credible evidence of learning and represent the full range of individual student capabilities. Where such tasks are not common in classroom practice, skilled assessment developers should work with teachers to develop model tasks.

Technical (Measurement) Quality

Issues of technical quality—sometimes real and sometimes perceived—contributed to the demise of the authentic assessment movement more than a decade ago. The challenges of performance assessment were identified and exposed, but not all states made the effort to work through the technical problems of reliability and validity. Given the demands of continuous development of new tests, coordination and administration of tests statewide, and analysis and reporting, the assessment and accountability staff in the state departments of education and their contractors had more than enough on their plates. Few had time to publish and disseminate information about what they were learning—including evidence of strong technical quality, when it existed—beyond the requirements of the programs and contracts. Unsubstantiated criticism was common and often left unanswered. For example, this was clearly the case with respect to human scoring, which is relied on heavily in performance assessment.

Scoring Accuracy and Reliability

Some outspoken critics of open-ended performance assessments have argued that the process is subjective and focused on values and attitudes (Schlafly, 2001). There is a mistaken but all-too-common belief that tests requiring human scoring are inherently unreliable. True, human scorers evaluating complex student performance are not perfectly calibrated when they begin the process. However, with training and moderation, it is not uncommon to achieve high rates of interrater reliability—often 90 percent or higher (Measured Progress, 2009; New York State Education Department, 1996).

The process that states typically use to score student responses (to constructed-response questions, for example) has many elements designed to render it objective. Scorers typically do not know student names or schools. Using preestablished rubrics describing the characteristics of work earning each point value, scorers are really being asked to categorize responses. The rubrics are developed with the tasks, and both are field-tested and then improved if necessary. Scorers generally must have a background in the relevant subject area; they are trained on each test question using the rubric and many samples of student work of varying quality; they have to qualify (calibrate) by achieving a certain level of accuracy on benchmarked student responses before being allowed to score; and although they are scoring for the record, various approaches to blind double scoring are used to monitor their accuracy, with corrective action taken when necessary. All of these practices are described in “Operational Best Practices,” a document produced jointly by state testing directors and testing company experts at the request of former Secretary of Education Margaret Spellings (Association of Test Publishers & Council of Chief State School Officers, 2010).

In addition, student knowledge can often be characterized more accurately by scoring performance tasks rather than multiple-choice questions. For example, students can be credited with partial knowledge on the basis of performance, which is more accurate than receiving full credit for guessing correctly on a multiple-choice question. The same level of test reliability can be achieved with eight to ten four-point constructed-response questions as with fifty multiple-choice items. This is a matter of fact, documented in the technical manuals associated with hundreds of state tests.

The same techniques for human scoring of student responses to constructed-response questions and writing prompts are applicable to any of the scorable products resulting from performance tasks or projects. Consequently, when it comes to scoring performance assessments, expertise and experience have been established. Taking such assessments to scale is an issue of time and expense. Methods for dealing with this problem are addressed in the next section.

Lesson: Many assessment programs have demonstrated that open-ended items and tasks can be reliably scored if the tasks are designed to be comparable and well defined; if scorers are well trained with a rubric that is sufficiently clear about the features of the task to be scored; and if raters are involved in a moderation or calibration activity that provides benchmarks for each score point and feedback about the accuracy of scoring until a consistent level is reached. This work should be carefully documented. It is not enough to create and use high-quality measures; it is also necessary to convincingly demonstrate to a variety of audiences that they are indeed of high quality in terms of both what they measure and how reliably they measure it.

Informative Score Scales

Scoring should be not only reliable but sufficiently informative. Scorers of writing portfolios in the 1990s Kentucky assessment program assigned a score of 1, 2, 3, or 4 to each portfolio—scores that corresponded to the four performance levels used in the state for all subject areas. This practice was a mistake for a number of reasons. The idea of a portfolio is to accumulate a larger body of evidence of a student’s capabilities. (Seven writing samples were initially required in every student’s portfolio.) Reducing a portfolio to a number from 1 to 4 so early in the process negated the advantages of a wider score scale and in fact reduced the measurement value of portfolios to that of a single constructed-response item.

The Kentucky portfolio scoring approach also led to inflated scoring in some schools, which the audit process identified, as described in the previous section. Teachers knew that student writing had improved, but the only way they could show improvement was by assigning higher scores to the work. With so few score points, each point corresponded to a wide range of quality, and for many students, their improvement was not enough to cross the line to the next performance level. This problem was verified in many audited schools. (Interestingly, the more limited, on-demand performance events administered in Kentucky had much larger score ranges.)

Lesson: Although no particular score range is optimal, a wider range of possible scores on a task or project allows more finely grained distinctions to be made and growth or gains to be more appropriately noted. Of course, for a given task or project, several products or performances could contribute to that range (e.g., a writing sample, an oral presentation, a model, and even responses to follow-up questions). Collapsing the score points into fewer score ranges can then be done later for purposes of performance-level reporting and even equating. Also, different measures related to a task or project might be counted toward different subject area scores.

Feasibility

It is not enough for performance assessments to be of high quality. They must also be feasible to administer and maintain. For example, some assessment strategies that can be used with smaller samples of students may become unmanageable for larger populations, and solutions must be found for enabling administration and scoring to be handled in sustainable ways. We discuss lessons associated with creating manageable assessments, including supporting scoring so that it can be both affordable and reliable over time and using technology in new and innovative ways.

Manageability for Large-Scale Administration

At the start of the so-called authentic assessment period, the emphasis of many performance assessment programs was on group (school or statewide) results. Almost thirty years ago, a performance assessment project in the United Kingdom involved trained administrators going into the schools to administer performance tasks (Burstall, 1986; Burstall, Baron, & Stiggins, 1987). Students worked in pairs on the tasks, from the belief that a one-on-one situation might be intimidating to students, whereas working with a peer would be more comfortable even in the presence of an adult outsider.

The same approach was used in performance testing projects in Connecticut and Massachusetts during the 1980s, both of them small-scale studies with a limited sample of students (Badger, Thomas, & McCormack, 1990; Kahl, Abeles, & Baron, 1985). Trained task administrators (local or otherwise) followed detailed instructions and used prepackaged kits of materials. These two efforts were one-time probes, not intended for accountability. The sampling designs did not enable reporting of local results. Instead, reports focused on what the activities revealed about student understanding and on instructional implications. Thus, the findings were reported in much the same way as those of pre-1980 NAEP.

Although this approach was plausible for sample testing, it proved unsustainable for population testing. In Kentucky, for example, all students in three grades participated in performance testing during the first three years of the program. Trained administrators carried kits of materials to the schools and conducted testing one class at a time, with teams of three or four students working on different tasks during the same fifty-minute session. In this case, students worked together for the first part of the period and individually in the latter part to produce unique, individual, scorable products that were returned to the assessment contractor for central scoring. With students taking unequated tasks, no attempt was made to report student-level results; instead, results were aggregated and reported at the school level. Because of the high quality of the tasks and scoring, the state was able to count the results of the performance testing toward the accountability index for every school.

This component of Kentucky’s assessment program was expensive and labor intensive for the state and burdensome for the schools. Also, because the tasks were not necessarily related to the content being taught at the time, the efforts were of little immediate instructional value. The material and personnel needs of the programs posed challenges in terms of time, logistics, and expense. Thus, the approach did not prove sustainable over time.

An alternative is to create a system that incorporates curriculum-embedded assessments. This would eliminate the costs associated with administering isolated tasks statewide using external administrators transporting cartons of materials to every school. The same situation applies in scoring the resulting student work. The logistics of transporting student work (which can take many forms, not just writing) for central scoring and the scoring itself (which would be in addition to scoring that might already be necessary for on-demand constructed-response testing) can be time-consuming and expensive.

Local Administration with Central Auditing

For this kind of system to work, a set of procedures must be in place to create comparable administration and scoring. This can be accomplished through centralized specification of tasks, training or auditing processes, or combinations of these features. For example, in the statewide portfolio assessment programs in writing and mathematics developed by Vermont and Kentucky, the types of work samples to be permitted in the individual student portfolios were ultimately specified, while task development and selection were left to the teachers. In Vermont, teachers came together and were trained to conduct central scoring; in Kentucky, teachers scored their own students’ portfolios. Kentucky used an audit procedure by which samples of portfolios were scored centrally and audit results reported back to schools with some additional scorer training provided to teachers on a limited basis.

These portfolio assessments, successful in many ways, made performance assessment feasible on a large scale by placing a great deal more responsibility for various aspects of the process in the hands of local educators. As described in chapter 2, these assessments became more standardized in the design and administration of tasks in order to gain sufficient scoring reliability. Over time, the Kentucky model in particular achieved this goal to a high degree.

Kentucky’s solution (similar to the strategy used for the New York State Regents Examinations) was to have local educators score their students’ work in the writing portfolio, while the state audited the local scoring on a sampling basis and providing additional training as needed. For example, at the end of the second year of assessment, Kentucky audit results showed that the scores submitted by some schools were inappropriately high. (One reason for this was associated with the limited score scale.) These audit results were verified by an audit of the audit. Teachers in schools whose scores were found to be inaccurate were given extra training; they rescored their portfolios with close monitoring for accuracy, and the new scores, which were considerably more comparable, became the scores of record. The following year, the writing portfolio scores in the previously audited schools where extra training was furnished were found to be accurate (Kentucky Department of Education, 1997). The audit sample design was such that over a three-year period, all schools would have their portfolio scores audited and derive the benefit of additional training, if needed.

Lesson: Statewide performance testing of the general population of students using external administrators and supplying specialized materials is quite expensive. Cost efficiencies can be realized if tasks are administered by local school personnel and required materials are readily available in the schools, in homes, or online. To justify the burden on teachers and students, it would be best if the tasks were curriculum embedded. That is, they should be relevant, instructionally sensitive, and syllabus based.

Consistency of local scoring across schools, with comparability of results, can be accomplished if states make a commitment to teacher training, as well as an audit process and associated “remediation” to yield scoring that is consistent across a whole state. Some audit sampling and feedback approaches can significantly cut down on the need for remediation. For example, the use of interim measures (e.g., curriculum-embedded performance assessments) throughout the year, long before accountability results must be reported, allows time for feedback to local scorers. For each performance task, schools could be asked to submit the work of just a few students for central audit scoring, and the scores from that process, reported back to the schools, could be used as benchmarks to anchor the scoring of the work of the rest of the students. This practice is consistent with the guidance in “Operational Best Practices” (Association of Test Publishers & Council of Chief State School Officers, 2010) regarding real-time monitoring of scoring accuracy.

Technology

As the CLA has demonstrated, technological advances are beginning to enable highly reliable computer-based scoring of complex student responses. Coupled with appropriate use of human scoring to help produce the data for developing a scoring algorithm, check on its reliability, and score outlier responses that cannot be evaluated by machine, this technology can also enhance the feasibility of performance assessments. (See chapter 6, this volume.)

In addition, moving to a more balanced, multiple-measure accountability system depends in part on developing intelligent technologies to capture and transform information beyond simple test scores, including both formative and summative student performance data, ranging from simple text to digital media that can display oral and visual exhibitions of student work. Many data management systems currently in use yield accessible and relevant demographic and test score data. These systems, however, are not generally structured to produce actionable “just-in-time” evidence of academic factors that schools, districts, and states can use to guide curriculum, instruction, and assessment. More important, these systems do not generally focus on putting classroom-based formative and diagnostic data in the hands of teachers and educators to support struggling students and to continuously monitor student progress over time—so-called early-warning and on-track measures. Ready access to actionable data embedded in the school’s culture and norms can guide development of preventive and proactive strategies to strategically target resources to high-leverage areas of need, which will lead to improved student outcomes and school improvement.

A smart technology system is designed to create viable multimedia tools for teachers and schools to advance learning through the interactive use of information and communication technologies that allow people to share, develop, and process information. A platform that integrates traditional and nontraditional data (such as performance data and digital media) is critically needed to support high-performing schools in the future.

Lesson: Technologies that will enable educators to use and capture classroom performance data as the data unfold in real time can be powerful tools to support strategic decision making for an immediate difference at the classroom, school, and district levels. Accomplishing such change is possible if intelligent technology platforms are developed to build capacity and support accountability tools that can have a positive influence on curriculum, instruction, and assessment.

A PROPOSAL FOR NEW STATE SYSTEMS OF ASSESSMENT

If local educators are to teach for higher-order skills, then the next generation of high-stakes accountability assessment should include challenging performance tasks aligned to the demands of college and career in the twenty-first century. This can be accomplished by a two-pronged approach involving (1) a more rigorous statewide, on-demand test that includes higher-order, constructed-response questions and (2) a locally administered and scored, curriculum-embedded performance assessment component that addresses skills not measurable by the on-demand test. In the latter, the local educator role is more substantial than in traditional testing because of participation in implementation and scoring of performance assessment. As in many examination systems abroad (see chapter 4), the scores from these two components can be combined to produce a total score.

On-Demand Component

It is important that both assessment components model good classroom practices. As discussed earlier, sole reliance on multiple-choice tests can and will narrow curriculum and drive instruction toward tested skills (Amrein & Berliner, 2002; Shepard et al., 1995). These authors recommend that an on-demand component (either paper and pencil or computer delivered) include a significant number of constructed-response questions carrying substantial weight toward the total score. Teachers tend to model the state tests in their classroom testing; as Kentucky showed, on-demand, constructed-response testing can indeed lead teachers to place greater emphasis on this format, using rubrics for scoring and gaining the benefit of seeing more actual student work while focusing more on higher-order thinking skills.

Over the past few decades, several states have used a test with common and matrix-sampled questions. Matrix sampling is the practice of generating multiple forms of an assessment, each with a different set of items (though some items overlap across forms), and administering these different forms to different students within classrooms and schools. Common questions are the same across all forms of a test at a grade level and are the basis for individual scores. Matrix-sampled questions differ across forms and serve several purposes. If included for successive years, they can be used for test-equating purposes. Also, matrix sampling is a means of field-testing items for use in future years, replacing common items that are released or held for reuse in later years. Embedding field-test items in operational tests constitutes the most effective means of field-testing items because students do not know which items are operational and which are being field-tested. Consequently, student motivation is the same for both. Matrix-sampled items can also be used to bolster measurement in subtest areas for which school results are produced.

The NECAP uses a common/matrix-sampled design similar to the one used previously by several consortium states. This design allows the equating items to approximate the proportions of numerous types of items in the common test. The highly cited Massachusetts Comprehensive Assessment System has used a test design similar to the NECAP since the late 1990s. Matrix sampling of multiple item types is an efficient way to develop multiple test forms specific within grade levels, as well as to support the development of a vertical score scale across grade levels (K–12).

Matrix sampling is a cost-efficient and technically sound practice that allows sampling across a content domain to measure the full range of learning outcomes within a school or district. During the 1990s, matrix sampling allowed school-level reporting in a policy context in which schools were the focus of accountability policies in many states (e.g., Maryland, Nebraska, and Kentucky). By the early 2000s, however, with the authorization of NCLB, student-level reporting on state tests had become a policy imperative and matrix sampling approaches are now rare in state-level summative assessment.

Curriculum-Embedded Performance Component

There is growing belief that to support education of students who can compete effectively in the digital age, assessment systems must be broadened to include locally administered, curriculum-embedded performance assessments (Darling-Hammond, 2010; Popham, 1999). Much of what is considered core knowledge can be assessed by traditional summative tests, but higher-order skills that traditional tests address either inadequately or not at all should be the focus of curriculum-embedded performance tasks designed to measure higher-order cognitive ability. A performance assessment component that capitalizes on the valuable lessons from the past could be built in the following way:

  1. The state posts models online, tried-and-true tasks and projects calling for individual, scorable products closely aligned to standards, with a total score range of at least twenty points for each task or project. The tasks use materials and other resources readily available in schools, homes, or online. The posting also includes sample student work, scoring rubrics, and specifications for tasks and projects.
  2. Teachers use the state-provided tasks and projects in their own instruction and as models for tasks and projects they develop themselves to submit to the state for review. The state also conducts professional development training sessions using both online and train-the-trainer or coaching models.
  3. The state reviews, selects, rejects, revises, and furnishes feedback to teachers for their submissions.
  4. The state selects high-quality tasks or projects for pilot testing, collects associated student work, and then posts the tasks and projects, rubrics, and sample student work online for local use. This development, vetting, field testing, and posting sequence is ongoing.
  5. The state holds back (does not post) selected tasks or projects, saving them for later use in the local performance assessment component of accountability testing.
  6. The state posts a set of tasks or projects for schools to administer within a specified time frame. Teachers score the resulting student work and submit the scores to the state.
  7. Each school identifies a low-, middle-, and high-performing student for each task or project and submits the work of those students to the state using electronic portfolio platforms. The teachers’ scoring for those students is audited (rescored) by content specialists.
  8. Audit scores are sent back to the schools, and local personnel adjust the scores of their students to be consistent with the benchmarks obtained through the audit process.
  9. The next year, steps 6, 7, and 8 are repeated three times, with the tasks and projects for each round chosen to coincide as closely as possible to the time during the year when relevant instruction is given.
  10. The results of the performance component are combined with those of the on-demand assessment component, thereby contributing to both student- and school-level results.
  11. States support and supply resources for creating learning networks that build and spread educator capacity to strengthen instructional practice by means of effective use of performance assessments.

This approach, which builds on successful models of balanced assessment systems in the United States and abroad, could help create high-quality, reliable, and feasible assessments that are also educationally valuable. Creating systems that support the teaching and learning process also renders assessment more consequentially valid, and more validly consequential, as we describe next.

A New Approach to Validity

The external validity argument for performance assessments begins with the fact that to meet the new demands of the Common Core standards (aimed at ensuring that all students are college and career ready), measures of higher-order thinking need to be a core component of the next generation of assessment and accountability in this country. Performance tasks are necessary to evaluate whether students can think critically, reason analytically, solve problems, and communicate effectively.

Achieving these goals requires a more balanced view of assessment and accountability that includes both formative information—how students develop and access learning resources to complete challenging tasks—and summative judgments of student learning, based on performance tasks that are aligned to national standards and can be used for district and state accountability purposes. Because these metacognitive tasks are situated in the learning process in classrooms, they change the nature of the validity argument supporting use and interpretation of embedded performance tasks. A broader conception of a validity argument is suggested—one that begins with a detailed description of the constructs being measured, takes into account multiple types of evidence at different levels of the scale, and is sensitive to the dynamic interaction between the student and the task, as the act of inquiry continuously shapes student learning.

Validity theory is centered around claims about the appropriateness of the interpretation of data in relationship to student performance on a test or performance task (Cronbach, 1971; Fredericksen & Collins, 1996; Mislevy, Steinberg, & Almond, 2003; Pellegrino, Chudowsky, & Glaser, 2001; Moss, Girard, & Haniford, 2006). According to Campbell and Stanley (1963), to claim test validity means that evidence obtained from the assessment provides support for interpretation of the evidence to the extent that the interpretation is stronger than any other alternative explanation (e.g., internal validity).

Applying this paradigm to performance assessment focuses on the scorer’s collecting and presenting the evidence used to make judgments about the knowledge and skills the student exhibits. Teachers or expert scorers who have been trained and calibrated to score consistently using a scoring protocol or rubric often make judgments rooted in cognitive theory about student learning. As they analyze student work more completely and accurately, they are better able to understand the abilities that have been developed and are being exhibited by students. (See chapter 7, this volume.)

This chain of validity evidence to support curriculum-embedded work requires clear understanding of the skills to be measured and rich description of the performance to support interpretation of a student’s metacognitive ability to carry out complex performance tasks. In addition, teachers can use this chain of evidence to inform their practice by evaluating the impact of specific instructional strategies on student learning, including subgroup analyses (Moss et al., 2006).

Teachers and students need to know the learning demands of the tasks the students are expected to master, and teachers need to create instructional opportunities for students to complete the tasks successfully. This interaction between standards and tasks in the classroom is based on teachers and students developing common understandings of the skills that will be measured and a clear description of the performance indicators to interpret the students’ performance on the basis of standards, student work samples, benchmarks, and rubrics. To this end, the goal of creating more transparency in assessment is to signal to the students the knowledge and skills necessary for success in the classroom.

In addition, students who are engaged in self-assessment and peer assessment of student work cause a significant shift in how we think about test administration and validation. That is, using standardized administration standards for on-demand tests to ensure the objectivity and comparability of assessment is confounded when students have the opportunity to collaborate on the performance tasks. Research suggests that opportunities for peer collaboration, coupled with formative feedback to students, are a leading indicator of student knowledge of the subject matter and a strong predictor of future success (Black & Wiliam, 1998; Bransford & Schwartz, 2001).

Finally, performance assessment data go well beyond supplying scores. They are designed to inform students and teachers about what is important to learn, what learning looks like, and how learning is shaped by the context of the learning environment and its learners (Engeström, 1999). As a result, broadening the conception of validity to address richer and more complex performance tasks requires considering how assessment functions in various instructional and school-based contexts and how the learner is influenced (shaped) by the learning environment.

CONCLUSION

The United States has adhered to an assessment paradigm that principally favors one form of measurement, driven by a restrictive focus on developing standardized tests to meet standards of performance as defined by NCLB. Instead of ongoing battles between competing policies and testing philosophies, it is time to take a bold and inclusive step forward by focusing efforts on keeping what works in NCLB, fixing its problems, and broadening the prevalent conception of accountability to include performance measures of higher-order thinking.

Other high-performing nations are implementing more balanced content- and skills-driven accountability systems that use classroom-based performance measures in combination with national examinations to assess student learning. (See chapter 4, this volume.) Although several of these countries stepped up to the new global and economic realities and reformed their education systems, the United States has not yet done fully done so.

This chapter describes the lessons learned from states’ current and past efforts to use performance assessments. It builds on those lessons to offer a new vision for making performance assessments an integral part of a statewide, multiple-measure, balanced assessment system. Identifying promising practices, as well as prior missteps, in building performance assessment for statewide use can inform and guide the development of the next generation of assessment in this country. The lessons learned can inform policymakers and practitioners alike. Future work to develop new measures of student learning should be accompanied by an equally vigorous effort to develop and evaluate a system of technical, organizational, and human resource supports for states and institutions to enable them to make better use of accountability data, including performance data. Designed and used well, development of the next generation of state accountability systems has the potential to strengthen instruction, curriculum, and assessment, as well as serve as a catalyst to reform schools and districts.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.191.223.123