Chapter 2
Looking Back: Performance Assessment in an Era of Standards-Based Educational Accountability

Brian Stecher

Performance assessment—judging student achievement on the basis of relatively unconstrained responses to relatively rich stimulus materials—gained increasing favor in the United States in the late 1980s and 1990s. At least one national commission advocated the replacement of multiple-choice tests with performance assessments (National Commission on Testing and Public Policy, 1990); the National Assessment of Educational Progress (NAEP) conducted extensive pilot testing of hands-on assessment tasks in science and mathematics (Educational Testing Service, 1987); and performance assessments were adopted by a number of state testing programs, including Vermont, Kentucky, Maryland, Maine, Nebraska, Washington, and California.

Yet despite this enthusiasm, performance assessment almost disappeared from large-scale K–12 testing in the United States in the intervening years (Council of Chief School Officers, 2009) and is just beginning to return to the educational policy landscape. A number of factors account for the failure of performance assessment to maintain a large role in achievement testing in this country, despite its wide use in some other countries, and this history can inform educators and education policymakers looking for better ways to test students and schools in an era of standards-based accountability.

The chapter begins with a definition of performance assessment and suggests ways to classify different types of performance tasks. Then it provides background information on large-scale testing to familiarize readers with key terms and concepts. A review of some recent performance assessments efforts in the United States follows, along with a summary of research on the quality, impact, and burden of performance assessments used in large-scale K–12 achievement testing. The chapter concludes with a discussion of the relevance of performance assessment to contemporary standards-based educational accountability and offers recommendations to support effective use of this form of assessment.

ARGUMENTS FOR PERFORMANCE ASSESSMENT

Tests define desired knowledge and skills through both the content they cover and the format of the test items, as defined by prompts and, in the case of multiple-choice items, response options that are used to elicit student choices. These format choices reduce the generality of the test results to some degree because they represent only one of many possible ways that knowledge or skills might be demonstrated. Such preferences on the part of test developers are often unrecognized, but they limit what test scores tell us about student understanding of the domain of interest. Furthermore, in some cases these incidental features of test items have become the focus of test preparation, further eroding the meaning of the test scores (Koretz, McCaffrey, & Hamilton, 2001).

Although item writers can, in theory, write multiple-choice items to measure a wide range of skills, including complex reasoning, some behaviors can be only weakly approximated in this format. For example, you cannot fully measure a person’s ability to write a persuasive essay, conduct a scientific investigation, or perform a somersault using multiple-choice items.1 Therefore, if a test uses only multiple-choice items, some aspects of the domain may be excluded from the test specifications. Furthermore, novice item writers may find that the format limits the kinds of skills they can measure and the ways they can measure them.

In general, the greater the distance between the specifications and items that constitute the test and the academic content standards that describe the domain—in terms of content, cognitive demands, and format—the less confidence we can have that the test score will reflect understanding of the domain. One of the potential advantages of performance tasks over multiple-choice items is that they offer broader windows onto student understanding. By using more complex prompts and permitting more unconstrained responses, performance assessments can represent many domains in more complete ways.

Advocates of performance assessment argue that the fixed set of responses in multiple-choice tests (and their cousins, true-false tests and matching tests) is inauthentic. That is, the tests do not reflect the nature of performance in the real world, which rarely presents people with structured choices. With the possible exception of a few game shows, people demonstrate their ability in the real world by applying knowledge and skills in settings where there are no predetermined options. A person balances his or her checkbook; buys ingredients and cooks a meal; reads a news article in the paper and frames an opinion of the rightness or wrongness of the argument; assesses a customer’s worthiness for a mortgage; interviews a patient, then orders tests and diagnoses the status of his or her health; listens to a noisy engine running at low and high revolutions per minute and identifies the likely cause. Even in the context of school, the typical learning activity involves a mix of skills and culminates in a complex performance: a homework assignment, a persuasive letter, a group project, a research paper, a first down, a band recital, a sketch. Rarely does a citizen or a student have to choose among four distinct alternatives.

Multiple-choice tests are also criticized because the stimulus materials tend to be very limited (usually short passages or problem statements with simple figures or diagrams) and because the demands of the format encourage test developers to focus on declarative knowledge (knowing that) or procedural knowledge (knowing how) rather than schematic knowledge (knowing why) or strategic knowledge (knowing when, where, and how our knowledge applies) (Shavelson, Ruiz-Primo, & Wiley, 2005).

Advocates of performance assessment argue that the limits of multiple-choice tests can be overcome in part by replacing them with tests in which respondents have to construct responses instead of selecting responses from a predetermined set. Such situations are common in schools, suggesting that they are a more authentic way to judge what students have learned. In almost every subject, homework includes open-ended tasks: mathematics problems where students work out the solution (even earning partial credit for partial solutions); English assignments where students reflect on a text from their own point of view; chemistry laboratory where students record their observations while conducting an experiment; choral music ensembles where students sing a score. All of these situations are forms of performance assessment. Once students leave school, the tasks they will encounter in their lives will be even less structured in terms of both the assignments and the nature of the responses. For this reason, many educators use the term authentic assessment to emphasize the similarity of the task to the real world.

In some instances, advocates for performance assessment have argued that certain kinds of very rich activities are valuable in their own right as demonstrations of a set of core understandings and abilities and are not just proxies that allow generalizations about a larger construct (Haertel, 1999). These demonstrations—which combine many kinds of knowledge and skills into a major undertaking that represents the way in which work in that domain is generally done—might include performances like designing and completing an independent science investigation, an independently designed computer program, or a doctoral dissertation. Others have supported efforts to infuse the curriculum with rich performance tasks because they reveal more about students’ thinking and are useful for instructional planning and classroom assessment. This chapter focuses on large-scale testing in which items, including performance tasks, represent a more bounded set of concepts or skills and are viewed as samples from a domain.

If performance events are so pervasive in our lives and performance assessments have such advantages over multiple-choice tests, why do we rely almost exclusively on multiple-choice tests when making important judgments about students (promotion to the next grade level, graduation), schools (making adequate yearly progress), and, recently, teachers (value-added judgments about effectiveness)? Should educators and policymakers be trying to overturn the tyranny of multiple-choice testing that exists in educational accountability systems? If so, what would they need to do to ensure that new assessments could survive and become a successful tool of education policy?

DEFINING PERFORMANCE ASSESSMENT

In this chapter, the terms test and assessment are used interchangeably, although multiple choice is paired with test and performance with assessment. Individual multiple-choice questions are called items, and individual performance activities are called tasks.

Constructed Response versus Selected Response

For many educators, performance assessment is most easily defined by what it is not; specifically, it is not multiple-choice testing. In performance assessment, rather than choosing among predetermined options, the examinee must either construct or supply an answer, produce a product, or perform an activity (Madaus & O’Dwyer, 1999). From this perspective, performance assessment encompasses a wide range of activities, from completing a sentence with a few words (short answer), to writing a thorough analysis (essay), to conducting a laboratory investigation and writing a descriptive analysis of the process (hands-on). Given this range, it is surprising how often people make general claims about performance assessment without differentiating among types.

Elements of Performance Assessment

In the literature, different authors use the term performance assessment to mean different things. Some emphasize the cognitive processes demanded of the students, some the format of the desired response, and others the nature and content of the actual response (Palm, 2008). These differences in emphasis underscore one of the lingering problems facing performance assessment, which is that different educators and policymakers have different implicit meanings for the term.

For the purposes of this chapter, I define a performance assessment primarily in terms of the performances required of test takers involved in large-scale assessments. A performance task is a structured situation in which stimulus materials and a request for information or action are presented to an individual, who generates a response that can be rated for quality using explicit standards. The standards may apply to the final product or the process of creating it. A performance assessment is a collection of performance tasks.

This definition has four important elements. First, each task must occur in a structured situation, meaning the task is constrained with respect to time, space, access to materials, and so on. The standardized structure makes it possible to replicate the conditions, so the same assessment can be administered to different people and their performances can be compared. The requirement that there be structure with respect to administrative conditions does not exclude from consideration complex, extended tasks, such as conducting a scientific experiment and reporting the results; instead, structure ensures that tasks can be replicated at different times and in different places. The requirement for structure does exclude from consideration a number of familiar incidents of assessment, including oral examinations in which the examiner asks a different question of each student; ratings of behaviors in natural settings, such as ratings of athletes during competition (Nadeau, Richard, & Godbout, 2008); or observations of students’ literacy behaviors (Meisels, Xue, & Shamblott, 2008); and assessments that attempt to compare change in performance over time (Wentworth et al., 2009).

Second, each performance task contains some kind of stimulus material or information that serves as the basis for the response. In this respect, performance tasks can be very similar to multiple-choice items. They might begin with a mathematical problem, a text passage, a graph, or a figure. However, because the response options are unconstrained, the stimulus material does not have to lead to four specific choices. That freedom permits performance tasks to include more varied, complex, and novel stimuli than are typically used in multiple-choice assessments.2

Third, the task must have directions indicating the nature of the desired response. The directions can be part of the stimulus materials (e.g., “How many 3-inch by 4-inch tiles would it take to cover a circular dance floor 8 feet in diameter?”) or separate from them (e.g., “Read these descriptions of the Battle of the Bulge written by soldiers on opposing sides and write an essay to support or refute the claim that the Allied and Axis soldiers were more similar than different”). The key feature of these directions is that they be explicit enough so two test takers would have similar understanding of what they were asked to do. Because the responses can be much broader than the responses for multiple-choice tests, vague queries (e.g., “What do you think about that?”) that can be widely interpreted must be avoided.

Fourth, the task must prompt responses that can be scored according to a clear set of standards. It is usually the case that the standards are fully developed before the task is given. If the task developer does not know what constitutes a good response, it is unlikely that the task is measuring something clearly enough to be useful. In some cases, however, the scoring rubrics will need to be elaborated on the basis of responses received. For example, students may come up with legitimate responses the developers did not anticipate.

TYPES OF PERFORMANCE ASSESSMENTS

The definition I gave in the previous section admits a wide range of performance assessments, and it is helpful to have a system for classifying them into categories for discussion.

Classifying Based on Stimulus Materials and Response Options

I suggest a two-way classification scheme based on the structural characteristics of the task, particularly the nature of the stimulus materials and the nature of the response options. (This scheme is inspired by the work of Baxter & Glaser, 1998, discussed subsequently.) The stimulus materials can be classified in terms of complexity along a dimension that runs from simple to complex. (See table 2.1.)

Table 2.1 Classification Based on Task Structural Characteristics

Stimulus/Response Simple Complex
Simple Simple stimulus Simple stimulus
Simple response Complex response
Complex Complex stimulus Complex stimulus
Simple response Complex response

A math task that asks the student to solve an equation for x represents a relatively simple stimulus. In contrast, a language arts task that asks the student to read an essay and a poem and to look at a reproduction of a painting as the basis for a comparative task presents a relatively complex set of stimulus materials. Similarly, the response options can be classified in terms of freedom along a dimension that runs from constrained to open. A short-answer history question requiring the test taker to fill in a short phrase to correctly place an event offers a relatively constrained response space. In comparison, a science task in which students are given a set of leaves to observe and are asked to create at least two different classification schemes and arrange the leaves into groups based on each scheme offers a relatively open range of responses.

By crossing the stimulus and response dimensions, we create four quadrants that can be used to classify all performance tasks. A written, short-answer (fill-in-the-blank) question is an example of a relatively simple, relatively constrained task. A math word problem that requires setting up equations, using a graphing calculator, and other calculations is an example of a relatively simple, relatively open task. Such tasks can also be found in many international examinations, such as the A-level examinations used in England, in which relatively simple prompts require extended, open-ended responses demonstrating the application of knowledge in real-world settings (see Darling-Hammond, chapter 4, this volume).

The language arts task in which an essay, a poem, and a piece of art serve as prompts for a comparative essay on the similarities and differences in the artists’ visions of the subject is an example of a relatively complex, relatively open task. The College and Work Readiness Assessment is another example; it includes complex, open tasks in which students must review a set of materials that might include graphs, interviews, testimonials, or memoranda; extract information from each; assess the credibility of the information; and synthesize it to respond to problem situation (Hersh, 2009). An interesting feature of this assessment is that it can be administered online with both prompts and responses transmitted over the Internet.

At this point, a savvy reader should object that the stimulus-response classification scheme focuses on surface features of the task and ignores more important cognitive and performance aspects of the activity. I agree and discuss these aspects below. Yet this simple classification will be useful when thinking about some of the practical aspects of performance assessment, including feasibility, burden, and costs.

Classifying Based on Content Knowledge and Process Skills

Baxter and Glaser (1998) suggest a way to classify science performance tasks by their cognitive complexity, and this approach could be used more generally. They divide the science assessment space into four quadrants depending on whether the process skills demanded are open or constrained and whether the content knowledge demanded is lean or rich. They provide examples of science tasks corresponding to each quadrant. For example, “Exploring the Maple Copter” is an example of a content-rich and open-process task. In this task, high school physics students are asked to design and conduct experiments with a maple seed and develop an explanation of its flight for someone who does not know physics. The two-way content-process classification is helpful for characterizing the cognitive complexity of performance tasks. This distinction is useful when thinking about the inferences that are appropriate to make from scores on performance assessments and the kinds of information that would be needed to validate those inferences.

Classifying Based on Subject Field

The examples I have given suggest that subject field may be another useful way to classify performance tasks. Expertise in one subject is demonstrated very differently from expertise in another, and performance assessment lets these distinctive styles of thinking and performing come to the fore. For example, “doing science” involves observing events, designing experiments, imposing controls, collecting information, analyzing data, and building theories, and rich performance tasks in the sciences incorporate these kinds of skills and behaviors. In contrast, “doing mathematics” is somewhat more abstract. Although mathematics begins with observations of objects in space, the discipline focuses more on manipulating numbers, building facility with operations, translating situations into representations, explicating assumptions, and testing theories. Performance assessment in mathematics is more likely to involve solving problems with pencil, paper, and calculators, representing relationships in different forms such as graphs, and so on.

Similarly, performance assessment in the arts involves performing—music, dance, drawing, and more. In language arts, performance assessments are likely to focus on the written word: reading; comprehending; interpreting; comparing; and producing text that describes, explains, and persuades, for example. Because of these disciplinary ways of thinking and acting, educators in different fields may be thinking about very different kinds of activities when they refer to performance assessments.

Finally, some performance assessments are designed to be less discipline bound. For example, the Collegiate Learning Assessment includes tasks that are designed to measure the integration of skills across disciplines (Klein, Benjamin, Shavelson, & Bolus, 2007). Similarly, Queensland, Australia, has developed a bank of rich tasks that call for students to demonstrate “transdisciplinary learnings” (Darling-Hammond & Wentworth, 2010; see chapter 4 in this volume). Other performance tasks are designed to measure transferable reasoning skills that can be applied in many contexts.

Portfolio Assessments

A portfolio is a collection of work, often with personal commentary or self-analysis, assembled by an individual over time as a cumulative record of accomplishment. In most cases, the individual selects the work that goes into the portfolio, making each collection unique. This was the case with the Vermont and Kentucky portfolio assessments in the early 1990s (described later in this chapter). Although both became more standardized in the definition of tasks over time, students and teachers could choose the work within each genre requested to place in the portfolio that would be scored. This is also the case with most of the currently popular portfolios in teacher education programs.

According to the definition of performance assessment I gave above, nonstandardized individual collections of work are not performance assessments because each portfolio contains different performance tasks. One student might include a persuasive essay, while another includes a set of poetry. One might include the solution to a set of math word problems that are completely omitted from the portfolio of another. It is not just narrow-mindedness that leads us to exclude free-choice portfolios from the realm of performance assessment; rather, the lack of standardization undermines the value of such collections of work as assessment tools.

The initial Vermont experience is a case in point. Teachers and students jointly selected student work to include in each student’s mathematics and writing portfolios. Thus, two portfolios selected from a given teacher’s class contained some common pieces and some unique choices. This variation was exacerbated across teachers and made the portfolios difficult to score reliably (Koretz, Stecher, Klein, & McCaffrey, 1994). Studies have reported much higher rates of scoring consistency for more standardized portfolios featuring common task expectations and analytical rubrics, like those that evolved in Vermont and were ultimately developed in Kentucky (Measured Progress, 2009). These portfolios evolved rules about how the tasks were produced to ensure additional commonality and that they could be verified as the work of a particular student.

RECENT HISTORY OF PERFORMANCE ASSESSMENT IN LARGE-SCALE TESTING

In 1990, eight states were using some form of performance assessment in math or science, or both, and another six were developing or piloting alternative assessments in math, science, reading, and/or writing. An additional ten states were exploring the possibility of or developing plans for various forms of performance assessment. In total, twenty-four states were interested in, developing, or using performance assessment (Aschbacher, 1991). Twenty years later, the use of performance assessment has been scaled back significantly, although it has certainly not disappeared.

No Child Left Behind (NCLB) was a factor in many state decisions. For example, the requirement that all students in grades 3 through 8 receive individual scores in reading and math presented a major obstacle for states like Maryland that were using matrix sampling and reporting scores only at the school level. Before 2002, states were required to report scores only once in each grade span (e.g., grades 3, 6, and 9), rather than annually as NCLB required. Many states could not afford the costs of scoring extended performance tasks in all grades on an annual basis. Concerns about technical quality, costs, and politics contributed to changing assessment practices in other states. In this section, we recap some of this history to explore why the enthusiasm of the 1990s was tempered in the 2000s. (Chapter 3 contains a further discussion of state performance assessments that are currently in operation, as well as some new assessments that are on the horizon.)

The Appeal of Performance Assessment

The use of performance assessment can be traced back at least two millennia to the Han dynasty in China, and its history makes fascinating reading (Madaus & O’Dwyer, 1999). For our purposes, it is sufficient to look back two or three decades to the educational reform movement of the 1980s. This period was marked by an increased use of standardized tests for the purposes of accountability with consequences for schools and students (Hamilton & Koretz, 2002). Minimum competency testing programs in the states gave way to accountability systems in which test results variously determined student grade-to-grade promotion or graduation, teacher pay, and school rankings, financial rewards, or interventions. In this period, educators also began to subscribe to the idea that tests could be used to drive educational reform. The term measurement-driven instruction was coined to describe the purposeful use of high-stakes testing to try to change school and classroom behaviors (Popham, Cruse, Rankin, Sandifer, & Williams, 1985).

By the end of the decade, educators began to recognize a number of problems associated with high-stakes multiple-choice testing, including degraded instruction, for example, narrowing of the curriculum to tested topics and excessive class time devoted to test preparation, that led to inflated test scores (Hamilton & Koretz, 2002). They worried about persistent differences in performance between demographic groups, which many incorrectly attributed to the multiple-choice testing format. There were also pointed criticisms from specific content fields. Science educators, for example, complained that multiple-choice tests emphasized factual rather than procedural knowledge (Frederiksen, 1984).

Not wanting to give up the power of measurement-driven instruction to shape teacher and student behavior, many educators began to call for a new generation of “tests worth teaching to.” Under the banner of “what you test is what you get,” translated into the catchy acronym WYTIWYG (Resnick & Resnick, 1992), advocates of performance assessment thought they could bring about improvements in curriculum, instruction, and outcomes by incorporating more performance assessments into state testing programs. They also recognized that performance assessments could more easily be designed to tap higher-order skills, including problem solving and critical thinking (Raizen et al., 1989). This enthusiasm led many states to incorporate forms of performance assessment into their large-scale testing programs. We briefly summarize some of the more notable efforts in the following sections.

Vermont Portfolio Assessment Program

Educators in Vermont began to develop the Vermont Portfolio Assessment Program in 1988. They had twin goals: to provide high-quality data about student achievement (sufficient to permit comparisons of schools or districts) and to improve instruction. The centerpiece of the program was portfolios of student work in writing and mathematics collected jointly by students and teachers over the course of the school year. Teachers and students had nearly unconstrained choice in selecting tasks to be included in the portfolios. In writing, students were expected to identify a single best piece and a number of other pieces of specified types. In mathematics, students and teachers reviewed each student’s work and submitted the five to seven best pieces. The portfolios were complemented by on-demand “uniform tests” in writing (a single, standardized prompt) and mathematics (primarily multiple choice). The program was implemented in grades 4 and 8 as a pilot in 1990–1991 and statewide in 1991–1992 and 1992–1993. Early evaluation studies, however, raised concerns with the reliability of the scoring and the overall validity of the portfolio system (Koretz, Klein, McCaffrey, & Stecher, 1993).

As the tasks were more standardized, reliability did begin to increase (Koretz et al., 1994). In the late 1990s, though, the portfolio assessment program was replaced for accountability purposes with the New Standards Reference Exam (Rohten, Carnoy, Chabran, & Elmore, 2003), which included some on-demand performance tasks but not portfolios. Most local districts continued to use the portfolios for their own purposes, but they are no longer used for state-level reporting. More recently, Vermont joined with New Hampshire and Rhode Island to develop the New England Common Assessment Program, which includes multiple-choice and short constructed-response items in reading, math, and science, as well as a writing assessment. (See chapter 3 for more on this program and Vermont’s current assessment program.)

In addition, as part of the Vermont Developmental Reading Assessment (DRA) for second grade, students are asked to read short books and retell the story in their own words.3 Teachers score the student’s oral reading for accuracy and his or her retelling for comprehension. To ensure reliability, teachers who administer the assessment first have their scoring calibrated through an online process. The results of the DRA are reviewed annually at the Summer Auditing Institute. Note that the Vermont accountability system does not have high stakes for students; student promotion and high school graduation do not depend on test scores (Rohten et al., 2003).

Kentucky Instructional Results Information System

In response to a 1989 decision by the Kentucky Supreme Court declaring the state’s education system to be unconstitutional, the state legislature passed the Kentucky Education Reform Act of 1990. This law brought about sweeping changes to Kentucky’s public school system, including changes to school and district accountability for student performance. The Kentucky Instructional Results Information System (KIRIS) was a performance-based assessment system implemented for the first time in the spring of 1992.4 KIRIS tested students in grades 4, 8, and 11 in a three-part assessment that included multiple-choice and short-essay questions, performance “events” requiring students to solve practical and applied problems, and portfolios in writing and mathematics in which students presented the “best” examples of their classroom work collected throughout the school year. Students were assessed in seven areas: reading, writing, social science, science, mathematics, arts and humanities, and practical living/vocational studies (US Department of Education, 1995).

KIRIS was designed as a school-level accountability system, and schools received rewards or sanctions based on the aggregate performance of all their students.5 School ratings were based on a combination of cognitive and noncognitive indicators (including dropout rates, retention rates, and attendance rates). A school accountability index combined cognitive and noncognitive indicators and was reported in biennial cycles. Schools were expected to have all students at the proficient level, on average, within twenty years, and their annual improvement target was based on a straight-line projection toward this goal. Every two years, schools that exceeded their improvement goals received funds that could be used for salary bonuses, professional development, or as school improvement funds. In 1994–1995, about $26 million was awarded, with awards of about $2,000 per teacher in eligible schools. The state also devoted resources to support and improve low-performing schools, including assigning “distinguished educators” to advise on school operations.

Concerns raised by researchers (Hambleton, Jaeger, Koretz, Linn, Millman, & Phillips, 1995; Catterall, Mehrens, Flores, & Rubin, 1998) and the public (Fenster, 1996) about some aspects of the assessments contributed to changes in the program. In 1998, the Kentucky legislature replaced KIRIS with the Commonwealth Accountability Testing System (White, 1999), which incorporated some of the components (performance tasks, the writing portfolio) from KIRIS but replaced the mathematics portfolios with more structured performance tasks. Many factors contributed to this decision, including philosophical disagreements over the “valued outcomes” adopted for education, disputes about the correct way to teach mathematics and literacy, and a switch in the political balance in the legislature (Gong, 2009). More recently, Kentucky switched to a criterion-referenced test for NCLB reporting, the Kentucky Core Content Test (KCCT) for math (grades 3 through 8 and 11), English language arts (grades 3 through 8 and 10), and sciences (grades 4, 7, and 11). These include a fifty-fifty weighting of multiple-choice and constructed response items.

An on-demand writing assessment is part of the KCCT and provides students in grades 5 and 8 with the choice of two writing tasks that include a narrative writing prompt and a persuasive writing prompt; students in grade 12 are given one common writing task and the choice of one of two additional writing tasks (Kentucky Department of Education, 2009). Until 2012, the KCCT assessed student achievement in writing using the writing portfolio in grades 4, 7, and 12 and an on-demand writing assessment in grades 5, 8, and 12. A four-piece portfolio was required in grade 12 and a three-piece portfolio in grades 4 and 7. The required content includes samples of reflective writing, personal expressive writing/literary writing, transactive writing, and (in grade 12 only) transactive writing with an analytical or technical focus. The writing portfolio was discontinued in 2012, while the on-demand writing assessment remains in place. As of this writing, the Kentucky Department of Education was deliberating about what kind of performance assessment will take the place of the portfolio.

Maryland School Performance Assessment Program

The Maryland School Performance Assessment Program (MSPAP) was created in the late 1980s and early 1990s to assess progress toward the state’s educational reform goals. The MSPAP, first administered in 1991, assessed reading, writing, language use, mathematics, science, and social science in grades 3, 5, and 8. All of the tasks were performance based, ranging from short-answer responses to more complex, multistage responses to data, experiences, or text. Human raters scored all responses.

MSPAP tasks were innovative in several ways. Activities frequently integrated skills from several subject areas, some tasks were administered as group activities, some were hands-on tasks involving the use of equipment, and some tasks had preassessment activities that were not scored. MSPAP items were matrix sampled: every student took a portion of the exam in each subject. As a result, there was insufficient representation of content on each test form to permit reporting of student-level scores. MSPAP was designed to measure school performance, and standards-based scores (percentage achieving various levels) were reported at the school and district levels. Schools were rewarded or sanctioned depending on their performance on the MSPAP (Pearson, Calfee, Walker Webb, & Fleischer, 2002).

Many of the features of MSPAP were unusual for state testing programs, and some stakeholders raised concerns about the quality of its school results. A technical review committee commissioned by the Abell Foundation in 2000 reported generally positive findings with respect to the psychometric aspects of MSPAP (Hambleton, Impara, Mehrens, & Plake, 2000). The foundation also suggested changes to remove some aspects of the program, like the group-based, preassessment activities. The technical review committee criticized the content of the tests, and some objected to the Maryland Learning Outcomes on which the test was based (Ferrara, 2009). According to the Washington Post (Shulte, 2002), MSPAP school-level scores fluctuated widely from year to year, leading the superintendent of one of Maryland’s largest districts to demand the delay of the release of the test scores until the fluctuations could be explained. Partially due to concerns with scoring and partially due to a desire (and an NCLB requirement) to have individual student scores, the Maryland School Assessment replaced MSPAP in 2002 (Hoff, 2002). The Maryland School Assessment tests reading and mathematics in grades 3 through 8 and science in grades 5 and 8 using both multiple-choice and brief constructed-response items.

Washington Assessment of Student Learning

In 1993, the Washington legislature passed the Basic Education Reform Act, including the Essential Academic Learning Requirements for Washington State students. The learning requirements defined learning goals in reading; writing; communication; mathematics; social, physical, and life sciences; civics and history; geography; arts; and health and fitness. The Washington Assessment of Student Learning (WASL) was developed to assess student mastery of these standards. WASL included a combination of multiple-choice, short-answer, essay, and problem-solving tasks. In addition, the system included classroom-based assessments in subjects not included in WASL.

WASL was implemented in fourth grade in 1996 and in other grades subsequently. Eventually it was administered in reading (grades 3 through 8 and 10), writing (grades 4, 7, and 10), mathematics (grades 3 through 8 and 10), and science (grades 5, 8, and 10). Initially listening was assessed as part of the WASL, but this test was discontinued in 2004 as part of a legislative package of changes in anticipation of WASL’s use as the high school exit exam starting with the class of 2008.

The use of WASL as the state’s high school exit exam was controversial because of the low pass rates of tenth graders, especially in mathematics (Queary, 2004). Other concerns included considerable variation in student performance (percent proficient) in reading and mathematics from year to year, and even greater variation at the strand level (Washington State Institute for Public Policy, 2006). In 2007, the governor delayed the use of the math and science sections, and in 2008 he mandated that scores for the math portion of the WASL not be used.

The WASL was replaced in 2009–2010 with the Measurements of Student Progress (MSP) in grades 3 through 8 and the High School Proficiency Exam (HSPE) in grades 10 through 12. The MSP and HSPE tests include multiple-choice and short-answer questions; the essay questions have been eliminated from the reading, math, and science tests.

Interestingly, Washington uses classroom-based assessments, including performance assessments, to gauge student understanding of the EALR learning standards in social studies, the arts, and health and fitness. Districts must report to the state that they are implementing the assessments and strategies in those content areas, but individual student scores are not reported. (For examples of some of these assessments, see appendix A.)

California Learning Assessment System

The California Learning Assessment System (CLAS) was designed in 1991 to align the testing program with the state’s curricular content, measure students’ attainment of that content using performance-based assessment, and provide performance assessments for both students and schools (Kirst & Mazzeo, 1996). First administered in 1993, CLAS assessed students’ learning abilities in reading, writing, and mathematics in grades 4, 8, and 10. In reading and writing, CLAS used group activities, essays, and short stories to measure students’ critical thinking. In math, students were asked to show how they arrived at their answers. The performance assessment was based not only on the annual exams but also on portfolios of student work.

Controversy over CLAS arose shortly after the first round of testing, when some school groups and parents claimed that the test items were too subjective, they encouraged children to think about controversial topics, or they asked about the students’ feelings, which some parents said was a violation of their student’s civil rights (McDonnell, 2004; Kirst & Mazzeo, 1996). In addition, the debate in California highlighted fundamental conflict about the role of assessment in education, with policymakers, testing experts, and the public often voicing very different expectations and standards of judgment (McDonnell, 1994). The California Department of Education did not help matters when it initially declined to release sample items from the exams, citing the cost of developing new items. A series of newspaper articles and state-level committee reports were critical of the test’s sampling procedures and of the objectivity of the scoring. In 1994, the legislature reauthorized CLAS in a bill that increased the number of multiple-choice and short-answer questions to complement the performance tasks, but this change came too late to save the program; CLAS was administered for the last time later that year.

After a four-year hiatus from statewide achievement testing, the Standardized Testing and Reporting (STAR) exams began in 1998. STAR uses multiple-choice questions to measure the achievement of California content standards in English language arts, mathematics, science, and history/social science (in grades 2 through 11). Initially the STAR program used the Stanford Achievement Test (ninth edition); however, beginning in 2001 the state began to substitute the California Standards Tests, which are largely multiple-choice tests aligned to the California standards, with a writing component at specific grade levels.

NAEP Higher-Order Thinking Skills Assessment Pilot

In 1985–1986, the National Science Foundation funded the National Assessment of Educational Progress (NAEP) to conduct a pilot test of techniques to study higher-order thinking skills in mathematics and science. Adapting tasks that had been used in the United Kingdom, NAEP developed prototype assessment activities in a variety of formats, including pencil-and-paper tasks, demonstrations, computer-administered tasks, and hands-on tasks. In all, thirty tasks were developed and piloted with about one thousand students in grades 3, 7, and 11. The hands-on tasks were designed to assess classifying, observing and making inferences, formulating hypotheses, interpreting data, designing an experiment, and conducting a complete experiment.

For example, in Classifying Vertebrae, an individual hands-on task, students were asked to sort eleven small animal vertebrae into three groups based on similarities they observed, record their groups on paper, and provide written descriptions of the features of each group. In Triathlon, a group pencil-and-paper activity, students were given information about the performances of five children on three events (Frisbee toss, weight lift, and fifty-yard dash), asked to decide which child would be the all-around winner, and write an explanation of their reasoning.

According to NAEP, the results were promising: students “responded well to the tasks and in some cases, did quite well” (NAEP, 1987, p. 7). Older students did better than younger students, and across grade levels, students did better on tasks involving sorting and classifying than those that required determining relationships and conducting experiments. The researchers also concluded that conducting hands-on assessments was both feasible and worthwhile, although they found it to be “costly, time-consuming, and demanding on the schools and exercise administrators” (Blumberg, Epstein, MacDonald, & Mullis, 1986). Perhaps for these reasons, the hands-on items were not used in the 1990 NAEP assessment in science.

Summary

These six examples were among the more ambitious attempts to use performance assessment on a large scale in the United States during the past two decades, but many other states incorporated performance assessments in some form in their testing programs, and many continue to use performance assessments today. (See chapter 3, this volume.) The examples above were selected because they were pioneering efforts, because performance assessment played such a prominent role in each system, and because they offer lessons regarding technical quality, impact, and burden associated with performance assessment that continue to be relevant today. This brief history should be not be interpreted to mean that performance assessment has no future. The demise of some of these assessments was the result of a confluence of factors unique to each time and setting. Although there are lessons to be learned from these histories (as the rest of the chapter will discuss), it would be incorrect to infer from these cases that large-scale performance assessment is infeasible or impractical. Many states are using performance assessment successfully today for classroom assessment, end-of-course testing, and on-demand assessment. Instead, the history highlights the kinds of challenges that have to be addressed if performance assessments are to be used successfully on a large scale.

RESEARCH FINDINGS

Researchers studied many of the state performance assessment initiatives to understand how these highly touted reforms operated in practice. In addition, a number of researchers undertook research and development efforts of their own. These efforts produced a rich literature on the technical quality of the assessments, their impact on practice, and their feasibility for use in large-scale assessment.

Technical Quality

Research on the technical quality of performance assessments provides information about agreement among raters (reliability of the rating process), the reliability of student scores, the fairness of performance assessment for different population groups, and the validity of scores for particular inferences. In reviewing the evidence, it is important to remember that the research was conducted in many different contexts—mathematics portfolios, hands-on science investigations, writing tasks, music performances—and the body of evidence may not be complete with respect to technical quality for any specific type of performance assessment.

Agreement among Raters

When students construct rather than select answers, human judgment must be applied to assign a score to their responses. As performance tasks become more complex (i.e., as process skills become richer and content knowledge more open), it becomes more difficult to develop scoring criteria that fully reflect the quality of student thinking (Baxter & Glaser, 1998). For example, a review of nine studies of direct writing assessment reported rater consistency estimates that ranged from 0.33 to 0.91 (Dunbar, Koretz, & Hoover, 1991). The authors speculated that rater consistency was affected by many factors, including the number of score levels in the rubric and the administrative conditions under which ratings are obtained.

As a broad generalization, in most cases it is possible to train qualified raters to score well-constructed, standardized performance tasks with acceptable levels of consistency using thoughtful rating criteria. Of course, the adjectives qualified, well constructed, and thoughtful are not insignificant obstacles. The keys to achieving consistency among raters on performance tasks seem to be these:

  1. Selecting raters who have sufficient knowledge of the skills being measured and the rating criteria being applied
  2. Designing tasks with a clear idea of what constitutes poor and good performance
  3. Developing scoring guides that minimize the level of inference raters must make to apply the criteria to the student work
  4. Providing sufficient training for teachers to learn how to apply the criteria to real examples of student work
  5. Monitoring the scoring process to maintain calibration over time

When all of these elements are in place, it is usually possible to obtain acceptable levels of agreement among raters.

One way to achieve the third goal is to develop analytic scoring guides, which tell raters exactly what elements to look for and what score to assign to each type of response. However, success has also been achieved using holistic rules, which call for overall judgments against more global standards. Klein et al. (1998) compared analytic and holistic scoring of hands-on science tasks and found that the analytic scoring took longer and led to greater interreader consistency; however, when scores were averaged over all the questions in a task, the two methods were equally reliable. Not all methods are interchangeable, however. While item-by-item (analytic) scoring and holistic scoring yielded similar scores on mathematics performance assessments, trait scoring for conceptual understanding and communication was sensitive to different aspects of student performance (Taylor, 1998).

Nonstandardized portfolios present tougher challenges for raters. For example, in Vermont, rater consistency was very low for both the reading portfolios and the mathematics portfolios: piece-level correlations among raters in reading and mathematics averaged about 0.40 during the first two years. Even when aggregated across pieces and dimensions to produce student scores, the correlations between raters were only moderate in writing in the second year (0.60), although they were decent in mathematics in the second year (0.75). Initially Vermont was not able to achieve agreement among raters that was high enough for the scores to be useful for accountability purposes (Koretz et al., 1994). The difficulty in scoring was attributed to a number of factors, including the quality of the rubrics, the fact that the portfolios were not standardized (so raters had to apply common rubrics to very different pieces), and the large number of readers who had to be trained.

Over time, rater reliability improved (reliability for total student scores averaged 0.65 in writing and 0.85 in mathematics across the grades by 1995), suggesting that insufficient rater familiarity and training may have played a large role in unreliability in the early years.6

In Kentucky, raters assigned a single score for the portfolio as a whole (not a separate score for each piece). In the early years, rater reliability for these overall scores was comparable to reliability for total scores in Vermont: 0.67 for grade 4 writing portfolios and 0.70 for grade 8 writing portfolios.7 Although these scores were reasonably high, there was concern because on average, students received higher scores from their own teachers than from independent raters of their portfolios (Hambleton et al., 1995). In later years, scoring reliabilities improved as tasks were more clearly specified, analytic rubrics were developed, and training was strengthened. By 1996, scoring reliability for the writing portfolio had increased significantly. An independent review of 6,592 portfolios from one hundred randomly selected schools found an agreement rate of 77 percent between independent readers and the ratings given at the school level. By 2008, the agreement rate (exact or adjacent scoring) for independent readers involved in auditing school-level scores was over 90 percent (Measured Progress, 2009, p. 92).

The results from the NAEP writing portfolio pilot did not have the benefit of years of training and auditing, and rater consistency was still problematic (National Center for Education Statistics, 1995). To facilitate comparison of performance, students were asked to include in their portfolios pieces representing particular genres (e.g., persuasive writing, descriptive writing). When readers reviewed the portfolios, they first classified each piece as to genre, and then they scored the pieces that fell into genres for which scoring rubrics had been developed. (The remainder of the portfolio was not scored.) Even with this simplification, the level of interrater consistency was only moderate: from 0.41 for persuasive writing in grade 4 to 0.68 for informative writing in grade 8.

Reliability of Student Scores

For accountability purposes, the results from multiple performance tasks may be combined to produce an overall score for each student. (This is analogous to combining results from many multiple-choice items to produce a total student score.) In an accountability context, the reliability of these student-level scores is of greater importance than the consistency of ratings, although score reliability depends in part on rater consistency. Unfortunately, research suggests that student performance can vary considerably from one performance task to the next due to unique features of the task and the interaction of those features with student knowledge and experience. This task sampling variability means that it takes a moderate to large number of performance tasks to produce a reliable score for a student (Shavelson, Baxter, & Gao, 1993; Dunbar et al., 1991; Linn, Burton, DeStafano, & Hanson, 1996). In addition, researchers have found that performance on complex tasks can vary by occasion, further complicating interpretation of student performance (Webb, Schlackman, & Sugrue, 2000).

The number of tasks needed to obtain a reliable score for a student is probably a function of the complexity of each task, the similarity among tasks,8 and the specific task-related knowledge and experiences of the student. As a result, researchers working in different contexts have reported estimates of the minimum number of performance tasks needed for reliable student score that range from two tasks per student to well over twenty tasks per student. For example, the number of writing tasks required to obtain a score reliability of 0.8 ranged from two to ten in six studies reviewed by Koretz, Linn, Dunbar, and Shepard (1991). Three class periods of hands-on science tasks were required to produce score reliability of 0.8 (Stecher & Klein, 1997). To produce a student score with reliability of 0.85, as many as twenty-five pieces would have to be included in the student’s mathematics portfolio in Vermont (Klein, McCaffrey, Stecher, & Koretz, 1995). Over twenty mathematics performance tasks that were relatively similar would be needed to produce reliable student scores, and many more would be needed if dissimilar mathematics performance tasks were used (McBee & Barnes, 1998).

Thus, there is no simple answer to the question of how many performance tasks are needed to produce reliable scores. It might be possible to produce more consistent results by studying separate tasks with different levels of complexity and different response characteristics. In theory, score reliability could also be improved by developing tests that combined performance tasks with multiple-choice items, assuming the items were assembled to represent a domain in a conceptually sound manner.

Fairness

Some advocates of performance assessment hope that the tasks will reduce score differences between population groups that are commonly reported on multiple-choice tests. They interpret persistent group differences as evidence of inherent bias in multiple-choice tests, although researchers admonish that mean group differences are not prima facie evidence of bias, and over the years, bias reviews have removed items that show differences that are not associated with overall ability. Nevertheless, many hope that performance assessments will reduce traditional group differences. Research does not find that the use of performance assessments changes the relative performance of racial/ethnic groups, however (Linn, Baker, & Dunbar, 1991). For example, differences in scores among racial/ethnic groups on hands-on science tasks were comparable to differences on multiple-choice tests of science in grades 5, 6, and 9 (Klein et al., 1997). Similar results were obtained on NAEP mathematics assessments, which included both short, constructed-response tasks and extended-response items (Peng, Wright, & Hill, 1995).

Validity of Inferences from Performance Assessment Scores

Validity is not a quality of tests themselves, but of the inferences made on the basis of test scores. Researchers gather a variety of types of evidence to assess the validity of inferences from a given measure: for example, expert judgments about the content of the measure, comparisons among scores from similar and dissimilar measures, patterns of correlations among elements that go into the total score, comparisons with concurrent or future external criteria, and sensitivity to relevant instruction. In fact, Miller and Linn (2000) identify six aspects of validity that are relevant for performance assessments. None of the performance assessments described in this chapter has been subject to a thorough validity study encompassing all these elements. In many cases, the assessments were developed for research purposes and were not part of an operational testing program where the intended uses of the scores would be clear, but few operational programs examine validity as thoroughly as they might.

One of the challenges in establishing validity for performance assessments is lack of clarity about precisely what the assessments are intended to measure and what relationships ought to be found with other measures of related concepts. For example, would we expect to find high or low correlations between student scores on an on-demand writing task and scores on their writing portfolios? Both are measures of writing, but they are obtained in different situations. Are they measuring the same or different writing skills?9 In many situations where performance tasks have been used, students also complete multiple-choice items covering related content (this is generally the case with NAEP, for example). Yet there is often little theoretical or empirical justification for predicting how strongly the scores of performance tasks and multiple-choice items of the same overall subject should be related or how strongly they should be related to scores on performance tasks and multiple-choice items measuring a different subject.

For example, in the early days in Vermont, mathematics portfolio scores correlated as highly with scores on the uniform test of writing as they did with scores on the uniform test of mathematics, and researchers concluded that portfolio scores were not of sufficient quality to be useful for accountability purposes (Koretz et al., 1994). Similarly, in Kentucky, scores on KIRIS were improving while comparable scores on NAEP and on the American College Testing program were not (Koretz & Barron, 1998); yet the state standards were generally consistent with the NAEP standards. Kentucky teachers were more likely to report that score gains were the result of familiarity, practice tests, and test preparation than broad gains in knowledge and skills, which would have appeared on other tests of the same content.

Overall, there is insufficient evidence to warrant overall claims about the validity of performance assessments as a class. Recall that one of the primary justifications for using performance assessment is to learn things about student knowledge and skills that cannot be learned from multiple-choice tests. Yet few would argue that there is no relationship between skills measured by performance assessments and those measured by multiple-choice tests in the same subject. Thus, psychometricians generally look for some relationship between the two measures but would not expect an extremely high correlation. This ambiguity about predicted relationships makes it difficult to establish a simple concurrent validity argument for a given performance assessment. As a result, performance assessments are often validated primarily on the basis of expert judgment about the extent to which the task appears to represent the constructs of interest. Even here there are complications (Crocker, 1997). As Baxter and Glaser (1998) note, it can be difficult to design performance assessment to measure complex understanding; as a corollary, it can be just as difficult to interpret evidence from complex performance assessments.

Impact of Assessment Initiatives

Tests used for standards-based accountability send signals to educators (as well as students and parents) about the specific content, styles of learning, and styles of performing that are valued. An abundance of research suggests that teachers respond accordingly, emphasizing in their lessons the content, styles of learning, and performing that are manifest on the tests.10 In reviewing this literature, Stecher (2002) concluded that “large-scale high-stakes testing has been a relatively potent policy in terms of bringing about changes within schools and classrooms.” On the positive side, high-stakes testing is associated with more content-focused instruction and greater effort on the part of teachers and students. Performance assessment, in particular, has been found to lead to greater emphasis on problem solving and communication in mathematics and to more extended writing in language arts. For example, researchers in Vermont reported that the portfolio assessment program had a powerful positive effect on instruction, leading to changes that were consistent with the goals of the developers. Mathematics teachers reported devoting more time to problem solving and communication in mathematics; similarly they spent more time having students work in pairs or small groups (Stecher & Mitchell, 1995).

Likewise, researchers studying the Kentucky reforms found considerable evidence that teachers were changing their classroom practices to support the reform, for example, to support problem solving and communicating in mathematics and writing (Koretz, Barron, Mitchell, & Stecher, 1996). Similarly, researchers in Maryland found that statewide, most mathematics teaching activities were aligned with the state standards and performance assessments, although classroom assessments were less consistent with state assessments (Parke & Lane, 2008). Teachers in Maryland reported making positive changes in instruction as a result of MSPAP, and schools in which teachers reported the most changes saw the greatest score gains (Lane, Parke, & Stone, 2002). In general, these instructional effects are not a function of the format of the test; they occur with both multiple-choice tests and performance assessments—but of attaching consequence to measured student outcomes. However, Kentucky teachers were more likely to report that open-response items and portfolios had an effect on practice, adding credence to the impact of tests worth teaching to.

On the negative side, Stecher (2002) concluded, “Many of these changes appear to diminish students’ exposure to curriculum.” This conclusion was drawn primarily from research in which teachers reported that they changed instruction in ways that narrowed the curriculum to topics covered by the tests (Shepard & Dougherty, 1991). In addition, researchers documented substantial shifts in instructional time from nontested to tested subjects and, within subjects, from nontested to tested topics. For example, teachers increased coverage of basic math skills, paper-and-pencil computations, and topics included in the tests and decreased coverage of extended project work, work with calculators, and topics not included in the test (Romberg, Zarinia, & Williams, 1989).

More recently, researchers have documented the phenomenon of educational triage, where teachers focus resources on students near the cutoff point for proficient at the expense of other students (Booher-Jennings, 2005). Although these studies were conducted in the context of multiple-choice testing, it seems fair to predict that similar effects would be observed with high-stakes performance assessments if they focused on some parts of the curriculum or some students more than others. In fact, the curriculum-narrowing problem might be exacerbated with the use of performance assessments because each task is more “memorable” than a corresponding multiple-choice item, increasing the likelihood that teachers might focus on task-specific features rather than broader skills.

Finally, the research documents instructional changes that can be associated with the format of the high-stakes test. For example, teachers engaged in excessive test preparation, in which students practiced taking multiple-choice tests between one and four weeks per year and up to one hundred hours per class (Herman & Golan, n.d.). Others noted instances of coaching that focused on incidental aspects of the test (e.g., the orientation of the polygons) that were irrelevant to the skills that were supposed to be measured (Koretz et al., 2001). Other researchers found that teachers had students engage in activities that mimicked the format of the tests. For example, teachers had students find mistakes in written work rather than producing writing of their own (Shepard & Dougherty, 1991).

Using performance assessments rather than multiple-choice tests might reduce the prevalence of these effects because the tasks are more representative of the reasoning embodied in the standards. But performance assessment is not immune from negative effects when used in a high-stakes context. For example, teachers in Vermont were found to engage in “rubric-driven instruction,” in which they emphasized the aspects of problem solving that led to higher scores on the state rubric rather than problem solving in a larger sense (Stecher & Mitchell, 1995). Performance assessments are also more costly than multiple-choice tests (see chapters 8 and 9, in this volume), and they take more time to complete (Stecher & Klein, 1997).

Ultimately policymakers would like to know whether the benefits of performance assessment (in terms of more valid measurement of student performance and positive impact on classroom practice, for example) justify the burdens (in terms of development costs, classroom time, scoring costs, and so on). The expenditures and administrative burdens associated with performance assessments may be viewed as high relative to multiple-choice tests. Yet that is not the end of the story.

First, the benefits may justify the burdens from the perspective of education. Vermont teachers and principals thought their state’s portfolio assessment program was a “worthwhile burden.” In fact, in the first years, many schools expanded their use of portfolios to include other subjects (Koretz et al., 1994), and even in recent years, most Vermont districts have continued the use of the writing and mathematics portfolios, even though they are not used for state accountability purposes. Similarly, Kentucky principals reported that although they found KIRIS to be burdensome, the benefits outweighed the burdens (Koretz et al., 1996).

Second, as described later in this book, the costs associated with performance assessments have declined in recent years, making it more attractive to incorporate some degree of performance assessment into state testing programs. In spite of the complicated history of performance assessment, its future looks promising.

CURRENT EXAMPLES OF LARGE-SCALE PERFORMANCE ASSESSMENTS

Despite the technical and practical challenges that confront large-scale use of performance assessment, there are testing programs in operation in the United States that rely on performance assessments. Many of the contemporary state-level initiatives are described in chapter 3. I describe four performance assessment programs that operate at the national level.

Collegiate Learning Assessment

The Council for Aid to Education created the Collegiate Learning Assessment (CLA) in 2000 to help postsecondary faculty improve teaching and learning in higher education institutions (Benjamin et al., 2009). The intent of the test is to provide an assessment of the value added by the school’s instructional and other programs with respect to desired learning outcomes (Klein et al., 2007).

CLA is entirely performance based and uses two types of tasks: performance tasks and analytic writing tasks. The performance tasks present students with problems that simulate real-world issues and give an assortment of relevant documents, including letters, memos, summaries of research reports, newspaper articles, maps, photographs, diagrams, tables, charts, interview notes, or transcripts. Students have ninety minutes to review the materials and prepare their answers. These tasks often require students to marshal evidence from different sources; distinguish rational from emotional arguments and fact from opinion; understand data in tables and figures; deal with inadequate, ambiguous, or conflicting information; spot deception and holes in the arguments made by others; recognize information that is and is not relevant to the task at hand; identify additional information that would help to resolve issues; and weigh, organize, and synthesize information from several sources. Students’ written responses to the problems are evaluated to assess their abilities to think critically, reason analytically, solve problems, and communicate clearly and cogently.

The analytical writing tasks ask students to write answers to two types of essay prompts: a make-an-argument question that asks them to support or reject a position on some issue and a critique-an-argument question that asks them to evaluate the validity of an argument made by someone else.

The CLA is computer administered over the Internet, with all supporting documents contained in an online document library. The online delivery permits the performance assessments to be administered, scored, analyzed, and reported to the students and their institutions more quickly and inexpensively. Initially, trained readers scored the tasks using standardized scoring rubrics. Starting in fall 2008, a combination of machine and human scoring was used. Scores are aggregated by institution and are not reported at the individual student level (Collegiate Learning Assessment, 2009). A recent study of machine scoring has suggested that the correlations between hand and machine scoring are so high that when the institution is the unit of analysis, machine scores alone can be relied on (Klein, 2008). Even at the student level, the correlation between machine scoring and human scorers is 0.86 (Klein et al., 2007).

National Assessment of Educational Progress

The NAEP conducts periodic assessments in mathematics, reading, science, writing, the arts, civics, economics, geography, and US history. NAEP results are based on representative samples of students at grades 4, 8, and 12 for the main assessments or samples of students at ages 9, 13, or 17 for the long-term trend assessments (National Assessment of Educational Progress, 2009a). In all of the subject areas but writing, the NAEP items are a combination of multiple-choice and constructed-response items, which require short or extended written responses. In the three subject areas discussed next, NAEP also uses other forms of performance-based tasks.

Science Assessment Hands-On Experiments

In the 2009 Science Assessment, administered to students in grades 4, 8, and 12, a sample of students performed hands-on experiments, manipulating selected physical objects to solve a scientific problem. In addition, half of the students in each participating school received one of three hands-on tasks and related questions. These performance tasks required students to conduct actual experiments using materials provided to them and to record their observations and conclusions in their test booklets by responding to both multiple-choice and constructed-response questions. For example, twelfth graders might be given a bag containing three different metals, sand, and salt and be asked to separate them using a magnet, sieve, filter paper, funnel, spoon, and water and to document the steps they used to do so.

Arts Assessment Creative Tasks

NAEP has used performance-based assessments of the arts since 1972 (music) and 1975 (visual arts). Both music and visual arts were assessed in 1997 and most recently in 2008; the next assessment is planned for 2016. In 1997, NAEP assessed students in four arts disciplines: dance, music, theater, and visual arts. The 2008 assessment included music and visual arts only because of budget constraints and the small percentage of schools with dance and theater programs.

The 2008 arts assessment of a sample of eighth graders used a combination of “responding” tasks (written tasks, multiple-choice items) and “creative” performance-based tasks. The music portion of the assessment was composed of responding questions only, such as listening to pieces of music and then analyzing, interpreting, critiquing, and placing the pieces in historical context. The visual arts assessment also included creative response tasks in which, for example, students were asked to create a self-portrait that was scored for identifying detail, compositional elements, and use of materials (Keiper, Sandene, Persky, & Kuang, 2009). Responding questions asked students to analyze and describe works of art and design. For example, the test asked students to describe specific differences in how certain parts of an artist’s self-portrait were drawn.

Writing Assessment

In 2007, NAEP assessed writing in a nationwide sample of eighth and twelfth graders (Salahu-Din, Persky, & Miller, 2008). Students were provided with narrative, informative, and persuasive writing prompts.

Currently NAEP scans all open-ended responses, and the scanned responses are sent to appropriately trained human readers for scoring. In 2005, NAEP examined whether the mathematics and writing exams, including the written constructed responses, could be automatically or machine scored (Sandene et al., 2005). In the mathematics exam, eight of the nine constructed-response items included in the computer test in grades 4 and 8 were scored automatically. For both grades, the automated scores for the items requiring simple numeric entry or short text responses generally agreed as highly with the scores assigned by two human raters as the raters agreed with each other.

Questions requiring more extended text entry had less agreement between the automatic scores and the scores assigned by two human raters. The Writing Assessment presented two essay questions to eighth graders. The results showed that the automated scoring of essay responses did not agree with the scores awarded by human readers; the automated scoring produced mean scores that were significantly higher than the mean scores awarded by human readers. Furthermore, the automated scores agreed less frequently with the readers in level than the readers agreed with each other, and the automated scores agreed less with the readers in rank order than the readers agreed with one another. Nonetheless, the 2011 NAEP Writing Framework called for assessment of “computer-based writing” using word processing software with commonly available tools in grades 8 and 12 (National Assessment of Educational Progress, 2009b).

National Occupational Competency Testing Institute

The National Occupational Competency Testing Institute (NOCTI) is a nonprofit organization founded in the early 1970s to coordinate and lead the efforts of the states in developing competency tests for occupational programs (NOCTI, 2009). Today NOCTI provides “job-ready” assessments to measure the skills of an entry-level worker or a student in secondary or postsecondary career and technical programs. Most job-ready assessments include both multiple-choice and performance components. For example the Culinary Arts Cook II Assessment includes as a performance assessment preparing a recipe for chicken with sauce. The task is scored on organization, knife skills, use of tools and equipment, preparation of chicken and sauce, safety and sanitation procedures, appearance, and taste of finished product. The Business Information Processing Assessment includes a performance task requiring the student to create a spreadsheet; the task includes header and placement, spreadsheet and column headings, data entry, formula entry, computation of totals, use of functions, formatting, creating a pie chart, saving spreadsheet, printing material, and overall timeliness of job completion. More than seventy job-ready assessments are available in sixteen industry and occupational categories. The performance assessments may be scheduled over one to three days (NOCTI, 2009). Depending on the subject area and the test site, they range in cost from less than one hundred dollars to as much as seven hundred dollars per student.

Performance Assessments of Teachers

The previous examples all have low stakes for students, but performance assessments are also being used in cases where student performance has important consequences. The National Board for Professional Teaching Standards uses a variety of assessment center exercises as well as an annotated portfolio to certify teachers with advanced teaching skills, which often qualifies them for salary bonuses (National Research Council, 2008). The portfolios include teachers’ plans, videotapes of their instruction, and samples of student work illustrating their learning over time, along with teachers’ commentary explaining their decisions and interpreting the outcomes.

California requires teacher preparation programs to use a similar teaching performance assessment for beginning teachers as one component in their credentialing decision. The Performance Assessment for California Teachers (PACT), modeled on the National Board portfolio, has been found to be a valid measure for this purpose (Pecheone & Chung, 2006). In addition, teachers’ scores on both the National Board assessment and the PACT have been found to predict teachers’ effectiveness in supporting gains in student achievement (Darling-Hammond, Newton, & Wei, 2013). A similar assessment, the edTPA, developed from the PACT model, is now beginning to be used in many states across the country (Darling-Hammond, 2012).

PERFORMANCE ASSESSMENT IN THE CONTEXT OF STANDARDS-BASED ACCOUNTABILITY

This review suggests that large-scale testing for accountability in the United States could be enhanced by the thoughtful incorporation of standardized performance assessment. The enhancements would come from better representation of academic content standards, particularly those describing higher-order, cognitively demanding performance; from clearer signals to teachers about the kinds of student performances that are valued; and from reduced pressures to mimic the multiple-choice frame of mind in classroom instruction.

How Performance Assessments May Be Best Used

The appropriate role for performance assessments should be determined in part by an analysis of content standards. Such an analysis should reveal which standards are served well by which types of assessments. To the extent that the standards call for mastery of higher-order, strategic skills, they may favor the use of performance assessment. Perhaps more important, to the extent that standards are silent about the nature of performances expected of students, they abrogate responsibility to others for these decisions. Thus, it may be important to revisit standards documents to make sure they provide adequate guidance with respect to desired student performance.

The research reminds us that subject domains are different, and mastery of each domain is manifest in unique ways. Rich, thoughtful, integrated writing can be observed under different circumstances and in different ways from rich, thoughtful, integrated scientific inquiry or rich, thoughtful, integrated musical performance. When the Mikado sings, “Let the punishment fit the crime” (Gilbert & Sullivan, 1885), the educator should reply, “Let the assessment fit the domain.”

There are costs and benefits associated with testing for accountability in whatever form that testing takes. We are used to the current high-stakes, multiple-choice model, but that does not mean it is cost free or benefit rich. Adopting performance assessments for some or all accountability testing will have trade-offs, and we are more likely to make wise decisions if we understand these trade-offs better. Performance assessments themselves differ; the costs and benefits associated with short-answer, fill-in-the-blank items are not the same as those associated with prompted writing tasks, equipment-rich investigations, or judged real-time performances.

At the same time, computer-based scoring of tasks with rich responses (like the Collegiate Learning Assessment described earlier) has reached levels of reliability with human scoring that make a greater range of options both possible and affordable. (See chapter 5 for further discussion of computer scoring of open-ended tasks.) In addition, costs could be controlled by deciding to use the scores from performance tasks as indicators of achievement at the school level rather than the individual level. This approach is consistent with current NCLB and state accountability systems in which the primary unit of accountability is the school.11

If this situation changes—for example, if incentives are assigned to individual teachers as they are in pay-for-performance schemes—it would still be possible to use a matrix sampling approach within classrooms to offset some of the cost increases associated with performance assessment. Student-level accountability policies (such as testing for promotion or graduation) would probably require universal use of performance tasks; this would be more costly overall, although not more complicated logistically. Further analysis will be necessary to estimate the magnitude of the additional costs under different performance assessment scenarios. As states adopt Common Core State Standards, as more than forty have recently done, this could reduce costs by permitting the wider use of tasks. (See chapters 8 and 9, this volume, for discussion of the costs and benefits of multistate assessment development.)

Messick (1994) distinguishes between task-centered performance assessment, which begins with a specific activity that may be valued in its own right (e.g., an artistic performance) or from which one can score particular knowledge or skills, and construct-centered performance assessment, which begins with a particular construct or competency to be measured and creates a task in which it can be revealed. Research suggests it could be more productive to concentrate performance assessment for accountability on construct-oriented tasks derived from academic content standards and leave for classroom use more task-defined activities that may be engaging and stimulate student learning but do not represent a clear, state-adopted learning expectation. In addition, it would be wise to eschew the use of unstructured portfolios in large-scale assessment because they are difficult to score reliably and because it is difficult to interpret the scores once obtained.

The Problem of High Stakes

It would also be wise to remember that Campbell’s Law applies to performance assessments as well as multiple-choice tests (Campbell, 1979). When stakes are attached to the scores, teachers will feel pressure to focus narrowly on improving performance on specific tasks, which will undermine the interpretability of scores from those tasks. While performance assessments may be more “worth teaching to” than multiple-choice tests, performance tasks still represent just a sample of behaviors from a larger domain. Strategies that are used to reduce corruption with multiple-choice tests, including changing test forms regularly and varying the format of tasks and the representation of topics, will be equally useful to reduce corruption with performance assessments.

The recent history of performance assessment at the state level raises some concerns about using these tasks in high-stakes contexts. States at the forefront of the performance assessment movement sometimes found that it was difficult to garner and sustain public support for these “new” forms of testing. While some of the problems that states encountered were due to difficulties with scoring, reliability, and validity, others came from energized stakeholder groups that objected to the manner in which they were implemented. In some states, people objected because the assessments were unfamiliar and stretched the boundaries of traditional testing. In others, the assessments were implemented in ways that did not adequately answer parents’ questions or address their concerns. For example, some parents worried that if students were asked their views on a topic, they would be graded on the substance of their opinions rather than on their reasoning or writing skills.

McDonnell (2009) characterized these problems as disputes about “the cultural and curricular values embodied in the standards and assessments” (p. 422). Conflicts over values are not easily resolved, but better communication and dissemination of information might help to forestall them. Educators and policymakers may underestimate the need for efforts to inform the public. For example, despite the endless discussion of NCLB in the education community since 2001, a majority of the general public reports that it is not very familiar with the law (Bushaw & Gallup, 2008). History suggests that educators would be wise to clearly describe the role of performance assessments and make extra efforts to educate parents and the general public about changes in the testing program before they are adopted.

In their 1999 review, Madaus and O’Dwyer concluded that “the prognosis for the feasibility of deploying a predominantly performance-assessment oriented system for high-stakes decisions about large numbers of individual students is not very promising in light of the historical, practical, technical, ideological, instructional and financial issues described above” (p. 694). While it is still the case that a standards-based accountability system relying primarily on performance assessment seems unlikely in the United States, we can be more optimistic about the possibility of using performance assessment as part of a large-scale testing program in combination with less open-ended item formats. The successful use of structured performance assessments on a large scale in low-stakes contexts such as the Program in International Student Assessment and NAEP suggests that practical and logistical problems can be overcome and that performance tasks can enhance our understanding of student learning.

If such progress is to be made, more research will be needed to establish a “foundation for assessment design that is grounded in a theory of cognition and learning and methodological strategies for establishing the validity of score use and interpretation in terms of the quality and nature of the observed performance” (Baxter & Glaser, 1994, p. 44). Such work may need to be done in each of the major disciplines to be sensitive to discipline-specific ways of knowing, thinking, and acting.

RECOMMENDATIONS

I offer the following recommendations to educators and policymakers considering the future role of performance assessment in large-scale testing in the United States:

  • Set reasonable expectations. Performance assessment is not a panacea for the ills of American education, but it can improve our understanding of what students know and can do, and it can help educators focus their effort to bolster critical skills among American youth. Performance assessments are more likely to succeed operating in tandem with selected-response items than replacing them completely.
  • Let the standards inform the assessments. The use of performance assessments should be linked clearly to state academic contents standards, so they have a strong warrant for inclusion and a clear reference for inferences. As Linn (2000) observed, “content standards can, and should, if they are to be more than window dressing, influence both the choice of constructs to be measured and the ways in which they are eventually measured” (p. 8). This is not an unreasonable demand, and there is evidence that researchers and practitioners can work together toward this goal. For example, Niemi, Baker, and Sylvester (2007) describe a seven-year collaboration with a large school district to develop and implement performance assessment connected to explicit learning goals and standards.
  • Revise standards so they better support decisions about assessments, and revise test specifications accordingly. Like the curriculum frameworks and syllabi in many countries, standards should include descriptions of how knowledge and skills would be manifest in student performance. The Common Core State Standards in English language arts and math and the Next Generation Science Standards, make progress on this agenda. States, individually and in consortia, should revise their test specifications to clearly delineate the role of performance assessment.
  • Clearly delimit the role of performance assessments in ways that help the public understand their relevance and value in making judgments about student performance. Provide adequate information (including sample items) to educate parents about the nature of performance tasks, their role in testing, and the way the results should be interpreted.
  • Invest in the development of a new generation of performance tasks. Previous efforts demonstrated the creativity of researchers and test developers, but they were not always well integrated with standards-based systems. One goal of such efforts should be to develop multiple approaches to measuring particular skills. Develop more than one format for measuring each construct to avoid focused test preparation on incidental aspects of task format. This work might be facilitated by encouraging states to pool efforts to develop performance tasks. Joint development efforts will reduce unit costs, broaden the applicability of the tasks, and provide information across a larger universe of students.
  • Provide instructional support materials for teachers. When performance assessments are included in statewide testing, it is important to develop and make available support materials for teachers, including descriptions of skills assessed, sample lessons for teaching those skills, and sample tasks to use locally to judge student performance. As NAEP (1987) found, “Teachers need the political, financial and administrative support that will allow them to concentrate on developing ideas and building up the process skills necessary for students to learn to solve problems and accomplish complex tasks” (p. 7).
  • Support research and development to advance the science of performance assessment. This should include efforts to develop performance assessment models to facilitate new task development and research into automated delivery and scoring to reduce costs. A relatively simple but important task is to develop clearer terminology. Having a clearer vocabulary to differentiate among performance tasks with respect to format, cognitive demand, and other aspects will facilitate thoughtful discussion and policymaking and avoid misapplication of lessons from the past.

NOTES

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.250.223