Chapter 5
Performance Assessment: The State of the Arta

Suzanne Lane

This chapter addresses design, scoring, and psychometric advances in performance assessments that allow for the evaluation of twenty-first century skills. Although I discuss performance assessments for classroom purposes, the focus of this chapter is on the use of performance assessments in large-scale assessment programs. I consider advances in the integration of cognitive theories of learning and measurement models with assessment design, scoring, and interpretation.

The chapter begins with a discussion of advances in the design of performance assessments, including a description of the important learning outcomes that can be assessed by performance assessments, and not by other assessment formats. The second section discusses advances in the scoring of performance assessments, including both the technical and substantive advances in automated scoring methods that allow for timely scoring of student performances to innovative item types. The third section addresses issues related to the validity and fairness of the use and interpretation of scores derived from performance assessments. The type of evidence needed to support the validity of score interpretations and use, such as content representation, cognitive complexity, fairness, generalizability, and consequential evidence, is discussed. The last section briefly addresses additional psychometric advances in performance assessments, including advances in measurement models used to capture student performance and rater inconsistencies as well as advances in linking performance assessments.

DESIGN OF PERFORMANCE ASSESSMENTS

In the design of any assessment, developers should first delineate the type of score inferences they want to make. This includes deciding whether they want to generalize to a larger construct domain or provide evidence of a particular accomplishment or performance. The former requires sampling tasks from the domain to ensure content representativeness, which will contribute to the validity of the score generalizations. This approach is typically used in the design of large-scale assessments. The latter approach, performance demonstration, requires the specification of a performance assessment that allows the demonstration of a broader ability or performance, which is similar to a “merit badge” approach. This approach is commonly used for classroom purposes such as a high school project or paper.

This section focuses on design issues that need to be considered to ensure that performance assessments are capable of eliciting the cognitive processes and skills that they are intended to measure in order to ensure coherence among curriculum, instruction, and assessment. Advances in the design of computer-based simulation tasks, the use of learning progressions, and the management of expert review and field testing are discussed. Throughout this section, examples of performance assessments are also provided, some of which have been used in large-scale assessment programs.

Goals of Performance Assessments

Performance assessments can measure students’ cognitive thinking and reasoning skills and their ability to apply knowledge to solve realistic, meaningful problems. They are designed to more closely reflect the performance of interest, allow students to construct or perform an original response, and use predetermined criteria to evaluate student work. The close similarity between the performance that is assessed and the performance of interest is the defining characteristic of a performance assessment (Kane, Crooks, & Cohen, 1999). As stated by the Standards for Educational and Psychological Testing, performance assessments attempt to “emulate the context or conditions in which the intended knowledge or skills are actually applied” (American Educational Research Association, American Psychological Association, & National Council on Measurement in Education, 1999, p. 137).

As this definition indicates, performance assessments do not have to assess complex reasoning and problem-solving skills. For example, if the targeted domain is the speed and accuracy at which students can keyboard, a measure that captures the accuracy and speed of students’ keyboarding would be considered a performance assessment. Clearly keyboarding is not a high-level thinking skill but a learned, automated procedural skill. However, the performance assessments we treat in this chapter are those designed to assess complex reasoning and problem-solving skills in academic disciplines and that can be used for large-scale assessments.

The Maryland School Performance Assessment Program (MSPAP) was an excellent example of a performance assessment that consisted of interdisciplinary tasks that assessed problem-solving, reasoning, and evaluation skills (Maryland State Board of Education, 1995). As an example for a grade 5 science MSPAP task, students were asked to investigate how a hydrometer can be used to measure different levels of saltiness (salinity) in various water samples, predict how the hydrometer might float in mixtures of freshwater and saltwater, and determine how the hydrometer could be used to establish the correct salinity for an aquarium. This hands-on task allowed students to conduct several investigations, make predictions, evaluate their work, and provide explanations for their responses.

When working on well-designed performance tasks, students may be engaged in applying their knowledge to real-world problems, evaluating different approaches to solving problems, and providing reasoning for their solutions. A prevailing assumption underlying performance assessments is that they serve as motivators in improving student achievement and learning and that they encourage instructional strategies that foster reasoning, problem solving, and communication (Frederiksen & Collins, 1989; National Council on Education Standards and Testing, 1992; Resnick & Resnick, 1982). Performance assessments can not only link school activities to real-world experiences (Darling-Hammond, Ancess, & Falk, 1995); they can include opportunities for self-reflection and collaboration as well as student choice, such as choosing a particular topic for a writing assignment (Baker, O’Neil, & Linn, 1993; Baron, 1991).

Collaborative efforts were required on the MSPAP (Maryland State Board of Education, 1995) in that students worked together on conducting science investigations and evaluated each other’s essays. Collaboration is required on many performance assessments in other countries (see chapter 4, this volume). It can be argued that these types of collaborations on performance assessments better reflect skills required in the twenty-first century.

Performance assessments may also allow a particular task to yield multiple scores in different content domains, which has practical as well as pedagogical appeal. Tasks that are designed to elicit scores in more than one content domain may not only reflect a more integrated approach in instruction, but also motivate a more integrated approach to learning. As an example, MSPAP tasks were integrated and would yield scores in two or more domains (Maryland State Board of Education, 1995). Practical implications for tasks that afford multiple scores may be reduced time and cost for task development, test administration, and scoring by raters (Goldberg & Roswell, 2001). It is important, however, to provide evidence that each score represents the construct it is designed to measure and does not include construct-irrelevant variance (Messick, 1989).

Furthermore, Mislevy (1996) pointed out that performance assessments may enable better measurements of change, both quantitative and qualitative. A hypothetical example of a quantitative measure of change for a mathematics performance assessment would be that a student used a recursive strategy only two out of ten times during the first administration of a math assessment, but six out of ten times during the second administration. An example of a qualitative evaluation of change would be that the student switched from a less effective strategy to the more sophisticated recursive strategy after instruction.

Performance assessments, in particular, writing assessments, have been included in some large-scale assessment programs in the United States for monitoring students’ progress toward meeting national or state content standards, promoting educational reform, and holding schools accountable for student learning. High-stakes decisions are typically required, as well as an evaluation of changes in performance over time, which requires a level of standardization of the content to be assessed, the administration of the assessment, and the scoring of student performances over time. Thus, extended time periods, collaborative work, choice of task, and use of ancillary material may challenge the standardization of the assessment and, consequently, the accuracy of score interpretations. Large-scale performance assessment programs such as MSPAP, however, have included these attractive features of performance assessments while ensuring the quality and validity of the score interpretations at the school level.

Another consideration in the design of assessments of complex skills is whether a portfolio approach will be used. The Advanced Placement (AP) Studio Art portfolios provide an excellent example of a large-scale portfolio assessment that has been sustained over time (Myford & Mislevy, 1995). As an example, in the three-dimensional design portfolio, students are required to submit a specified series of images of their three-dimensional artworks, which are evaluated independently according to their quality (demonstration of form, technique, and content), breadth (demonstration of visual principles and material techniques), and concentration (demonstration of depth of investigation and process of discovery). Using a well-delineated scoring rubric for each of these three areas, from three to seven artist-educators evaluate the submitted images. The portfolios that are submitted are standardized in that specific instructions are provided to students that specify what type of artwork is appropriate, along with detailed scoring rubrics that delineate what is expected for each of the dimensions being assessed.

Performance assessments that are aligned with curriculum and instruction can provide valuable information to guide the instructional process. Thus, it is imperative to ensure that both classroom and large-scale assessments are aligned to the curriculum and instruction. A rich, contextualized curriculum-embedded performance assessment that is used for classroom purposes does not require the same level of standardization as the typical large-scale performance assessment. These curriculum-embedded assessments allow teachers to more fully discern the ways in which students understand the subject matter and can help guide day-to-day instruction. Large-scale assessments can also inform instruction, but at a broader level for both individual students and groups of students (e.g., classrooms).

Cognitive Theories in the Design of Performance Assessments

The need for models of cognition and learning and quantitative psychometric models to be used together to develop and interpret achievement measures has been widely recognized (Embretson, 1985; Glaser, Lesgold, & Lajoie, 1987; National Research Council, 2001). When we understand how individuals acquire and structure knowledge and cognitive skills and how they perform cognitive tasks, we are better able to assess their reasoning and obtain information that will lead to improved learning. Substantial theories of knowledge acquisition are needed in order to design assessments that can be used in meaningful ways to guide instruction and monitor student learning.

Several early research programs that have had a direct impact on the assessment of achievement studied the difference between experts’ and novices’ knowledge structures (Simon & Chase, 1973; Chi, Feltovich, & Glaser, 1981). Much of what is known in the development of expertise is based on studies of students’ acquisition of knowledge and skills in content domains. For example, Chi and her colleagues (1981) demonstrated that an expert’s knowledge in physics is organized around central principles of physics, whereas a novice’s knowledge is organized around the surface features represented in the problem description.

Other early approaches that link cognitive models of learning and psychometrics have drawn on work in the area of artificial intelligence (e.g., Brown & Burton, 1978). As an example, Brown and Burton (1978) represented the complex procedures underlying addition and subtraction as a set of component procedural skills, and described proficiency in terms of these procedural skills. Using artificial intelligence, they were able to uncover procedural errors or bugs in students’ performance that represented misconceptions in their understanding.

Cognitive task analysis using experts’ talk-alouds (Ericsson & Smith, 1991) has also been used to design performance assessments in the medical domain (Mislevy, Steinberg, Breyer, Almond, & Johnson, 2002). Features of the experts’ thinking, knowledge, procedures, and problem posing are considered to be indicators of developing expertise in the domain (Glaser et al., 1987) and can be used systematically in the design of assessment tasks. These features can then be used in the design of the scoring rubrics by embedding them in the criteria at each score level. Experts need not be professionals in the field. Instead, in the design of K–12 assessments, experts are typically considered students who have attained competency within the content domain.

While many researchers recognize that theories of cognition and learning should ground the design and interpretation of assessments, widespread use of cognitive models of learning in assessment design has not been realized. As summarized by Bennett and Gitomer (2009), there are three primary reasons for this: (1) the disciplines of psychometrics and of cognition and learning that have developed separately are just beginning to merge, (2) theories of the nature of proficiency and learning progressions are not fully developed, and (3) economic and practical constraints affect what is possible.

Some promising assessment design efforts, however, are taking advantage of what has been learned about the acquisition of student proficiency. A systematic approach to designing assessments that reflect theories of cognition and learning is embodied in Mislevy and his colleagues’ (Mislevy, Steinberg, and Almond, 2003) evidence-centered design (ECD), in which evidence observed in student performances on complex problem-solving tasks (that have clearly articulated cognitive demands) is used to make inferences about student proficiency. Some of these design efforts are discussed in this chapter.

A Conceptual Framework for Design

A well-designed performance assessment begins with the delineation of the conceptual framework. The extent to which the conceptual framework considers cognitive theories of student proficiency and is closely aligned to the relevant curriculum will affect the validity of score interpretations. The delineation of the conceptual framework includes a description of the construct to be assessed, the purpose of the assessment, and the intended inferences to be drawn from the assessment results (Lane & Stone, 2006). Construct theory as a guide to the development of an assessment provides a rational basis for specifying features of assessment tasks and scoring rubrics, as well as for expecting certain empirical evidence, such as the extent of homogeneity of item responses and the relationship between scores with other measures (Messick, 1994; Mislevy, 1996; National Research Council, 2001).

Two general approaches to designing performance assessments have been proposed: a construct-centered approach and a task-centered approach (Messick, 1994). We focus here on the construct-centered approach, in which developers start by identifying the set of knowledge and skills that are valued in instruction and need to be assessed and then identify the performances or responses that should be elicited by the assessment. Thus, the construct guides the development of the tasks as well as the specification of the scoring criteria.

This process allows the designer to pay attention to both construct underrepresentation and construct-irrelevant variance, which may have an impact on the validity of score inferences (Messick, 1994). Construct underrepresentation occurs when the assessment does not fully capture the targeted construct, and therefore the score inferences may not be generalizable to the larger domain of interest. Construct-irrelevant variance occurs when one or more irrelevant constructs is being assessed in addition to the intended construct. For example, students’ writing ability may have an unwanted impact on performance on a mathematics assessment. (This is further treated in the section in this chapter on validity and fairness.) Scores derived from a construct-centered approach may be more generalizable across tasks, settings, and examinee groups than scores derived from a task-centered approach because of the attention to reducing construct-irrelevant variance and increasing the representation of the construct (Messick, 1994).

As an example, a construct-centered approach was taken in the design of a mathematics performance assessment that required students to show their solution processes and explain their reasoning (Lane, 1993; Lane et al., 1995). Cognitive theories of student mathematical proficiency provided a foundation for defining the construct domain. The conceptual framework that Lane and her colleagues proposed then guided the design of the performance tasks and scoring rubrics. Four components were specified for the task design: cognitive processes, mathematical content, mode of representation, and task context. To reflect the construct domain of mathematical problem solving, reasoning, and communication, for example, a range of cognitive processes was specified, including discerning mathematical relations, using and discovering strategies and heuristics, formulating conjectures, and evaluating the reasonableness of answers. Performance tasks were then developed to assess one or more of these skills.

Test specifications need to clearly articulate the cognitive demands of the tasks, problem-solving skills, and strategies that can be employed and criteria to judge performance. This includes the specification of knowledge and strategies that are not only linked closely to the content domain but also are content domain independent (Baker, 2007). Carefully crafted and detailed test specifications are even more important for performance assessments than multiple-choice tests because there are fewer performance tasks and typically each is designed to measure something relatively unique (Haertel & Linn, 1996). The use of detailed test specifications can also help ensure that the content of the assessment is comparable across forms within a year and across years so as to support measurement of change over time. The performance tasks and scoring rubrics are then developed iteratively based on a well-delineated conceptual framework and test specifications (Lane & Stone, 2006).

The use of conceptual frameworks in designing performance assessments leads to assessments that are linked to educational outcomes and provide meaningful information to guide curriculum and instructional reform. For large-scale educational assessments, the conceptual framework is typically defined by content standards delineated at the state or national level. The grain size at which the content standards are specified affects whether narrow bits of information or broader, more contextualized understanding of the content domain will be assessed. This is because the content standards guide the development of the test specifications that include the content, cognitive processes and skills, and psychometric characteristics of the tasks.

Degree of Structure in Task Design

Underlying performance assessments is a continuum that represents different degrees of structure versus open-endedness in the response (Messick, 1996). The degree of structure for the problem posed and the response expected should be considered in the design of performance assessments. Baxter and Glaser (1998) characterized performance assessments along two continua with respect to their task demands. One continuum represents the task demand for cognitive processes, ranging from open to constrained, and the other continuum represents the task demand for content knowledge, from rich to lean. A task is process open if it promotes opportunities for students to develop their own procedures and strategies, and a task is content rich if it requires substantial content knowledge for successful performance. These two continua are crossed to form four quadrants so that tasks can be designed to fit one or more of these quadrants or considered along a progression within a content area.

This model allows cognitive and content targets to be clearly articulated in task design and for the evaluation of tasks in terms of their alignment with these targets (Baxter & Glazer, 1998). In the design of performance assessments that assess complex cognitive thinking skills, design efforts can be aimed primarily in the quadrant that reflects tasks that are process open and content rich; however, familiarity with these types of tasks in instruction and the age of the student need to be considered in design efforts.

Task Models

Task models, sometimes called templates or task shells, help ensure that the cognitive skills that are of interest are assessed. These models can enable the design of tasks that assess the same cognitive processes and skills, and a scoring rubric can then be designed for the tasks that can be generated from a particular task model. The use of task models for task design allows an explicit delineation of the cognitive skills to be assessed and can improve the generalizability of the score inferences.

Baker (2007) proposed a model-based assessment approach that uses task models. The major components of the model are the cognitive demands of the task, criteria to judge performance derived by competent performance, and a content map that describes the subject matter, including the interrelationships among concepts and the most salient features of the content. The cognitive demands of the tasks can then be represented in terms of families of tasks such as reasoning, problem solving, and knowledge representation tasks (Baker, 2007).

As an example, the explanation task model asks students to read one or more texts that require some prior knowledge of the subject domain, including concepts, principles, and declarative knowledge, in order to understand them and to evaluate and explain important issues introduced in the text (Niemi, Baker, & Sylvester, 2007). Following is a task from the explanation family that was developed for assessing student proficiency in Hawaii:

Imagine you are in a class that has been studying Hawaiian history. One of your friends, who is a new student in the class, has missed all the classes. Recently, your class began studying the Bayonet Constitution. Your friend is very interested in this topic and asks you to write an essay to explain everything that you have learned about it.

Write an essay explaining the most important ideas you want your friend to understand. Include what you have already learned in class about Hawaiian history and what you have learned from the texts you have just read. While you write, think about what Thurston and Liliuokalani said about the Bayonet Constitution, and what is shown in the other materials.

Your essay should be based on two major sources:

  1. The general concepts and specific facts you know about Hawaiian history, and especially what you know about the period of Bayonet Constitution
  2. What you have learned from the readings yesterday. (Niemi et al., 2007, p. 199)

Prior to receiving this task, students were required to read the primary source documents referred to in the prompt. This task requires students to not only make sense of the material from multiple sources, but to integrate material from these sources in their explanations. This is just one example of a task that can be generated from the explanation task model. Task models can also be used to design computer-based simulation tasks.

Design of Computer-Based Simulation Tasks

Computer-based simulations have made it possible to assess complex thinking skills that cannot be measured well by more traditional assessment methods. Using extended, integrated tasks, a large problem-solving space with various levels of complexity can be provided in an assessment (Vendlinski, Baker, & Niemi, 2008). Computer-based simulation tasks can assess students’ competency in formulating, testing, and evaluating hypotheses; selecting an appropriate solution strategy; and, when necessary, adapting strategies based on the degree to which a solution has been successful. An attractive feature of computer-based simulation tasks is that they can include some form of immediate feedback to the student based on the course of actions he or she takes.

Other important features of computer-based simulations are the variety of the types of interactions that a student has with tools in the problem-solving space and the monitoring and recording of how a student uses these tools (Vendlinski et al., 2008). Technology used in computer-based simulations allows assessments to provide more meaningful information by capturing students’ processes and strategies, as well as their products. Information on how a student arrived at an answer or conclusion can be valuable in guiding instruction and monitoring the progression of student learning (Bennett, Persky, Weiss, & Jenkins, 2007). The use of automated scoring procedures for evaluating student performances to computer-based simulation tasks can sometimes address the cost and time demands of human scoring.

Issues in Computer-Based Assessment

Like all other assessments, computer-based tasks have the potential to measure factors that are irrelevant to the construct that is intended to be assessed, and therefore the validity of the score interpretations can be hindered. It is important to ensure that the computer interface is one in which examinees are familiar with and students have had the opportunity to practice with the computer interface and navigation system. It is also important to ensure that the range of cognitive skills and knowledge assessed is not narrowed to those that are more easily assessed using computer technology. Furthermore, the automated scoring procedures need to reflect important features of proficiency so as to ensure that the generated scores provide accurate interpretations (Bennett, 2006; Bennett & Gitomer, 2009). The use of test specifications that outline the cognitive skills and knowledge to be assessed by the computer-based simulations will help ensure representation of the assessed content domain in both the tasks and scoring procedures so as to permit valid score interpretations. Furthermore, task models can be used to ensure that the tasks and scoring rubrics embody the important cognitive demands.

Examples of Simulations

The advancements of computer technology have made it possible to use performance-based simulations that assess problem-solving and reasoning skills in large-scale, high-stakes assessment programs. The most prominent large-scale assessments that use computer-based simulations are licensure examinations in medicine, architecture, and accountancy. As an example, computer-based case simulations have been designed to measure physicians’ patient management skills, providing a dynamic interaction simulation of the patient care environment (Clyman, Melnick, & Clauser, 1995). The examinee is first presented with a description of the patient and then must manage the case by selecting history and physical examination options or making entries on the patient’s chart to request tests, treatments, and consultations. The patient’s condition changes in real time based on the disease and the examinee’s course of action. The computer-based system generates a report that displays each action taken and when it was ordered. The examinee’s performance is then scored by a computerized system for the appropriateness of the sequence of actions. The intent of this examination is to capture essential and relevant problem-solving, judgment, and decision-making skills required of physicians.

Research on a computer-based exam for architecture candidates demonstrated how the format of a task can affect the problem-solving and reasoning skills that examinees use (Martinez & Katz, 1996). Test takers used different cognitive skills when they were asked to manipulate icons representing parts of a building site (e.g., a parking lot, a library, a playground) to meet particular criteria than when they responded to multiple-choice items. For the figural response items, candidates devised a strategy, generated a response, and evaluated it based on the criteria, whereas on the multiple-choice items, they just examined each alternative with respect to the criteria. The cognitive demands of the item formats were clearly different, with the skills that students engaged on the figural response items better aligned to the skills of interest than were those used on the multiple-choice items.

Simulations for K–12 students have been developed in several domains. For example, science tasks developed in the context of the National Assessment of Educational Progress (NAEP) were designed to represent exploration features of real-world problem solving and incorporated “what-if” tools that students used to uncover underlying scientific relationships (Bennett et al., 2007). To assess scientific inquiry skills, students were required to design and conduct experiments, interpret results, and formulate conclusions. As part of the simulations, students needed to select values for independent variables and make predictions as they designed their experiments. To interpret their results, students developed tables and graphs and formulated conclusions. In addition, tasks were developed to assess students’ search capabilities on a computer.

One eighth-grade computer-based simulation task required students to investigate why scientists use helium gas balloons to explore space and the atmosphere. As shown below, an item within this task required students to search a simulated World Wide Web:

This task assesses students’ research skills using a computer, which is typical of what is expected in their instructional experiences. This example of a related scientific-inquiry task required students to work on a simulated experiment, record data, evaluate their work, form conclusions, and provide rationales after designing and conducting a scientific investigation:

These kinds of simulation tasks are based on models of student cognition and learning and allow the assessment of problem-solving, reasoning, and evaluation skills that are valued within the science discipline.

Designing Assessments That Measure Learning Progressions

Some recent advances in assessment design efforts reflect learning progressions, defined as “descriptions of successively more sophisticated ways of reasoning within a content domain based on research syntheses and conceptual analyses” (Smith, Wiser, Anderson, & Krajcik, 2006, p. 1). Such progressions should be organized around central concepts or big ideas within a content domain.

Assessments that reflect learning progressions can identify where students are on a continuum of knowledge and skill development within a particular domain and the knowledge they need to acquire to become more competent. Empirically validated models of cognition and learning can be used to design assessments that monitor students’ learning as they develop understanding and competency in the content domain. These models of student cognition and learning across grade levels can be reflected in a coherent set of content standards across grade levels. This can help ensure the continuity of student assessment across grades, supporting monitoring of student understanding and competency and informing instruction and learning. Furthermore, this may lead to more meaningful scaling of assessments that span grade levels and thus more valid score interpretations regarding student growth.

An issue in the design of learning progressions is that there may be multiple paths to proficiency; however, some paths typically are followed by students more often than others (Bennett & Gitomer, 2009; National Research Council, 2006). Learning progressions that inform the design of assessments should be based on cognitive models of learning supplemented by teacher knowledge of student learning within content domains. Assessments can then be designed to elicit evidence that can support inferences about student achievement at different points along the learning progression (National Research Council, 2006).

Wilson and his colleagues have designed an assessment system that incorporates information from learning progressions and advances in both technology and measurement referred to as the BEAR Assessment System (Wilson, 2005; Wilson & Sloane, 2000). One application of this assessment system is for measuring a student’s progression for one of the three “big ideas” in the domain of chemistry. One is matter, which is concerned with describing molecular and atomic views of matter. The two other big ideas are change and stability, the former concerned with kinetic views of change and the conservation of matter during chemical change, and the latter concerned with the system of relationships in conservation of energy. Table 5.1 illustrates the construct map for the matter big idea for two of its substrands, visualizing and measuring.

Table 5.1 BEAR Assessment System Construct Map for the Matter Strand in Chemistry

Source: Adapted from Wilson (2005).

Matter Substrands
Levels of Success Visualizing Matter: Atomic and Molecular Views Measuring Matter: Measurement and Model Refinement
1—Describing Properties of matter Amounts of matter
2—Representing Matter with chemical symbols Mass with a particulate view
3—Relating Properties and atomic views Measured amounts of models
4—Predicting Phase and composition Limitations of models
5—Integrating Bonding and relative reactivity Models and evidence

Level 1 in the table is the lowest level of proficiency and reflects students’ lack of understanding of atomic views of matter, reflecting only their ability to describe some characteristics of matter, such as differentiating between a solid and a gas (Wilson, 2005). At level 2, students begin to use a definition or simple representation to interpret chemical phenomena, and at level 3 they begin to combine and relate patterns to account for chemical phenomena. Items are designed to reflect the differing achievement levels of the learning progression, or construct map, and empirical evidence is then collected to validate the construct map.

A task designed to assess the lower levels of the construct map depicted in table 5.1 asks students to explain why two solutions with the same molecular formula have two very different smells. The task presents students with the two solutions, butyric acid and ethyl acetate; their common molecular formula, C4H8O4; and a pictorial representation depicting that one smells good and the other bad. The students are required to respond in writing to the following prompt: “Both of the solutions have the same molecular formulas, but butyric acid smells bad and putrid while ethyl acetate smells good and sweet. Explain why these two solutions smell different” (Wilson, 2005, p. 11).

By delineating the learning progressions within each of the “big ideas” of chemistry based on models of cognition and learning, assessments can be designed so as to provide evidence to support inferences about student competency at different achievement levels along the learning progressions. Performance assessments are well suited for capturing student understanding and thinking along these learning progressions. Smith and her colleagues (2006) proposed a learning progression around three key questions and six big ideas within the scientific topic of matter and atomic-molecular theory and provided examples of performance tasks that can assess different points along the continuum of understanding and inquiry within this domain.

BioKIDS Project

Learning progressions are also considered in the BioKIDS project, based on the Principled Assessment Designs for Inquiry (PADI) system. Within this system, three main design patterns for assessing scientific inquiry were identified: formulating scientific explanations from evidence, interpreting data, and making hypotheses and predictions (Gotwals & Songer, 2006).

Tasks based on a specific design pattern have many features in common. As an example, the Formulating Scientific Explanations from Evidence design pattern has two dimensions that are crossed: the level of inquiry skill required for the task and the level of content knowledge required for the task. This allows the design of assessment tasks in nine cells, each cell representing a task model. There are three inquiry skill steps:

  • Students match relevant evidence to a given claim.
  • Students choose a relevant claim and construct a simple explanation based on given evidence (construction is scaffolded).
  • Students construct a claim and explanation that justifies the claim using relevant evidence (construction is unscaffolded). (Gotwals & Songer, 2006, p. 13)

The level of content knowledge required for the task is classified as simple, moderate, or complex, ranging from requiring minimal content knowledge and no interpretation to applying extra content knowledge and interpretation of evidence.

This is similar to Baxter and Glaser’s (1998) conceptualization of four quadrants that differ in terms of content richness and level of inquiry skills and then divides these two dimensions into nine quadrants. To better reflect scientific inquiry, Gotwals and Songer (2006) have proposed a matrix for each of the three design patterns (formulating scientific explanations from evidence, interpreting data, and making hypotheses and predictions).

In designing performance tasks, the amount of scaffolding needs to be considered. Scaffolding is a task feature that is manipulated explicitly in the BioKIDS design patterns (Gotwals & Songer, 2006; Mislevy & Haertel, 2006). For example, figure 5.1 presents two scientific inquiry assessment tasks that require scientific explanations from the BioKIDS project.

image

Figure 5.1 BioKids Assessment Tasks for Formulating Scientific Explanations Using Evidence

The first task requires more scaffolding than the second one. The amount of scaffolding built into a task depends on the age of the students and the extent to which they have had the opportunity in instruction to solve tasks that require explanations and complex reasoning skills. The first task in the figure represents the second step in the level of inquiry skills and the second level of content knowledge (moderate) in that evidence is provided (pictures of invertebrates that must be grouped together based on their characteristics), but the students need to choose a claim and construct the explanation. They must also interpret evidence or apply additional content knowledge, or both (need to know which characteristics are relevant for classifying animals).

The second task is at step 3 of the level of inquiry skill and the third level of content knowledge (complex): students need to construct a claim and an explanation that require the interpretation of evidence and application of additional content knowledge (Gotwals & Songer, 2006). More specifically, in this second task, “Students are provided a scenario, and they must construct (rather than choose) a claim and then, using their knowledge of food web interactions, provide evidence to back up their claim” (Gotwals & Songer, 2006, p. 16).

Review and Field Testing of Performance Assessments

Systems of appraisal are needed to evaluate the quality and comprehensiveness of the content and processes being assessed, as well as to guard against potential bias in task content, language, and context. The review process is an iterative one: when tasks are developed, they may be reviewed by experts and modified a number of times prior to and after being field-tested. This involves logical analyses of the tasks to help evaluate whether they are assessing the intended content and processes, worded clearly and concisely, and free from anticipated sources of bias. The development process also includes field-testing the tasks and scoring rubrics to ensure they elicit the processes and skills intended.

Testing Items

It is important to field-test items individually as well as in a large-scale administration. For example, protocol analysis in which students are asked to think aloud while solving a task or to describe retrospectively the way in which they solved the task can be conducted to examine whether the intended cognitive processes are elicited by the task (Chi, Glaser, & Farr, 1988; Ericsson & Simon, 1984). These individual pilots afford rich information from a relatively small number of students regarding the degree to which the tasks evoke the content knowledge and complex thinking processes that they were intended to evoke. The individual piloting of tasks also provides an opportunity for additional probing regarding the processes underlying student performance: the examiner can pose questions to students regarding their understanding of task wording and directions and thus evaluate the task’s appropriateness for different subgroups of students, such as students whose first language is not English.

A large-scale field testing provides additional information regarding the quality of the tasks, including the psychometric characteristics of the items. Student work from constructed-response items or essays can also be analyzed to ensure that the tasks evoke the content knowledge and cognitive processes that they are intended to evoke and the directions and wording are as clear as possible. Multiple variants of tasks can also be field-tested further to examine the best way to phrase and format tasks to ensure that all students have the opportunity to display their reasoning and thinking. Any one of these analyses may point to needed modifications to the tasks and scoring rubrics, or both.

Maintaining Security

Large-scale field testing of performance tasks poses a risk to security because these experiences tend to be memorable to students. To help ensure the security of performance assessments, some state assessment programs have field-tested new tasks in other states. As an example, the initial field testing of writing prompts for the Maryland Writing Test (MWT) occurred in states other than Maryland (Ferrara, 1987). However, the state’s concern about the comparability of the out-of-state sample with respect to demographics, motivation, and writing instruction led to an in-state field-test design.

While security issues such as students sharing the field-test prompts with other students were considered problematic, the benefits of the field-test data outweighed security concerns (Ferrara, 1987). For example, in 1988, twenty-two new prompts were field-tested on a sample of representative ninth-grade students in Maryland, with each student receiving two prompts. The anchor prompts were spiraled with the field-test prompts in the classrooms, and each prompt was exposed to only approximately 250 students (Maryland State Department of Education, 1990). Field-test prompts that were comparable to the anchor prompts (e.g., similar means and standard deviations) were selected for future operational administrations (Ferrara, personal communication, July 30, 2009), and sophisticated equating procedures were not used. Enough prompts produced similar mean scores and standard deviations so as to be considered interchangeable.

To help maintain the security of the MWT prompts, a number of procedures were implemented (Ferrara, personal communication, July 30, 2009). First, the number of students who were exposed to any one prompt was small (approximately 250), and the number of teachers involved in the field test was relatively small. Second, the prompts were field-tested two to three years before they were administered on an operational test. Third, there was rigorous enforcement of security regulations.

For the field testing of essay topics for the SAT, a number of steps are implemented to help ensure the security of the prompts (Educational Testing Service, 2004). First, approximately seventy-eight topics are pretested each year to a representative sample of juniors and seniors in high schools across the country. No more than 175 students are involved in the field test in a participating school, and only three prompts are administered in any one school, with each student receiving only one prompt. Each prompt is field-tested in approximately six schools so that only three hundred students are administered a given prompt. Second, prompts are field-tested at least two years prior to being on an operational form of the SAT. Third, rigorous security procedures are used for shipping and returning the field-test prompts. Finally, several security procedures are implemented during the pretest readings, such as the requirement of signed confidentiality statements by all prescreened readers who have served on College Board writing committees.

Security issues need to be considered for assessment programs for which the intent is to generalize from the score to the broader content domain. If the task becomes known, some scores will be artificially inflated, which will have an impact on the validity of the score interpretations. However, prior exposure to the task is not a security issue for performance demonstrations that require students to demonstrate competency within a discipline. Like the driver’s test, knowing what will be required does not invalidate the performance because candidates must be able to demonstrate that they can accomplish the performance. Other issues do need to be considered for performance demonstrations, such as ensuring the demonstration reflects the examinee’s work.

Scoring Performance Assessments

As in the design of performance tasks, the design of scoring rubrics is an iterative process and involves coordination across grades as well as across content domains to ensure a cohesive approach to student assessment (Lane & Stone, 2006). Much has been learned about the design of quality scoring rubrics for performance assessments.

First, it is critical that criteria embedded in rubrics are aligned to the processes and skills that are intended to be measured by the assessment tasks. Unfortunately, it is not uncommon for performance assessments to be accompanied by scoring rubrics that focus on lower levels of thinking rather than on the more complex reasoning and thinking skills that the tasks are intended to measure; therefore, the benefits of the performance tasks are not fully realized. Typically scoring rubrics should not be developed to be unique to specific tasks or generic to the entire construct domain; rather, they should be reflective of the “classes of tasks that the construct empirically generalizes or transfers to” (Messick, 1994, p. 17). Thus, a scoring rubric can be designed for a family of tasks or a particular task model. The underlying performance on a task is a continuum that represents different degrees of structure versus open-endedness in the response, and this needs to be considered in the design of the scoring rubric and criteria (Messick, 1996).

The design of scoring rubrics requires the specification of the criteria for judging the quality of performances, the choice of a scoring procedure (e.g., analytic or holistic), ways for developing criteria, and procedures used to apply the criteria (Clauser, 2000). The ways for developing criteria include the process used for specifying the criteria and who should be involved in developing them. For large-scale assessments in K–12 education, the scoring criteria typically are developed by a group of experts as defined by their knowledge of the content domain and experience as educators. Often these experienced educators have been involved in the design of the performance tasks and have knowledge of how students of differing levels of proficiency would perform on the task.

Criteria may also be developed by analyzing experts’ thinking and reasoning when solving tasks. Cognitive task analysis using experts’ talk-alouds (Ericsson & Smith, 1991) has been used to design performance tasks and scoring criteria in the medical domain (Mislevy et al., 2002). Features of experts’ thinking, knowledge, procedures, and problem posing are considered to be indicators of developing expertise in a domain (Glaser et al., 1987) and can be used systematically in the design of assessment tasks and scoring criteria.

Two ways in which the criteria can be applied rely on the use of trained raters and computer-automated scoring procedures (Clauser, 2000). The following sections discuss the specification of the criteria, different scoring procedures, research on scoring procedures, and computer-automated scoring systems.

Establishing Criteria

The criteria specified at each score level should be linked to the construct being assessed and depend on a number of factors, including the cognitive demands of the tasks in the assessment, the degree of structure or openness expected in the response, the examinee population, the purpose of the assessment, and the intended score interpretations (Lane & Stone, 2006). Furthermore, the number of scores each performance assessment yields needs to be considered based on how many dimensions are being assessed. Performance assessments are well suited for measuring multiple dimensions within a content domain. For example, a grade 5 mathematics assessment may be designed to yield information on students’ strategic knowledge, mathematical communication skills, and computational fluency. Separate criteria would be defined for each of these dimensions and a scoring rubric then developed for each dimension.

The knowledge and skills reflected at each score level should differ distinctly from those at other score levels. The number of score levels used depends on the extent to which the criteria across the score levels can distinguish among various levels of knowledge and skills. When cognitive theories of learning have been delineated within a domain, the learning progression can be reflected in the criteria. The criteria specified at each score level are then guided by knowledge of how students acquire understanding and competency within a content domain.

A generic rubric may be designed that reflects the skills and knowledge underlying the defined construct. The development of the generic rubric begins in the early stages of the performance assessment design, and then guides the design of specific rubrics for each family of tasks (task model) or a particular task that captures the cognitive skills and content assessed by the family of tasks or the particular task. An advantage of this approach is that it helps ensure consistency across the specific rubrics and is aligned with a construct-centered approach to test design. Typically student responses that cover a wide range of competency are then evaluated to determine the extent to which the criteria reflect the components displayed in the student work. The criteria for the generic or specific rubrics may then be modified, or the task may be redesigned to ensure it assesses the intended content knowledge and processes. This may require several iterations to ensure the linkage among the content domain, tasks, and rubrics.

Scoring Procedures

The design of scoring rubrics has been influenced considerably by efforts in the assessment of writing. There are three major types of scoring procedures for direct writing assessments: holistic, analytic, and primary trait scoring (Huot, 1990; Miller & Crocker, 1990; Mullis, 1984). The choice of a scoring procedure depends on the defined construct, purpose of the assessment, and nature of the intended score interpretations. With holistic scoring, the raters make a single, holistic judgment regarding the quality of the writing and assign one score, using a scoring rubric with criteria and benchmark papers anchored at each score level. With analytic scoring, the rater evaluates the writing according to a number of features, such as content, organization, mechanics, focus, and ideas, and assigns a score indicating level of quality to each one. Some analytic scoring methods weigh the domains, allowing for domains that are assumed to be more pertinent to the construct being measured to contribute more to the overall score.

As Mullis (1984) summarized, “Holistic scoring is designed to describe the overall effect of characteristics working in concert, or the sum of the parts, analytic scoring is designed to describe individual characteristics or parts and total them in a meaningful way to arrive at an overall score” (p. 18). Although the sum of the parts of writing may not be the same as an overall holistic judgment, the analytic method has the potential to provide information regarding the examinee’s potential strengths and weaknesses. Evidence, however, is needed to determine the extent to which the domain scores are able to reliably differentiate aspects of students’ writing ability.

Primary trait scoring was developed by the NAEP (Lloyd-Jones, 1977). The primary trait scoring system is based on the premise that most writing is addressed to an audience with a particular purpose—for example, informational, persuasive, or literary—and levels of success in accomplishing that purpose can be defined concretely (Mullis, 1984). The specific task determines the exact scoring criteria, although criteria are similar across similar kinds of writing (Mullis, 1984). The design of a primary trait scoring system requires identifying one or more traits relevant for a specific writing task. For example, features selected for persuasive writing may include clarity of position and support, whereas characteristics for a literary piece may include plot, sequence, and character development. Thus, the primary trait scoring system reflects aspects of a generic rubric as well as task-specific rubrics. By first using a construct-centered approach, the construct, and in this case the type of writing, guides the design of the scoring rubrics and criteria. The development of primary trait rubrics then allows for the general criteria to be tailored to the task, producing more consistency in raters’ application of the criteria to the written response. Thus, in the end, there may be one scoring rubric for each writing purpose. This would be analogous of having one scoring rubric for each family of tasks or task model.

In the design of performance assessments, Baker and her colleagues (Baker, 2007; Niemi, Baker, & Sylvester, 2007) have represented the cognitive demands of the tasks in terms of families of tasks such as reasoning, problem solving, and knowledge representation (Baker, 2007). To ensure a coherent link between the tasks and the score inferences, they have designed a scoring rubric for each of these families of tasks. In the adoption of a construct-driven approach to the design of a mathematics performance assessment, Lane and her colleagues (Lane, 1993; Lane, Silver, Ankermann, Cai et al., 1995) used this approach in the design of their holistic scoring rubric. They first developed a generic rubric, as shown in table 5.2, that reflects the conceptual framework used in the design of the assessment, including mathematical knowledge, strategic knowledge, and communication (i.e., explanations) as overarching features. These features guided the design of families of tasks: tasks that assessed strategic knowledge, reasoning, and both strategic knowledge and reasoning. Mathematical knowledge was assessed across these task families. The generic rubric guided the design of each task-specific rubric that reflected one of these three families. The use of task-specific rubrics helped ensure the consistency in which raters applied the scoring rubric and the generalizability of the score inferences to the broader construct domain of mathematics.

Table 5.2 Holistic General Scoring Rubric for Mathematics Constructed-Response Items

Source: Adapted from Lane (1993).

Score Mathematical Knowledge Strategic Knowledge Communication
4 Shows understanding of the problem’s mathematical concepts and principles; uses appropriate mathematical terminology and notations; executes algorithms completely and correctly. Identifies all the important elements of the problem and shows understanding of the relationships among them; reflects an appropriate and systematic strategy for solving the problem; gives clear evidence of a solution process, and solution process is complete and systematic. Gives a complete response with a clear, unambiguous explanation and/or description; may include an appropriate and complete diagram; communicates effectively to the identified audience; presents strong supporting arguments that are logically sound and complete; may include examples and counterexamples.
3 Shows nearly complete understanding of the problem’s mathematical concepts and principles; uses nearly correct mathematical terminology and notations; executes algorithms completely; computations are generally correct but may contain minor errors. Identifies the most important elements of the problem and shows general understanding of the relationships among them; gives clear evidence of a solution process, and solution process is complete or nearly complete and systematic. Gives a fairly complete response with reasonably clear explanations or descriptions; may include a nearly complete, appropriate diagram; generally communicates effectively to the identified audience; presents strong supporting arguments that are logically sound but may contain some minor gaps.
2 Shows understanding of some of the problem’s mathematical concepts and principles; may contain computational errors. Identifies some important elements of the problem but shows only limited understanding of the relationships among them; gives some evidence of a solution process, but solution process may be incomplete or somewhat unsystematic. Makes significant progress toward completion of the problem, but the explanation or description may be somewhat ambiguous or unclear; may include a diagram that is flawed or unclear; communication may be somewhat vague or difficult to interpret; and arguments may be incomplete or may be based on a logically unsound premise.
1 Shows very limited understanding of some of the problem’s mathematical concepts and principles; may misuse or fail to use mathematical terms; and may make major computational errors. Fails to identify important elements or places too much emphasis on unimportant elements; may reflect an inappropriate strategy for solving the problem; gives incomplete evidence of a solution process; solution process may be missing, difficult to identify, or completely unsystematic. Has some satisfactory elements but may fail to complete or may omit significant parts of the problem; explanation or description may be missing or difficult to follow; may include a diagram, which incorrectly represents the problem situation, or diagram may be unclear and difficult to interpret.
0 Shows no understanding of the problem’s mathematical concepts and principles.

Table 5.3 shows the scoring rubric for the tasks that assess student learning in the matter strand in the chemistry domain (Wilson, 2005) discussed previously. The scoring rubric reflects the construct map, or learning progression, depicted in table 5.1, with students progressing from the lowest level of describe to the highest level of explain. Score levels 1 (describe) and 2 (represent) in the rubric further differentiate students into three levels.

Table 5.3 BEAR Assessment System Scoring Guide for the Matter Strand in Chemistry

Source: Adapted from Wilson (2005).

Level Descriptor Criteria
0 Irrelevant or blank response Response contains no relevant information
1 Describe the properties of matter Relies on macroscopic observation and logic skills. No use of atomic model. Uses common sense and no correct chemistry concepts.
  1. 1−: Makes one or more macroscopic observation and/or lists chemical terms without meaning
  2. 1: Uses macroscopic observation AND comparative logic skills to get a classification, BUT shows no indication of using chemistry concepts
  3. 1+: Makes simple microscopic observations and provides supporting examples, BUT chemical principle/rule cited incorrectly
2 Represent changes in matter with chemical symbols Beginning to use definitions of chemistry to describe, label, and represent matter in terms of chemical composition. Uses correct chemical symbols and terminology.
  1. 2−: Cites definitions/rules about matter somewhat correctly
  2. 2: Cites definition/rules about chemical composition
  3. 2+: Cites and uses definitions/rules about chemical composition of matter and its transformation
3 Relate Relates one concept to another and develops models of explanation
4 Predict how the properties of matter can be changed Applies behavioral models of chemistry to predict transformation of matter
5 Explain the interactions between atoms and molecules Integrates models of chemistry to understand empirical observations of matter

A constructed response that reflects level 2 is: “They smell different because even though they have the same molecular formula, they have different structural formulas with different arrangements and patterns.” This example response is at level 2 because it “appropriately cites the principle that molecules with the same formula can have different arrangements of atoms. But the answer stops short of examining structure-property relationships (a relational, Level 3 characteristic)” (Wilson, 2005, p. 16). A major goal of the assessment system is to be able to estimate, with a certain level of probability, where a student is on the construct map or learning progression. Students and items are located on the same construct map, which allows for student proficiency to have substantive interpretation in terms of what the student knows and can do (Wilson, 2005). The maps can then be used to monitor the progress of an individual student as well as groups of students. Thus, valid interpretations of a student’s learning or progression require a carefully designed assessment system that has well-conceived items and scoring rubrics representing the various levels of the construct continuum, as well as the empirical validation of the construct map, or learning progression.

Students do not necessarily follow the same progression in becoming proficient within a subject domain. Consequently, in the design of assessments, considerations should be given to identifying the range of strategies used for solving problems in a domain, with an emphasis on strategies that are more typical of the student population (Wilson, 2005). This assessment-design effort provides an interesting example of the integration of models of cognition and learning, and of measurement models in the design of an assessment system that can monitor student learning and inform instruction. Furthermore, a measurement model called the saltus (Latin for leap), developed by Wilson (1989), can incorporate developmental changes (or conceptual shifts in understanding) as well as the incremental increases in skill in evaluating student achievement and monitoring student learning.

To assess complex science reasoning in middle and high school, Liu, Lee, Hofstetter, and Linn (2008) adopted a systematic assessment-design procedure. First, they identified an important construct within scientific inquiry, science knowledge integration. Then they developed a comprehensive, integrated system of inquiry-based science curriculum modules, assessment tasks, and scoring rubric to assess science knowledge integration. They designed a scoring rubric so that the different levels captured qualitatively different kinds of scientific cognition and reasoning that focused on elaborated links rather than individual concepts. Their assessment design is similar to the modeling of construct maps, or stages in learning progressions, described by Wilson (2005) and Wilson and Sloan (2000). The knowledge integration scoring rubric is shown in table 5.4.

Table 5.4 Knowledge Integration Scoring Rubric

Source: Liu et al. (2008).

Link Levels Description
Complex Elaborate two or more scientifically valid links among relevant ideas
Full Elaborate one scientifically valid link between two relevant ideas
Partial State relevant ideas but do not fully elaborate the link between relevant ideas
No Make invalid ideas or have nonnormative ideas

The rubric is applied to all the tasks that represent the task model for science knowledge integration, permitting score comparisons across different items (Liu et al., 2008). As they indicate, having one scoring rubric that can be applied to the set of items that measure knowledge integration makes it more accessible for teachers to use and provides coherence in the score interpretations. The authors also provided validity evidence for the learning progression reflected in the scoring rubric.

Research on Analytic and Holistic Scoring Procedures

The validity of score interpretation and use depends on the fidelity between the constructs being measured and the derived scores (Messick, 1989). Validation of the scoring rubrics includes an evaluation of the match between the rubric and the targeted construct or content domain, how well the criteria at each score level capture the defined construct, and the extent to which the domains specified in analytic scoring schemes each measure some unique aspect of student cognition. As an example, Roid (1994) used Oregon’s direct-writing assessment to evaluate its analytic scoring rubric in which students’ essays were scored on six dimensions. The results suggested that each dimension may not be unique, in that relative strengths and weaknesses for some students were identified for combinations of dimensions. Thus, some of the dimensions could be combined in the scoring system without much loss of information while simplifying the rubric and the scoring process.

Other researchers have suggested that analytic and holistic scoring methods for writing assessments may not necessarily provide the same relative standings for examinees. Vacc (1989) reported correlations between the two scoring methods ranging from .56 to .81 for elementary school students’ essays. Research that has examined factors that affect rater judgment of writing quality have shown that holistic scores for writing assessments are most influenced by the organization of the text and important ideas or content rather than domains related to mechanics and sentence structure (Breland & Jones, 1982; Huot, 1990; Welch & Harris; 1994). Breland and colleagues (Breland, Danos, Kahn, Kubota, & Bonner, 1994) reported relatively high correlations between holistic scores and scores for overall organization (approximately .73), supporting ideas (approximately .70), and noteworthy ideas (approximately .68). Lane and Stone (2006) provide a brief summary of the relative advantages of both analytic and holistic scoring procedures for writing assessments.

In the science domain, Klein et al. (1998) compared analytic and holistic scoring of hands-on science performance tasks for grades 5, 8, and 10. The correlations between the total scores obtained for the two scoring methods were relatively high: .71 for grade 5 and .80 for grade 8. The correlations increased to .90 for grade 5 and .96 for grade 8 when disattenuated for the inconsistency among raters within a scoring method. The authors suggested that the scoring method has little unique influence on the raters’ assessment of the relative quality of a student’s performance. They also suggested that if school performance is of interest, the use of one scoring method over the other probably has little or no effect on a school’s relative standing within a state given the relatively high values of the disattenuated correlations. The time and cost for scoring for both methods were also addressed. The analytic method took nearly three times as long as the holistic method to score for a grade 5 response and nearly five times as long to score for a grade 8 response, resulting in higher costs for scoring using the analytic method.

The results of these studies suggest that the impact of the choice of scoring method (e.g., analytic versus holistic) may vary depending on the similarity of the criteria reflected in the scoring methods and for the use of the scores. The more closely the criteria for the analytic method resemble the criteria delineated in the holistic method, the more likely it is that the relative standings for examinees will be similar. The research also suggests that analytic rubrics typically are capable of providing distinct information for only a small number of domains or dimensions (i.e., two or three), and thus providing scores for a small number of domains has the potential for identifying overall strengths and weaknesses in student achievement and informing instruction. As previously suggested, scores derived from performances on computer-based simulation tasks also allow addressing different aspects of students’ thinking.

Human Scoring

Student responses to performance assessments may be scored by humans or computer systems that have been “trained” through analysis of human scoring. Lane and Stone (2006) provide an overview of the training procedures and methods for human scorers and discuss rating sessions that may involve raters spending several days together evaluating student work as well as online rating of student work. As described in chapter 7, this kind of activity can support improved instruction when teachers are brought together to score student work, calibrate their judgments, and discuss teaching implications. And as chapters 2 and 3 illustrate, teacher scoring can reach high levels of reliability.

To reach high levels of interrater reliability, several challenges must be overcome. According to Eckes (2008), raters may differ in the extent to which they implement the scoring rubric, the way in which they interpret the scoring criteria, and the extent to which they are severe or lenient in scoring examinee performance, as well as in their understanding and use of scoring categories and their consistency in rating across examinees, scoring criteria, and tasks (Bachman & Palmer, 1996; McNamara, 1996; Lumley, 2005). Thus, it is critical to attend to raters’ interpretation and implementation of the scoring rubric, as well as to features of the training session itself. For example, the pace at which raters are expected to score student responses may affect their ability to use their unique capabilities to accurately evaluate student responses or products (Bejar, Williamson, & Mislevy, 2006).

Carefully designed scoring rubrics and training procedures can support consistent human scoring. Freedman and Calfee (1983) pointed out the importance of understanding rater cognition and proposed a model of rater cognition for evaluating writing assessments that consisted of three processes: reading the students’ text to build a text image, evaluating the text image, and articulating the evaluation. Wolfe (1997) elaborated on Freedman and Calfee’s model of rater cognition and proposed a cognitive model for scoring essays that included a framework for scoring and a framework for writing. He proposed that understanding the process of rating would allow for better design of scoring rubrics and training procedures. The framework of scoring is a “mental representation of the processes through which a text image is created, compared to the scoring criteria, and used as the basis for generating a scoring decision” (Wolfe, 1997, p. 89).

The framework for writing, which includes the rater’s interpretation of the criteria in the scoring rubrics, emphasized that raters initially have different interpretations of the scoring rubric. Through training, they begin to share a common understanding of the scoring rubric so as to apply it consistently. Wolfe (1997) also observed that proficient scorers were better able to withhold judgment as they read an essay and focused their efforts more on the evaluation process than less proficient scorers did. This shared common framework for writing and high level of scoring proficiency can lead to a high level of agreement among raters and has implications for the training of raters. Raters can be trained to internalize the criteria in a similar manner and to apply them consistently so as to ensure scores that allow for valid interpretations of student achievement.

Automated Scoring Systems

Automated scoring is defined by Williamson et al. (2006) as “any computerized mechanism that evaluates qualities of performances or work products” (p. 2). Automated scoring systems have supported the use of computer-based performance assessments such as computer-delivered writing assessments and computer-based simulation tasks, as well as paper-and-pencil assessments that are scanned. Automated scoring procedures have a number of attractive features. They apply the scoring rubric consistently and, more important, allow the test designer to control precisely the meaning of scores (Powers, Burstein, Chodorow, Fowles, & Kukich, 2002). In order to accomplish this, they “need to elicit the full range of evidence called for by an appropriately broad definition of the construct of interest” (Bejar et al., 2006, p. 52). Automated scoring procedures may enable an expanded capacity to collect and record many features of student performance from complex assessment tasks that can measure multiple dimensions (Williamson et al., 2006). A practical advantage is that scores can be generated in a timely manner.

Although there are many challenges with automated scoring—including its limitations for discerning the relationships among words or concepts presented and constraints on item types and response modes—a growing number of examples of automated scoring of complex constructed-response tasks have proven effective for large-scale assessments as well as for classroom assessment purposes. For example, Project Essay Grader, developed by Page (1994, 2003) in the 1960s, paved the way for automated scoring systems for writing assessments, including e-rater (Burstein, 2003), Intelligent Essay Assessor (Landauer, Foltz, & Laham, 1998; Landauer, Laham, & Foltz, 2003), and Intellemetric (Elliot, 2003).

Automated scoring procedures have also been developed to score short constructed-response items, although the scoring of content in these items poses unique challenges and tends to be less reliable than automated scoring procedures for essays. C-rater is an automated scoring method for scoring constructed-response items that elicit verbal responses ranging from one sentence to a few paragraphs; they have rubrics that explicitly specify the content required in the response but do not evaluate the mechanics of writing items (Leacock & Chodorow, 2003, 2004). There are a number of successful applications of c-rater. It has been used successfully in Indiana’s state end-of-course grade 11 English assessment, the NAEP Math Online project that required students to provide explanations of their mathematical reasoning, and the NAEP simulation study that required students to use search queries (Bennett et al., 2007; Deane, 2006). C-rater is a paraphrase recognizer in that it can determine when a student’s constructed-response matches phrases in the scoring rubric regardless of their similarity in word use or grammatical structure (Deane, 2006). In the NAEP study that used physics computer-based simulations, c-rater models were built using student queries and then cross-validated using a sample of queries that were independently hand-scored. The agreement between human raters and c-rater for the cross-validation study was 96 percent.

Automated scoring procedures have also been developed and used successfully for licensure examinations in medicine, architecture, and accountancy. These exams use innovative computer-based simulation tasks that naturally lend themselves to automated scoring. The assessment that uses computer-based case simulations to measure physicians’ patient management skills (Clyman et al., 1995) and the figural response items in the architecture assessment (Martinez & Katz, 1996) are excellent examples of the feasibility in using automated scoring procedures for innovative item types.

Scoring Algorithms for Writing Assessments

The most widely used automated scoring systems are those that assess students’ writing. Typically the design of these scoring algorithms requires humans to first rate a set of student essays written to a common prompt. The essays and their ratings then serve as calibration data to train the software for scoring. The scoring algorithm is designed to analyze specific features of essays, and weights are assigned for each of these features.

The fields of computational linguistics, artificial intelligence, and natural language processing have produced a number of methods for investigating the similarity of text content, including latent semantic analysis (LSA) and content vector analysis (CVA) (Deane & Gurevich, 2008). These methods have been applied to automated essay scoring applications. As an example, e-rater, developed by the Educational Testing Service, uses natural language processing techniques and identifies linguistic features of text in the evaluation of the quality of an essay (Burstein, 2003; Attali & Burstein, 2005). The first version of e-rater used over sixty features in the scoring process, whereas the latter versions use only “a small set of meaningful and intuitive features” (Attali & Burstein, 2005) that better captures the qualities of good writing, and thus simplifies the scoring algorithm. The scoring system uses a model-building module to analyze a sample of student essays to determine the weight of the features for assigning scores.

Evaluation of Automated Scoring Procedures

As with any other assessment procedure, validation studies are imperative for automated scoring systems so as to provide evidence for appropriate score interpretations. Yang, Buchendahl, Juszkiewicz, and Bhola (2002) identified three categories of validation approaches for automated scoring procedures: comparing scores given by human and computer scorers, comparing test scores and external measures of the construct being assessed, and assessing the scoring process itself.

Most studies have examined the relationship between human- and computer-generated scores, typically indicating that the relationship between the scores produced by computer and humans is similar to that between the scores produced by two humans, indicating the potential interchangeability of human and automated scoring. Few studies, however, focus on the latter two categories. In particular, validation studies focusing on the scoring process for automated scoring procedures are limited.

As Bennett (2006) has argued, automated scoring procedures should be grounded in a theory of domain proficiency, using experts to delineate proficiency in a domain rather than having them as a criterion to be predicted. Both construct-irrelevant variance and construct underrepresentation may affect the validity of the scores obtained by automated scoring systems (Powers et al., 2002). With respect to construct-irrelevant variance, automated scoring procedures may be influenced by irrelevant features of students’ writing and assign a higher or lower score than deserved. In addition, they may not fully represent the construct of good writing, which can affect the score assigned (Powers et al., 2002).

Studies have been conducted that require experts to evaluate the relevance of the computer-generated features of the target construct, identify extraneous and missing features, and evaluate the appropriateness of the weights assigned to the features (Ben-Simon & Bennett, 2007). Ben-Simon and Bennett (2007) found that the dimensions that experts in writing believe are most important in the assessment of writing are not necessarily the same as those obtained by automated scoring procedures that statistically optimize weights of the dimensions. As an example, experts in the study indicated that approximately 65 percent of the essay scores should be based on organization, development, and topical analysis, whereas empirical weights gave approximately 21 percent of the emphasis to these dimensions. The opposite pattern occurred for the dimensions related to grammar, usage, mechanics, style, and essay length, with a much lower emphasis assigned by experts and a higher emphasis given by the automated scoring procedure.

As Ben-Simon and Bennett (2007) indicated, the parameters of automated scoring procedures can be adjusted to be more consistent with those that experts believe are features of good writing; however, these adjustments may not be based on the criteria specified in the scoring rubric implemented in the study but rather on the criteria that scorers used in assigning scores. The authors indicated that the rubric employed in their study was missing key features of good writing, leaving experts to apply some of their own criteria in the scoring process. This result illustrates the importance of linking the cognitive demands of the tasks to the criteria specified in the scoring rubric regardless of whether the responses are to be scored by human raters or automated scoring procedures. The authors suggested as well that current theories of writing cognition should be used in assessment design so as to ensure that a more theoretical, coherent model for identifying scoring dimensions and features is reflected in the criteria of the rubrics.

Typically the agreement between the scores assigned by human raters and those assigned by the automated scoring procedure is very high. Some research, however, indicates that scores assigned by human raters and by automatic scoring procedures may differ in some ways associated with student demographics. Bridgeman, Trapani, and Attali (2009) examined whether there were systematic differences in the performance of subgroups using an automated scoring procedure versus human scoring for an eleventh-grade English state assessment. The prompt required students to support an opinion on a proposed topic within a forty-five-minute class period.

The essays were scored holistically using a six-point scale. The results indicated that, on average, Asian American and Hispanic students received higher scores from the automated scoring procedure than from human raters, whereas African American students scored similarly across the two scoring methods. Hypothesizing that Asian American and Hispanic subgroups have a higher proportion of nonnative English speakers, the authors suggested that this finding may not be due to minority status, but instead may be related to their use of English as a second language. This may be reasonable given that the African American subgroup performed similarly across the two scoring methods. In their conclusions, they suggest that “although we treat human scores at the gold standard, we are reluctant to label discrepancies from the human score as bias because it is not necessarily the case that the human score is a better indicator of writing ability than the e-rater score” (Bridgeman et al., 2009, p. 17).

Bridgeman et al. (2009) suggested that additional research needs to examine features that contribute to differential subgroup results for human and automated scores, especially for students for whom English is a second language. An understanding of the features of automated scoring systems that led to differential subgroup patterns will inform future designs of these systems. As Abedi notes in chapter 6 in this volume, there are a number of linguistic features that should be considered in both the design and the scoring of performance assessments for English language learners.

Because they are using features of text that are not a direct representation of the text meaning, automated scoring systems can be fooled (Winerip, 2012). Thus, scoring systems need to be capable of flagging bad-faith essays, which, for example, may include essays that are off topic and are written to a different prompt, essays that repeat the prompt, essays that consist of multiple repeated text, and essays that are a mix of a genuine response and a repetition of the prompt. (Of course, these same kinds of problems could also occur in a nonmalicious but poorly written essay that receives a higher score than warranted.)

Studies have found that some automated scoring procedures can detect bad-faith essays. In an early study, Powers and his colleagues (2002) examined the extent to which an early version of e-rater could be tricked into assigning either too high or too low a score. Writing experts were asked to fabricate essays in response to the writing prompts in the Graduate Record Examination that would trick e-rater into assigning scores that were either higher or lower than deserved. The writing experts were instructed on how e-rater scores student essays and were asked to write one essay for which e-rater would score higher than human readers and one essay for which e-rater would score lower than human readers. E-rater scores on these fabricated essays were then compared with the scores of two human readers.

The predictions that e-rater would score higher than the human readers were upheld for 87 percent of the cases. Some of the essays that were scored higher by e-rater as compared with the human raters consisted of repeated paragraphs with or without a rewording of the first sentence in each paragraph. E-rater also provided higher scores than human readers did for essays that did not provide a critical analysis but focused on the features that e-rater attends to, such as relevant content words and complex sentence structures. An important result is that only 42 percent of the cases were upheld when the predictions were that e-rater would score lower than the human raters (Powers et al., 2002). Thus, the experts were less able to trick e-rater to provide a lower score than human raters. There have been numerous revisions of e-rater since this study. Among other things, to detect off-topic essays, a content vector analysis program is used, along with the more recent versions of e-rater (Higgins, Burstein, & Attali, 2006).

An evaluation of IntellicMetric for use with the Graduate Management Test (GMAT) found that the system was successful at identifying fabricated essays that were a copy of the prompt, consisted of multiple repeated text, and consisted of the prompt and partly a genuine response. For each detected essay, the system provided specific warning flags for plagiarism, copying the prompt, and nonsensical writing. The system was not successful at detecting off-topic responses; however, as the authors indicated, this version of the system did not include a routine to flag off-topic essays (Rudner, Garcia, & Welch, 2006).

The current versions of automated scoring systems for essays have shown high rates of agreement with human raters in assigning scores. To some extent, they can detect bad-faith essays. Some automated scoring procedures for computerized short constructed-response items and innovative item types have also been used effectively in large-scale assessment programs. Typically most of the work and costs in designing automated scoring systems occur prior to the operational administration of the assessments in developing the algorithm that will be used for scoring. Depending on the novelty of the task, this can require substantial human scoring of the prompts and analysis of those scores, as well as programming of the computer for scoring, which can be costly in its own right. In circumstances where test volume is sufficiently large, however, cost savings can then be reaped in the scoring process, along with savings in time for scoring and reporting of results.

EVALUATING THE VALIDITY AND FAIRNESS OF PERFORMANCE ASSESSMENT

Assessments are used in conjunction with other information to make important inferences about student, school, and state-level achievement, and therefore it is essential to obtain evidence about the appropriateness of those inferences and any resulting decisions. In evaluating the worth and quality of any assessment, including performance assessments, evidence to support the validity of the score inferences is at the forefront. Validity pertains to the meaningfulness, appropriateness, and usefulness of test scores (Kane, 2006; Messick, 1989). The Standards for Educational and Psychological Testing (American Educational Research Association et al., 1999) state that “validity refers to the degree to which evidence and theory support the interpretations of test scores entailed by proposed uses of tests” (p. 9). This requires specifying the purposes and the uses of the assessment, designing the assessment to fit these intentions, and providing evidence to support the proposed uses of the assessment and the intended score inferences.

There are two sources of potential threats to the validity of score inferences: construct underrepresentation and construct-irrelevant variance (Messick, 1989). Construct underrepresentation occurs when the assessment does not fully capture the targeted construct, and therefore the score inferences may not be generalizable to the larger domain of interest. Thus, if the purpose of a performance assessment is to assess complex thinking skills so as to make inferences about students’ problem solving and reasoning, an important validity study would examine the cognitive skills and processes underlying task performance for support of those intended score inferences.

Construct-irrelevant variance occurs when one or more irrelevant constructs is being assessed in addition to the intended construct. Sources of construct-irrelevant variance for performance assessments may include task wording and context, response mode, and raters’ attention to irrelevant features of responses or performances. As an example, in designing a performance assessment that measures students’ mathematical problem solving and reasoning, tasks should be set in contexts that are familiar to the population of students. If one or more subgroups of students are unfamiliar with a particular problem context and this affects their performance, the validity and fairness of the score interpretations for those students are hindered. Similarly, if a mathematics performance assessment requires a high level of reading ability and students who have very similar mathematical proficiency perform differently due to differences in their reading ability, the assessment is measuring in part a construct that is not the target; that is, reading proficiency.

This is of particular concern for English language learners (ELLs). Abedi and his colleagues (Abedi, Lord, & Plummer, 1997; Abedi & Lord, 2001) have identified a number of linguistic features that slow down readers, increasing the chances of misinterpretation. In one study, they used their linguistic modification approach in that mathematics items were modified to reduce the complexity of sentence structures and unfamiliar vocabulary was replaced with familiar vocabulary (Abedi & Lord, 2001). The mathematics scores of both ELL students and non-ELL students in low- and average-level mathematics classes improved significantly when the linguistic modification approach was used. (See chapter 6, this volume, for an in-depth discussion of these issues.)

When students are asked to explain their reasoning on mathematics and science assessments, the writing ability of the student could be a source of construct-irrelevant variance. To help minimize the impact of writing ability on math and science assessments, scoring rubrics need to clearly delineate the relevant criteria. Construct-irrelevant variance may also occur when raters score student responses to performance tasks according to features that do not reflect the scoring criteria and are irrelevant to the construct being assessed (Messick, 1994). This can also be addressed by clearly articulated scoring rubrics and the effective training of the raters.

Validity criteria that have been suggested for examining the quality of performance assessments include content representation, cognitive complexity, meaningfulness, transfer and generalizability, fairness, and consequences (Linn, Baker, & Dunbar, 1991; Messick, 1994). The discussion that follows is organized around these validity criteria, which are consistent with the sources of validity evidence proposed by the Standards for Educational and Psychological Measurement (American Educational Research Association et al., 1999).

Content Representativeness

An analysis of the relationship between the content of the assessment and the construct it is intended to measure provides important validity evidence. Test content refers to the skills, knowledge, and processes that are intended to be assessed by tasks, as well as the task formats and scoring procedures. Performance tasks can be designed so as to emulate the skills and processes reflected in the targeted construct.

Although the performance tasks may be assessing students’ understanding of some concepts or set of concepts at a deeper level, the content of the domain may not be well represented by a relatively small subset of performance tasks. This can be addressed by including other item formats that can appropriately assess certain skills and using performance tasks to assess complex thinking skills that cannot be assessed by the other item formats.

Methods are currently being investigated that will produce accurate student-level scores derived from mathematics and language arts performance assessments that are administered on different occasions throughout the year (Bennett & Gitomer, 2009). This will not only permit content representation across the performance assessments, but the assessments can be administered in closer proximity to the relevant instruction and information from any one administration can be used to inform future instructional efforts. If school-level scores are of primary interest, matrix-sampling procedures can be used to ensure content representation on the performance assessment, as was done on the MSPAP (Maryland State Board of Education, 1995).

The coherence among the assessment tasks, scoring rubrics and procedures, and the target domain, as well as their representativeness, are other aspects of validity evidence for score interpretations. It is important to ensure that the cognitive skills and content of the target domain are systematically represented in the tasks and scoring procedures. Both logical and empirical evidence can support the validity of the method used for transforming performance to a score.

For a performance demonstration, such as a major project that illustrates the ability to perform a type of inquiry, for example, we are not interested in generalizing the student performance on the demonstration to the broader domain, so the content domain does not need to be represented fully. The content and skills being assessed by the performance demonstration should be meaningful and relevant within the content domain. Performance demonstrations provide the opportunity for students to show what they know and can do on a real-world task, similar to a driver’s license test or the design of a scientific investigation.

Cognitive Complexity

One of the most attractive aspects of performance assessments is that they can be designed to assess complex thinking and problem-solving skills. As Linn and his colleagues (1991) have cautioned, however, it should not be assumed that a performance assessment measures complex thinking skills; evidence is needed to examine the extent to which tasks and scoring rubrics are capturing the intended cognitive skills and processes. The alignment between the cognitive processes underlying task responses and the construct domain needs to be made explicit, because typically the goal is to generalize interpretations of scores to the construct domain (Messick, 1989). The validity of the score interpretations will be affected by the extent to which the design of performance assessments is guided by cognitive theories of student achievement and learning within academic disciplines. Furthermore, the use of task models will support the explicit delineation of the cognitive skills required to perform particular task types.

Several methods have been used to examine whether tasks are assessing the intended cognitive skills and processes (Messick, 1989), and they are particularly appropriate for performance assessments that are designed to tap complex thinking skills. These methods include protocol analysis, analysis of reasons, and analysis of errors. In protocol analysis, students are asked to think aloud as they solve a problem or describe retrospectively how they solved the problem. In the analysis-of-reasons method, students are asked to provide rationales, typically written, to their responses to the tasks. The analysis-of-errors method requires an examination of procedures, concepts, or representations of the problems in order to make inferences about students’ misconceptions or errors in their understanding.

As an example, in the design of a science performance assessment, Shavelson and Ruiz-Primo (1998) used Baxter and Glaser’s 1998 analytic framework, which reflects a content-process space depicting the necessary content knowledge and process skills for successful performance. Using protocol analysis, Shavelson and Ruiz-Primo (1998) compared expert and novice reasoning on the science performance tasks that were content rich and process open. Their results from the protocol analysis confirmed some of their hypotheses regarding the different reasoning skills that tasks were intended to elicit from examinees. Furthermore, the results elucidated the complexity of experts’ reasoning as compared to the novices and informed the design of the tasks and interpretation of the scores.

Meaningfulness and Transparency

An important validity criterion for performance assessments is their meaningfulness (Linn et al., 1991), which refers to the extent to which students, teachers, and other interested parties find value in the tasks at hand. Meaningfulness is inherent in the idea that performance assessments are intended to measure more directly the types of reasoning and problem-solving skills that educators value. A related criterion is transparency (Frederiksen & Collins, 1989), that is, students and teacher need to know what is being assessed, by what methods, the criteria used to evaluate performance, and what constitutes quality performance. It is important to ensure that all students are familiar with the task format and scoring criteria for both large-scale and classroom assessments. Teachers can use performance tasks with their students and engage them in discussions about what the tasks are assessing and the nature of the criteria used for evaluating student work. Teachers can also engage students in using scoring rubrics to evaluate their own work and the work of their peers.

Generalizability of Score Inferences

For many large-scale assessments, the intent is to draw inferences about student achievement in the domain of interest based on scores derived from the assessment. While this is the case for some performance tasks, it is not the intent for performance demonstrations. There are other aspects of generalizability, however, that are relevant.

Generalizability theory provides both a conceptual and statistical framework to examine the extent to which scores derived from an assessment can be generalized to the domain of interest (Brennan, 1996, 2000, 2001; Cronbach, Gleser, Nanda, & Rajaratnam, 1972). It is particularly relevant in evaluating performance assessments that assess complex thinking skills because it examines multiple sources of errors that can limit the generalizability of the scores, such as error due to tasks, raters, and occasions. Error due to tasks occurs because only a small number of tasks typically are included in a performance assessment. As Haertel and Linn (1996) explained, students’ individual reactions to specific tasks tend to average out on multiple-choice tests because of the relatively large number of items, but such individual reactions to specific items have more of an effect on scores from performance assessments that are composed of relatively few items. Thus, it is important to consider the sampling of tasks; by increasing the number of tasks on an assessment the validity and generalizability of the assessment results is enhanced. Furthermore, this concern with task specificity is consistent with research in cognition and learning that underscores the context-specific nature of problem solving and reasoning in subject matter domains (Greeno, 1989). The use of different item formats, including performance tasks, can improve the generalizability of the scores.

Error due to raters can also affect the generalizability of the scores in that raters may differ in their evaluation of the quality of students’ responses to a particular performance task and across performance tasks. Raters may differ in their overall leniency, or they may differ in their judgments about whether one student’s response is better than another’s response, resulting in an interaction between the student and rater facets (Hieronymus & Hoover, 1987; Lane, Liu, Ankenmann, & Stone, 1996; Shavelson, Baxter, & Gao, 1993). Typically the occasion is an important hidden source of error because performance assessments are given on only one occasion and the occasion is not typically considered in generalizability studies (Cronbach, Linn, Brennan, & Haertel, 1997).

Generalizability theory estimates variance components for the object of measurement (e.g., student, class, school) and for the sources of error in measurement, such as task and rater. The estimated variance components provide information about the relative contribution of each source of measurement error. The variance estimates are then used to design measurement procedures that enable more accurate score interpretations. For example, the researcher can examine the effects of increasing the number of items or number of raters on the generalizability of the scores. Generalizability coefficients estimate the extent to which the scores generalize to the larger construct domain for relative or absolute decisions, or both.

Generalizability studies have shown that error due to raters for science hands-on performance tasks (e.g., Shavelson et al., 1993) and mathematics-constructed response items (Lane, Liu, et al., 1996) tends to be smaller than for writing assessments (Dunbar, Koretz, & Hoover, 1991). To help achieve consistency among raters, attention is needed in the design of well-articulated scoring rubrics, selection and training of raters, and evaluation of rater performance prior to and throughout operational scoring of student responses (Lane & Stone, 2006).

Researchers have also shown that task-sampling variability is a greater source of measurement error than rater-sampling variability in students’ scores in science, mathematics, and writing performance assessments (Baxter et al., 1993; Gao, Shavelson, & Baxter, 1994; Hieronymus & Hoover, 1987; Lane, Liu, Ankenmann, & Stone, 1996; Shavelson et al., 1993). This indicates that students were responding differently across the performance tasks, whereas error due to raters was negligible. Thus, increasing the number of tasks in an assessment has a greater positive effect on the generalizability of the scores than increasing the number of raters scoring student responses.

Shavelson and his colleagues (Shavelson et al., 1993; Shavelson, Ruiz-Primo, & Wiley, 1999) found that the large task sampling variability in science performance assessments was due to variability in both the person × task interaction and the person × task × occasion interaction. They conducted a generalizability study using data from a science-performance assessment (Shavelson et al., 1993). The person × task variance component accounted for 32 percent of the total variability, whereas the person × task × occasion variance component accounted for 59 percent of the total variability. In this study and another, Shavelson and colleagues (1999) found that students performed differently on each task from occasion to occasion. Interestingly, the variance component for the person × occasion effect was close to zero, suggesting that “even though students approached the tasks differently each time they were tested, the aggregate level of their performance, averaged over the tasks, did not vary from one occasion to another” (Shavelson et al., 1999, pp. 64–65).

In sum, the results from generalizability studies indicate that scoring rubrics and training can be designed so as to minimize rater error. Increasing the number of performance tasks will increase the generalizability of the scores across tasks, and including other item formats on performance assessments will aid in the generalizability of scores to the broader content domain.

Fairness of Assessments

The evaluation of the fairness of an assessment is inherently related to all sources of validity evidence. Bias can be conceptualized “as differential validity of a given interpretation of a test score for any definable, relevant subgroup of test takers” (Cole & Moss, 1989, p. 205). A fair assessment therefore requires evidence to support the meaningfulness, appropriateness, and usefulness of the test score inferences for all relevant subgroups of examinees. Validity evidence for assessments that are intended for students from various cultural, ethnic, and linguistic backgrounds needs to be collected continuously and systematically as the assessment is being developed, administered, and refined. The linguistic demands on items can be simplified to help ensure that ELLs are able to access the task as well as other students.

As Abedi and Lord (2001) have demonstrated through their language modification approach, simplifying the linguistic demands on items can narrow the gap between ELLs and other students. The contexts used in mathematics tasks can be evaluated to ensure that they are familiar to various subgroups and will not have a negative effect on the performance on the task for one or more subgroups. The amount of writing required on mathematics, reading, and science assessments, for example, can be examined to help ensure that writing ability will not unduly influence the ability of the students to demonstrate what they know and can do on these assessments. Scoring rubrics can be designed to ensure that the relevant math, reading, or science skills, not students’ writing ability, are the focus. The use of other response formats, such as graphic organizers, on reading assessments may alleviate the concerns of writing ability confounding student performance on reading assessments (O’Reilly & Sheehan, 2009).

Although researchers have argued that performance assessments offer the potential for more equitable assessments, it is unlikely they could eliminate disparities in achievement among groups. As Linn and his colleagues (1991) note, differences among subgroups most likely occur because of differences in learning opportunities, familiarity, and motivation, and are not necessarily due to item format.

Research that has examined subgroup differences has focused on the impact of an assessment on subgroups by examining mean differences or differential group performance on individual items when groups are matched with respect to ability, that is, differential item functioning (DIF; Lane & Stone, 2006). The presence of DIF may suggest that inferences based on the test score may be less valid for a particular group or groups.

This could occur if tasks measure construct-irrelevant features that contribute to DIF. Gender or ethnic bias could be introduced by the typical contextualized nature of performance tasks or the amount of writing and reading required. In addition, the use of raters to score responses to performance assessments could introduce another possible source of differential item functioning (see, for example, Gyagenda & Engelhard, 2010). Results from DIF studies can be used to inform the design of assessment tasks and scoring rubrics so as to help minimize any potential bias.

Some researchers have supplemented differential item functioning methods with cognitive analyses of student performances designed to uncover reasons that items behave differently across subgroups of students of approximately equal ability. In a study to detect DIF in a mathematics performance assessment consisting of constructed-response items that required students to show their solution processes and explain their reasoning, using the analyses of reasons method. Lane, Wang, and Magone (1996) examined differences in students’ solution strategies, mathematical explanations, and mathematical errors as a potential source of differential item functioning. They reported that for items that exhibited DIF and favored females, females performed better than their matched males because females tended to provide more comprehensive conceptual explanations and displayed their solution strategies more fully. They suggest that increasing the opportunities in instruction for students to provide explanations and show their solution strategies may help alleviate these differences.

Ericikan (2002) examined differential item response performances among different language groups. In her research, she conducted linguistic comparisons across different language test versions to identify potential sources of DIF. Her results suggest that care in item writing is needed so as to minimize linguistic demands of items. As Wilson (2005) has suggested, the inclusion of DIF parameters into measurement models would allow a direct measurement of different construct effects such as using different solution strategies and different types of explanations or to capture linguistic differences.

Some research studies have shown both gender and ethnic mean differences on performance assessments that measure complex thinking skill. As an example, Gabrielson, Gordon, and Engelhard (1995) observed ethnic and gender differences in persuasive writing. Their results indicated that high school female students wrote higher-quality persuasive essays than male students, and white students wrote higher-rated essays than black students. The scores for conventions and sentence formation were more affected by gender and ethnic characteristics than the scores in content, organization, and style—which were consistent with results from Engelhard, Gordon, Walker, and Gabrielson (1994). These differences may be more reflective of differences in learning opportunities and home or community language use patterns than true differences in ability, suggesting the need for targeted instruction.

Studies have used advances in statistical models to examine subgroup differences so as to better control for student demographic and school-level variables. One study examined the extent to which the potentially heavy linguistic demands of a performance assessment might interfere with the performance of students who have English as a second language (Goldschmidt, Martinez, Niemi, & Baker, 2007). The results obtained by Goldschmidt and his colleagues (2007) revealed that subgroup differences on student written essays to a writing prompt were less affected by student background variables than a language arts commercially developed test consisting of multiple-choice items and some constructed-response items. The performance gaps between white students, English-only students, and traditionally disadvantaged students (e.g., ELLs) were smaller for the writing performance assessment than for the commercially developed test (Goldschmidt et al., 2007). Thus, in a context where students had opportunities in instruction to craft written essays, the performance of students on the writing assessment used in this study was stronger than for traditional selected- and constructed-response items.

Consequential Evidence

The evaluation of both intended and unintended consequences of any assessment is fundamental to the validation of score interpretation and use (Messick, 1989). Because a major goal of performance assessments is to improve teaching and student learning, it is essential to obtain evidence of any such positive consequences and any potentially negative consequences (Messick, 1994). As Linn (1993) stated, the need to obtain evidence about consequences is “especially compelling for performance-based assessments . . . because particular intended consequences are an explicit part of the assessment systems’ rationale” (p. 6). Furthermore, adverse consequences bearing on issues of fairness are particularly relevant because it should not be assumed that a contextualized performance task is equally appropriate for all students because “. . . contextual features that engage and motivate one student and facilitate his or her effective task performances may alienate and confuse another student and bias or distort task performance may alienate and confuse another student and bias or distort task performance” (Messick, 1994, p. 29).

This concern can be addressed by a thoughtful design process in which fairness issues are addressed, including expert analyses of the tasks and rubrics, as well as analyses of student thinking as they solve performance tasks with special attention to examining potential subgroup differences and features of tasks that may contribute to these differences.

Large-scale performance assessments that measure complex thinking skills have been shown to have a positive impact on instruction and student learning (Lane, Parke, & Stone, 2002; Stecher, Barron, Chun, & Ross, 2000; Stein & Lane, 1996; Stone & Lane, 2003. See also chapters 1 to 3, this volume). In a study examining the consequences of Washington’s state assessment, Stecher and his colleagues (2000) indicated that approximately two-thirds of fourth- and seventh-grade teachers reported that the state standards and the state assessment short-answer and extended-response items were influential in promoting better instruction and student learning.

An important aspect of consequential evidence for performance assessments is the examination of the relationship between changes in instructional practice and improved student performance on the assessments. A series of studies examined the relationship between changes in instructional practice and improved performance on the MSPAP, which consisted entirely of performance tasks that were integrated across content domains (Lane et al., 2002; Parke, Lane, & Stone, 2006; Stone & Lane, 2003). The results revealed that teachers’ reports about features of their instruction accounted for differences in school performance on MSPAP in reading, writing, mathematics, and science—with higher performance among schools with more reform-oriented practices. Furthermore, increases in these practices over time accounted for differences in the rate of change in MSPAP school performance in reading and writing over a five-year period. Furthermore, Linn, Baker, and Betebenner (2002) demonstrated that the slopes of the trend lines for the math assessments on both NAEP and MSPAP were similar, suggesting that the performance gains in Maryland were due to deepened mathematical understanding on the part of the students rather than merely teaching to the content and format of the specific state test.

When using test scores to make inferences regarding the quality of education, contextual information is needed to inform the inferences and actions (Haertel, 1999). Stone and Lane (2003) indicated, for example, that whereas the percentage of students receiving free or reduced lunch (a proxy for socioeconomic) was significantly related to school-level performance on MSPAP in all areas, there was no significant relationship between this measure and school-level growth on MSPAP in mathematics, writing, science, and social studies.

Instructional Sensitivity

An assessment concept that can help inform the consequential aspect of validity is instructional sensitivity, which refers to the degree to which tasks are sensitive to improvements in instruction (Popham, 2003; Black & Wiliam, 2007). Performance assessments are considered to be vehicles that can help shape sound instructional practice by modeling to teachers what is important to teach and to students what is important to learn. In this regard, it is important to evaluate the extent to which improved performance on an assessment is linked to improved instructional practices. To accomplish this, the assessments need to be sensitive to instructional improvements. Assessments that may not be sensitive to well-designed instruction may be measuring something other than instruction, such as irrelevant domain constructs or learning that may occur outside of the school.

Two methods have been used to examine whether assessments are instructionally sensitive: studies have either examined whether students have had the opportunity to learn the material, or they have examined the extent to which differences in instruction affect performance on the assessment. For example, study using a model-based approach to assessment design Baker (2007) found that student performance on a language arts performance assessment was sensitive to different types of language instruction and captured improvement in instruction (Niemi, Wang, Steinberg, Baker, & Wang, 2007).

This study examined the effects of three types of instruction (literary analysis, organization of writing, and teacher-selected instruction) on student responses to an essay about conflict in literary work. The results indicated that students who received instruction on literary analysis were significantly better able to analyze and describe conflict in literature than students in the other two instructional groups and students who had direct instruction on organization of writing performed significantly better on measures of writing coherency and organization. These results provide evidence that performance assessments can be instructionally sensitive with respect to different types of instruction and suggest the need to ensure alignment and coherency among curriculum, instruction, and assessment.

ADDITIONAL PSYCHOMETRIC ISSUES

This section briefly discusses additional psychometric issues in the design of performance assessments. First, I briefly present measurement models that have been developed for performance assessments and extended constructed-response items, including those that account for rater effects. These types of models have been used successfully in large-scale assessment programs to account for rater error in the scores obtained when evaluating performance assessments, allowing more valid score interpretations. I then discuss issues related to linking performance assessments.

Measurement Models and Performance Assessments

Item response theory (IRT) models are typically used to scale assessments that consist of performance tasks only and assessments that consist of both performance tasks and multiple-choice items. IRT involves a class of mathematical models that are used to estimate test performance based on characteristics of the items and characteristics of the examinees that are presumed to underlie performance. The models use one or more ability parameters and various item parameters to predict item responses (Embretson & Reise, 2000; Hambleton & Swaminathan, 1985). These parameters, in conjunction with a mathematical function, are used to model the probability of a score response as a function of ability.

The more commonly applied models assume one underlying ability dimension determines item performance (Allen & Yen, 1979) and accommodate ordinal response scales that are typical of performance assessments. They include the graded-response model (Samejima, 1969, 1996), the partial-credit model (Masters, 1982), and the generalized partial-credit model (Muraki, 1992). As an example, Lane, Stone, Ankenmann, and Liu (1995) demonstrated the use of the graded-response model with a mathematics performance assessment, and Allen, Johnson, Mislevy, and Thomas (1994) discussed the application of the generalized, partial-credit model to NAEP, which consists of both selected- and constructed-response items.

Performance assessment data may be best modeled by multidimensional item response theory (MIRT), which allows the estimation of student proficiency on more than one skill area (Reckase, 1997). The application of MIRT models to assessments that are intended to measure student proficiency on multiple skills can provide a set of scores that profile student proficiency across the skills. These scores can then be used to guide the instructional process so as to narrow gaps in student understanding. Profiles can be developed at the student level or the group level (e.g., class) to inform instruction and student learning. MIRT models are particularly relevant for performance assessments because these assessments can capture complex performances that draw on multiple dimensions within the content domain, such as procedural skills, conceptual skills, and reasoning skills.

Modeling Rater Effects Using IRT Models

Performance assessments require either human scorers or automated scoring procedures in evaluating student work. When human scorers are used, performance assessments are considered to be “rater mediated,” since information is mediated through interpretations by raters (Engelhard, 2002). Engelhard (2002) provided a conceptual model for performance assessments in which the obtained score is a function not only of the domain of interest (e.g., writing ability), but also of rater severity, difficulty of the task, and the structure of the rating scale (e.g., analytic versus holistic, number of score levels). Test developers exert control over the task difficulty and the nature of the rating scale; however, other potential sources of construct-irrelevant variance are introduced by the raters, including differential interpretation of score scales, halo effects, and bias (Engelhard, 2002).

Models have been developed that account for rater variability in scoring performance assessments. As an example, Patz and his colleagues (Patz, 1996; Patz, Junker, Johnson, & Mariano, 2002) developed a hierarchical rating model to account for the dependencies between rater judgments. A parameter was introduced into the model that could be considered an “ideal rating” or expected score for an individual, and raters could vary with respect to how close their rating is to this ideal rating. This variability reflects random error (e.g., lack of consistency) as well as systematic error (e.g., rater tendencies such as leniency). According to Bejar et al. (2006), this modeling of rater variability may reflect an accurate modeling of rater cognition in that under operational scoring conditions, raters may try to predict the score an expert rater would assign based on the scoring rubric and benchmark papers. In addition, covariates can be introduced into the model to predict rater behaviors such as rater features (e.g., hours of scoring) and item features. The modeling of rater variability is a way to account for error in the scores obtained when evaluating performance assessments and enables more valid interpretations of the scores.

Equating and Linking Issues

Equating helps ensure comparability of interpretations of assessment results from assessment forms administered at one time or over time; however, equating an assessment that consists of only performance tasks is complex (Kolen & Brennan, 2004). One way to equate forms so that they can be used interchangeably is to have a common set of items, typically called anchor items, on each of the forms. The anchor items are then used to adjust for any differences in difficulty across forms. An important issue that needs to be addressed in using performance tasks as anchor items in the equating procedure is that rater teams could change their scoring standards over time and the application of standard equating practices would lead to bias in the equating process and, consequently, inaccurate scores (Bock, 1995; Kim, Walker, & McHale, 2008a; Tate, 1999). As a solution to this problem, Tate (1999, 2000) suggested an initial linking study in which any changes in rater severity and discrimination across years could be identified.

To accomplish the equating, a large representative sample of anchor item papers (i.e., trend papers) from year 1 are rescored by raters in year 2. These raters in year 2 are the same raters who score the new constructed responses in year 2. These trend papers now have a set of scores from the old raters in year 1 and a set of scores from the new raters in year 2. This allows examining the extent to which the two rater teams across years differ in severity in assigning scores, and then adjustments can be made to ensure the two tests are on the same scale. Tate contends that instead of having item parameters, there are item and rating team parameters that reflect the notion that if the rating team changes across years, any differences due to the change in rating teams will be reflected in the item parameters. Another way to conceptualize this is that the item parameters are confounded by rater-team effects so the rating team needs to be considered in the equating.

The effectiveness of this IRT linking method using trend score papers was established by Tate and his colleague (Tate, 2003; Kamata & Tate, 2005). The use of trend score papers in non-IRT equating methods has also proven effective by Kim, Walker, and McHale (2008a, 2008b). They compared the effectiveness of equating for a design that required anchor items and a design that did not require anchor items with and without trend score papers. The design that does not incorporate anchor items alleviates the concern of content representativeness of anchor items. Their results indicated that both designs using trend score papers were more effective in equating the scores as compared to designs that did not use the trend score papers.

More important, their results suggest that changes in rater severity can be examined and the equating of test forms across years should adjust for differences in rater severity if the trend scoring indicates that a rater shift has occurred (Kim et al., 2008a, 2008b). Trend scoring should be implemented for any assessment program that uses constructed-response items for equating across years to control for equating bias caused by a scoring shift over years. Kim and colleagues (2008b) point out that the trend-scoring method requires additional rating of student papers, which increases cost, and it may be a bit cumbersome to implement. The use of image and online scoring methods, however, can ease the complexities of the implementation of the trend-scoring method.

CONCLUSION

Performance assessments have been an integral part of educational systems in many countries; however, they have not been fully used in the United States. There is evidence that the format of the assessment affects the type of thinking and reasoning skills that students use, with performance assessments being better suited to assessing high-level, complex thinking skills. Recent advances in the design and scoring of performance assessments by human scorers and computers support their increased use in large-scale assessment programs. In addition, computer simulations now allow the design of meaningful, real-world tasks that require students to problem-solve and reason.

Well-specified content standards that reflect high-level thinking and reasoning skills can guide the design of performance assessments so as to ensure the alignment among curriculum, instruction, and assessment. Task models can be designed so as to ensure that tasks embody the intended cognitive demands rather than irrelevant constructs. Task models also have the potential to increase the production of tasks and help ensure comparability across forms within a year or across years. Various task design strategies have proven useful in helping to ensure the validity and fairness of performance-assessment results, including language-modification approaches and opportunities for students to demonstrate their understanding graphically as well as linguistically.

When human raters are used, well-articulated scoring rubrics, carefully crafted training material, and rigorous training procedures for raters can minimize rater error. Measurement models and procedures have been designed to model rater errors and inconsistencies so as to control them in the estimation of student scores on performance assessments.

The educational benefit of using performance assessments has been demonstrated by many studies. When students are given the opportunity to work on meaningful, real-world tasks in instruction, they have demonstrated improved performance on performance assessments. Moreover, research has shown that growth on performance assessments at the school level is not related to socioeconomic status variables. Sound educational practice begs for alignment among curriculum, instruction, and assessment, and there is ample evidence to support the use of performance assessments in both instruction and assessment to improve student learning for all students.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.250.223