CHAPTER 31

Review of Science Education Program Evaluation

Frances Lawrenz

University of Minnesota

The purpose of this chapter is to review the field of science education program evaluation. To accomplish that purpose the chapter begins by defining evaluation, outlining the broad types of evaluation that can be undertaken and the philosophies that underlie the different approaches, and distinguishing evaluation from research. The chapter goes on to explicate the relationships between science education program funding and science education program evaluation through a historical approach, where the history of science education is juxtaposed with the history of evaluation and examples of the types of science education program evaluation that were implemented. Following this historical examination of the development process, different models of evaluation are discussed, and examples of how these apply to science education program evaluation are provided. This discussion contrasts the strengths and limitations of the different approaches and specifies the types of questions that the different models are able to address. Within the discussion of models, different methodological approaches are also contrasted. Finally the chapter concludes with some thoughts about the future of science education program evaluation.

With the passage of the No Child Left Behind Act our nation is committed to accountability for its schools. Accountability, being held answerable for accomplishing goals, can be considered a subset of the larger concept of evaluation. Evaluation is based on the notion of valuing and includes a variety of perspectives in addition to accomplishing goals. The idea of evaluation is not a new one. Perhaps one of the first implementations of evaluation in the United States was Joseph Rice's 1897– 1898 comparative study of student's spelling performance (Rice, 1900). The next landmark was “The Eight Year Study” by Tyler and Smith (1942). This longitudinal study of thirty high schools made use of a wide variety of tests, scales, inventory questionnaires, check lists, pupil logs, and other measures to gather information about the achievement of curricular objectives. The major impetus for science education program evaluation, however, arose out of the National Science Foundation funding of large curriculum development projects and teacher institutes and the Elementary and Secondary Education Act of 1965. This was the first time legislated programs were required to have evaluations. This cemented the reciprocal arrangement between funding and evaluation and led to significant development of the field of educational evaluation.

Extensive literature searches were conducted to provide the background for this chapter. There is an enormous amount of material available, because science education program evaluation crosses discipline areas. For example, keyword searches for documents published between the years 1980 and 2003 on “Program Evaluation” produced 51,660 hits; “Science Education” produced 65,796 hits; and “Science Program Evaluation” 4,213 hits. The majority of articles are geared toward teacher assessment of content with a teacher and administrator audience in mind, not necessarily technical researchers or policy makers. Testing conducted within classrooms by teachers is only one component of science program evaluation, so the majority of the articles are not relevant to this review. The available materials about science education program evaluation are generally descriptions of the evaluation processes and the programs evaluated, again not directly useful for a review of science program evaluation. Also the range of programs is very wide, with references running the gambit from health science programs in universities to effects of using incubators in a fourth-grade classroom. Ultimately there is little direct research about how to conduct program evaluation. The references included in this chapter are meant to be exemplary of the range of material available, not an exhaustive set. Examples are drawn mainly from NSF-funded projects and programs.

WHAT IS EVALUATION?

The Joint Committee on Standards for Educational Evaluation first presented standards for educational evaluation in 1981, and the second edition of the standards (1994) defined evaluation as the systematic investigation of the worth or merit of an object. Objects of evaluations include educational and training programs, projects, and materials and are sometimes described as evaluands. Michael Scriven in his Evaluation Thesaurus (1991) agreed with this definition and went on to say that the process normally involves “some identification of relevant standards of merit, worth, or value; some investigation of the performance of evaluands on these standards; and some integration or synthesis of the results to achieve an overall evaluation” (p. 139).

One of the first definitions of educational evaluation was provided by Daniel Stufflebeam and the Phi Delta Kappan National Study Committee on Evaluation in Educational Evaluation and Decision-Making (1971). In this book the authors say “the purpose of evaluation is not to prove but to improve” (p. v). They define evaluation as the systematic process of delineating, obtaining, and providing useful information for judging decision alternatives. This definition is particularly useful in that it highlights that evaluation includes determining what type of information should be gathered, how to gather the determined information, and how to present the information in usable formats. Evaluation is often divided into summative and formative aspects. A summative evaluation approach identifies how valuable an object is by demonstrating whether it is successful or not to various stakeholders. Formative evaluation is designed to help improve the object.

The field of evaluation and educational evaluation in particular has expanded quite a bit since the 1970s, although some of the original texts have kept pace through new editions and additional authors and continue to be the leading sources of evaluation information. The first edition of Weiss's Evaluation Research was published in 1972, the second, Evaluation, in 1998. Rossi and Freeman's Evaluation: A Systematic Approach (1979) is now in its sixth edition, with an additional author (Rossi, Freeman, & Lipsey, 1999). The original Worthen and Sanders text, Educational Evaluation: Theory and Practice (1973), is now in a fourth rendition with an additional author as Program Evaluation: Alternative Approaches and Practical Guidelines (Fitzpatrick, Sanders, & Worthen, 2003). The National Science Foundation has also provided an evolving set of User Friendly Handbooks to help principal investigators evaluate their projects. These began with the Stevens, Lawrenz, Ely, and Huberman (1993) text and culminated recently in the 2002 rendition by Frechtling.

As part of the expanding definitions of evaluation, Patton in his book Utilization-Focused Evaluation (1997) reiterates and extends usefulness by making it clear that the receivers of the evaluation information need to be substantively involved in the evaluation process so that the resulting information will be used effectively. Fetter-man's empowerment evaluation was introduced in 1994 (Fetterman, 1994) and expanded in his text Foundations of Empowerment Evaluation (2001). Fetterman views empowerment evaluation as a shift from the previously exclusive focus on merit and worth alone to a commitment to self-determination and capacity building. In other words, empowerment evaluation is evaluation conducted by participants with the goal of continual improvement and self-actualization.

The different approaches to evaluation are grounded in different philosophies. House (1983) has categorized these differing philosophies along two continua: the objectivist-subjectivist epistemologies and the utilitarian-pluralist values. Objectivism requires evidence that is reproducible and verifiable. It is derived largely from empiricism and related to logical positivism. Subjectivism is based in experience and related to phenomenologist epistemology. The objectivists rely on reproducible facts, whereas the subjectivists depend upon accumulated experience. In the second continuum, utilitarians assess overall impact, whereas pluralists assess the impact on each individual. In other words, the greatest good for utilitarians is that which will benefit the most people, whereas pluralism requires attention to each individual's benefit. Often utilitarianism and objectivism operate together, and pluralism and subjectivism operate together, although other combinations are possible.

Collectively, approaches that engage the evaluand in the evaluation process are identified as participatory approaches to evaluation. Different approaches include stakeholder evaluation (Mark & Shotland, 1985), democratic evaluation McTaggart, (1991), or developmental evaluation (Patton, 1994). Cousins and Whitmore's (1998) framework categorizes participatory evaluation along three dimensions: control of the evaluation process, selection of participants, and the depth of participation. Positions along these different dimensions are indicative of different approaches to participatory evaluation.

Cousins and Whitmore (1998) suggest that there are two philosophies under-girding participatory evaluation. Practical participatory evaluation is one that is common in the United States and Canada and “has as its central function the fostering of evaluation use with the implicit assumption that evaluation is geared toward program, policy, or organizational decision making” (p. 6). Transformative participatory evaluation “invokes participatory principles and actions in order to democratize social change” (p. 7).

Another underlying movement in evaluation is termed responsive evaluation. The legacy of the philosophy of responsiveness in evaluation is discussed by Greene and Abma (2001). This philosophical perspective informed Guba and Lincoln's (1989) influential book, Fourth Generation Evaluation. This is an approach to evaluation that rests on a relativist rather than a realist ontology and on a monistic subjective, rather than dualistic objective, epistemology. It therefore recommends evaluations that are important and meaningful within the context and frames of references of the people involved. In terms of the discussion above, responsive evaluation is in the pluralism and subjectivism sphere.

Another way to help define something is to explain what it is not. Therefore it is important to point out that assessment can be considered a subset of evaluation and that although there are many similarities, evaluation and research are not the same. Generally assessment is considered the process of measuring an outcome, whereas evaluation employs assessment information in its determination of merit or worth.

Weiss (1998) suggests that what distinguishes evaluation from other research is not method or subject matter, but intent. In other words, evaluations are conducted for purposes than other research. Worthen, Sanders, and Fitzpatrick (1997) expand the distinction to point out that evaluation and research differ in the motivation of the inquirer, the objective of the inquiry, the outcome of the inquiry, the role played by explanation, and generalizability. In terms of motivation, evaluators are almost always asked to conduct their evaluations and therefore are constrained by the situation. On the other hand, although researchers may apply for grants to conduct their research, they are generally the ones who make the decisions about why and how to conduct it. The objectives and outcomes in the two types of inquiry are also slightly different. Research is generally conducted to determine generalizable laws governing behavior or to form conclusions. Evaluation, on the other hand, is more likely to be designed to provide descriptions and inform decision making. Finally, evaluation is purposefully tied to a specific object in time and space, whereas research is designed to span these dimensions.

These distinctions are important because they affect the type and appropriateness of evaluation designs. Because of their tie to specific situations, evaluations are both less and more constrained than research. They are less constrained because their results do not have to be universally generalizable, but they are more constrained because the results have to address a specific context.

THE RELATIONSHIP OF EVALUATION TO SCIENCE EDUCATION

Evaluation is applied, disciplined inquiry. As suggested above, evaluations are generally commissioned in response to a specific need and operate across various power structures and in different contexts. In science education, evaluations have closely followed the funding priorities and requirements of the federal government. For example, as the government provided funds for science curriculum development, evaluations of science curricula were conducted. Naturally not all evaluation was directly tied to federal funding, but since the federal agendas reflect the priorities of the citizens, the issues of interest to the general public and science educators were included. To help clarify this relationship, this section includes a look at the history of funding in science education with a parallel history of science education evaluation.

Many reviews have been provided on science education research (e.g., Welch, 1985; Finley, Heller, & Lawrenz, 1990); however, these reviews may include only one type of evaluation or may not include evaluations at all (Welch, 1977).

As explained earlier, assessment is a subset of evaluation; a comprehensive review of the history of assessment in science is provided by Doran, Lawrenz, and Helgeson (1994). That review highlights work in instrument development and validity. As they point out, the 1960s laid the groundwork for science education program evaluation with the beginning of the National Assessment of Educational Progress (NAEP) and the ESEA Act. The 1970s were a time of concern about fairness in testing and of advances in testing procedures, such as matrix sampling and item response theory. The first international science study was conducted during the 1969–70 school year by the International Association for the Evaluation of Educational Achievement. The 1980s showed consolidation and growing interest in gathering data on various indicators. Interest in international comparisons and authentic testing grew through the 1990s along with interest in trends analysis because longitudinal data were now becoming available. Currently the emphasis in assessment is on assessing student in-depth understanding of science content through authentic measures (Newman, 1996; Wiggins, 1998).

Status data provide a unique type of evaluation evidence. Generally status data are not tied to specific situations or stakeholders except in a very general way, so they are not evaluations in their own right. They are used, however, as comparison data for many individual programs and can be used to examine trends over time. NAEP, TIMSS, and the science and mathematics teaching surveys by Weiss at Horizon Research, Inc. are classic examples of status-type evaluation evidence. The Weiss reports began in 1977 and have continued at intervals through 2002. These reports contain information on the amount of time spent on science and mathematics, the objectives for science and mathematics, classroom activities, types of students in various types of classes, teachers’ views of education, and their opinions about the environments in which they teach. NAEP and TIMSS show state, national, and international levels of student achievement in various areas of science, as well as some data on the teachers and classroom environments. These types of data have been synthesized into large reports outlining indicators of science education (Suter, 1993, 1996) as well as additional pieces such as comparing indicators with standards (Weiss, 1997).

Considering the history of the NAEP science tests provides some insight into the contextual and political changes that have worked to shape science education program evaluation. NAEP was born in a sea of controversy over whether or not the federal government should be collecting any national data. In response to the desire for local autonomy and privacy, the first NAEP reports provided information about only large regions of the country and were very careful to not suggest that any national standards or requirements were included in the data. Today, the pendulum has shifted to acceptance of national standards and openness in reporting, so notonly are individual state data reported, but they are expected. Furthermore, they are tied to standards, and the collection of ever more data is being required (No Child Left Behind Act, 2001). These types of status data are also being collected internationally; the TIMSS-R (U.S. National Research Center, 2003) is but the latest step in this process.

Another historical example is the evolution of the Joint Dissemination and Review Panel of the Department of Education, which was established in 1972. It began as a group of research experts who judged the quality of educational programs requesting dissemination funds based on evidence the program provided. The panel was strict in its assessment of causality, and very few programs were designated as “programs that work.” In 1987 the panel was reconstituted and renamed the Program Effectiveness Panel (Cook, 1991). This panel was instructed to include a variety of evidence in its deliberations, including qualitative data and implementation costs. Educational programs were expected to provide proof that their claims were met, and these claims were judged based on the data provided. In 1994 the Educational Research, Development, Dissemination and Improvement Act directed the establishment of panels of appropriately qualified experts and practitioners to evaluate educational programs and designate them as promising or exemplary. Multiple panels were created (e.g., the Math and Science Education Expert Panel), with several different criteria and several subpanels. Subject matter experts and users determined the quality of the materials. Evaluation experts commented on only one of the criteria—the extent to which the program made a measurable difference in student learning. Programs received overall ratings as promising as well as effective. These panels were phased out recently, but new panels for determining quality programs are being formed.

As an exemplar of evolution at the federal level, Table 31.1 outlines the major activities of the Science Education Directorate at the National Science Foundation, along with evaluation emphases. The accompanying Fig. 31–1 shows the history of funding at the National Science Foundation by directorate. Figure 31–1 clearly shows the changing levels of funding supporting the Education Directorate activities described in Table 31.1 and the proportion of the funding that was allotted to education.

In the 1960s, with the advent of Sputnik, the National Science Foundation began to focus on curriculum development. People believed that the United States would win the race for space if our children had better science and mathematics curricula. Many different curricular projects were undertaken. Evaluation concentrated on the effectiveness of these curricula in helping students learn science.

In the 1970s curriculum development continued, but the emphasis shifted toward how to get these new curricula implemented in the schools. Evaluation was focused on delivery systems and accomplishing change within classrooms, schools, and districts. There was also an emphasis on enhancing teachers’ science knowledge so that they would be better prepared to deliver the new curricula. Evaluations of these teacher institutes focused on perceived quality. The “Man: A Course of Study” curriculum caused a great furor across the country. Many people did not believe their children should be studying about different cultural practices, such as Eskimo elders going out on the ice to die. As shown in Fig. 31–1, funding for the Science Education Directorate was essentially wiped out by the political aftermath. It was rebuilt, however, through funding of smaller local programs, often summer institutes, designed to enhance teacher understanding of science and mathematics and teacher pedagogical skills. The small local nature of the programs guaranteed local acceptance. Evaluation of these programs was individualized to the needs of the program and their stakeholders. This time also witnessed a growth in commitment to diversity in the pool of science and mathematics professionals. Evaluation of these programs focused more on social activism and facilitation of movement across power barriers.

TABLE 31.1
History of NSF Education Directorate Funding Initiatives and Science Education Program Evaluation

Date Major NSF Education Directorate Funding Initiatives Science Education Program Evaluation
60s Curriculum development

Focused on improving the individual curricula being developed. National Assessment of Education Progress begins.

70s

Comprehensive curriculum implementation and teacher institutes

Extent of implementation and gains in teacher content knowledge in individual projects. An Office of Program Integration related to evaluation of programs was established. First international science study and first status survey of science teaching.

80s

Precipitous drop in funding and then rebuilding. Funding of small and varied projects in graduate and undergraduate education and K–12 teacher inservice

Evaluation tied to individual projects. Focus on science teacher perceptions of professional development. Status studies continue.

90s Systemic initiatives

Comprehensive evaluations of many aspects of projects. Centralized requirements for projects to meet program evaluation needs. An office of evaluation was formed and expanded into the Division of Research, Evaluation and Dissemination. Government Performance and Results Act required agency accountability.

00s Partnerships

Evaluations focusing on K–12 student achieve ment and institutional climate. RFPs to conduct program evaluations, provide expert assistance to large project evaluators, expand evaluation capacity, and conduct research about evaluation.

 

The 1990s were characterized by systemic initiatives. The idea was that all parts of a system needed to be focused on the same goals in order to achieve success. The systemics included statewide programs, urban programs, and local programs. Evaluation was much more complex and assessed how to change cultures as well as interactions and what results those changes might produce. This produced the beginnings of national data bases to track the status information, centralized or pooled approaches to conducting evaluations, and the realization that this sort of evaluation takes a good deal of time and money. The systemics met with mixed success, and in particular it was difficult for the larger initiatives to “go to scale.” The systemics were successful in some ways or at some locations, but that success did not seem to spread across the entire system.

Images

FIGURE 31–1. Funding levels for the NSF directorates over time (in millions of constant FY 2003 dollars).

The present emphasis in funding is partnerships. These focus on changing various institutions so that they will better interact with others. Evaluation of partnerships is complex, like it was for the systemics, but the approaches are more restricted in some ways. There is a heavy emphasis on accountability and direct ties to state-based testing systems. In contrast, how to measure organizational change and promote interaction is viewed as complex, and several technical assistance evaluation projects are being funded to assist the partnerships with their evaluations.

Today science education evaluation is quite complex. It functions at both the individual project and the state or national program level. The terms project and program are often used by federal or state funding agencies and their evaluators in a distinct way. The term program is used to mean the overall funding initiative across the state or nation, and the term project is used to mean the sites that were funded. Therefore program evaluation would be of all of the projects related to or funded under a particular plan or funding initiative. Project evaluation is a smaller and more coherent endeavor with fewer categories of stakeholders. Take, for example, the NSF materials development program, which funds several individual curriculum development projects. Each project would be responsible for conducting an evaluation of itself. The program evaluation would examine the value of all of the projects as a set. Projects can range in size from a single school's attempts to improve its science curriculum to a large math science partnership working with several school districts and multiple institutions, including higher education, informal science settings, and business and industry. With these sorts of large projects the distinctions between program and project evaluations blur somewhat. Sometimes smaller types of project evaluations can be combined in a technique called cluster evaluation, where a cluster of projects work together to obtain evaluation help and comparable data that might be used in a program evaluation (Barley & Jenness, 1993).

Project evaluations are generally quite varied and unique to the specific project. They can follow diverse philosophies and use the full range of evaluation methods, so they are difficult to characterize. The following examples help to illustrate the range of project evaluations. One example is the evaluation of the teaching and learning of Hispanic students in a solar energy science curriculum which showed that the approach increased the student retention rate (Hadi-Tabassum, 1999). Another example is a school district that was concerned about the quality of its K–12 science curriculum. The evaluation consisted of a series of focus group sessions with parents; elementary; middle and high school teachers; and middle school, high school, and recently graduated students. The focus groups discussed visions for excellence and strengths and limitations of the existing curriculum, which revealed a need for stronger communication about goals and closer articulation across schools and grade levels (Huffman & Lawrenz, 2003). Another example is the Desimone, Porter, Garet, Yoon, and Birman (2002) evaluation of the effects of science and mathematics teacher professional development. That evaluation showed that professional development focused on specific instructional practices increases teachers’ use of those practices in the classroom. Furthermore, specific features, such as active learning opportunities, increased the effect. It is clear from these examples that the range of issues, styles, and goals is very broad.

The Science Education Directorate at the National Science Foundation provides an exemplar of the current state of affairs in program evaluation. Evaluations being funded can be categorized along a continuum based on the level of participation of the projects: exterior evaluation, centrally prescribed evaluation, consensus evaluation, and pooled evaluation (Lawrenz & Huffman, 2003). At one end of the continuum are evaluations conducted by an entity separate from the projects, with the external entity collecting the data and making the decisions to address the needs of NSF. This type of evaluation is exemplified by the evaluation of the Advanced Technician Education program. The evaluation is funded through a different division of NSF than the program and employs its own instruments and collects its own data independently from the funded projects and centers. It is directly tied to the needs and questions of NSF through matching of survey and site visit instruments to a logic model. This evaluation project produces yearly reports, which are posted on the program evaluation web site (Hanssen, Gullickson, & Lawrenz, 2003). The results have shown that the projects engage in a wide variety of activities related to technician education, that significant amounts of cost sharing have been provided to the projects, that a random subset of the materials developed by the projects is perceived by external experts as of good quality, and that the number of technicians produced has increased.

Another example of an external program evaluation is COSMOS Corporation's evaluation of the statewide systemic reform initiative (SSI) (Yin, 2002). This evaluation examines the status of statewide reform in seven states and one territory through intensive case studies that provide an inventory of district policies and curriculum practices, and an analysis of student achievement trends over time. The interim findings suggest that across the eight sites much reform progress only occurred in the late 1990s and is still occurring. Interestingly, it appears that school finance rulings by state supreme courts may be an important reform influence. Successful refor processes do not necessarily follow a top-down format. The case studies show that successful states may need to make a sustained commitment to the reform process. Finally it appears that SSIs can pursue both catalytic and direct service roles in support of reform.

Moving on toward the middle of the continuum are program evaluations with mandated procedures that each of the sites must follow but which allow the sites to collect their own data and turn it in to the central external evaluator. These are exemplified by the Local Systemic Change (LSC) program evaluation. In that program each funded project is required to gather specific information with the use of pre-designed evaluation instruments. Each project may also add its own evaluation components. The data from all of the different projects are synthesized into program evaluation reports. An extension of this type of evaluation is the status data collection required of projects in several NSF education programs. This status information includes information such as numbers and characteristics of participants.

The LSC evaluation reports are provided yearly (Weiss, Banilower, Crawford, & Overstreet, 2003). The outcomes for 2003, like those for other years, show mixed results. Thirty-five percent of the teachers participating nationally rated the LSC professional development they received as excellent or very good. Professional development sessions show high ratings on appropriateness of the mathematics/science content, the climate of respect for and collegial interactions among participants, and for encouraging active participation. Weaknesses include questioning about conceptual understanding and providing adequate wrap-up. Survey data show significant positive impact on teacher attitudes and beliefs about mathematics/science education. In addition, participants are becoming more confident in their content knowledge and more likely to use standards-based instructional strategies. Classroom observations show that the quality of the lessons taught improved, with increased participation in LSC activities.

Toward the participatory end of the continuum are evaluations where the projects determine the evaluation procedures and what data to collect. The Collaboratives for Excellence in Teacher Preparation (CETP) program evaluation is an example. The CETP program evaluation is one where sites collect some similar data, using centrally developed instruments. The procedures and instruments were developed by the projects, and projects decide which data they wish to provide. The program evaluation or core team provides leadership, a communication hub, instruments, data analysis, and incentives for collecting core data. Evaluation reports are provided yearly (Lawrenz, Michlin, Appeldoorn, & Hwang, 2003). The results show that the CETP projects have had a positive impact on the establishment and institutionalization of reformed courses and on interactions within and among STEM and education schools and K–12 schools. Over time, it appears that all higher education classes used more standards-based instructional strategies. Results from the K–12 classrooms show that CETP and non-CETP teachers were reporting the same frequencies of use of instructional strategies in their classrooms. Students, however, reported that their CETP teachers more frequently used real-world problems, technology, and more complicated problems in their teaching. Students of non-CETP teachers, on the other hand, were more likely to report doing activities involving writing, making presentations, and using portfolios. Moreover, external observers rated CETP classes higher.

MODELS FOR SCIENCE EDUCATION PROGRAM EVALUATION

Stufflebeam (2001) provides a descriptive and evaluative review of the different evaluation models that have been used over the past 40 years. He describes 22 different approaches and then recommends nine for continued use. He bases this recommendation on how well these approaches meet the Program Evaluation Standards of utility, feasibility, propriety, and accuracy. These nine include three improvement or accountability-oriented approaches, four social agenda or advocacy-oriented approaches, and two method-oriented approaches. The models are defined below and listed in Table 31.2, along with an example of how each could be operationalized in science education.

The three accountability-oriented approaches are Decision/Accountability, Consumer Orientation, and Accreditation. Decision/Accountability evaluation provides information that can be used to help improve a program as well as to judge its merit and worth (Stufflebeam, 1971). Consumer Orientation evaluation provides conclusions about the various aspects of the quality of the objects being evaluated, so that the consumers will know what will be of use in their situations (Scriven, 1974). Accreditation evaluation studies institutions, institutional programs, and personnel to determine the fit with requirements of a given area and what needs to be changed in order to meet these requirements.

TABLE 31.2
Science Education Examples of Evaluation Models

Model of Evaluation Science Education Example
Decision/accountability

Determining the strengths and weaknesses of a science teacher training program to make decisions about what to do in the coming year

Consumer-oriented

Rating all of the existing high school science curricula using a specific set of criteria

Accreditation Utilizationfocused

Certifying that a middle school's science program was acceptable Providing a timely report to a school district contrasting the two science curricula they were considering using criteria they felt were important

Client-centered/responsive

Working with a school district as they develop a new science curricula gathering different data about different questions as needs evolve

Deliberative democratic

Having all of the science teachers in the school district debate and vote on the evaluation questions, the data and the interpretation

Constructivist

Providing descriptions of the different perspectives different groups of science teachers have of a new assessment procedure

Case study

Providing a school board with an in depth description of the AP chemistry class

Outcome/value added assessment

Looking at school level student science results over assessment time and examining the change of slope in schools using a new science program.

The four social agenda approaches are Utilization-Focused, Client-centered/ Responsive, Deliberative Democratic, and Constructivist. Utilization-Focused evaluation is a process for making choices about an evaluation study in collaboration with a targeted group of priority users, in order to focus effectively on their intended uses of the evaluation (Patton, 2000). Client-centered evaluation requires that the evaluator interact continuously with the various stakeholders or clients and be responsive to their needs (Stake, 1983). Deliberative Democratic evaluation operates within a framework where democratic principles are used to reach conclusions (House & Howe, 2000). Constructivist evaluation works within a subjectivist framework and requires that the evaluators advocate for all participants, particularly the disenfranchised, to help emancipate and empower everyone (Guba & Lincoln, 1989; Fetterman, 2001).

The two method-oriented approaches are Case Study and Outcome/Value Added Assessment. Case study evaluations are in-depth, holistic descriptions and analyses of evaluands (Merriam, 1998). Outcome/Value Added Assessment evaluation involves determining what changes have occurred in the patterns of data collected over time as the result of program or policy changes (Sanders & Horn, 1994).

Stufflebeam's categorization of evaluation models is in addition to other models that have been proposed for the evaluation of science education. Early models were provided by Welch (1979a,b), Welch (1985), and Knapp, Shields, St. John, Zucker, and Stearns (1988). A more recent model was provided by Altschuld and Kumar in the 1995 issue of New Directions for Program Evaluation edited by Rita O'sullivan. Their model is a synthesis of models of evaluation applied to science programs before 1994. They review several different types of science evaluation at what they call the micro developmental or formative level and the macro system or contextual level. They then provide a model with the main stages of program or product development, defined as Need, Conceptualization, Development, Tryout, Formal Use, and Long-Term Use and Impact. These stages are informed by contextual and supportive factors and intermediate outcomes. Their intent is that the main stages are embedded in specific contexts and that the contexts will have profound effects on the implementation and results of the stages. Synthesis of the stages and factors results in an evaluation of overall effectiveness. They say, “Carefully evaluating development, studying process variables, evaluating outcomes along the way rather than just at the end of product development, and analyzing supportive and contextual variables generates a comprehensive understanding of the overall effectiveness of science education programs and to a degree, the interface between levels” (p. 13).

In 2001 the National Research Council's Committee on Understanding the Influence of Standards in K–12 Science, Mathematics and Technology Education presented A Framework for Research in Mathematics, Science and Technology Education (NRC, 2001). The committee's Framework used two main questions: How has the system responded to the introduction of nationally developed mathematics, science, and technology standards? and What are the consequences for student learning? The Framework “provides conceptual guideposts for those attempting to trace the influence of nationally developed mathematics, science, and technology standards and to gauge the magnitude or direction of that influence on the education system, on teachers and teaching practice and on student learning” (p. 3). Undergirding the first question are contextual forces such as politicians and policy makers, the public, business and industry members, and professional organizations. These forces are viewed as operating through channels of influence, including curriculum, teacher development, and assessment and accountability. The context funneled through the channels results in teachers and teaching practice in the classroom and the school context, which ultimately leads to student learning. The Framework goes on to provide examples of hypothetical studies that could address the evaluation questions. In conjunction with this, the Council of Chief State School Officers (1997) has a Tool Kit for evaluating the development and implementation of standards.

METHODS FOR SCIENCE EDUCATION PROGRAM EVALUATION

Despite the differences between research and evaluation, they use similar methodologies and are subject to the considerations of rigor applied to all forms of disciplined inquiry. There is an ongoing debate about what constitutes rigor, which was highlighted by the National Research Council (2003) report Scientific research in education. The report articulates the nature of scientific research in education and offers a framework for the future of a federal educational research agency charged with supporting high-quality scientific work. The report considers several different methodological approaches to research, and consequently evaluation, with an emphasis on rigor and matching the methods to the questions. Many of the research questions in education are evaluation-oriented, such as, is this curriculum better than the one we have, or what is happening or how is it happening? A special class of questions is questions of causality, which the NRC report suggests are best answered by randomized experiments. Various researchers have raised several issues, such as how to have a culture of rigor and how to educate professionals to operate within such a culture (Pellegrino and Goldman, 2002). Additionally, issues such as random assignment, the uniqueness of each site, and the potential endorsement of an evidence-based social engineering approach to educational improvement need to be considered (Berliner, 2002; Erickson & Gutierrez, 2002).

There is a good deal of information available on what constitutes rigor in quantitative evaluations. There are fairly clear guidelines on how to calculate appropriate sample sizes, which statistical tests are appropriate for what types of data, and what constitutes a meaningful result. The guidelines for rigor in more interpretive or qualitative evaluation are no less strong, but they are different. Lincoln and Guba (1985) provide an informative contrast of terms related to rigor from the quantitative and qualitative perspectives. They say that in quantitative work there is validity as determined by an evaluation's internal validity, external validity, reliability, and objectivity. In contrast, in qualitative work there is trustworthiness, as determined by an evaluation's credibility, transferability, dependability, and confirmability. Anfara, Brown, and Mangione (2002) suggest making the qualitative research process more public to help ensure rigor. They discuss different techniques that would help to increase trustworthiness (e.g., triangulation, member checks, prolonged field work) and go on to suggest that all qualitative research (or evaluation) should include documentation tables that show how the different techniques were used. Suggestions include specifically linking interview questions to research questions, providing tables of how individual codings of narratives were synthesized, and having matrices of triangulation showing findings and sources.

Another rich methodological source for science education program evaluation is design experiments. This type of evaluation attempts to support arguments constructed around the results of active innovation and intervention in classrooms (Kelly, 2003). It is aimed at understanding learning and teaching processes when the researcher or evaluator is active as an educator. This approach ties into the participatory evaluation literature as well as the evaluation capacity-building literature, because the people involved in the innovation are the evaluators, and through the experience they gain expertise, which will enhance the educational enterprise (Stockdill, Baizerman, & Compton, 2002). It also dovetails with the action research or teacher-as-researcher movements.

Most science education evaluations employ mixed methods. In other words, a variety of data-gathering and interpretation techniques are incorporated into a single evaluation. The issues involved in mixing methods are complex because the methods are often embedded in an overarching philosophy that informs how the method should be interpreted. For example, Greene and Caracelli (1997) suggest three stances to mixing methods: purist, pragmatic, and dialectical. The purist stance uses methods embedded within a philosophical paradigm. The pragmatic stance puts methods together in ways that produce an evaluation result that is the most useful to the stakeholders in the evaluation. The dialectical stance is synergistic in that it plays the different methodologies off against each other to produce an evaluation that transcends any of the individual methods. Caracelli and Greene (1997) go on to discuss how these different stances can be formulated into different mixed method evaluation designs. These include component designs where different methods could be used to triangulate findings, to complement findings from another more dominant method, or to address different aspects of the science education program being evaluated. There are also integrated, mixed-method designs where the use of different methods could be iterative, nested, holistic, or transformative (giving primacy to the value and action oriented aspects of the program). Lawrenz and Huffman (2003) combined these ideas into another mixed-method model they termed the archipelago approach.

The mixed approaches to evaluation are grounded in methodological advances across the qualitative and quantitative continuum. There have been significant advances in quantitative analyses, especially in the area of modeling, such as linear models (Moore, 2002), hierarchical models (Byrk & Radenbush, 1992), longitudinal models (Moskowitz & Hershberger, 2002), and structural equation modeling (Maruyama, 1998). There has also been significant work in measurement and sampling theory, including matrix sampling and Rasch modeling (Wright, 1979). In the middle of the quantitative-qualitative continuum there have been advances in survey design to encompass the new capabilities inherent in web-based settings (Dillman, 2002). On the qualitative side there have been new approaches and new insights, including, for example, interpretive interactionism (Denzin, 2001) and interpreting the unsayable (Budick, 1989).

Table 31.3 lists some science education areas, selected methods that could be used, and the evaluation questions each method would allow the evaluator to answer. The intent is to show the direct tie between the evaluation questions and the methods. Methods are not good or bad (assuming they are implemented proficiently), but they can only be used to answer specific types of questions.

TABLE 31.3
Science Education Area by Evaluation Method and Questions

Science Education Area Evaluation Method Questions the Method Addresses
A science education program focused on institutional culture Case study How can the nature of the institution be described?
 

Opinion surveys of institutional members and persons from other institutions who interact with the primary institution with retrospective items

What do people within the institution think the culture is? How do they think it has changed? What do people who interact with the institution think the culture is? How do they think it has changed?

 

Pre and post observations by external experts or participant observers

How do observers characterize the culture of the institution before and after the program?

  Artifact analysis of policies, procedures and public statements during the course of the program

What changes have occurred in the policies, procedures and expressed public image during the program?

A teacher development workshop on participating teachers

Pre and post testing of content knowledge, attitudes, teaching philosophies.

What immediate changes have occurred in teacher content knowledge, attitudes and teaching philosophies?

  Phenomenological study

How do the teachers perceive the lived experience of the workshop?

 

Observations of the workshop by external experts or participant observers

What are observers’ opinions of the quality of the workshop?
A new curriculum on science classroom environment Ethnography What is the culture of the classroom and how is it evolving?
 

Pre and post assessment of student perception of the classroom environment

How do students perceive the classroom environment before and after using the curriculum?

  Observations of the classroom by experts

What are observers’ opinions of the characteristics of the classroom environment

  Discourse analysis

What verbal interaction patterns are heard within the classroom and how do these reflect on the classroom environment?

Effect of standards based science instruction on students throughout a district

Phenomenological studies

How do selected students view the lived experience of participating in standards based instruction?

 

participating in standards based instruction? and application of HLM analyses

Which student variables are predictive of student achievement and attitude and how much do classrooms and schools contribute to this relationship

  Value added analysis of student scores over time

What changes have occurred in the longitudinal patterns of student achievement and attitudes across the district since the implementation of the new science instruction?

CONCLUSION

This review has documented the significant growth experienced by the field of science program evaluation since its solidification in the early 1960s. Science program evaluation is also shown as closely tied to political agendas through public and private funding initiatives. The literature reveals that most published work has focused on practical applications and descriptions of approaches. Philosophical underpinnings have become more clearly articulated, and diverse models and approaches have been developed and implemented. The variety of methods available to use has expanded, and connections between methods and questions have become more explicit. Despite this growth there has been very little research on science program evaluation practices per se. The work has been mainly theoretical. The literature provides extensive examples of procedures and approaches and suggestions of when to use them, but little direct research about which of these might be more effective.

Research comparing the strengths and weaknesses of different science program evaluation approaches and methods would be beneficial to the field. This type of research, however, would be expensive because more than one evaluation would have to be funded for any project. Also research comparing different approaches would have to be carried out in multiple settings, such as different grade levels, different content areas, or different types of students. Also as is clear from the existing theoretical and practical work, different types of evaluation provide different types of information. This information might be differentially valuable to different stake-holders in determining the merit or worth of a science education program. Therefore the value of science program evaluation approaches would have to be determined in terms of the needs and opinions of different stakeholders. One possible cost-effective manner of addressing some of the issues would be through consideration of the evaluation results of similar programs obtained through different methods. This idea is similar to the one suggested by the National Research Committee to assess standards-based reform. However, those recommendations are related to using evaluation results to understand standards-based reform, not to answering questions about how to conduct evaluation. The distinction between researching the value of evaluation and researching the value of the programs being evaluated needs to be kept in mind.

Experience and expertise in science education program evaluation is growing but is still scarce. Educational programs designed to provide qualified science education program evaluators should be continued and perhaps expanded. Increasing the capacity of the evaluation field to deal with science program evaluation has pursued several avenues, such as direct grants, indirect training programs, graduate-level programs, short courses, and intensive workshops. Additionally, although there has been work identifying the essential competencies required of an evaluator, there is no clear indication of what skills might be expressly needed for science program evaluation (King, Stevahn, Ghere, & Minnema, 2001). Furthermore, there is little evidence available about which sort of educational programs would produce the best science program evaluators. Capacity building then is another fruitful area for research. Currently the Research Evaluation and Communication division of NSF has a yearly competition for proposals to increase the capacity of science program evaluators. Coordinating and pooling evidence from these projects might provide valuable information about improving capacity.

The field of science education program evaluation continues to mature and expand. At present evaluations encompass a variety of methodologies, underlying social values, and philosophies. Recent emphases in funding tend toward large projects that require complex evaluations. It is likely that new techniques and devices designed to evaluate system-wide reforms, partnerships, and collaborations will be developed to meet this need. The diversity within science program evaluation contributes to a rich literature and the opportunity for the field to debate and discuss issues and perspectives. These interactions provide fertile ground for the field of science education evaluation to grow and evolve. Without the various perspectives the field could easily become sterile and barren.

The new federal emphasis on accountability, specifically student assessment, may significantly narrow the diversity of methods and perspectives existing in science education evaluation. Currently evaluations use a variety of information to judge program value. Student achievement, as defined by a score on a particular test, is only one of many valuable outcomes hoped for in science education. Science education evaluations are designed to serve the many stakeholders involved in the object being evaluated. This responsiveness to evaluating a program to address the needs and ideals of all people concerned about or affected by it, appears to be the process most likely to produce the most valid indication of the value of the science education programs. Science educators should advocate for diversity of perspectives and methods, as well as high quality and rigor, in evaluations of science education projects.

ACKNOWLEDGMENTS

Thanks to James Alhtshied and William Boone, who reviewed this chapter.

REFERENCES

Altschuld, J. W., & Kumar, D. (1995). Program evaluation in science education: The model perspective. In R. O'sullivan (Vol. Ed.), New directions for program evaluation: No. 35. Emerging roles of evaluation in science education reform (Spring, pp. 5–18). San Francisco: Jossey-Bass.

Anfara, V. A., Brown, K. M., & Mangione, T. L. (2002, October). Qualitative analysis on stage: Making the research process more public. Educational Researcher, 31(7), 28–38.

Barley, Z. A., & Jeness, M. (1993, June). Cluster evaluation: A method to strengthen evaluation in smaller programs with similar purposes. Evaluation Practice, 14(2), 141–147.

Berliner, D. C. (2002, November). Educational research: The hardest science of all. Educational Researcher, 31(8), 18–20.

Budick, S., & Iser, W. (Eds.). (1989). Languages of the unsayable. New York: Columbia University Press.

Byrk, A. S., & Radenbush, S. W. (1992). Hierarchical linear models. Newbury Park, NJ: Sage.

Caracelli, V. J., & Green, J. C. (1997). Crafting mixed-method evaluation designs. In J. C. Green & V. J. Caracelli (Vol. Eds.), New directions for evaluation: No. 74. Advances in mixed-method evaluation: The challenges and benefits of integrating diverse paradigms (Summer, pp. 19–32). San Francisco: Jossey-Bass.

Cook, N. R., Dwyer, M. C., & Stalford, C. (1991). Evaluation and validation: A look at the program effectiveness panel. New Hampshire: U.S. Government (ERIC Document Reproduction Service no. ED333045).

Council of Chief State School Officers. (1997). Tool kit: Evaluating the development and implementation of standards. Washington, DC: Author.

Cousins, J. B., & Whitmore, E. (1998). Framing participatory evaluation. In E. Whitmore (Vol. Ed.),New directions for evaluation: No. 80. Understanding and practicing participatory evaluation (Winter, pp. 5–24). San Francisco: Jossey-Bass.

Denzin, N. K. (2001). Interpretive interactionism (2nd ed.). Thousand Oaks, CA: Sage.

Desimone, L., Porter, A., Garet, M., Yoon, K., & Birman, B. (2002). Effects of professional development on teachers’ instruction: Results from a three-year longitudinal study. Educational Evaluation and Policy Analysis, 24(2), 81–112.

Dillman, D. A. (2002). Mail and internet survey: The tailored design method (2nd ed.). New York: John Wiley & Sons.

Doran, R., Lawrenz, F., & Helgeson, S. (1994). Research assessment in science. In D. Gabel (Ed.),Handbook f or research teaching and learning (pp. 388–442). New York: Macmillan.

Educational Research, Development, Dissemination and Improvement act of 1994. Public Law 103-227, H.R. 856, 103rd Congress, March 31, 1994.

Elementary and Secondary School Act of 1965. Public Law 89-10, 89th Congress, 1st Session, April 11, 1965.

Erickson, F., & Gutierrez, K. (2002, November). Culture, rigor, and science in educational research. Educational Researcher, 31(8), 21–24.

Fetterman, D. M. (1994). Empowerment evaluation. Evaluation Practice, 15(1), 1–15.

Fetterman, D. M. (2001). Foundations of empowerment evaluation. Thousand Oaks, CA: Sage.

Finely, F., Heller, P., & Lawrenz, F. (1990). Review of research in science education. Pittsburgh: Science Education.

Fitzpatrick, J. L., Sanders, J. R., & Worthen, B. R. (2003). Program evaluation: Alternative approaches and practical guidelines (3rd ed.). New York: Pearson Allyn & Bacon.

Frechtling, J. (2002). User friendly handbook for project evaluations. Prepared under contract REC99-12175. Arlington, VA: National Science Foundation, Directorate for Education and Human Resources, Division of Research, Evaluation and Communication.

Greene, J. C., & Abma, T. A. (Eds.). (2001). Editor's notes. In New directions for evaluation: Responsive evaluation: No. 92 (Winter, pp. 1–5). San Francisco: Jossey-Bass.

Green, J. C., & Caracelli, V. J. (1997). Defining and describing the paradigm issue in mixed-method evaluation. In J. C. Green & V. J. Caracelli (Vol. Eds.), New directions for evaluation, advances in mixed-method evaluation: No. 74. The challenges and benefits of integrating diverse paradigms (Summer, pp. 5–18). San Francisco: Jossey-Bass.

Guba, E. G., & Lincoln, Y. S. (1989). Fourth generation evaluation. Newbury Park, CA: Sage.

Hadi-Tabassum, S. (1999). Assessing students’ attitudes and achievements in a multicultural and multilingual science classroom. Multicultural Education, 7(2), 15–20.

Hanssen, C., Gullickson, A., & Lawrenz, F. (2003). Assessing the impact and effectiveness of the Advanced Technological Education (ATE) Program. Kalamazoo, MI: The Evaluation Center.

House, E. R. (1983). Assumptions underlying evaluation models. In G. F. Madaus, M. Scriven, & D. L. Stufflebeam (Eds.), Evaluation models. Boston, MA: Kluwer-Nijhoff.

House, E. R., & Howe, K. R. (2000). Deliberative democratic evaluation in practice. In D. L. Stufflebeam, G. F. Madaus, & T. Kellaghan (Eds.), Evaluation models: Viewpoints on educational and human services evaluation (2nd ed.). Boston, MA: Kluwer Academic.

Huberman, A. M., & Miles, M. B. (Eds.). (2002). The qualitative researcher's companion. Thousand Oaks, CA: Sage.

Huffman, D., & Lawrenz, F. (2003). Vision of science education as a catalyst for reform. Journal for Elementary/Middle Level Science Teachers, 36(2), 14–22.

Joint Committee on Standards for Educational Evaluation. (1981). Standards for evaluations of educational programs, projects and materials (1st ed.). New York: McGraw-Hill.

Joint Committee on Standards for Educational Evaluations. (1994). The program evaluation standards: How to assess evaluations of educational programs (2nd ed.). Thousand Oaks, CA: Sage.

Kelly, A. E. (2003). Research as design. Educational Researcher, 32(1), 3–4.

King, J., Stevahn, L., Ghere, G., & Minnema, J. (2001). Toward a taxonomy of essential evaluator competencies. American Journal of Evaluation, 22(2), 229–247.

Knapp, M. S., Shields, P. M., St. John, M., Zucker, A. A., & Stearns, M. S. (1988). Recommendations to the National Science Foundation. An approach to assessing initiatives in science. (ERIC Document Reproduction Service no. ED299145). Menlo Park, CA: SRI International.

Lawrenz, F., & Huffman, D. (2002). The archipelago approach to mixed method evaluation. American Journal of Evaluation, 23(3), 331–338.

Lawrenz, F., & Huffman, D. (2003). How can multi-site evaluations be participatory? American Journal of Evaluation, 24(4), 331–338.

Lawrenz, F., & Jeong, I. (1993). Science and mathematics curricula. In Indicators of science and mathematics education. Washington, DC: National Science Foundation.

Lawrenz, F., Michlin, M., Appeldoorn, K., & Hwang, E. (2003). CETP core evaluation: 2001–2002 results. Minneapolis, MN: Center for Applied Research, University of Minnesota.

Lawrenz, F., Weiss, I., & Queitzsch, M. (1996). The K–12 learning environment. In Indicators of science and mathematics education, Washington, DC: National Science Foundation.

Lincoln, Y., & Guba, E. (1985). Naturalistic inquiry. Beverly Hills, CA: Sage.

Mark, M. M., & Shotland, L. R. (1985). Stakeholder-based evaluation and value judgments: The role of perceived power and legitimacy in the selection of stakeholder groups. Evaluation Review, 9, 605–626.

Maruyama, G. (1998). Basic structural equation modeling. Thousand Oaks, CA: Sage.

McTaggart, R. (1991b). When democratic evaluation doesn't seem democratic. Evaluation Practice, 12(1), 168–187.

Merrian, S. B. (1998). Qualitative research and case study applications in education (rev. ed.). San Francisco: Jossey-Bass.

Moore, S. D. (2002). The basic practice of statistics (2nd ed.). New York: W. H. Freeman.

Moskowitz, D. A., & Hershberger, S. L. (Eds.). (2002). Modeling intraindividual variability with repeated measures data. Mahwah, NJ: Lawrence Erlbaum Associates.

National Research Council. (2001). I. Weiss, M. Knapp, K. Hollweg, & G. Burrell (Eds.), Investigating the influence of standards: A framework for research in mathematics science and technology education. Washington, DC: Committee on Understanding the Influence of Standards in K–12 Science Mathematics and Technology Education.

National Research Council. (2002). R. J. Shavelson & L. Towne (Eds.), Scientific research in education. Washington, DC: Committee on Scientific Principles of Educational Research, National Academy Press.

Newman, F. M. (1996). Authentic achievement: Restructuring schools for intellectual quality (1st ed.). San Francisco: Jossey-Bass.

No Child Left Behind Act of 2001. Public Law 107-110. H.R. 1. 107th Congress, 2nd Session (2001).

Patton, M. Q. (1978). Utilization-focused evaluation (1st ed.). Beverly Hills, CA: Sage.

Patton, M. Q. (1986). Utilization-focused evaluation (2nd ed.). Beverly Hills, CA: Sage.

Patton, M. Q. (1994). Developmental evaluation. Evaluation Practice, 5(3), 311–319.

Patton, M. Q. (1997). Utilization-focused evaluation: The new century text (3rd ed.). Thousand Oaks, CA: Sage.

Patton, M. Q. (2000). Utilization-focused evaluation. In D. L. Stufflebeam, G. F. Madaus, & T. Kellaghan (Eds.). Evaluation models: Viewpoints on educational and human services evaluation (2nd ed.). Boston: Kluwer Academic.

Pellegrino, J. W., & Goldman, S. R. (2002, November). Be careful what you wish for—you may get it: Educational research in the spotlight. Educational Researcher, 31(8), 15–17.

Rice, J. M. (1898). The rational spelling book. New York, NY: American Book Company.

Rossi, P. H., & Freeman, H. E. (1985). Evaluation: A systematic approach (3rd ed.). Beverly Hills, CA: Sage.

Rossi, P. H., & Freeman, H. E. (1989). Evaluation: A systematic approach (4th ed.). Newbury Park, CA: Sage.

Rossi, P. H., & Freeman, H. E. (1993). Evaluation: A systematic approach (5th ed.). Newbury Park, CA: Sage.

Rossi, P. H., Freeman, H. E., & Lipsey, M. W. (1999). Evaluation: A systematic approach (6th ed.). Thousand Oaks, CA: Sage.

Rossi, P. H., Freeman, H. E., & Rosenbaum, S. (1982). Evaluation: A systematic approach (2nd ed.). Beverly Hills, CA: Sage.

Rossi, P. H., Freeman, H. E., & Wright, S. R. (1979). Evaluation: A systematic approach (1st ed.). Beverly Hills, CA: Sage.

Sanders, W. L., & Horn, S. P. (1994). The Tennessee value-added assessment system (TVAAS): Mixed model methodology in educational assessment. Journal of Personnel Evaluation in Education, 8(3), 299–311.

Scriven, M. (1974). Evaluation perspectives and procedures. In W. J. Popham (Ed.), Evaluation in education: Current applications. Berkeley, CA: McCutchen.

Scriven, M. (1991). Evaluation thesaurus (4th ed.). Newbury Park, CA: Sage.

Smith, E. R., & Tyler, R. W. (1942). Appraising and recording student progress. New York: McGraw-Hill.

Stake, R. E. (1967). The countenance of educational evaluation. Teachers College Record, 68, 523–540.

Stake, R. E. (1983). Program evaluation, particularly responsive evaluation. In G. F. Madaus, M. Scriven, & D. L. Stufflebeam (Eds.), Complementary methods for research in education (pp. 253–300). Boston: Kluwer-Nijhoff.

Stevens, F., Lawrenz, F., Ely, D., & Huberman, M. (1993). The user friendly handbook for project evaluation. Washington, DC: National Science Foundation.

Stockdill, S., Baizerman, M., & Compton, D. (2002). Toward a definition of the ECB process: A conversation with the ECB literature. In D. W. Compton, M. Baizerman, & S. H. Stockdill (Vol. Eds.), New directions for evaluation: No. 93. The art, craft, and science of evaluation capacity building (Spring, pp. 27–26). San Francisco: Jossey-Bass.

Stufflebeam, D. L. (1971). The relevance of the CIPP evaluation model for educational accountability. Journal of Research and Development in Education, 5(1), 19–25.

Stufflebeam, D. L. (Vol. Ed.). (2001). Evaluation models. In New directions for evaluation (No. 89, Spring). San Francisco: Jossey-Bass.

Stufflebeam, D. L., Foley, W. J., Gephart, W. J., Guba, E. G., Hammond, R. L., Merriman, H. O., & Provus, M. M. (1971). Educational evaluation and decision-making in education. Itasca, IL: Peacock (copyright 1971 by Phi Delta Kappa, Bloomington, IN).

Stufflebeam, D. L., & Welch, W. W. (1986). Review of research on program evaluation in United States school districts. Educational Administrators Quarterly, 22(3), 150–170.

Suter, L. (1993). Indicators of science and mathematics education, 1992 (1st ed.). Washington, DC: National Science Foundation.

Suter, L. (1996). Indicators of science and mathematics education, 1995 (2nd ed.). Washington, DC: National Science Foundation.

U.S. National Research Center for Third International Mathematics and Science Study (TIMMS). Retrieved May 27, 2003 from Michigan State University, College of Education Web site: http://ustimss.msu.edu/

Weiss, C. H. (1972). Evaluation research: Methods for assessing program effectiveness (1st ed.). Englewood Cliffs, NJ: Prentice-Hall.

Weiss, C. H. (1998). Evaluation research: Methods for studying program and policies (2nd ed.). Upper Saddle River, NJ: Prentice-Hall.

Weiss, I. R. (1997). The status of science and mathematics teaching in the United States: Comparing teacher views and classroom practice to national standards. ERS Spectrum, (Summer), 34–39.

Weiss, I., Banilower, E., Crawford, R., & Overstreet, C. (2003). Local systemic change through teacher enhancement: Year eight cross-site report. Chapel Hill, NC: Horizon Research.

Welch, W. W. (1969). Curriculum evaluation. Review of Educational Research, 39(4), 429–443.

Welch, W. W. (1972). Review of research 1968–69, secondary level science. Journal of Research in Science Teaching, 9(2), 97–122.

Welch, W. W. (1979a). Five years of evaluating federal programs: Implications for the future. Science Education, 63(2), 335–344.

Welch, W. W. (1979b). Twenty years of science curriculum development: A look back. In F. M. Berliner & R. M. Gagne (Eds.), Review of research in education (pp. 282–306). Washington, DC: American Educational Research Association.

Welch, W. W. (1985). Research in science education: Review and recommendations. Science Education, 69(3), 421–448.

Welch, W. W. (1995). Student assessment and curriculum evaluation. In B. J. Fraser & H. J. Walberg (Eds.), Improving science education: What do we know. Chicago: National Society for the Study of Education.

Wiggins, G. P. (1998). Educative assessment: Designing assessments to inform and improve student performance (1st ed.). San Francisco: Jossey-Bass.

Worthen, B. R., & Sanders, J. R. (1973). Educational evaluation: Theory and practice. Worthington, OH: Charles A. Jones.

Worthen, B. R., & Sanders, J. R. (1987). Educational evaluation: Alternative approaches and practical guidelines (1st ed.). New York: Addison Wesley Longman.

Wright, B. D., & Stone, M. H. (1979). Best test design: Rasch measurement. Chicago: Mesa Press.

Yin, R. (2002). Study of statewide systemic reform in science and mathematics education: Interim report. Bethesda, MD: Cosmos.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.191.60.249