Handbook of Labor Economics, Vol. 4, No. Suppl PA, 2011

ISSN: 1573-4463

doi: 10.1016/S0169-7218(11)00408-4

Chapter 2Field Experiments in Labor Economics

John A. List*1, Imran Rasul**2


* Department of Economics, University of Chicago, 1126 East 59th Street, Chicago, IL 60637, USA

** Department of Economics, University College London, Drayton House, 30 Gordon Street, London WC1E 6BT, United Kingdom

E-mail address: [email protected]

E-mail address: [email protected]

Abstract

We overview the use of field experiments in labor economics. We showcase studies that highlight the central advantages of this methodology, which include: (i) using economic theory to design the null and alternative hypotheses; (ii) engineering exogenous variation in real world economic environments to establish causal relations and learn about the underlying mechanisms; and (iii) engaging in primary data collection and often working closely with practitioners. To highlight the potential for field experiments to inform issues in labor economics, we organize our discussion around the individual life cycle. We therefore consider field experiments related to the accumulation of human capital, the demand and supply of labor, and behavior within firms, and close with a brief discussion of the nascent literature of field experiments related to household decision making.

JEL classification

• C93 • J01

Keywords

• Field experiments • Labor economics

1 Introduction

This chapter overviews the burgeoning literature in field experiments in labor economics. The essence of this research method involves researchers engineering carefully crafted exogenous variation into real world economic environments, with the ultimate aim of identifying the causal relationships and mechanisms underlying them. This chapter describes this approach and documents how field experiments have begun to yield new insights for research questions that have long been studied by labor economists.

Given our focus on such long-standing questions, in no way do we attempt to do justice to the enormous literature to which field experiments are beginning to add. Our aim is rather to showcase specific field experiments that highlight what we view to be the central advantages of this methodology: (i) using economic theory to design the null and alternative hypotheses; (ii) engineering exogenous variation in real world economic environments to establish causal relations and learn the mechanisms behind them; and (iii) engaging in primary data collection and often working closely with practitioners.

As with any research methodology in economics, of course, not every question will be amenable to field experiments. Throughout our discussion, we will bring to the fore areas in labor economics that remain relatively untouched by field experiments. In each case, we try to distinguish whether this is simply because researchers have not had opportunities to design field experiments related to such areas, or whether the nature of the research question implies that field experiments are not the best approach to tackle the problem at hand. Finally, even among those questions where field experiments can and have provided recent insights, we will argue that such insights are most enhanced when we are able to combine them with, or take inspiration from, other approaches to empirical work in economics. For example, a number of studies we describe take insights from laboratory environments to show the importance of non-standard preferences or behaviors in real world settings. A second class of papers combine field experimentation with structural estimation to measure potential behavioral responses in alternative economic environments, and ultimately to help design optimal policies to achieve any given objective.

The bulk of the chapter is dedicated to documenting the use and insights from field experiments in labor economics. To emphasize the relevance of field experiments to labor economics, we organize this discussion by following an individual over their life cycle. More specifically, we begin by considering field experiments related to early childhood interventions and the accumulation of human capital in the form of education and skills. We then consider research questions related to the demand and supply of labor, and labor market discrimination. We then move on to consider research questions related to behavior within firms: how individuals are incentivized within firms, and other aspects of the employment relationship. Finally, we conclude with a brief discussion of the nascent literature on field experiments related to household decision making.

We have chosen these topics selectively on the basis of where carefully designed field experiments have already been conducted. Within each field, we have decided to discuss a small number of papers using field experiments that, in our opinion, showcase the best of what field experiments can achieve. In no way is our discussion meant to be an exhaustive survey of the literature on field experiments in labor economics. The literature arising just in the past decade has grown too voluminous for even a tome to do it justice.

In each stage of the life cycle considered, wherever appropriate, we try to discuss: (i) the link between the design of field experiments and economic theory; (ii) the benefits of primary data collection that is inherent in field experimentation to further probe theory and distinguish between alternative hypotheses; and (iii) how reduced form effects identified from a field experiment can be combined with theory and structural modelling to make sample predictions and inform policy design.

In the remainder of this section, we place our later discussion into historical context by describing the experimental approach in science, and arguing how among economists, labor economists have for decades been at the forefront of exploiting and advancing experimental approaches to identify causal relationships. We then lay the groundwork for the discussion in later sections. We define the common elements at the heart of all field experiments, and then present a more detailed typology to highlight the subtle distinctions between various sub-types of field experiments. This approach allows us to discuss the advantages and disadvantages of field experiments over other forms of experimentation, such as large scale social experiments and laboratory based experiments.3

Our final piece of groundwork is to identify key trends in published research in labor economics over the past decade. This allows us to organize our later discussion more clearly in two dimensions. First, we think of nearly all research questions in labor economics as mapping to particular stages of an individual’s life cycle. We therefore roughly organize research questions in labor economics into those relating to the accumulation of human capital, labor market entry and labor supply choices, behavior within firms, and household decision making. Second, we are able to focus in on those sub-fields in labor economics where extant field experiments have already begun to make inroads and provide new insights. In turn, this helps us make precise the types of research question field experiments are most amenable to, areas in which field experiments have been relatively under supplied, and those research questions that are better suited to alternative empirical methods.

1.1 The experimental approach in science

The experimental approach in scientific inquiry is commonly traced to Galileo Galilei, who pioneered the use of quantitative experiments to test his theories of falling bodies. Extrapolating his experimental results to the heavenly bodies, he pronounced that the services of angels were not necessary to keep the planets moving, enraging the Church and disciples of Aristotle alike. For his efforts, Galileo is now viewed as the Father of Modern Science. Since the Renaissance, fundamental advances making use of the experimental method in the physical and biological sciences have been fast and furious.4

Taking the baton from Galileo, in 1672 Sir Isaac Newton used experimentation to show that white light is equal to purity, again challenging the preachings of Aristotle. The experimental method has produced a steady stream of insights. Watson and Crick used data from Rosalind Franklin’s X-ray diffraction experiment to construct a theory of the chemical structure of DNA; Rutherford’s experiments shooting charged particles at a piece of gold foil led him to theorize that atoms have massive, positively charged nuclei; Pasteur rejected the theory of spontaneous generation with an experiment that showed that micro-organisms grow in boiled nutrient broth when exposed to the air, but not when exposed to carefully filtered air. Even though the experimental method produced a steady flow of important facts for roughly 400 years, the proper construction of a counterfactual control group was not given foundations until the early twentieth century.

1.1.1 An experimental cornerstone

In 1919, Ronald Fisher was hired at Rothamsted Manor to bring modern statistical methods to the vast experimental data collected by Lawes and Gilbert (Levitt and List, 2009). The data collection methods at Rothamsted Manor were implemented in the standard way to provide practical underpinnings for the ultimate purpose of agricultural research: to provide management guidelines. For example, one of the oldest questions in the area of agricultural economics relates to agricultural yields: what is the optimal application rate of fertilizer, seed, and herbicides?

In an attempt to modernize the experimental approach at Rothamsted, Fisher introduced the concept of randomization and highlighted the experimental tripod: the concepts of replication, blocking, and randomization were the foundation on which the analysis of the experiment was based (Street, 1990). Of course, randomization was the linchpin, as the validity of tests of significance stems from randomization theory.

Fisher understood that the goal of any evaluation method is to construct the proper counterfactual. Without loss of generality, define image as the outcome for observational unit image with treatment, image as the outcome for unit image without treatment. The treatment effect for plot image can then be measured as image. The major problem, however, is one of a missing counterfactual—plot image is not observed in both states. Fisher understood that methods to create the missing counterfactual to achieve identification of the treatment effect were invaluable, and his idea was to use randomization.

As Levitt and List (2009) discuss, Fisher’s fundamental contributions were showcased in agricultural field experiments, culminating with the landmark 1935 book, The Design of Experiments, which was a catalyst for the actual use of randomization in controlled experiments. At the same time, Jerzy Neyman’s work on agricultural experimentation showcased the critical relationship between experiments and survey design and the pivotal role that randomization plays in both (Splawa-Neyman, 1923a,b). Neyman’s work continued in the area of sampling and culminated in his seminal paper, published in 1934. As Rubin (1990) notes, it is clear that randomization was “in the air” in the early 1920s, and the major influences of the day were by scholars doing empirical research on agricultural related issues. Clearly, such work revolutionized the experimental approach and weighs on experimental designs in all sciences today.

As emphasized throughout, we view field experimenters as being engaged in data generation, primary data collection, and data evaluation. Labor economists in particular have been at the forefront of the use of experimental designs, as highlighted by the following two historic examples.

1.1.2 Early labor market field experiments at the Hawthorne plant

In the 1920s the Western Electric Company was the monopoly supplier of telephone equipment to AT&T. Western opted to have its main factory, the Hawthorne plant located in the suburbs of Chicago, be the main supplier for this important contract. The Hawthorne plant was considered to be one of the most advanced manufacturing facilities in America at the time, and employed roughly 35,000 people, mainly first- and second-generation immigrants (Gale, 2004). Always open to new techniques to improve efficiency and profitability, officials of Western were intrigued when the National Academy of Sciences expressed interest in a hypothesis put forth by electrical suppliers, who claimed that better lighting improved productivity.

The experimental exercises that resulted have few parallels within social science. The indelible footprint of these experiments laid the groundwork for a proper understanding of social dynamics of groups and employee relations in the workplace. Indeed, the data drawn from this research became the thrust of the human relations movement of the twentieth century, and represent the underpinnings of contemporary efforts of industry to motivate and deal with workers. In academia, the Hawthorne data spawned the development of a new field of study—Industrial Psychology—and remains an important influence on the manner in which scientists conduct experimental research today. Many of the issues raised in these studies are considered part of mainstream personnel economics, as discussed in the chapter in this Handbook on Human Resource Management practices by Bloom and Van Reenen (2011). In a later section of this chapter we review how a new generation of field experiments have provided new insights into these age old questions of behavior within firms.5

The first experiments executed at the Hawthorne plant have been famously denoted the “illumination experiments” because they varied the amount of light in the workplace. More specifically, between 1924 and 1927 the level of lighting was systematically changed for experimental groups in different departments (Mayo, 1933, pp. 55-56, Roethlisberger and Dickson, 1939, pp. 14-18, provide a more complete account). Workers in these departments were women who assembled relays and wound coils of wire, and their output was measured as units completed per unit of time.6

Discussions of these data have been widespread and have been an important influence on building the urban legend. For instance, Franke and Kaul (1978, p. 624) note that “Inexplicably worker output…generally increased regardless of increase or decrease in illumination.” Yet, the only account of these experiments published at the time is Snow (1927), published in an engineering newsletter, and he argues that “The corresponding production efficiencies by no means followed the magnitude or trend of the lighting intensities. The output bobbed up and down without direct relation to the amount of illumination.” Unfortunately, the article does not present data or any statistical analysis. Ever since, the literature has remained at a state of question since people thought that the data were lost. Indeed, an authoritative voice on this issue, Rice (1982) notes that “the original research data somehow disappeared.” Gale (2004, p. 439) expresses similar thoughts concerning the illumination experiments: “these particular experiments were never written up, the original study reports were lost, and the only contemporary account of them derives from a few paragraphs in a trade journal” (Roethlisberger and Dickson, 1939; Gillespie, 1991).7

Using data preserved in two library archives Levitt and List (2010) dug up the original data from the illumination experiment, long thought to be destroyed. Their analysis of the newly found data reveals little evidence to support the existence of a Hawthorne effect as commonly described. Namely, there is no systematic evidence that productivity jumped whenever changes in lighting occurred. Alternatively, they do uncover some weak evidence consistent with more subtle manifestations of Hawthorne effects in the data. In particular, output tends to be higher when experimental manipulations are ongoing relative to when there is no experimentation. Also consistent with a Hawthorne effect is that productivity is more responsive to experimenter manipulations of light than naturally-occurring fluctuations.

As mysterious and legendary as the illumination experiments have become, it is fair to say that the second set of experiments conducted at the plant—the relay assembly experiments—have kept academics busy for years. Using an experimental area constructed for the illumination experiments, beginning in April 1927, researchers began an experiment meant to examine the effect of workplace changes upon productivity. In this case, the task was relay assembly.

The researchers began by secretly observing the women in their natural environment for two weeks, and then used various treatments, including manipulating the environment in such a way to increase and decrease rest periods, over different temporal intervals. While their design certainly did not allow easy assessment of clean treatment effects, the experimenters were puzzled by the observed pattern: output seemingly rose regardless of the change implemented. When output remained high after the researchers returned conditions to the baseline—output had risen from 2400 relays per week to nearly 3000 relays per week—management became interested in identifying the underlying mechanisms at work.

Western Electric subsequently brought in academic consultants, including Elton Mayo, in 1928. With Mayo’s assistance, the experiments continued and by February of 1929, when productivity was at a startling rate of a new relay dropped down the chute every 40-50 seconds, the company besieged the five women with attention, besides “a new test room supervisor, an office boy, and a lady who helped with the statistics” others could be added: “an intermittent stream of other visitors or consultants: industrialists, industrial relations experts, industrial psychologists, and university professors.” (Gale, 2004, p. 443). The experiment lasted until June 1932, when the women in the test room received their notices (except the exceptional worker, Jennie Sirchio, who worked in the office for a few months before being let go) after the stock market crash of October 24, 1932. The crash induced one in ten US phones to be disconnected in 1932, leading to a decrease in Western Electric’s monopoly rents of more than 80%.

The five year experiment provided a wealth of data, and much of the Hawthorne Effect’s statistical underpinnings are a direct result of the relay assembly experiment. Mayo’s (1933) results concluded that individuals would be more productive when they knew they were being studied.8 For this insight, Mayo came to be known as the “father of the Hawthorne effect”, and his work lead to the understanding that the workplace was, importantly, a system that was first and foremost social, and composed of several interdependent parts. When we present a detailed typology of field experiments later in this section, we make precise a distinction between those field experiments in which agents are aware of their participation in an experiment, and those in which they are unaware of the exogenous manipulation of their economic environment.

Mayo stressed that workers are not merely at work to earn an honest wage for an honest day’s effort, rather they are more prominently influenced by social demands, their need for attention, input to decision making, and by the psychological factors of the environment. The notion that workers effort and behavior are driven by more than the monetary rewards of work, is an idea that has received close scrutiny among the most recent generation of field experiments in firms, as reviewed later.

Clearly, Mayo argued, being the object of attention with the study induced a sense of satisfaction among workers that made them feel proud and part of a cohesive unit, generating greater productivity levels than could ever be imagined. Mayo’s disciples, Frist Roethlisberger and William Dickson, another engineer at Western Electric, produced a detailed assessment that focused mainly on the relay experimental data (Roethlisberger and Dickson, 1939) and generated similar conclusions. Industrial psychology would soon find an important place in undergraduate and graduate curricula. Again in later sections, we provide examples of where field experiments have taken insights from psychology and laboratory environments to check for the existence and quantitative importance of such behaviors that are not encompassed within neoclassical economic models.

It is difficult to understate the importance of these findings, as they have served as the paradigmatic foundation of the social science of work (Franke and Kaul, 1978), providing a basis for an understanding of the economics of the workplace, and dramatically influenced studies in organizational development and behavior, leadership, human relations, and workplace design. The results also provide an important foundation for experimental work within the social sciences, including economics, where one must constantly be aware of the effects argued to be important in the Hawthorne relay experiment.9

1.1.3 Large-scale social experiments

A second period of interest directly related to field experiments in labor economics is the latter half of the twentieth century, during which government agencies conducted a series of large-scale social experiments. 10 In the US, social experiments can be traced to Heather Ross, an MIT economics doctoral candidate working at the Brookings Institution. As Levitt and List (2009) discuss, Ross wrote a piece titled “A Proposal for Demonstration of New Techniques in Income Maintenance”, in which she suggested a randomly assigned social experiment to lend insights into the policy debate.

The experiment that resulted began in 1968 in five urban communities in New Jersey and Pennsylvania: Trenton, Paterson, Passaic, and Jersey City in NJ, and Scranton, PA and eventually became Ross’ dissertation research (“An Experimental Study of the Negative Income Tax”, which cost more than $5 million—exceeding $30 million in today’s dollars). The idea behind the experiment was to explore the behavioral effects of negative income taxation, a concept first introduced by Milton Friedman and Robert Lampman, who was at the University of Wisconsin’s poverty institute.11 The experiment, which targeted roughly 1300 male-headed households who had at least one employable person, experimentally varied both the guaranteed level of income and the negative tax rate (Ross, 1970). The guaranteed level of income ranged from 50% to 125% of the estimated poverty line income level for a family of four ($1650-$4125 in 1968 dollars) while the negative income tax rate ranged from 30% to 70%.12 The experiment lasted three years. Families in both the control and treatment groups were asked to respond to questionnaires every three months during this time span, with the questions exploring issues such as family labor supply, consumption and expenditure patterns, general mobility, dependence on government, and social integration.

The most interesting outcome for labor economists involved labor supply. Strong advocates of the negative income tax program argued that the program would provide positive, or at least no negative, work incentives. Many economists, however, were skeptical, hypothesizing that the results would show some negative effect on work effort. Early experimental results discussed in Ross (1970), argued that work effort did not decline for the treatment groups. In fact, as Ross (1970, p. 568) indicates “there is, in fact, a slight indication that the participants’ overall work effort increased during the initial test period.”

Since this initial exploration, other scholars have re-examined the experimental design and data, coming to a less optimistic appraisal. An excellent elucidation is Ashenfelter (1990), who notes that because of attrition it is not actually possible to simply tabulate the results. In this sense, and from the experimenters point of view, the experiments were flawed in part because the design took little advantage of the inherent advantages of randomization. Of course, the ultimate policy test is whether the income maintenance programs increased work incentives relative to the existing welfare system, which as Moffitt (1981) notes at that time had large benefit-reduction rates that may have discouraged work. In certain cases, the new approach did outperform existing incentive schemes, in others it did not.

More importantly for our purposes, the New Jersey income maintenance experiment is generally considered to be the first large-scale social experiment conducted in the US, for which Ross is given credit (Greenberg et al., 1999; Greenberg and Shroder, 2004).13 The contribution of Ross, along with the excellent early summaries of the virtues of social experimentation (Orcutt and Orcutt, 1968), appears to have been instrumental in stimulating the explosion in social experiments in the ensuing decades.1415

Such large-scale social experiments have continued in the US, and have included employment programs, electricity pricing, and housing allowances (see Hausman and Wise, 1985, for a review). While this early wave of social experiments tended to focus on testing new programs, more recent social experiments tended to be “black box” in the sense that packages of services and incentives were proffered, and the experiments were meant to test incremental changes to existing programs. 16 This generation of social experiments had an important influence on policy, contributing, for instance, to the passage of the Family Support Act of 1988, which overhauled the AFDC program. Indeed, as Manski and Garfinkel (1992) note, in Title II, Section 203, 102 Stat. 2380, the Act even made a specific recommendation on evaluation procedures: “a demonstration project conducted … shall use experimental and control groups that are composed of a random sample of participants in the program.”

Much like the experimental contributions of the agricultural literature of the 1920s and 1930s, the large-scale social experiments conducted in the twentieth century influenced the economics literature immensely. Since the initial income maintenance social experiment, there have been more than 235 known completed social experiments (Greenberg and Shroder, 2004), each exploring public policies in health, housing, welfare, and the like. The early social experiments were voluntary experiments typically designed to measure basic behavioral relationships, or deep structural parameters, which could be used to evaluate an entire spectrum of social policies. Optimists even believed that the parameters could be used to evaluate policies that had not even been conducted. As Heckman (1992) notes, this was met with deep skepticism along economists and non-economists alike, and ambitions have since been much more modest.

As Manski and Garfinkel (1992) suggest, this second wave of social experiments had a methodological influence within academic circles, as it provided an arena for the 1980s debate between experimental advocates and those favoring structural econometrics using naturally-occurring data. Manski and Garfinkel (1992) provide an excellent resource that includes insights on the merits of the arguments on both sides, and discusses some of the important methodological issues. Highlighting some of the weaknesses of social experiments helps to clarify important distinctions we draw between social experiments and the generation of field experiments which has followed.

1.1.4 Potential shortcomings of social experiments

One potential problem arising in social experiments is “randomization bias”, a situation wherein the experimental sample is different from the population of interest because of randomization. It is commonly known in the field of clinical drug trials that persuading patients to participate in randomized studies is much harder than persuading them to participate in non-randomized studies (Kramer and Shapiro, 1984). The same problem applies to social experiments, as evidenced by the difficulties that can be encountered when recruiting decentralized bureaucracies to administer the random treatment (Hotz, 1992).17

Doolittle and Traeger (1990) provide a description of the practical importance of randomization bias when describing their experience in implementing the Job Training Partnership Act. Indeed, in almost any social experiment related to job training programs, it is a concern that those most likely to benefit from the program select into the program. Moreover, as Harrison and List (2004) discuss, in social experiments, given the open nature of the political process, it is almost impossible to hide the experimental objective from the person implementing the experiment or the subject, opening up the possibility of such self-selection. As Heckman (1992) puts it, comparing social experiments to agricultural experiments: “plots of ground do not respond to anticipated treatments of fertilizer, nor can they excuse themselves from being treated.”

To see this more formally, we follow the notation above and assume that image is the treatment effect for individual image. Figure 1 shows a hypothetical density of image in the population, a density assumed to have mean, image. In this case, the parameter image is equivalent to the average treatment effect; this is the treatment effect of interest if the analyst is pursuing an estimate of the average effect in this population.

image

Figure 1 Simple illustration of the selection problem.

The concern is that selection into the experiment is not random, but might occur with a probability related to image. Using this notion to formulate the selection rule leads to positive selection: subjects with higher image values are more likely to participate if offered. In Fig. 1, we denote the cutoff value of image as image: people above image participate, those below do not.

In this case, the treatment effect on the treated is what is measured in the social experiment: image. image is equal to image, which represents the estimate of the treatment effect for those who select to participate. A lack of recognition of selection causes the analyst to mis-measure the treatment effect for the population of interest. Figure 1 also shows the treatment effect on the untreated, image. This image is equal to image, which represents the unobserved estimate of the treatment effect for those who chose not to participate. Therefore, the population parameter of interest, image, is a mixture of these two effects: image, where image represents the probability of image. Even if one assumes that the population density of image among participants is isomorphic to the population density of inferential interest, such selection frustrates proper inference. A related concern is whether the density of image in the participant population exactly overlaps with the population of interest.

A second issue stems from Heckman (1992), Heckman and Smith (1995), and Manski (1995), who contend that participants in small-scale experiments may not be representative of individuals who would participate in ongoing, full-scale programs. Such non-representativeness of the experimental sample could occur because of a lack of information diffusion, the reluctance of some individuals to subject themselves to random assignment, or resource constraints in full-scale programs that result in program administrators restricting participants to people meeting certain criteria. As a result, making inference on how individuals would respond to the same intervention were they to be scaled up is not straightforward.

A third set of concerns stem from the supply side of those implementing the social experiment as it is scaled up. For example, the quality of those administering the intervention might be very different from the quality of personnel selected to take part in the original social experiment. Moreover, the ability of administrative agencies to closely monitor those charged with the actual implementation of the program might also vary as programs are scaled-up. These concerns might also apply to field experiments unless they are explicitly designed to allow for such possibilities. In general, the role played by program implementers in determining program outcomes remains poorly understood and is a rich area for future study both for field experiments and researchers in general.

A fourth concern that arises in social experiments is attrition bias. Attrition bias refers to systematic differences between the treatment and control groups because of differential losses of participants. As Hausman and Wise (1979) note, a characteristic of social experiments is that individuals are surveyed before the experiment begins as well as during the experiment, which in many cases is several years. This within-person experimental design permits added power compared to a between-person experimental design—because of the importance of individual effects. But, there are potential problems, as they note (p. 455): “the inclusion of the time factor in the experiment raises a problem which does not exist in classical experiments—attrition. Some individuals decide that keeping the detailed records that the experiments require is not worth the payment, some move, some are inducted into the military.”18

Beyond sampling and implementation shortcomings, social experiments also run the risk of generating misleading inference out of sample due to the increased scrutiny induced by the experiment. If experimental participants understand their behavior is being measured in terms of certain outcomes, some of them might attempt to succeed along these outcomes. Such effects have been deemed “John Henry” effects for the control sample because such participants work harder to show their worth when they realize that they are part of the control group. More broadly, some studies denote such effects as “Hawthorne” effects; if these Hawthorne effects do not operate equally on the treatment and control group, bias is induced. 19

Another factor that might lead to incorrect inference in a social experiment is control group members seeking available substitutes for treatment. This is denoted “substitution bias” in the literature, a bias that can result in significant understatement of the treatment effect. Substitution bias can occur if a new program being tested experimentally absorbs resources that would otherwise be available to members of the control group or, instead, if as a result of serving some members of a target group, the new program frees up resources available under other programs that can now be used to better serve members of the control group. The practical importance of substitution bias is provided in Puma et al. (1990) and Heckman and Smith (1995).

Although these concerns, as well as others not discussed here, need always to be accounted for, social experiments continue to be an important and valuable tool for policy analysis, as evidenced by two recent and notable large scale undertakings: Moving To Opportunity (Katz et al., 2001) and PROGRESA (Schultz, 2004), as well as the more recent social experiments documented in Greenberg and Shroder (2004).

1.2 Field experiments

Following from the first two periods of experimentation discussed above, the third distinct period of field experimentation is the most recent surge of field experiments in economics. Harrison and List (2004), List (2006) and List and Reiley (2008) provide recent overviews of this literature. The increased use of this approach reflects a long running trend in labor economics, and applied microeconomics more generally, to identify causal effects. This is not surprising given that nearly all of the central research questions in labor economics are plagued by econometric concerns related to the simultaneous determination of individual decisions related to the accumulation of human capital, self-selection into labor markets and careers. Furthermore, many of the key variables that underlie behavior in labor markets—such as motivation or talent—are either simply unmeasured or measured with error in standard surveys.

Field experiments form the most recent addition to the wave of empirical strategies to identify causal effects that have entered mainstream empirical research in labor economics since the mid 1980s. For example, these are based on fixed effects, difference-in-differences, instrumental variables, regression discontinuities, and natural experiments. Comprehensive reviews of these developments are provided in Angrist and Krueger (1999). At the same time as these research strategies have developed, greater emphasis has been placed on econometric methods that are robust to functional form and distributional assumptions. These include the development of semi-parametric and non-parametric estimation techniques. Reviews of these developments are provided in Moffitt (1999).

We view the increased use of field experiments to have its origins in the last decade in part because of an acceleration of three long-standing trends in how applied economic research is conducted: (i) the increased use of research designs to uncover credible causal effects; (ii) the increased propensity to engage in primary data collection; and (iii) the formation of ever closer interactions with practitioners and policy makers more generally.

Similar to the experiments at the Hawthorne plant and social experiments, but unlike the first-generation agricultural studies, the most recent field experiments typically apply randomization to human subjects to obtain identification. In contrast to social experiments, however, recent field experiments strive to carry out this randomization on naturally occurring populations in naturally occurring settings, often without the research subjects being aware that they are part of an experiment. As a consequence, these more recent studies tend to be carried out opportunistically, and on a smaller scale than social experiments.20

This current generation of field experiments oftentimes has more ambitious theoretical goals than social experiments, which largely aim to speak to policy makers and identify whether a package of measures leads to some desired change in outcomes. Modern field experiments in many cases are designed to test economic theory, collect facts useful for constructing a theory, and organize primary data collection to make measurements of key parameters, assuming a theory is correct. Field experiments can also help provide the necessary behavioral principles to permit sharper inference from laboratory or naturally-occurring data. Alternatively, field experiments can help to determine whether lab or field results should be reinterpreted, defined more narrowly than first believed, or are more general than the context in which they were measured. In other cases, field experiments might help to uncover the causes and underlying conditions necessary to produce data patterns observed in the lab or the field.

Since nature in most cases does not randomize agents into appropriate treatment and control groups, the task of the field experimental researcher is to develop markets, constructs, or experimental designs wherein subjects are randomized into treatments of interest. The researcher faces challenges different from those that arise either in conducting laboratory experiments or relying on naturally occurring variation. The field experimenter does not exert the same degree of control over real markets as the scientist does in the lab. Yet, unlike an empiricist who collects existing data, the field experimenter is in the data generating business, as opposed to solely engaging in data collection or evaluation. Consequently, conducting successful field experiments demands a different set of skills from the researcher: the ability to recognize opportunities for experimentation hidden amidst everyday phenomena, an understanding of experimental design and evaluation methods, knowledge of economic theory to motivate the research, and the interpersonal skills to manage what are often a complex set of relationships involving parties to an experiment.

1.2.1 What is a field experiment?

Harrison and List (2004) propose six factors that can be used to determine the field context of an experiment: the nature of the subject pool, the nature of the information that the subjects bring to the task, the nature of the commodity, the nature of the task or trading rules applied, the nature of the stakes, and the environment in which the subjects operate. Using these factors, they discuss a broad classification scheme that helps to organize one’s thoughts about the factors that might be important when moving from non-experimental to experimental data.

They classify field experiments into three categories: artefactual, framed, and natural. Figure 2 shows how these three types of field experiments compare and contrast with laboratory experiments and approaches using naturally occurring non-experimental data. On the far left in Fig. 2 are laboratory experiments, which typically make use of randomization to identify a treatment effect of interest in the lab using a subject pool of students. In this Handbook, Charness and Kuhn (2011, 2010) discuss extant laboratory studies in the area of labor economics.

image

Figure 2 A field experiment bridge.

The other end of the spectrum in Fig. 2 includes empirical models that make necessary identification assumptions to identify treatment effects from naturally-occurring data. For example, identification in simple natural experiments results from a difference in difference regression model: image, where image indexes the unit of observation, image indexes years, image is the outcome, image is a vector of controls, image is a binary treatment variable equal to one if unit image is treated and zero otherwise, image, and image is measured by comparing the difference in outcomes before and after for the treated group with the before and after outcomes for the non treated group. A major identifying assumption in this case is that there are no time-varying, unit-specific shocks to the outcome variable that are correlated with image, and that selection into treatment is independent of a temporary individual specific effect.

Useful alternatives include the method of propensity score matching (PSM) developed in Rosenbaum and Rubin (1983). Again, if both states of the world were observable, the average treatment effect, image, would equal image. However, given that only image or image is observed for each observation, unless assignment into the treatment group is random, generally image. The solution advocated by Rosenbaum and Rubin (1983) is to find a vector of covariates, image, such that image, image, where image denotes independence. This assumption is called the “conditional independence assumption” and intuitively means that given image, the non-treated outcomes are what the treated outcomes would have been had they not been treated. Or, likewise, that selection occurs only on observables. If this condition holds, then treatment assignment is said to be “strongly ignorable” (Rosenbaum and Rubin, 1983, p. 43). To estimate the average treatment effect (on the treated), only the weaker condition image is required. Thus, the treatment effect is given by image, implying that conditional on image, assignment to the treatment group mimics a randomized experiment.21

Other more popular methods of estimating treatment effects include the use of instrumental variables (Rosenzweig and Wolpin, 2000) and structural modeling. Assumptions of these approaches are well documented and are not discussed further here (Angrist and Krueger, 1999; Blundell and Costa-Dias, 2002). Between the two extremes in Fig. 2 are various types of field experiment. We now turn to a more patient discussion of these types.

1.2.2 A more detailed typology of field experiments

Following Harrison and List (2004), we summarize the key elements of each type of field experiment in Table 1. This also makes precise the differences between field and laboratory experiments.

Table 1 A typology of field experiments.

image

Harrison and List (2004) argue that a first useful departure from laboratory experiments using student subjects is simply to use “non-standard” subjects, or experimental participants from the market of interest. In Table 1 and Fig. 2, these are denoted as “artefactual” field experiments. This type of field experiment represents a potentially useful type of exploration outside of traditional laboratory studies because it affords the researchers with the control of a standard lab experiment but with the realism of a subject pool that are the natural actors from the market of interest. In the past decade, artefactual field experiments have been used in financial applications (Alevy et al., 2007; Cipriani and Guarino, 2009), to test predictions of game theory (Levitt et al., 2009), and in applications associated with labor economics (Cooper et al., 1999). 22

Another example of the use of artefactual field experiments is to explain or predict non-experimental outcomes. An example of this usage is Barr and Serneels (2009), who correlate behavior in a trust game experiment with wage outcomes of employees of Ghanian manufacturing enterprises. They report that a one percent increase in reciprocity in these games is associated with a fifteen percent increase in wages. Another example is Attanasio et al. (2009) who combine household data on social networks with a field experiment conducted with the same households in Colombia, to investigate who pools risk with whom when risk pooling arrangements are not formally enforced. Combining non-experimental and experimental research methods in this way by conducting an artefactual field experiment among survey respondents provides an opportunity to study the interplay of risk attitudes, pre-existing networks, and risk-sharing.

Moving closer to how naturally-occurring data are generated, Harrison and List (2004) denote a “framed field experiment” as the same as an artefactual field experiment, except that it incorporates important elements of the context of the naturally occurring environment with respect to the commodity, task, stakes, and information set of the subjects. Yet, it is important to note that framed field experiments, like lab experiments and artefactual field experiments, are conducted in a manner that ensures subjects understand that they are taking part in an experiment, with their behavior subsequently recorded and scrutinized. Framed field experiments include the Hawthorne plant experiments, the social experiments of the twentieth century, and two related experimental approaches.

One related approach might be considered a cousin of social experiments: the collection of studies done in developing countries that use randomization to identify causal effects of interventions in settings where naturally-occurring data are limited. The primary motivation for such experiments is to inform public policy. These studies typically use experimental treatments more bluntly than the controlled treatments discussed above, in that the designs often randomly introduce a package of several interventions. On the other hand, this package of measures is directly linked to a menu of actual public policy alternatives. A few recent notable examples of this type of work are the studies such as Kremer et al. (2009) and Duflo et al. (2006).

Framed field experiments have also been done with a greater eye towards testing economic theory, for instance several framed field experiments of this genre have been published in the economics literature, ranging from further tests of auction theory (List, 2002b, 2004c; List and Price, 2005), and tests of information assimilation among professional financial traders (Alevy et al., 2007).23

Unlike social experiments, this type of framed field experiment does not need to worry about many of the shortcomings discussed above. For example, since subjects are unaware that the experiment is using randomization, any randomization bias should be eliminated. Also, these experiments tend to be short-lived and therefore attrition bias is not of major importance. Also, substitution bias should not be a primary concern in these types of studies. The cost of not having these concerns is that rarely can the long run effects of experimentally introduced interventions be assessed. This might limit therefore the appropriateness of such field experiments for questions in labor economics in which there are long time lags between when actions are made and outcomes realized.

As Levitt and List (2007a,b) discuss, the fact that subjects are in an environment in which they are keenly aware that their behavior is being monitored, recorded, and subsequently scrutinized, might also cause generalizability to be compromised. Decades of research within psychology highlight the power of the role obligations of being an experimental subject, the power of the experimenter herself, and the experimental situation (Orne, 1962). This leads to our final type of field experiment—“natural field experiments,” which complete Table 1 and Fig. 2.

Natural field experiments are those experiments completed in cases where the environment is such that the subjects naturally undertake these tasks and where the subjects do not know that they are participants in an experiment. Therefore, they neither know that they are being randomized into treatment nor that their behavior is subsequently scrutinized. Such an exercise is important in that it represents an approach that combines the most attractive elements of the lab and naturally-occurring data: randomization and realism. In addition, it is difficult for people to respond to treatments they do not necessarily know are unusual, and of course they cannot excuse themselves from being treated. Hence, many of the limitations cited above are not an issue when making inference from data generated by natural field experiments. As we document in later sections, natural field experiments have already been used to answer a wide range of traditional research questions in labor economics.

1.2.3 Simple rules of thumb for experimentation

Scholars have produced a variety of rules of thumb to aid in experimental design. Following List et al. (2010), we provide a framework to think through these issues. Suppose that a single treatment image results in (conditional) outcomes image if image, where image, and image if image, where image. Since the experiment has not yet been conducted, the experimenter must form beliefs about the variances of outcomes across the treatment and control groups, which may, for example, come from theory, prior empirical evidence, or a pilot experiment. The experimenter also has to make a decision about the minimum detectable difference between mean control and treatment outcomes, image, that the experiment is meant to be able to detect. In essence, image is the minimum average treatment effect, image, that the experiment will be able to detect at a given significance level and power. Finally, we assume that the significance of the treatment effect will be determined using a image-test.

The first step in calculating optimal sample sizes requires specifying a null hypothesis and a specific alternative hypothesis. Typically, the null hypothesis is that there is no treatment effect, i.e. that the effect size is zero. The alternative hypothesis is that the effect size takes on a specific value (the minimum detectable effect size). The idea behind the choice of optimal sample sizes in this scenario is that the sample sizes have to be just large enough so that the experimenter: (i) does not falsely reject the null hypothesis that the population treatment and control outcomes are equal, i.e. commit a Type I error; and, (ii) does not falsely accept the null hypothesis when the actual difference is equal to image, i.e. commit a Type II error. More formally, if the observations for control and treatment groups are independently drawn and image and image, we need the difference in sample means image (which are of course not yet observed) to satisfy the following two conditions related to the probabilities of Type I and Type II errors.

First, the probability image of committing a Type I error in a two-sided test, i.e. a significance level of image, is given by,


image     (1)


where image and image for image are the conditional variance of the outcome and the sample size of the control and treatment groups. Second, the probability image of committing a Type II error, i.e. a power of image, in a one-sided test, is given by,


image     (2)


Using (1) to eliminate image from (2) we obtain,


image     (3)


It can easily be shown that if image, i.e. image, then the smallest sample sizes that solve this equality satisfy image and then,


image     (4)


If the variances of the outcomes are not equal this becomes,

image     (5)

image

where image, image, image.

If sample sizes are large enough so that the normal distribution is a good approximation for the t-distribution, then the above equations provide a closed form solution for the optimal sample sizes. If sample sizes are small, then image must be solved by using successive approximations. Optimal sample sizes increase proportionally with the variance of outcomes, non-linearly with the significance level and the power, and decrease proportionally with the square of the minimum detectable effect. The relative distribution of subjects across treatment and control is proportional to the standard deviation of the respective outcomes. This suggests that if the variance of outcomes under treatment and control are fairly similar—namely, in those cases when there are expected to be homogeneous treatment effects—there should not be a large loss in efficiency from assigning equal sample sizes to each.

In cases when the outcome variable is dichotomous, under the null hypothesis of no treatment effect, image, one should always allocate subjects equally across treatments. Yet, if the null is of the form image, where image, then the sample size arrangement is dictated by image in the same manner as in the continuous case. If the cost of sampling subjects differs across treatment and control groups, then the ratio of the sample sizes is inversely proportional to the square root of the relative costs. Interestingly, differences in sampling costs have exactly the same effect on relative sample sizes of treatment and control groups as differences in variances.

In those instances where the unit of randomization is different from the unit of observation, special considerations must be paid to the correlation in outcomes between units in the same treated cluster. Specifically, the number of observations required is multiplied by image, where image is the intracluster correlation coefficient and image is the size of each cluster. The optimal size of each cluster increases with the ratio of the within to between cluster standard deviation, and decreases with the square root of the ratio of the cost of sampling a subject to the fixed cost of sampling from a new cluster. Since the optimal sample size is independent of the available budget, the experimenter should first determine how many subjects to sample in each cluster and then sample from as many clusters as the budget permits (or until the optimal total sample size is achieved). 24

A final class of results pertains to designs that include several levels of treatment, or more generally when the treatment variable itself is continuous, but we assume homogeneous treatment effects. The primary goal of the experimental design in this case is to simply maximize the variance of the treatment variable. For example, if the analyst is interested in estimating the effect of treatment and has strong priors that the treatment has a linear effect, then the sample should be equally divided on the endpoints of the feasible treatment range, with no intermediate points sampled. Maximizing the variance of the treatment variable under an assumed quadratic, cubic, quartic, etc., relationship produces unambiguous allocation rules as well: in the quadratic case, for instance, the analyst should place half of the sample equally distributed on the endpoints and the other half on the midpoint. More generally, optimal design requires that the number of treatment cells used should be equal to the highest polynomial order of the anticipated treatment effect, plus one.

1.2.4 Further considerations

In light of the differences between field experimentation and other empirical methods—lab experiments and using observational data—it is important to discuss some perceived differences and potential obstacles associated with this research agenda. One shortcoming of field experiments is the relative difficulty of replication vis-à-vis lab experiments. 25 As Fisher (1926) emphasized, replication is an important advantage of the experimental methodology. The ability of other researchers to reproduce quickly the experiment, and therefore test whether the results can be independently verified, not only serves to generate a deeper collection of comparable data but also provides incentives for the experimenter to collect and document data carefully.

There are at least three levels at which replication can operate. The first and most narrow of these involves taking the actual data generated by an experiment and reanalyzing the data to confirm the original findings. A second notion of replication is to run an experiment which follows a similar protocol to the first experiment to determine whether similar results can be generated using new subjects. The third and most general conception of replication is to test the hypotheses of the original study using a new research design.

Lab experiments and many artefactual and framed field experiments lend themselves to replication in all three dimensions: it is relatively straightforward to reanalyze existing data, to run new experiments following existing protocols, and (with some imagination) to design new experiments testing the same hypotheses.

With natural field experiments, the first and third types of replication are easily done (i.e. reanalyzing the original data or designing new experiments), but the second type of replication (i.e. re-running the original experiment, but on a new pool of subjects) is more difficult. This difficulty arises because by their very nature, many field experiments are opportunistic and might be difficult to replicate because they require the cooperation of outside entities or practitioners, or detailed knowledge and the ability to manipulate a particular market.

Another consideration associated with field experiments relates to ethical guidelines (Dunford, 1990; Levitt and List, 2009). The third parties that field experimenters often need to work with can be concerned by the need to randomize units of observation into treatments. The benefits of such an approach need to be conveyed, as well as a practical sense of how to achieve this. For example, given resource constraints, practitioners are typically unable to roll out interventions to all intended recipients immediately. The field experimenter can intervene to randomly assign the order in which individuals are treated (or offered treatment), not whether they eventually receive the treatment or not.

With the onset of field experiments, new issues related to informed consent naturally arise. Ethical issues surrounding human experimentation is of utmost import. The topic of informed consent for human experimentation were recognized as early as the nineteenth century (Vollmann and Winau, 1996), but the principal document to provide guidelines on research ethics was the Nuremberg Code of 1947. The Code was a response to malfeasance of Nazi doctors, who performed immoral acts of experimentation during the Second World War. The major feature of the Code was that voluntary consent became a requirement in clinical research studies, where consent can be voluntary only if subjects: (i) are physically able to provide consent; (ii) are free from coercion; and, (iii) can comprehend the risks and benefits involved in the experiment.

What is right for medical trials need not be appropriate for the social sciences. To thoughtlessly adopt the Nuremberg Code whole cloth for field experiments without considering the implications would be misguided. In medical trials, it is sensible to have informed consent as the default because of the serious risk potential in most clinical studies. In contrast, the risks posed in some natural field experiments in economics are small or nonexistent, although such risks are almost certain to become more heterogeneous across field experiments as this research method becomes more prevalent. Hence while there might be valid arguments for making informed consent the exception, rather than the rule, in a field experimental context, it is true to say that covert experimentation remains hotly debated in the literature. For more detailed discussions, the interested reader should see Dingwall (1980) and Punch (1985).

There are certain cases in which seeking informed consent directly interferes with the ability to conduct the research (Homan, 1991). For example, for years economists have been interested in measuring and detecting discrimination in the marketplace. Labor market field studies present perhaps the deepest amount of work in the area of discrimination. The work in this area can be parsed into two distinct categories, personal approaches and written applications.

Personal approaches include studies that have individuals either attend job interviews or apply for employment over the telephone. In these studies, the researcher matches two testers that are identical along all relevant employment characteristics except the comparative static of interest (e.g., race, gender, age), and after appropriate training the testers approach potential employers who have advertised a job opening. Researchers “train” the subjects simultaneously to ensure that their behavior and approach to the job interview are similar.

Under the written application approach, which can be traced to Jowell and Prescott-Clarke (1970), carefully prepared written job applications are sent to employers who have advertised vacancies. The usual approach is to choose advertisements in daily newspapers within some geographic area to test for discrimination. Akin to the personal approaches, great care is typically taken to ensure that the applications are similar across several dimensions except the variable of interest.

It strikes us as unusually difficult to explore whether, and to what extent, race or gender influence the jobs people receive, or the wages they secured, if one had to receive informed consent from the discriminating employer. For such purposes, it makes sense to consider executing a natural field experiment. This does not suggest that in the pursuit of science, moral principles should be ignored. Rather, in those cases Local Research Ethics Committees and Institutional Review Boards (IRBs) in the US serve an important role in weighing whether the research will inflict harm, gauging the extent to which the research benefits others, and determining whether experimental subjects selected into the environment on their own volition and are treated justly in the experiment.

1.2.5 Limits of field experiments

Clearly, labor economists rarely have the ability to randomize variables directly related to individual decisions such as educational attainment, the choice to migrate, the minimum wage faced, or retirement ages or benefits. This might in part reflect why some active research areas in labor economics have been relatively untouched by field experiments, as described in more detail below. However, field experiments allow the researcher scope to randomize key elements of the economic environment faced that determine such outcomes. For example in the context of educational attainment, it is plausible to design field experiments that create random variation over the monetary costs of acquiring education, information on the potential returns to education, knowledge of the potential costs and benefits of education, or changes in the quality of inputs into the educational production function. Given the early and close involvement of researchers and the fact that primary data collection effort is part of a field experiment, there is always the potential to mitigate measurement error and omitted variables problems that are prevalent in labor economics (Angrist and Krueger, 1999).

Social experiments and field experiments are relatively easy for policy makers to understand. When designed around the evaluation of a particular policy or intervention, it is more straightforward to conduct a cost benefit analysis of the policy than would be possible through other empirical methods. As discussed before, a concern of using social experiments relates to sample attrition. While such attrition is less relevant in many field experiments, it is important to be clear that this often comes at the cost of field experiments evaluating relatively short run impacts of any given intervention. How outcomes evolve over time—in the absence of the close scrutiny of the experimenter, or how interventions should be scaled up to other units and other implementers, remain questions that field experimenters will have to always confront directly. Along these lines, we will showcase a number of field experiments in which researchers have combined random variation they have engineered to identify reduced form causal effects, with structural modeling to make out of sample predictions.

A second broad category of concerns for field experiments relate to sample selection. These can take a number of forms relating to the non-random selection of individuals, organizations, and interventions. At the individual level and in cases in which written consent is required, as for social and laboratory experiments, the self selection of individuals into the field experiment needs to be accounted for. Relatedly, the timing of decisions over who is potentially eligible to participate are critical, and potentially open to manipulation or renegotiation.

At the organizational level, there exists concerns related to whether we observe a non-random selection of organizations, or practitioners self-select to be subject to a field experiments. Similar concerns arise for social experiments often from political economy considerations.

Finally, at the intervention level, a concern is that practitioners, with whom field experimenters typically need to work, might only be willing to consider introducing interventions along dimensions they a priori expect to have beneficial effects. On the one hand this begs the question of why such practices have not been adopted already. On the other hand, one benefit of field experimentation might be that through closer ties between researchers and practitioners, the latter are prompted to think and learn about how they might change their behavior in privately optimal ways, and can be assured they will be able to provide concrete evidence of any potential benefits of such changes.

A third category of concerns relate to how unusual is the intervention. Although many parameters can be experimentally varied, it is important to focus on those parameters that would naturally vary across economic environments, and to calibrate the magnitude of induced variations based on the range of parameter values actually observed in similar economic environments. Introducing unusual types of variation, or variations of implausible or unusual magnitude, or those that do not accord with theory, will be hard to make generalizations from and will not easily map back to an underlying theory. At the very least, care needs to be taken to separately identify whether responses to interventions reflect changes in equilibrium behavior that will persist in the long run, or agent’s short run learning how to behave in new or unusual circumstances induced by the experimenter.

Fourth, there can sometimes be concerns that the third parties researchers collaborate with, might be under resource constraints that lead to the same set of implementers simultaneously or sequentially dealing with treated and control populations. Such implementation might lead to contamination effects and some of the other biases discussed above in relation to social experiments. This can lead to the use of within subject designs, where the researchers engineer an exogenously timed change to the economic environment, rather than between subject designs. The field experiments we discuss in later sections utilize both approaches.

Taken together, most of these concerns can be summarized as relating to the “external validity” of any field experiment—namely the ability to extrapolate meaningfully outside of the specific economic environment considered. This feature remains key to the worth of many field experiments. Field experiments almost inevitably face a trade-off between understanding the specifics of a given context and the generalizability of their findings. This trade-off can be eased by implementing a field experiment that considers the sources of heterogenous effects, or that combines reduced form estimates based on exogenous variation with structural modelling to predict responses to alternative interventions or to the same intervention in a slightly different economic environment.

Finally, it is worth reiterating that although primary data collection is a key element of field experimentation, this raises the costs of entry and might limit the number of experimenters relative to other purely lab based approaches. As will be apparent in the remainder of this chapter, there remain many issues in labor economics in which field experiments have yet to penetrate. In part these limits might be due to lack of opportunities, in some cases it might be because the activities under study are clandestine or illegal, although we will discuss carefully crafted field experiments to explore issues of racial discrimination for example. However, in some cases it is because the nature of the research question is simply not amenable to field experimentation. For example, questions relating to the design of labor market institutions are likely to remain outside the realm of field experimentation. In these and other cases, the controlled environment of the laboratory is the ideal starting point for economic research. Indeed, in this volume, Charness and Kuhn (2011, 2010) discuss the large laboratory-based literature on multiple aspects of the design of labor markets—such as market clearing mechanisms and contractual incompleteness. More generally, they discuss in detail the relative merits of laboratory and field experiments. We share their view that no one research method dominates the other, and that in many scenarios using a combination of methods is likely to be more informative.

1.3 Research in labor economics

An enormous range of research questions are addressed by labor economists today. While the core issues studied by labor economists have always related to labor supply, labor demand, and the organization of labor markets, to focus our discussion, we limit attention to a select few topics. These reflect long-standing traditional areas of work in labor economics.26

First, since the seminal contributions of Gary Becker and Jacob Mincer, research in labor economics, particularly related to labor supply, has placed much emphasis on understanding individual decision making with regards to the accumulation of human capital. This emphasis has widened the traditional purview of labor economists to include all decision making processes that affect human capital accumulation. These decisions are as broad as those taken in the marriage market, within the household, and those on the formation of specific forms of human capital such as investments into crime. By emphasizing the role of individual decision making, subfields in labor related to the accumulation of human capital might be especially amenable to the use of field experiments.

Second, the empirical study of labor demand has been similarly revolutionized by the rapid increase in the availability of panel data on individuals, the personnel records of firms, and matched employer-employee data.27 This has driven and fed back into research on various aspects of labor demand such as labor mobility, wage setting, rent sharing, and more generally, on the provision of incentives within organizations. This set of questions are again motivated by understanding the behavior of individuals and firms, there are rich possibilities to advance knowledge in related subfields through the use of carefully crafted field experiments. Field experiments offer the potential for researchers to lead data collection efforts.

To cover these broad areas, we loosely organize the discussion so as to follow an individual as they make important labor related decisions over their life cycle. Hence we discuss the role of field experiments in answering questions relating to early childhood interventions and the accumulation of human capital in the form of education and skills. We then consider research questions related to the demand and supply of labor, and labor market discrimination. We then move on to consider research questions related to behavior within firms: how individuals are incentivized within firms, and other aspects of the employment relationship. Finally, we end with a brief discussion of the nascent literature on field experiments related to household decision making.

Table 2 shows the number of published papers in selected subfields of labor economics in the decade prior to the last volume of the Handbook of Labor Economics (1990-99), and over the last decade (2000-09). The table is based on all published papers in the leading general interest journals of The American Economic Review, Econometrica, The Journal of Political Economy, The Quarterly Journal of Economics and the Review of Economic Studies.28 We use the Journal of Economic Literature classifications to place journal articles into one subfield within labor economics.29

Table 2 Published research in labor economics by decade.

image

Table 2 highlights a number of trends in published research in labor economics. First, the number of labor economics papers published in the top-tier general interest journals has not changed much over time. There were 278 published between 1990 and 1999, and 315 published between 2000 and 2009. Some of this increase probably reflects an increased numbers of papers in these journals as a whole, rather than changes in the relative importance of labor economics to economists. Examining the data by subfield, we do see changes in the composition of published papers in labor. There are large increases in the number of papers relating to: (i) education and the formation of human capital; (ii) firm behavior and personnel economics; (iii) household behavior; (iv) crime. Some of these increases reflect the wider available of data described above, such as personnel data from firms and matched employer-employee data sets, and primary data collected on households. Field experiments—an important component of which is primary data collection—are well placed to reinforce these trends. Indeed, below we discuss how field experiments have contributed to the first three of these areas in which there has been an increase in labor economics papers.

We observe a decline in papers on the organization of labor markets—an area in which not many field experiments have been conducted, in part because these questions are not well suited to field experimentation. Finally, the remaining subfields on the demand and supply of labor and on ageing and retirement remain relatively stable over the last two decades, and here field experiments remain scarce, but there might be particularly high returns from such research designs being utilized.

Second, the balance between theoretical and empirical work has remained relatively constant over the two decades. In both time periods, there have been approximately double the number of empirical as theoretical papers published in labor economics. We do not know whether for other areas of economics approximately a third of published papers are theoretical, but as will be emphasized throughout, labor economics has no shortage of theories that carefully designed field experiments can help determine the empirical relevance of. Within each subfield there are nearly always more empirical papers published than theoretical, with the exception of research into firm behavior and personnel economics, a pattern that holds across both decades. In other subfields, the ratios of theory to empirical papers vary considerably. Some areas such on the demand for education and formation of human capital have four to five times as many empirical papers, and the subfield of crime has been largely empirically driven.

1.3.1 How have labor economists used field experiments?

Table 3 presents evidence on the approach used by published papers in labor economics over the last decade.30

Table 3 Published papers 2000-9 by subfield and empirical method.

image image image

Three factors stand out. First, field experiments have been widely used in labor economics over the past decade, with there being 25 published papers using this research methodology in some form. For example, despite the surge in papers using laboratory experiments, over the last decade more papers published in the top-tier journals have employed field experiments. However, the number of empirical papers employing field experiments is still dwarfed by other empirical methodologies—there are 25 papers employing field experiments compared to 60 utilizing natural experiments, and 129 using non-experimental methods.

Second, the use of field experiments has thus far been concentrated to address research questions in a relatively small number of subfields in labor economics. Of the 25 published field experiments, three framed field experiments have been concerned with investments into education early in the life cycle, three natural field experiments have focused on the evaluation of specific labor market programs, and five natural field experiments have focused on incentives within firms.

In other subfields, such as on the determinants of wages and labor market discrimination, currently only one field experiment has been published, in contrast to four laboratory experiments. We view many research questions on discrimination in labor markets to be particularly amenable to study using field experiments. Hence this is one area in which field experiments have been relatively under supplied. Finally, the subfield of crime, which as documented in Table 2, has grown due almost exclusively to empirical papers, remains completely untouched by field experiments.

The third major fact to emerge from Table 3 is that there is a large supply of theory in labor economics, as previously noted in Table 2. Table 3 shows that this supply of theory is across all the subfields in labor economics. As we view carefully crafted field experiments to be able to potentially test between different theories, it would seem as if many areas of study of labor—across the life cycle from birth to retirement—are amenable to this method, and can give feedback on directions for future theoretical advancements.

To develop this point further, Table 4 provides a breakdown of how theory and evidence have been combined in labor economics, broken down by empirical method.

Table 4 Testing theory by empirical method, published papers 2000-9.

image

Two factors stand out. First, non-experimental papers are slightly more likely to use no theory than field experiments. Second, testing between more than one theory remains scarce, irrespective of the empirical approach. Although not all empirical papers should necessarily test theory, it is as important to establish facts on which future theory can be built. When testing between theories, it is important to both establish the power of these tests, to provide refutability or falsification checks, and to present evidence of the internal validity of the results. Natural field experiments might have a comparative advantage along such dimensions. Given such settings relate to real world behaviors, individuals are typically not restricted in how they respond to a change in their economic environment, which opens up the possibility of detecting behavior consistent with multiple theories.

Mirroring the discussion in Moffitt (1999), a second feature of on how best to use field experiments, that we aim to emphasize throughout, is the need to combine the use of field experiments with other research methodologies. For example, they might be combined with structural estimation, utilize a combination of evidence from the laboratory and the field, or draw inspiration from lab findings to establish plausible null and alternative hypotheses to be tested between.

Applying the full spectrum of approaches in trying to answer a single question can yield extra insights. A first example of such research relates to the importance of social preferences, which have been documented in numerous lab and field settings. To explore social preferences using a variety of approaches, List (2006) conducts artefactual, framed, and natural field experiments analyzing gift exchange. The games have buyers making price offers to sellers, and in return sellers select the quality level of the good provided to the buyer. Higher quality goods are costlier for sellers to produce than lower quality goods, but are more highly valued by buyers. The artefactual field experimental results mirror the typical findings with other subject pools: strong evidence for social preferences was observed through a positive price and quality relationship. Similarly constructed framed field experiments provide similar insights. Yet, when the environment is moved to the marketplace via a natural field experiment, where dealers are unaware that their behavior is being recorded as part of an experiment, little statistical relationship between price and quality emerges.

A second example comes from the series of field experiments presented in List (2004b)—from artefactual to framed to natural—in an actual marketplace to help distinguish between the major theories of discrimination: animus and statistical discrimination. Using data gathered from bilateral negotiations, he finds a strong tendency for minorities to receive initial and final offers that are inferior to those received by majorities in a natural field experiment. Yet, much like the vast empirical literature documenting discrimination that exists, these data in isolation cannot pinpoint the nature of discrimination. Under certain plausible scenarios, the results are consonant with at least three theories: (i) animus-based or taste-based discrimination, (ii) differences in bargaining ability, and (iii) statistical discrimination. By designing allocation, bargaining, and auction experiments, List (2004b) is able to construct an experiment wherein the various theories provide opposing predictions. The results across the field experimental domains consistently reveal that the observed discrimination is not due to animus or bargaining differences, but represents statistical discrimination.

1.4 Summary

We now move to describe, by various stages of the life cycle, how field experiments have been utilized in labor economics and the insights they have provided. Where appropriate, we discuss how these results have complemented or contradicted evidence using alternative research methods, and potential areas for future field experiments. We begin with the individual at birth and the accumulation of human capital before they enter the labor market. We then consider research questions related to the demand and supply of labor, and labor market discrimination. We then move on to consider research questions related to behavior within firms: how individuals are incentivized within firms, and other aspects of the employment relationship. Finally, we end with a brief discussion of the nascent literature on field experiments related to household decision making.

2 Human Capital

The literature associated with human capital acquisition prior to labor market entry is vast, and there is no room here to do it justice. As in the sections that follow, we therefore focus our discussion on a select few strands of this research and describe how field experiments can and have advanced knowledge within these strands. Even within this narrower branch of work, we are limited to focusing on select studies of inputs into the educational production function, where these inputs might be supplied by the school system, students, or their families.31

To see the issues, we follow Glewwe and Kremer’s (2006) presentation of a framework for the education production function with the following reduced form representation,

image     (6)

image     (7)

where image is years of schooling, image is skills learned (achievement), image is a vector of child characteristics (including “innate ability”), image is a vector of household characteristics, image is a vector of school and teacher characteristics (quality), and image is a vector of prices related to schooling. image and image are both functions of education policies (image) and local community characteristics (image), which can be substituted into Eqs (6) and (7) to yield the following reduced form,

image     (8)

image     (9)

Similar to Mincerian human capital earnings functions, this framework estimates the partial equilibrium effects of educational inputs and policies, rather than general equilibrium effects that alter returns to education and thereby demand. Broadly, there are two approaches to estimating the production function.

The first focuses on measuring the effect of direct inputs, such as per pupil expenditure, class size, teacher quality and family background (i.e. estimating Eqs (6) and (7))). The second examines the effects of educational policies governing the structure of the school system (i.e., estimating Eqs (8) and (9)). In both cases, non-experimental and experimental estimates have shed insights into the relationships in the education production function (for a literature survey see Hanushek (1986)). To help place field experiments in this area in a wider context, we now turn to a non-exhaustive discussion of select work using both approaches, but not based on field experiments.

2.1 Measuring the effects of direct inputs

An early measurement study focusing on the effect of direct inputs is the report due to Coleman et al. (1966), who explored what fraction of the variation in student achievement could be explained by direct inputs. The Coleman report found only a weak association between school inputs and outputs. Subsequent regression based approaches largely replicated the findings in the Coleman report. Yet, one remarkably consistent result did emerge from these early studies: students situated in classrooms with a larger number of students outperformed children in smaller classes on standardized tests. This result is robust to inclusion of several conditioning variables, such as key demographic variables.

One aspect that this robust empirical finding highlights is the care that should be taken to ensure reverse causality and omitted variable bias do not frustrate proper inference. Given that the simple regression approach potentially suffers from biases due to endogeneity of policy placement, omitted variables, and measurement error—i.e., it is almost always the case that some unobserved element of the vectors image or image will be correlated with both the outcome and the observed variables of interest—researchers have sought out other means to explore the parameters of the production function.

One such approach uses natural experiments. One neat example is the work of Angrist and Lavy (1999), who use legal rules to estimate the effect of class size on student performance. Assume legal limits on class size prevent the number of students in a classroom from exceeding 25. Then consider a particular school that has cohorts ranging from 70-100. Thus, if a cohort includes 100 children, we would have four classrooms of size 25, whereas if the cohort includes 76 children, we end up with 4 classrooms with 19 children occupying each. Angrist and Lavy (1999) compare standardized test scores across students placed in different sized classrooms and find that a ten-student reduction raises standardized test scores by about 0.2 to 0.3 standard deviations.

As Keane (forthcoming) points out, this type of approach has similar drawbacks associated with the simple regression framework in the Coleman report. For instance, incoming cohort sizes might not be determined randomly because high performing schools attract more students. Likewise, cohort size might be affected by parents reacting to large class sizes by sending their kids elsewhere for schooling. Similar issues revolve around teacher assignment to small and large classrooms, which might not be randomly determined.

In this way, the Angrist and Lavy (1999) estimates should be viewed as a first step in understanding the importance of class size on student performance. The next step is to deepen our understanding by exploring the robustness of these results. One approach is to look for more observational data, another is to use randomization directly—similar to the accidental randomization of natural experiments, purposeful randomization can aid the scientific inquiry.

A central figure in using randomization in the area of education is William McCall, an education psychologist at Columbia University who, at odds with his more philosophical contemporaries, insisted on quantitative measures to test the validity of education programs. For his efforts, McCall is credited as an early proponent of using randomization rather than matching as a means to exclude rival hypotheses, and his work continues to influence the field experiments conducted in education today.32

A landmark social experiment measuring the effects of classroom size is the Tennessee STAR experiment. In this intervention, more than 10,000 students were randomly assigned to classes of different sizes from kindergarten through third grade. Similar to the social experiments discussed in the first Section, the STAR experiment had both attrition bias and selection problems in that some students changed from larger to smaller classrooms after the assignment had occurred. Nevertheless, even after taking these problems into account, Krueger (1999) put together a detailed analysis that suggests there are achievement gains from studying in smaller classes.

Combined, these two examples indicate that there very well might be a statistically meaningful relationship between class sizes and academic achievement, but the broader literature has not concluded that to be necessarily true. Scanning the entire set of estimates from natural experiments and field experiments one is left with mixed evidence on the effects of class size at various tiers of the education system (Angrist and Lavy, 1999; Case and Deaton, 1999; Hoxby, 2000a; Kremer, 2003; Krueger, 2003; Hanushek, 2007; Bandiera et al., 2010, forthcoming).

Lazear (2001) theorizes that class size is dependent upon the behavior of students. As disruptive students are a detriment to the learning of their entire class, he proposes that the optimal class size is larger for better-behaved students. In his model, larger classes may be associated with higher student achievement, and may in part explain the mixed results in previous studies. This is one area where a natural field experiment might be able to help. One can envision that a test of Lazear’s (2001) theory is not difficult if the researcher takes the data generation process in her own hands: designing experimental treatments that interact class size with student behavior would permit an estimation of parameters of interest for measures of both class size and peer inputs into the educational production function.

The results from this literature, more generally, make it clear how one could move forward with a research agenda based on field experimentation. For instance, are there critical non-linearities in the relationship between class sizes and academic performance, as suggested for university class sizes in Hoxby, 2000b; Zimmerman, 2003; Angrist and Lang, 2004; Hoxby and Weingarth, 2006; Lavy et al., 2008; De Giorgi et al., 2009; Duflo et al., 2009), it might be the case that gender balance plays a key role in the classroom.

Even if we were to find strong evidence that class size matters for academic performance and answer the questions posed above, Eqs (6)-(9) highlight other features that we must be aware of before pushing such estimates too far. What is necessary is proper measurement of the estimates of the parameters of the production function, as well as an understanding of the decision rules of school administrators and parents. The next step is to deepen our understanding by exploring whether other more cost effective approaches to improve student achievement exist, say by understanding the optimal investment stream in students: at what age level are resources most effective in promoting academic achievement?

One line of work that addresses this question is the set of social experiments that explore achievement interventions before children enter school. Given that Fryer (2011) presents a lucid description of such interventions, we only briefly mention them here. The landmark social experiment in this area is the Perry Preschool program, which involved 64 students in Michigan who attended the Perry Preschool in 1962. Since then, dozens of other programs have arisen that explore what works with early childhood intervention, including Head Start, the Abcedarian Project, Educare, Tulsa’s universal pre-kindergarten program, and several others too numerous to list (see Fryer’s Table 5).33

As Fryer (2011) notes, outcomes in these programs exhibit substantial variance. And, even in those cases that were met with great success, the achievement gains faded through time. Indeed, in many cases once school started the students in these programs gave back all academic gains (Currie and Thomas, 1995, 2000; Anderson, 2008). Another fact with the bulk of these programs is that they exhibit much homogeneity, mostly following from the general design in the Perry Preschool program. Much has been learned about early childhood development in the previous several decades, and this presents the field experimenter with a unique opportunity to make large impacts on children’s lives. As Fryer (2011) notes, incorporating new insights from biology and developmental psychology represent opportunities for future research.

Such estimates cause us to pause and ask whether resource expenditures affect academic performance at all. In this spirit, there is a large literature that explores how direct school inputs, such as school expenditures, influence student performance. As a whole, the early literature found only a weak relationship between overall school expenditures and student achievement, primarily because resources tend to be allocated inefficiently (see Banerjee et al., 2001; Angrist and Lavy, 2002; Kremer et al., 2002; Glewwe et al., 2004, 2003; Banerjee et al., 2007).

2.2 Teacher quality

While the evidence on the effect of per pupil expenditure, class size, and peer composition is mixed, teacher quality has been found to be clearly important. Hanushek (2007) finds that the differences between schools can be attributed primarily to teacher quality differences. Little of this variation, however, can be explained by either teacher salaries or observable characteristics such as education and experience (Rivkin et al., 2005; Hanushek, 2006). Just as it is difficult to identify high quality teachers, little is known about how to improve teacher quality and performance. Given the evidence that education and professional development are largely ineffective, there is a growing interest in the use of performance-based incentives to improve teacher quality and effort.

The design and implementation of such incentives raises several areas of future study that observational data and field experiments can adequately fill, including: (i) what are the performance effects on the incentivized tasks and how can incentives be designed to cost-effectively maximize these effects; (ii) what are the effects on non-incentivized tasks, and how can incentives be designed to avoid diversion of effort in multitasking; (iii) how do teachers (of different quality) sort into different incentive and pay structures; and (iv) how does sorting affect general equilibrium teacher quality.

Evidence from non-experimental studies, natural experiments, and field experiments suggests that incentives can improve teacher performance (Lavy, 2002; Glewwe et al., 2003; Figlio and Kenny, 2007; Muralidharan and Sundararaman, 2007; Lavy, 2009; Duflo et al., 2009). Clearly, tighter links can be established between this literature and the larger labor literature on incentive design (Prendergast, 1999) to which, as discussed below, field experiments are also beginning to contribute.

More broadly, field experiments exploring mechanism design issues, such as comparing piece rate and tournament incentives, are rare. Also, these programs generally load incentives onto a single performance measure such as teacher attendance or student test scores, raising concerns that teachers might divert effort away from non-incentivized tasks (Holmstrom and Milgrom, 1991). Here, the evidence is mixed with some studies finding broad improvements in teacher effort (Kremer, 2006; Jacob, 2005).

Similarly, teacher sorting into incentive and pay structures is largely unexplored. Lazear (2001) applies his analysis of performance pay and productivity in a company (discussed in further detail below) to teacher incentives, suggesting that the effects of incentives on sorting could be comparable to effects on teacher effort. A well-designed field experiment could explore whether and how teacher sorting on incentives occurs. In general, field experiments that apply theories about incentive design, sorting and selection from other areas of labor could make a large contribution to the teacher incentives literature. Many of these issues arise in a later section when we discuss the role of field experiments in understanding behavior within firms.

Along with the school inputs, the primary inputs into the educational production function come from students and their families. A large literature models the effect of individual characteristics, family background and parental resources on schooling and achievement (Cameron and Heckman, 2001; Cameron and Taber, 2004). While it is impossible to randomly assign characteristics to individuals or to randomly assign children to families, quasi-experimental studies have exploited variation due to adoption in order to separately identify genetic inputs (“nature”) from parental inputs (“nurture”) (Plug and Vijverberg, 2003). Other studies focus on potential barriers to individual investment in human capital production. These include high costs to education, perhaps due to credit constraints or high discount rates, and low marginal returns to education due, for example, to poor health or lack of human capital investment prior to entering school.

Estimates from non-experimental studies and natural experiments suggest that credit constraints are of limited importance in schooling decisions (Cornwell et al., 2006; Barrera-Osorio et al., 2008; Angrist and Lavy, 2009; Maxfield et al., 2003; Kremer et al., 2009). Few of these experiments, however, explore how conditional cash transfers can be most effectively designed.

Berry (2009) develops a model of household education production in which parents’ ability to motivate their children is dampened by moral hazard. He then designs incentives to test several predictions of the model including the ability of parents to commit and the relative efficacy of incentives awarded to parents or to children based on the relative productivity of the two parties. Similarly, Levitt et al. (2010) implement a field experiment that compares both the incentive recipient (parent or student) and the incentive mechanism (piece rate or lottery). They also compare a year long broad-based incentive program that motivates sustained effort to an immediate one-time incentive aimed solely at increasing effort on a single standardized test. This design allows the authors to test a model of family investment, responsiveness to incentive mechanisms, and human capital returns from varying levels of effort. Both of these field experiments illustrate that researchers can design instruments that build on and test economic theory.

While conditional cash transfers aim to induce improvements in achievement by motivating greater effort and investment, a second strand of interventions attempts to directly improve abilities that can improve achievement. A growing of interest in this area focuses on investment in early childhood. Researchers argue that improving the abilities of young children can have long run returns on educational achievement, attainment and other outcomes such as employment, crime, fertility and health (Cunha and Heckman, 2009). Evidence from non-experimental studies, natural experiments and field experiments suggest that early education interventions can have significant effects on lifetime outcomes (Currie and Thomas, 1995; Currie, 2001; Garces et al., 2002; Behrman et al., 2004; Todd and Wolpin, 2006; Ludwig and Miller, 2007; Heckman et al., 2010).

Most of these studies require econometric techniques, such as matching, to correct for lack of valid randomization. And all of them are limited to identifying the effect of the intervention as a whole. They are not able to explore, for example, the relative importance of educational interventions compared to interventions that increase parental investments in early childhood. Given the evidence that early childhood is a key period of development and the relatively sparse body of empirical work, field experiments could address open questions related to: (i) the short and long run returns of the various inputs of the educational production function; (ii) to collect primary data and design field experiments to help decompose overall changes in outcomes from any given intervention into those arising from the behavioral responses of children, parents and teachers. Akin to the literature on public and private transfers to households (Albarran and Attanasio, 2003), this second strand of research can help shed light on whether altering some inputs leads to other inputs in the educational production function to be crowded in or out.

A final strand of the literature focuses on improving child health as a means of increasing school attendance rates. Estimates from natural experiments and field experiments find that health interventions have a positive and significant effect on school attendance (Bleakley, 2007; Bobonis et al., 2006; Miguel and Kremer, 2004). Miguel and Kremer (2004) expand beyond identification of individual returns to health interventions, modeling the positive externalities of deworming ignored in previous estimations. They use a field experiment randomized over schools to estimate positive externalities on the health and school attendance of untreated children in treated schools and schools neighboring treated schools. They also examine effects on test performance and estimate the health care and educational cost effectiveness of the program. As the authors argue, studies that ignore positive externalities in the comparison groups will underestimate the effect of the intervention by missing the external effects of deworming and underestimating the direct effect in comparison with an inflated baseline, biasing treatment effects towards zero. They point out that this identification problem is well recognized in the labor literature estimating the effects of job training programs on both participants and non-participants. The authors suggest an extension of their study that randomizes treatment at various levels such as within schools, across schools, and within clusters of schools.

2.3 Measuring the effects of policies governing the system

Recent studies of educational policy exploit natural experiments with randomized lotteries and variation in school district density to estimate the effects of school competition, school choice, school vouchers, school accountability and the presence of relatively autonomous public schools, such as charter schools (Clark, 2009; Cullen et al., 2006; Hoxby, 2000c; Jacob, 2004; Rouse, 1998; Angrist et al., 2002; Abdulkadiroglu et al., 2009). While proponents of expanding school choice argue that, as in other markets, choice and competition will improve overall school quality and efficiency, the empirical studies find somewhat mixed evidence on these educational policies.34

For example, non-experimental studies, natural experiments and field experiments finding that vouchers improve educational achievement include Peterson et al. (2003), Krueger and Zhu (2004), Angrist et al. (2002, 2006). On the other hand, using randomized school lotteries, Cullen et al. (2006) find that school choice programs have little or no effect on academic achievement, and they suggest that this result may be due to parents making poor choices. Hastings and Weinstein (2008) explore this hypothesis using both a natural experiment and a natural field experiment to examine how reducing information costs affects parental choices. In the natural experiment, parents listed their preferences for schools within a district both before and after receiving information mandated by No Child Left Behind (NCLB). The natural field experiment randomized distribution of a simplified version of the NCLB information to parents who had also received NCLB information and to parents who had received no information.

This design allows the authors to measure the effect of each piece of information alone as well as their interaction. They find that information on school-level academic performance pushes parents to choose higher scoring schools (with no differences across the types of information received). Using IV estimation, they also argue that these choices lead to increased academic achievement.

Similarly, a growing body of research has begun to identify the right tail of the distribution of treatment effects among heterogeneous charter schools (Dobbie and Fryer, 2009; Hoxby and Muraka, 2009; Angrist et al., 2010). These studies rely on randomized lotteries in oversubscribed schools and can only identify the effect of a school (or school system) as a whole. They have reported suggestive evidence, however, on specific features that correlate with successful schools, such as longer days, longer school years, highly academic environments and so on. Field experiments could be used to complement this work by separately identifying the effects of charter school innovations, such as length of school day, school time, and general environmental conditions on the educational production function.

The field experiments discussed in this section highlight several important advantages of their usage for labor economists. For example, they can address biases in previous empirical estimates, including those from non-experimental studies. They are able to build in empirical and theoretical literature from several fields, such as education, health, and labor. Finally, they can be used to identify parameters beyond the direct return of an input into an individual educational production function and explore mechanism design issues.

In the end, it is clear that empirical explorations into human capital acquisition prior to labor market entry is invaluable, and that there are several approaches that can be used in concert to learn more about the important parameters of interest. We argue that in this area field experiments can usefully add to the knowledge gained from naturally-occurring data, and the many low apples that are left to be picked give us great confidence that field experiments will only grow in importance in tackling particulars in the educational production function.

3 Labor Market Discrimination

Philosophers as far removed as Arcesilaus, Heraclitus, and Plato have scribed of injustice and extolled upon the virtues of removing it for the betterment of society. Perhaps taking a lead from these scholars, social scientists have studied extensively gender, race and age based discrimination in the marketplace. In this section we explore the stage of the life cycle where individuals are entering the labor market. We focus mainly on discrimination in labor markets and how field experiments can lend insights into this important social issue.

We begin with a statistical overview of the data patterns in labor market outcomes across minority and majority agents. To make precise how field experiments might be carefully designed, we need to discuss theories for why such discrimination exists. The two major economic theories of discrimination that we discuss are: (i) certain populations having a general “distaste” for minorities (Becker, 1957) or a general “social custom” of discrimination (Akerlof, 1980); (ii) statistical discrimination (Arrow, 1972; Phelps, 1972), which is third-degree price discrimination as defined by Pigou (1920)—marketers using observable characteristics to make statistical inferences about productivity or reservation values of market agents.

Empirically testing for marketplace discrimination has taken two quite distinct paths: regression-based methods and field experiments. The former technique typically tests for a statistical relationship between an outcome measure, such as wage or price, and a group membership indicator. By and large, regression studies find evidence of discrimination against minorities in the marketplace.35 Field experimental studies, which have arisen over the past 35 years, typically use matched pairs of transactors to test for discrimination. Due to the control that field studies offer the experimenter, they have become quite popular and have by now been carried out in at least ten countries (Riach and Rich, 2002). Across several heterogeneous labor markets, as well as product markets as diverse as home insurance and new car sales, field studies have made a strong case that systematic discrimination against minorities is prevalent in modern societies.

While regression-based empirical studies have served to provide an empirical foundation that indicates discrimination is prevalent in the marketplace, they have been less helpful in distinguishing the causes of discrimination. As Riach and Rich (2002) note, findings from field studies appear to be more consistent with the majority white populations having a general “distaste” for minorities in the sense of Becker (1957) or a general “social custom” of discrimination in line with Akerlof (1980); but statistical discrimination (Arrow, 1972; Phelps, 1972), or marketers using observable characteristics to make statistical inference about productivity or reservation values of market agents, for example, cannot be ruled out, ex ante or ex post.

Before one can even begin to discuss social policies to address discrimination, it is critical to understand the causes of the underlying preferential treatment that certain groups receive. As has been emphasized throughout, the potential for field experiments to be explicitly designed to test between theories, is a key advantage of this approach over other methodologies. In this section, we provide a framework for how field experiments can be used to advance our understanding of not only the extent of discrimination in the marketplace but also the nature of discrimination observed.

3.1 Data patterns in labor markets

As Altonji and Blank (1999) noted, researchers have observed labor market differences across race and gender lines for decades. Yet, the magnitude of market differences, and hence what a new generation of field experiments seek to explain, has changed substantially over time. For example, there was convergence in the black/white wage gap during the 1960s and early 1970s, but such convergence lost steam in the two decades afterwards. In addition, the Hispanic/white wage gap has risen among both males and females in the 1980s and 1990s. Of course, the world has not remained stagnant since the 1990s, and this section is meant to update the results in Altonji and Blank (1999).

Table 5 presents the labor outcomes of whites, blacks, and Hispanics by gender in 2009. Table 5 includes a set of labor market outcomes by race and gender that labor economists have studied for decades. The data are based on tabulations from the Current Population Survey (CPS) from May 2009. Row 2 of Table 5 indicates that white men earn 13% (21%) more than white women (black and Hispanic men) on an hourly basis. Black and Hispanic women earn less than minority men and majority women.

Table 5 Labor market data by race and gender.

image

When one focuses on annual earnings, row 3 of Table 5, the differential between white men continues: they earn more than 20% higher wages than minority men. Yet, for women the racial difference becomes markedly higher—50% for white women to black and 30% for white women to Hispanic. The differentials remain when we focus on full-time employees—rows 7 and 8 of Table 5. In general Table 5 tells a story that has been told often before: white men earn more money for hours worked than other groups, and white women earn more than their female counterparts.

Figures 3 and 4 complement these wage data by showing for each gender, the time series of annual median weekly income from 1969 to present for whites and blacks and 1986 to present for Hispanics.36 These figures bring to light some interesting trends. Regardless of racial or ethnic group, wage rates for women continue to grow faster than for men. Within each gender, though, the 2000s did very little for racial or ethnic differentials. In fact, for both genders any local trend of convergence is reversed by the mid-2000s. In part this could be a function of the well documented rise in wage inequality during the second half of the 2000s.

image

Figure 3 Median weekly earnings of male workers. Year 2000 dollars.

Bureau of Labor Statistics.

image

Figure 4 Median weekly earnings of female workers. Year 2000 dollars.

Bureau of Labor Statistics.

Another important set of data points on Table 5 is the extent to which whites face lower unemployment rates. Figures 5 and 6 extend this information by showing, by gender, the time series of unemployment rates for whites, blacks, and Hispanics. One interesting element is the magnitude of unemployment changes for whites versus blacks and Hispanics. The mid-2000s saw no change to this trend, with the impact of recessions falling harder on blacks and Hispanics relative to whites. This trend does not seem to depend strongly on gender either, even though neither gender nor any racial or ethnic group seems to be immune from being hit by the 2009/10 recession.

image

Figure 5 Male unemployment rates (annual averages) for men over 20.

Bureau of Labor Statistics.

image

Figure 6 Female unemployment rates (annual averages) for women over 20.

Bureau of Labor Statistics.

Wages and unemployment rates are a function of labor force participation rates as well. Figure 7 shows the time series of labor force participation. The convergence in participation rates from the 70s through 90s continued into the 2000s, although the pace of that convergence has slowed. Men of every race/ethnicity have dropped out of the labor force at a very slow rate while Hispanic and white females have increased participation. Interestingly, African American women have higher labor force participation than white women.

image

Figure 7 Labor force participation rates, 20 years and older.

Bureau of Labor Statistics

In considering the causes for these labor market disparities, economists have explored whether the workers themselves bring heterogeneous attributes to the workplace. To shed insights into this issue, we provide Table 6, which shows educational differences, family differences, and regional composition.

Table 6 Personal characteristics by race and gender.

image image

Rows 2 through 6 in Table 6 shows that whites obtain more years of education than blacks and Hispanics. Interestingly, white women are almost uniformly more educated then their ethnic/racial counterparts. This result is also reflected in the years of experience variable. Rows 8 through 10 in Table 6 give a sense of the different family choices (marriage and fertility) that are made by whites, blacks, and Hispanics—another important input to wages, especially for women. Row 8 shows that whites are more likely to be married (and perhaps enjoy the efficiencies of household trade), but row 10 shows that white women are likely to have fewer children to spend time caring for than black and Hispanic women. Rows 11 through 20 show the geographic breakdown of each race/ethnic group. Local labor market opportunities are surely influential for wages and in general, whites are from higher earning regions like New England and the Pacific.

Overall, these data are in line with Altonji and Blank (1999), who find large educational differences among these groups, with race and ethnicity mattering much more than gender. Of course, what these education differences represent is difficult to parse. On the one hand, they might be mostly due to different preferences. Alternatively, they might reflect behavior of agents who expect to face discrimination later on in the labor market—referred to as “pre-market” discrimination. As Altonji and Blank (1999) note, there is evidence that some minorities have been denied market opportunities, perhaps leading to less than efficient levels of schooling investment.

While the labor market outcomes disparities observed in Table 5 and Figs 3-7 might represent differences mainly due to these individual investment choices, perhaps investment varies because of preferences, comparative advantage, and the like. For example, another hypothesis put forth is that such outcomes are at least partly due to discrimination in the labor market. The remainder of this section briefly discusses theories of discrimination and attempts to test these theories, contrasting regression-based approaches to field experiments.

3.2 Theories of discrimination

We follow the literature and define labor market discrimination as a situation in which persons who provide labor market services and who are equally productive in a physical or material sense are treated unequally in a way that is related to an observable characteristic such as race, ethnicity, or gender. By “unequal” we mean these persons receive different wages or face different demands for their services at a given wage.

We consider two main economic models: entrepreneurs are willing to forego profits to cater to their “taste” for discrimination, as first proposed by Becker (1957). The second model is “statistical” discrimination: in an effort to maximize profits, firm owners discriminate based on a set of observables because they have imperfect information. This could be as simple as employers having imperfect information on the relative skills or productivity of minority versus majority agents. The models in both of these literatures are deep, and rich with good intuition. We do not have the space to do them justice, but strive simply to provide a sketch of each model to give the reader a sense of how one can test between them. We urge the reader to see Altonji and Blank (1999) for a more detailed presentation of these models and their implications.37

3.2.1 Taste-based discrimination

In his doctoral dissertation, Becker (1957) modeled prejudice or bigotry as a “taste” for discrimination among employers. Becker modeled employers as maximizing a utility function that is the sum of profits plus a disutility term from employing minorities,


image     (10)


where image is product price, image is the production function, which takes on two arguments: the number of employees that are non-minority image and minority image. The second term is the wage bill and the final term is the disutility from employing minorities, image. For prejudiced employers, the marginal cost of employment of a minority worker is image. Accordingly, image is the “coefficient of discrimination,” or the level of distaste of the employer for employing a minority worker. The higher image, the more likely the employer will hire non-minority workers, even if they are less productive than minority workers.

The Becker model then shows that the wage premium for non-minority workers is determined by the preferences of the least prejudiced employer who hires minority workers. Several extensions to this model have been proposed in the literature, including the possibility that image is a function of the job type, wage level, or the extent of segregation in the labor market. For example, Coate and Loury (1993) develop a model that restricts all employers to have identical preferences, but makes image a factor only when the employer hires minority workers for skilled jobs; an important consideration in the model then becomes the ratio of minority and non-minority people working in skilled jobs.

A logical conclusion of many of the studies in this area is that with certain assumptions—in many cases free entry, constant returns to scale, segmenting, etc.—in the long run non-discriminating employers will increase to the point that it is no longer necessary for minority workers to work for prejudiced employers, eliminating any wage discrepancies between minority and non-minority workers. This is a testable implication.

3.2.2 Statistical discrimination

Arrow (1972) and Phelps (1972) discuss discrimination that is consistent with the notion of profit-maximization, or Pigou’s (1920) “third-degree price discrimination.” In this class of model, in their pursuit of the most profitable transactions, marketers use observable characteristics to make statistical inference about reservation values of market agents. The underlying premise implicit in this line of work is that employers have incomplete information and use observables to guide their behavior. For example, if they believe that women might be more likely to take time out of the labor force, employers with high adjustment costs might avoid those expected to have higher attrition rates. Firms then have an incentive to use gender to “statistically discriminate” among workers if gender is correlated with attrition.

Of course, employers can discriminate along second moments of observable distributions too. Sobel and Takahashi (1983) develop a model along such lines, and their model is reconsidered in List and Livingstone (2010), which we closely follow here. In this framework, employers look at second moments and use prior beliefs about the productivity of group members to influence hiring and wage outcomes.

In the case where workers approach employers in an effort to sell their labor services, the employer proposes the wage (price) in each period. The worker can accept or reject the offer. If the offer is rejected, the employer makes another offer. If the offer is rejected in the terminal period, no exchange occurs. To keep the analysis simple and without losing focus on the critical incentives, we consider a two-period model. The results can all be extended to an image-period model.

Consider the situation where the employer’s reservation value is public information and denoted image, where image, and the employer knows only the distribution from which the worker’s reservation valuation is drawn. An employer confronts a potential worker, who has reservation value image, which is drawn from a distribution image on support image. It is assumed that this c.d.f. is continuously differentiable, and that the resulting p.d.f. image is positive for all image. The employer discounts future payoffs by image. Further assume that the worker discounts future payoffs at the rate of image. image and image can be thought of as the costs of bargaining and are known by both players.

The bargaining process proceeds as follows: the employer proposes a price (wage) to the worker in period 1. The worker can accept or reject the offer. If the offer is rejected, the employer proposes a new price. It is assumed that the new proposal must be a wage (price) that is no lower than the original offer. The worker can accept or reject this proposition. If it is rejected, the game ends and no transaction occurs.

Following a no-commitment equilibrium, the employer is assumed to make the period 1 offer at the beginning of period 1, and subsequently chooses the period 2 offer using the information gained from the worker’s rejection of the period 1 offer. Let image be the employer’s offer in period 1, and let image be the employer’s offer in period 2. Also, define, for period image,


image     (11)


A worker whose reservation value is image will prefer accepting in period image to accepting in period image if image, or if image. A worker’s most preferred time to accept is period image if image, so image is the employer’s ex ante probability of hiring in the imageth period and image is the employer’s ex ante undiscounted expected profit in period image.

The employer’s maximization problem can be stated in terms of his choice of the period 2 offer, image, and of image, which implies a choice of image, since image. The employer’s optimal strategies are found via backwards induction, starting with his period 2 decision. If the period 1 offer is rejected, then the employer knows image. The employer chooses an offer image to maximize his expected profits. Let image be this maximum value,


image     (12)


Let image be the unique value of image that solves (12). The first order condition of this problem, which implicitly defines image, implies that,


image     (13)


Since the offer image must be less than the employer’s valuation image, the left-hand side of (13) is positive, so the right-hand side must also be positive. For this to be the case, it must be true that,


image     (14)


In other words, in equilibrium, the second period offer must be greater than the first period offer, and both offers must be less than image.

The no-commitment equilibrium is fully characterized by image and a first period price, image, that solves,


image     (15)


subject to,


image     (16)


Substituting in the constraint and the definition of image, the problem becomes,


image     (17)


If image solves (17), then image and image are the no-commitment equilibrium offers. The first-order condition of (17) implies,


image     (18)


3.2.3 Optimal employer behavior

Within this framework one can analyze differences in how an employer will behave when he confronts members of the various groups. To obtain insights on the impact of changes in the variance of the worker’s reservation value on both offers that the employer may make, we consider a simple example where the worker’s value is drawn from a uniform distribution. There are two groups of potential workers. Members of group 1 draw their reservation value from a uniform distribution with lower bound image and upper bound image. Members of group 2 draw their value from a uniform distribution with lower bound image and upper bound image. Assume image and image, so the variance of group 2’s distribution is larger than the variance of group 1’s distribution. Without loss of generality, further assume that the bounds are such that the distributions have equivalent means. Now, consider the employer’s equilibrium offers, when confronting a worker who is a member of group image, image.

Solving through backwards induction, the employer first calculates his period 2 offer, as in (12). Substituting in the uniform distribution and simplifying, the problem becomes,


image     (19)


The solution of this problem is,


image     (20)


Hence, the period 2 decision is a function of image, which is chosen in period 1. The employer’s period 1 problem is to solve (17). Substituting in the distribution and the solution image, the first order condition is given by,


image     (21)


which implies the solution,


image     (22)


making the optimal period 2 offer,


image     (23)


and the optimal period 1 offer,


image     (24)


Note that the optimal offers image and image are both increasing in image, and therefore decreasing in the variance of reservation values. Group 2’s reservation value is drawn from a distribution with a larger variance, hence image. In this example, then, the analysis shows that when the employer believes he is dealing with a member of a group whose reservation value is widely distributed (group 2), he will offer to hire at a lower wage than he would if the worker were a member of a group with a lower variance (group 1). This is true despite the fact that the first moments of the distributions are identical. This prediction provides one means to test the statistical discrimination model against the taste-based discrimination model. We return to this notion below.

3.3 Empirical tests

Scholars have concerned themselves primarily with the question “is there discrimination in market X?” and much less time has been spent on answering the question “why do firms discriminate?” As economists interested in public policy, however, we should be interested in not only the extent of discrimination but also the source of discrimination. Conditional on the existence of discrimination, it is imperative to understand the source of discrimination, since one cannot begin to craft social policies to address discrimination if its underlying causes are ill-understood. We now turn to an overview of a select set of studies that measure discrimination.

3.3.1 Observational data

One of the most important means to empirically test for marketplace discrimination in labor markets is to use regression-based methods. The focus using this approach has ranged from measuring labor force participation to modeling wage determination. Within the line of work that explores wages, the overarching theme is to decompose wage differentials, using an Oaxaca decomposition, between groups into what can be explained by observables and what cannot be explained by observables. More specifically, consider a simple model that makes wages for minorities as follows,


image     (25)


and wages for non-minority agents as,


image     (26)


where image represent wages, image is a vector of individual specific observables that affect wages, and image is a classical error term. The wage difference between minority and non-minority agents can be computed by differencing these equations as follows,


image


The first term in the right most expression, image, is the component of the wage difference that is explained: it arises because of differences in the average characteristics of group members, such as region of residence, experience, or education level. The second term, image, is the part of the wage difference that is not explained by the regression model—the differences in the response coefficients of the regression, or the rate of return differences across minorities and non-minorities. This last term encompasses differences in wages due to differences in the returns to similar characteristics between groups. For example, returns to education may differ across minorities and non-minorities. The fraction of wage difference due to this second term is typically called the “share” of wage differences due to discrimination.

Before discussing some of the general results from various regression-based approaches, it is important to qualify the results. First, the approach of assuming that the entire second component, image, is due solely to discrimination is likely not correct. For example, for this to be true the wage equation must be well specified. If omitted variable bias exists, then the response coefficients will be biased. Second, this equation captures only discrimination in the labor market as measured today. That is, even if no discrimination is found to exist in such a model in today’s wages, that does not imply discrimination is unimportant. For example, if women are constantly denied market opportunities for skilled jobs, they might not invest optimally to obtain such positions. In the literature, such under investment is denoted as market discrimination before, or “pre” market discrimination. Clearly, it is difficult to parse the effects of years past with the current effects of discrimination, and this should be kept in mind when interpreting the empirical results below—both those from the regression based model as well as from field experiments.

The regression models can be applied to the data discussed above from the Current Population Survey (CPS) from May 2009. Yet, given that Altonji and Blank (1999) summarize a series of regression results from such wage equations that do not differ markedly from ours, we simply restate the main results. First, we find white men receive significantly higher wages than black men, even after controlling for education, job experience, region of residence, and occupation. Following the letter of the model, this is evidence of discrimination in the data.

Second, even after controlling for key factors, Hispanic men and minority female workers have lower wages than their non-minority counterparts. Once again, if one sticks to the interpretation of the model, this is suggestive evidence that discrimination exists between these groups. One should highlight, however, that there are certain difficulties in using CPS data for such an exercise—such as the problem of not having individual ability measures, such as cognitive and non-cognitive abilities. Altonji and Blank (1999) extend the CPS results by modeling data from the National Longitudinal Survey of Youth (NSLY). In general, their results with NSLY data confirm that an improved specification reduces the unexplained effects for blacks and for women.

While this line of work is suggestive that discrimination exists in the labor market, due to productivity unobservables the nature of discrimination is not discernible without rather strong assumptions: are minority men receiving lower wages because of tastes or because of statistical discrimination?

Some headway has been made in these regards recently in several clever studies. One such study is the ingenuous paper of Goldin and Rouse (2000), who use audition notes from a series of auditions among national orchestras in order to determine whether or not blind auditions—those in which musicians auditioned behind a screen—help women relatively more than men. The authors use a panel data set and identify discrimination by the change in hiring practices toward blind auditions that occurred in the 1970s. Goldin and Rouse study the actual audition records obtained from orchestra personnel managers and orchestra archives from eight major symphony orchestras from the late 1950’s to 1995. These records contain lists of everyone auditioning (first and last name) with notation around the names of those who advance. There are three rounds of auditions considered: preliminary, semifinal, and final. The gender of the participants is determined by their name (96% of the records are distinctly masculine or feminine).

Eighty-four percent of all preliminary rounds were blind, seventy-eight percent of all semifinal rounds were blind, and seventeen percent of all final rounds were blind. In addition, the authors have personnel rosters that describe final assignments (members of the orchestra). There is variation in hiring practices over time, so that within one orchestra, the same audition may be blind or non-blind over time and across categories (preliminary, semifinal, and final). In addition, since success is rare, the same musician sometimes auditions more than once.

In the authors’ data, 42 percent of individuals competed in more than one round and 24 percent competed in more than one audition. Including musician fixed-effects, the authors identify the effect of a screen to hide gender from those individuals who auditioned both with and without a screen. Without this “ability” control (individual fixed-effects) the data suggests that women are worse off with blind auditions. However, controlling for individual fixed effects, the authors find that for women who make it to the finals, a blind audition increases their likelihood of winning by 33 percentage points.

In their main specification, the authors find that women are significantly less likely to advance from semifinals when auditions are blind, but significantly more likely to advance from preliminary auditions and final auditions when they audition behind a screen. Turning to the final outcome space—what is the effect of the screen on the hiring of women?—the authors estimate that though they are unable to obtain a statistically significant effect (since the likelihood of winning an audition is less than three percent), women are five percentage points more likely to be hired than men when auditions are completely blind and there is no semifinal round. There is no difference between the likelihood that women are hired relative to men when there is a semifinal round and auditions are blind.

Ultimately, the effects discussed give pause to reported “traditional” orchestra practices. In particular, “a strong presumption exists that discrimination has limited the employment of female musicians.” Before the implementation of blind auditions, committees were instituted to overthrow the biased hiring practices of conductors (who reportedly hired select males from a small set of well known instructors). However, sex-based biases seemed to dominate hiring, even in the face of “democratization.” As the authors demonstrate, the institution of blind hiring significantly increased the success rate of women in most auditions.

However, it is difficult for the authors to parse whether or not the discrimination is taste-based or statistical. The authors note that an orchestra is a team, which requires constant improvement and study together. In this sense, female-specific absences—maternity leave—can impact the quality of the orchestra significantly and may motivate statistical discrimination against women. Using their data, the authors note that the average female musician took 0.067 leaves of absence per year, compared to the average males’ 0.061 leaves. The length of leave was negligibly different between genders. These statistics imply that taste-based discrimination, assuming no performance differences between hired males and females, are at least in part the cause of the discrimination against female musicians. Again, without the strong assumption that conditional on being hired, women and men of the same audition caliber perform indistinguishably in their careers, it is difficult for this innovative work to parse the type of discrimination observed.

A second clever piece of work based on the regression approach is due to Altonji and Pierret (2001), who create a model that generates strict predictions on the effect of race on wages over time under a hypothesis of statistical discrimination based on race by employers. Notably, they conclude that if firms do not statistically discriminate based on race (if they follow the law), but race is negatively related to productivity, then: (i) the race gap will widen with experience, and, (ii) adding a favorable variable that the hiring firm cannot observe will reduce the race difference in the experience profile. The authors find that the data satisfy these predictions: the race gap widens with experience and the addition of a “skill” variable reduces the race gap in experience slopes. Thus, the authors conclude that employers “do not make full use of race as information.”

Fundamentally, the authors’ model studies the differential effect on wages of “easy to observe” image variables and “hard to observe” image variables that predict worker productivity. While image variables such as schooling should have a smaller and smaller effect on wage over time, since an employer’s experience with the worker reveals far more important predictors of productivity, those variables that are difficult to observe such as skill have a relatively larger effect on wages as time goes on. This implies that the authors can identify whether or not the easily observable characteristic of race is acting as an image variable, or if employers are ignoring it. If employers are ignoring race, but race remains negatively correlated with productivity, then race acts as a image variable, appearing more important—more predictive of wage—over time. Again, the authors find support for the latter case.

The authors estimate their model using NLSY 1979 data—a panel study of men and women aged 14-21 in 1978 that have been surveyed annually since 1979. The data on white and black men with eight or more years of education forms the basis of their empirical analysis. The authors use AFQT (the Armed Forces Qualification Test) scores as a variable that employers do not observe, but that predicts productivity. In addition, the authors control for the first job held by all subjects in order to ensure that their results are not driven by the effect that a high AFQT may have on a worker’s access to jobs in which skill is observed, rather than “dead-end jobs” where skill is never observed. Because the authors control for secular shifts in the wage structure, their identification of the interactions between time and observable (image) characteristics and unobservable or ignored (image) characteristics comes from variation across age cohorts.

The authors find that a one standard deviation shift in AFQT rises from having no effect on wages when experience is zero to increasing log wages by 0.0692 when experience is 10. This supports the result that employers learn about productivity. The coefficient on education interacted with experience declines from 0.0005 to −0.0269 when the variable AFQT*experience is added. With an intercept of 0.0832 with the addition, we can conclude that the effect of an extra year of education declines from 0.829 to 0.0595 over ten years. This suggests that employers statistically discriminate on the basis of education because they have limited information about labor market entrants. In short, the effect of easy-to-observe variables like education dwindles as hard-to-observe variables like ability become more available—as time goes on and the employer becomes more familiar with the quality of the worker. The authors find similar effects with their other hard-to-observe variables that correlate with productivity such as father’s education and sibling wage rate: as experience increases, these variables become more and more predictive of higher wages (though the effect of father’s education is never significant).

The main analysis is on whether or not employers statistically discriminate based on race. If firms use race as information—that is, as easily-observable predictors of performance similar to education—then the effect of race over time on wages should decline as hard-to-observe variables like skill (predicted by the AFQT) become more transparent over time. If firms ignore race, however, the initial (experience = 0) race gap should be small, and should widen with experience if race is negatively related to productivity. Also, when race is ignored (a image variable) adding another image variable like AFQT*experience will reduce the race gap in experience slopes. The authors note that the effect of a “black” dummy will not necessarily be zero even if firms do not statistically discriminate on the basis of race, since race may be correlated with legally usable information available to the employer but not to the econometrician.

Empirical analysis shows that the effect of adding AFQT*experience decreases the race gap in experience slopes (from −0.1500 to −0.0861); this is the opposite of what we would expect if employers fully used race as a predictor of performance (as they do with schooling—recall, the addition of AFQT*experience increases the amount by which the impact of education changes over time). Using another prediction of their model, that the effect of learning on the image variables will equal the effect of learning on the image variables times the relationship between the image and image variables—that there are spillover effects from learning—the authors are able to reject race as an image variable but not able to reject race as a image variable.

A few points are of note. First, if the quantity of training is influenced by the employer’s beliefs about a worker’s productivity, effects of training cannot be separated from the effects of statistical discrimination with learning. In addition, if taste-based discrimination becomes more prevalent at higher level positions, a widening of the race gap based on experience may be a reflection of increased taste-based discrimination rather than employer learning. Finally, the authors model the effect of statistical discrimination on wages, but not on the extended hiring decision. Based on these considerations, the authors note that any of their results on race-based discrimination should be interpreted cautiously.

To summarize, Altonji and Pierret test for statistical discrimination in a very reasonable way: they argue that if firms statistically discriminate, an observable characteristic such as race will be very important in predicting wages early in the employment history—before productivity is well observed—but becomes less important in predicting wages as time goes on and the worker accumulates experience. In the data the opposite is true, suggesting that under the assumption that the model is well specified firms attempt to ignore race in their hiring decisions, but that race is correlated with productivity (which is revealed) and so it becomes more and more predictive of wages as time goes on.

A third innovative regression-based study is due to Charles and Guryan (2008), who use state-level variation in historical wage and survey data to empirically test the impacts of discrimination on the labor market, focusing on taste for discrimination. The main theoretical result from Becker’s work explored by Charles and Guryan is the assertion that black workers are hired by the least prejudiced employers in the market due to sorting in the labor market. Furthermore, they examine whether racial wage gaps are determined by the prejudice of a marginal employer, not the average. This sorting mechanism provides Charles and Guryan with two empirical regularities to verify Becker’s work: (i) the level of prejudice observed by the employers displaying large amounts of prejudice (in the upper tail of a distribution of prejudice) should not impact wages; (ii) holding prejudice constant, wages should be lower with more blacks in the labor market.

Although they do not target the question of taste versus statistical based discrimination directly, they do include a variable for the skill difference between blacks and whites in regressions run as robustness checks. This and other robustness checks do not alter the main results which find support for Becker’s theory of marginal prejudice affecting wages: marginal and low percentile prejudice levels negatively impact the black white wage gap while higher percentile and average prejudice levels have no impact; also the percent of the population that is black has a negative impact on the wage gap.

Charles and Guryan (2008) begin by empirically motivating the relationship between the black-white wage gap and prejudice by displaying the correlation between wage data from the CPS and white survey responses to questions concerning racial sentiments from the General Social Survey (GSS). After displaying the positive wage gap to prejudice relationship, Charles and Guryan review the theoretical findings to clarify the hypotheses of interest and then discuss the data. The data being used for prejudice is a non-uniform (the same questions are not asked every year) nationally representative survey with state-level data from 1972-2004. The survey questions used in this analysis are those from white responders and are vetted to reflect prejudice as much as possible (for example a question on whether “the government was obligated to help blacks” was not used due to the possible response aimed at the government). The survey responses were used to formulate a prejudice index relative to the responses given in 1977 and a prejudice distribution and the data on prejudice is combined with CPS May monthly supplement from 1977 and 1978 and CPS Merged Outgoing Rotation Group (MORG) for analysis.

The empirical results come from a hedonic wage regression. The regressions are run at the state level under the assumptions that employment markets are at the state level and interstate moves are costly. Because the prejudice measure they have is at the state level, Charles and Guryan take an additional step to allow for more reasonable standard errors than ones that would come from a full regression with observations at the individual level. This additional step comprises removing the prejudice index but including a black dummy variable for each state (state-black dummy interaction) in the first stage wage hedonic, and then using the coefficient from this interaction term as the dependent variable in a second stage regression which includes the prejudice index. Five main measures of the prejudice index are analyzed: average, marginal, 10th percentile, median and 90th percentile. The marginal level of prejudice is calculated as the “imageth percentile of the prejudice distribution, where image is the percentage of the state workforce that is black” and the prejudice distribution is calculated from the GSS data. Additionally, the fraction of the population that is black in the state is included in the second stage.

The second stage regression results all support Becker’s theory. The first result of a negative impact on the black-white relative wages (negative means lower wages for blacks) attributed to the average level of prejudice is not significant and becomes positive when the marginal level of prejudice is included. The impact of the marginal prejudice measure is always negative and significant. This is also the case for the coefficient on the measurement of the fraction of the state population that is black (always negative and significant). These first results are taken as indication that the average prejudice measures fail to explain the wage gap, while the marginal and fraction of black have the assumed relationship from Becker’s work.

The additional prejudice measurements: 10th percentile, median and 90th percentile, provide further support for Becker’s theory. When included together in a regression, both with and without the percent of the state’s population that is black, the 10th percentile is the only variable of significance (it is negative). This result is taken as further support of Becker’s theory because of the indication that higher measurements of prejudice do not affect the wage gap (note that when the proportion of the state’s population that is black is included, the 10th percentile increases in both absolute magnitude and significance).

Various robustness checks are completed such as the inclusion of variables to indicate skill as mentioned above. Two skill measures are used: (i) separate reading and math variables which measure the difference between black and white test scores at the state level from a National Assessment of Educational Progress-Long Term Trend (NAEP-LTT) test, and, (ii) black-white relative school quality measures used and Card and Krueger (1992) (for which they reduce the sample to just southern states). In both cases the results are similar to when the skill proxies are not included. Although this identification strategy does not disentangle the impact of taste and statistical based discrimination, the inclusion of skill level measures does suggest that this is taste-based discrimination under the assumption that the skill measures accurately reflect the difference in work-place abilities between races and that these differences in abilities are known by the employers. In a best case scenario, identifying statistical discrimination would require some measure of employee productivity by race and employment.

Further robustness checks investigate other possible endogeneity issues. An instrument of the proportion black in the state workforce in 1920 was used to account for possible endogeneity issues with the percent of the state’s current population that is black. No difference in results was found. Finally, Charles and Guryan (2008) include a measure from the National Education Longitudinal Survey of 1988 (NELS) to account for the fraction of co-workers that were of the same race. The results again supported Becker’s theory that market sorting results in blacks being more segregated towards lower prejudiced employers: the wage gap is larger when the co-workers are more mixed when accounting for racial prejudice and the black proportion of the population.

The overall result is best restated directly from the last paragraph in the paper: “Our various results suggest that racial prejudice among whites accounts for as much as one-fourth of the gap in wages between blacks and whites…a present discounted loss in annual earnings for blacks between $34,000 and $115,000, depending on the intensity of the prejudice of the marginal white in their states.”

Similar to the above studies, making an assumption on the regression specification, allows Charles and Guryan (2008) to begin to parse the type of discrimination observed. As such, as all of these incredibly insightful studies illustrate, one can go a long way in detecting discrimination, and its sources, but pinpointing exactly the extent that taste based and statistical discrimination is the underlying motive, is only possible with additional assumptions.

3.3.2 Field experiments

A complementary approach to measuring and disentangling the nature of discrimination is to use field experiments. Although a very recent study thoroughly catalogues a variety of field experiments that test for discrimination in the marketplace (Riach and Rich, 2002), a brief summary of the empirical results is worthwhile to provide a useful benchmark. Labor market field studies present perhaps the broadest line of work in the area of discrimination. The work in this area can be parsed into two distinct categories: personal approaches and written applications.

Personal approaches include studies that have individuals either attend job interviews or apply for employment over the telephone. In these studies, the researcher matches two testers who are identical along all relevant employment characteristics except the comparative static of interest (e.g., race, gender, age). Then, after appropriate training, the testers approach potential employers who have advertised a job opening. Researchers “train” the subjects simultaneously to ensure that their behavior and approach to the job interview are similar.

Under the written application approach, which can be traced to Jowell and Prescott-Clarke (1970), carefully prepared written job applications are sent to employers who have advertised vacancies. The usual approach is to choose advertisements in daily newspapers within some geographic area to test for discrimination. Akin to the personal approaches, great care is typically taken to ensure that the applications are similar across several dimensions except the variable of interest.

It is fair to say that this set of studies, including both personal and written approaches, has provided evidence that discrimination against minorities across gender, race, and age dimensions exists in the labor market. But due to productivity unobservables, the nature or cause of discrimination is not discernible. This point is made quite starkly in Heckman and Siegelman (1993, p. 224), who note that “audit studies are crucially dependent on an unstated hypothesis: that the distributions of unobserved (by the testers) productivity characteristics of majority and minority worker are identical.” They further note (p. 255): “From audit studies, one cannot distinguish variability in unobservables from discrimination.” Accordingly, while these studies provide invaluable insights into documenting that discrimination exists, care should be taken in making inference about the type of discrimination observed.

Much like the labor market regression studies discussed above, the literature examining discrimination in product markets has yielded important insights. Again, rather than provide a broad summary of the received results, we point the reader to Yinger (1998) and Riach and Rich (2002), who provide nice reviews of the product market studies.38 We would be remiss, however, not to at least briefly discuss the flavor of this literature.

One often cited, recent study is the careful work due to Bertrand and Mullainathan (2004), who utilize a natural field experiment to determine whether or not blacks are discriminated against by employers. By sending resumes with randomly assigned white- or black-sounding names to want-ads advertised in Boston and Chicago newspapers, Bertrand and Mullainathan find that white names receive 50% more callbacks for an interview than black names. This racial gap is uniform across occupation, industry, and employer size. Additionally, whites receive greater benefits to a higher-quality resume than blacks. Although Bertrand and Mullainathan are unable to test the type of discrimination, whether taste-based or statistical, as it is uncertain what information the employer is utilizing from the resumes, the authors use the results to suggest an alternate theory be considered, such as one based on lexicographic searches.

To choose names that are distinctly white-sounding or black-sounding, Bertrand and Mullainathan use name frequency data calculated from birth certificates of all babies born in Massachusetts between 1974 and 1979. Distinctiveness of a name is calculated as having a sufficiently high ratio of frequency in one racial group to that of the other racial group. The 9 most distinct male and 9 most distinct female names for each racial group, along with corresponding white- or black-sounding last names, are used. To verify this method of distinction, a brief survey was conducted in Chicago asking respondents to identify each name as “White”, “African-American”, “Other”, or “Cannot Tell.” Names that were not readily identified as white or black were discarded.

The authors sampled resumes posted more than six months prior to the start of the experiment on two job search websites to use as a basis for experimental resumes. The resumes sampled were restricted to people seeking employment in sales, administrative support, clerical services, and customer service in Boston and Chicago, and were purged of the original owner’s name and address. To minimize similarities to actual job seekers, Chicago resumes are used in Boston and Boston resumes are used in Chicago (after names of previous employers and schools are changed appropriately). The quality of the resumes were sorted into two groups (high and low), with high-quality resumes having some combination of more labor market experience; fewer gaps in employment history; being more likely to have an e-mail address, certification degree, or foreign language skills; or been awarded honors of some kind. Education is not varied between high- and low-quality resumes to ensure each resume qualifies for the position offered, and approximately 70% of all resumes included a college degree of some kind.

Fictitious addresses were created and randomly assigned to the resumes based on real streets in Boston and Chicago. The authors selected up to three addresses in each 5-digit zip code in both cities using the White Pages. Virtual phone lines with voice mailboxes were assigned to applicants in each race/sex/city/resume quality cell to track callbacks. The outgoing message for each line was recorded by someone of the appropriate race and gender, and did not include a name. Additionally, four e-mail addresses were created for each city, and were applied almost exclusively to the high-quality resumes.

The field experiment was carried out between July 2001 and January 2002 in Boston and between July 2001 and May 2002 in Chicago. In most cases, two each of the high- and low-quality resumes were sent to each sales, administrative support, and clerical and customer services help-wanted ad in the Sunday editions of The Boston Globe and The Chicago Tribune (excluding ads asking applicants to call or appear in person to apply). The authors logged the name and contact information for each qualifying employer, along with information on the position advertised and any specific requirements applicants must have. Also recorded was whether or not the ad explicitly stated that the employer is an “Equal Opportunity Employer.”

For each ad, one high-quality resume and one low-quality resume were randomly assigned a black-sounding name (with the remaining two resumes receiving white-sounding names). Male and female names were randomly assigned for sales jobs, while primarily female names were used for administrative and clerical jobs to increase the rates of callbacks. Addresses were also randomly assigned, and appropriate phone numbers were added before formatting the resumes (with randomly chosen fonts, layout, and cover letters) and faxing or mailing them to the employer. A total of 4870 resumes were sent to over 1300 employment ads. Of these, 2446 were of high-quality while 2424 were of low-quality.

Results are measured by whether a given resume elicits a callback or an e-mail back for an interview. Resumes with white-sounding names have a 9.65% chance of receiving a callback compared to 6.45% for black-sounding names, a 3.2 percentage point difference. This difference can only be attributed to name manipulation. According to these results, whites are 49% (50%) more likely to receive a callback for an interview in Chicago (Boston). This gap exists for both males and females, with a larger, though statistically insignificant, racial gap among males in sales occupations. An additional year of workforce experience increases the likelihood of a callback by approximately 0.4 percentage point, thus the return to a white name is equivalent to 8 additional years of experience. High-quality resumes receive significantly more callbacks for whites (11% compared to 8.5%, image), while blacks only see a 0.51% increase (from 6.2% to 6.7%). Whites are favored (defined as more whites than blacks being called back for a specific job opening) by 8.4% of employers, where blacks are favored by only 3.5% of employers, a very statistically significant difference (image). The remaining 88% of employers treat both races equally, with 83% of employers contacting none of the applicants.

A probit regression of the callback dummy on resume characteristics (college degree, years experience, volunteer experience, military experience, e-mail address, employment holes, work in school, honors, computer skills, special skills, fraction of high school dropouts in the neighborhood, fraction of neighborhood attending college or more, fraction of neighborhood that is white, fraction of neighborhood that is black, and log median per capital income) is created from a random subsample of one-third of the resumes. The remaining resumes are ranked using the estimated coefficients by predicted callback. Under this classification, blacks do significantly benefit from high-quality resumes, but they benefit less than whites (callback rates for high versus low are 1.6 for blacks and 1.89 for whites). The presence of an e-mail address, honors, or special skills have a positive significant effect on the likelihood of a callback. Interestingly, computer skills negatively predict callback and employment holes positively predict callback. Additionally, there is little systematic relationship between job requirements and the racial gap in callback.

Applicants living in whiter, more educated, or higher-income neighborhoods have a higher probability of receiving a callback, and there is no evidence that blacks benefit any more than whites from living in a whiter, more educated zip code. There is, however, a marginally significant positive effect of employer location on black callbacks.

Of all employers, 29% state that they are “Equal Opportunity Employers” and 11% are federal contractors, however these two groups are associated with a larger racial gap in callback. The positive white/black gap in callbacks was found in all occupation and industry categories except for transportation and communication. No systematic relationship between occupation earnings and the racial gap in callback was found.

Bertrand and Mullainathan did not design their study specifically test the two theories of discrimination, statistical and taste-based, and do not believe that either of the two can fully explain their findings. While both models explain the average racial gap, their results do not support animus. There is no evidence of a larger racial gap among jobs that explicitly require communication skills and jobs for which customer or co-worker contacts are more likely to be higher, which would be expected by theory. Further, as blacks’ credentials increase the cost of discrimination should increase, but this doesn’t explain why blacks get relatively lower returns to a higher-quality resume. This, combined with the uniformity of the racial gap across occupations, casts doubt on statistical discrimination theories as well.

The authors suggest that other models may do a better job than statistical or taste models at explaining these particular findings. For example, a lexicographic search by employers may result in resumes being rejected as soon as they see a black name, thus experience and skills are not rewarded because they are never seen. This theory may explain the uniformity of the race gap if this screening process is similar across jobs. The results could also follow from employers having coarser stereotypes for blacks. In any case, Bertrand and Mullainathan acknowledge the need for a theory beyond statistical discrimination and taste to explain their findings in full.

Another nice example of a natural field experiment is due to Riach and Rich (2006), who extend the literature by using carefully matched written applications made to advertised job vacancies in England to test for sexual discrimination in hiring. They find statistically significant discrimination against men in the “female occupation” and against women in the “male occupation.” This is important evidence to begin to uncover the underlying causes for labor market discrimination. This study is also careful to point out that it is difficult to parse the underlying motivation for why such discrimination exists. Even without such evidence, however, the paper is powerful in that it provides a glimpse of an important phenomenon in a significant market, and provocatively leads to questions that need to be addressed before strong policy advice can be given.

There are a number of other studies that examine discrimination and differential earnings in labor markets based on sexual orientation (Arabsheibani et al., 2005; Weichselbaumer, 2003; Berg and Lien, 2002), but like these two natural field experiments, they also have difficulties parsing the type of discrimination observed.

One might then ask, if field experiments have similar difficulties as regression based methods in parsing the nature of discrimination, why bother with this approach. Our answer is that field experiments in labor economics have the potential to parse both the nature and extent of discrimination observed in markets.

As a starting point, consider List (2004b), who made use of several settings in a naturally-occurring marketplace (the sports card market) to show that a series of field experiments can parse the two forms of discrimination. More specifically, after first demonstrating that dealers treat “majority” (white men) and “minority” (older white men, nonwhite men and white women) buyers and sellers in the marketplace differently, List provides evidence suggesting that sportscard dealers knowingly statistically discriminate. By executing a variety of field experiments, the evidence provided parses statistical discrimination from taste based discrimination and an agent’s ability to bargain when interacting with a dealer. The experiments conducted by List demonstrate a framework for potentially parsing the two forms of discrimination which could be utilized and moved forward to inform discrimination discussions in other markets.

The first experiment discussed in List (2004b) is similar to an audit study in that dealers are approached by buyers from both majority and minority groups with an offer of buying or selling a sportscard (unlike most audit studies, the subjects do not know that they are part of a study on discrimination, just that it’s an economic study). The results from this first experiment are highly suggestive that dealers base offers on group membership: buyers in the minority groups of white women and older white men received initial offers that were 10-13% greater than white male buyers when buying cards and minority groups received 30% lower initial offers when selling their cards.

Further, this initial framed field experiment shows that the gap between minority and majority subjects’ offers remain from the initial to final offers for inexperienced subjects but to a large part converges for subjects with experience. But, this convergence comes at a cost of time: subjects in the minority group having to invest a significantly larger portion of time to achieve better final offers.

This result provides support for non-taste based discrimination due to the convergence of the gap in offers through bargaining, a result that would not hold under a theory of taste based discrimination where the dealer would simply hold to one price. Finally, by surveying dealers in addition to subjects that were buying and selling, List controls for dealer experience in the marketplace and finds a positive relationship between dealer experience and discrimination as measured by the difference between a dealer’s average majority and average minority offers. Suggesting that statistical discrimination may be evident unless one believes that taste based discrimination increases with experience as a dealer at sportscard shows.

Although this initial experiment can measure discrimination, more treatments are necessary to parse statistical discrimination from alternative explanations. In total, List runs four experiments in addition to the framed field experiment described in the previous paragraph: (i) a dictator game artefactual field experiment with dealers as the dictator and four descriptions as the receiver: white men, non-white men, white women and white mature men; (ii) two framed field experiment treatments that are bilateral exchange markets with dealers selling to agents with randomly drawn reservations values, where in one market dealers know that the reservation value is random and in a second it is ambiguous; (iii) a Vickrey second price auction that is a natural field experiment; and, (iv) a framed field experimental game designed to determine dealers’ perceptions of the reservation value distributions of sportscard market participants. Each additional experiment helps parse the two forms of discrimination and the bargaining ability of the subjects and the results of all the experiments are necessary for List to suggest that dealers knowingly statistically discriminate.

First, the relatively uniform offers made to receivers across majority and minority groups in the dictator game suggests that dealers do not display taste based discrimination, at least in artefactual field experiments. Second, through the bilateral exchange markets, three results are found which each point towards statistical discrimination by testing hypotheses drawn directly from the theories of taste based and statistical discrimination. First, experienced dealers are found to lose less surplus than inexperienced dealers. Second, minority and majority buyers perform similarly with the randomly set reservation prices but not in the treatment where dealers think that reservation values are “homegrown values.” Finally, experienced dealers perform worse when it is ambiguous whether the reservation value is drawn randomly—suggesting that they are utilizing inferences which are not performing well (i.e. their statistical discrimination rubric fails due to the randomly set reservation value). These two additional experiments point toward statistical discrimination.

Yet, it is only through the final two experiments that sufficient evidence is provided for statistical discrimination through a discovery of a variation in reservation value distributions of sportscard market participants and dealer knowledge of the variation. The results from the Vickrey second price auction are used for two purposes: (i) to determine whether the reservation value distributions of the majority and minatory are indeed different and (ii) to provide distributions to determine the dealers’ abilities to accurately assign distributions in the final experimental game. The results from the Vickrey auction do show that the reservation values for the minority group have a larger variance than the reservation values for the majority group, suggesting that statistical discrimination could be utilized for profit maximization—see the above model. Further, when different reservation value distributions are shown to dealers in the final experiment, a majority of all dealers are able to determine which distributions are from which groups and experienced dealers are able to correctly assign distributions more often than inexperienced dealers.

Although List focuses on a market that every consumer does not necessarily approach, the framework of multiple field experiments to move towards identifying the form of statistical discrimination is one that should be considered for use elsewhere. Most importantly, this study highlights that a series of field experiments can be used to uncover the causes and underlying conditions necessary to produce data patterns observed in the lab or in uncontrolled field data.

This study shows highlights that a deeper economic understanding is possible by taking advantage of the myriad settings in which economic phenomena present themselves. In this case, field experimentation in a small-scale field setting is quite useful in developing a first understanding when observational data is limited or experimentation in more “important” markets is not possible. Yet, it is important to extend this sort of analysis to more distant domains.

This is exactly what is offered in Gneezy et al. (2010), who explore the incidence of discrimination against the disabled by examining actual behavior in a well-functioning marketplace—the automobile repair market. This study uses a traditional audit study, but combines it with a specific field experimental treatment to allow the authors to parse the type of discrimination observed.

The audit portion of the study was standard: the assignment given to subjects is clear: approach body shop image to receive a price quote to fix automobile image. The authors included subjects from two distinct groups—disabled white males age 29-45 and non-disabled white males age 29-45—who each visited six body shops. The disabled subjects in this experiment were all confined to a wheelchair and drove a specialized vehicle. All of the automobiles, which were personally owned by our disabled subjects, had visible body problems. Importantly, both testers in any given pair approached body shops with the identical car.

The authors find that overall, the disabled received considerably higher average price quotes, $1425, than the non-disabled, $1212. Inference as to why this disparate treatment exists, of course, is an open issue. Several clues provide potential factors at work: (i) access—many body shops are not easily approachable via wheelchairs; this considerably restricts the set of price offers the disabled can receive; and (ii) time—while the non-disabled can easily park and proceed to the front desk, the process is much more complex for the disabled. First, he must find a suitable parking place: it is very uncommon to have designated places for the disabled in body shops. As a result, the disabled must have special parking which permits the use of a wheelchair. Moreover, it must be a space that will be unoccupied when he returns to pick up the repaired vehicle. After finding an appropriate parking space, the disabled must commit much more effort and time to approach the service desk. An additional related problem which makes the expected search cost higher for the disabled person is that in some cases it is necessary to leave the car for the day in order to obtain a price quote. Using a taxi is much more complex for the disabled than for the non-disabled.

To investigate the search cost explanation further, the authors obtained data on search effort at the tester level and perceived search effort at the body shop level to examine if realizations of these variables are consistent with the pattern of discrimination observed. From this survey, the authors find that the non-disabled typically consult far fewer body shops: on average, the non-disabled visit 3.5 different mechanics whereas the disabled visit only 1.67 mechanics, a difference that is statistically significant. Concerning the supply side, the authors asked body shops questions revolving around body shop perceptions of the degree of search among the disabled and non-disabled. The results are consonant with the consumer-side statements observed above: the disabled are believed to approach 1.85 body shops for price quotes while the non-disabled are expected to approach 2.85, a difference of more than 50 percent and one that is significant. This evidence is consistent with statistical discrimination based on mechanics’ beliefs about relative search costs and how they map into reservation value distributions. Yet, the survey evidence alone is only suggestive and further investigation is necessary to pinpoint the underlying mechanism at work.

For these purposes, the authors provide a sharper focus on the underlying reason for discrimination by running a complementary field experiment. In this experiment, the authors not only replicated the initial results, with new testers and new vehicles in need of repair, but in another treatment had these exact same agents approach body shops explicitly noting that [they’re] “getting a few price quotes” when inquiring about the damage repair estimate. If differential search costs cause discrimination, then the authors should observe the offer discrepancies disappearing in this treatment.

This is exactly the result they observe. Although in the replication treatment the disabled continued to receive higher asks, when both agent types noted that they were “getting a few price quotes,” the disabled agents were able to secure offers that were not statistically distinguishable from the offers received by the non-disabled. We provide support of this insight in Fig. 8, which highlights the discrepancies observed when search is believed to be heterogeneous across the disabled and non-disabled. In this case, the first two bars show that the differences are nearly 20%. Yet when both agents clearly signal that this particular mechanic visit is just one part of their entire search process, these disparities are attenuated and indeed change signs.

image

Figure 8 Complementary experiment I summary.

While these two examples are not directly related to labor market outcomes, they display the power of the field experimental method to test important theories within labor economics, and especially the theories of discrimination discussed earlier. In this regard, we believe that similar treatments can be carried out in labor markets to explore wage differences, job offer differences, and other labor market outcomes.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.138.69.157