© Tobias Baer 2019
Tobias BaerUnderstand, Manage, and Prevent Algorithmic Biashttps://doi.org/10.1007/978-1-4842-4885-0_3

3. How Algorithms Debias Decisions

Tobias Baer1 
(1)
Kaufbeuren, Germany
 

In the previous chapter, you took a crash course in psychology to understand why humans are sometimes biased in their decision-making and what some of the most prevalent biases are. In this chapter, aimed primarily at readers who do not have experience in building algorithms themselves, I will explain how an algorithm works. More specifically, I will show how a good algorithm works and how it thereby can alleviate human bias; in later chapters you then can understand more easily how algorithms can go awry (i.e., show bias)—and what the different options are for addressing this problem.

A Simple Example of an Algorithm

In the introduction, I mentioned that algorithms are statistical formulas that aim to make unbiased decisions. How exactly do they manage to do this?

One of the simplest statistical algorithms is the linear regression. You estimate a number—for example, the number of hairs on a person’s head—with an equation that looks like this:

y = c + β1 • x1 + β2 • x2 + β3 • x3

The dependent variable (i.e., the number you want to estimate, here the number of hairs on a person’s head) is often denoted as y. It is estimated as a linear combination of independent variables (also called predictors ). Here the data scientist has chosen three predictors: x1, x2, and x3. x1 might be the surface area of the scalp of the person (larger heads have more hair), x2 might be the age of the person (as people get older, they may lose some hair), and x3 may be the gender (men seem to be bald more often than women).

How does this equation work? Let’s assume that this morning you marked one square centimeter on your Mom’s head and counted 281 hairs. You therefore may start off by multiplying the surface area of the scalp (denoted as x1 and measured in square centimeters) by 281. Statisticians call 281 a coefficient and denote it as β1. The subscript of 1 simply indicates that β1 belongs to x1.

Extrapolating from the number of hairs you found this morning in your bed, you also may believe that every year people lose on average 1,000 hairs. You therefore put –1,000 into β2.

Finally, you guess that men on average have 50,000 hairs less than women. But here you run into a problem: gender is a qualitative attribute, but your equation needs numbers. How can you solve that? The solution is what data scientists call variable transformation or feature generation : they create new numerical values that measure qualitative attributes (or other things—transforming qualitative values into numbers is just an example of feature generation). In our example, they would define x3 as a numerical value indicating “maleness.” The simplest way is a dummy variable: x3 can be defined as a binary variable that is 1 for men and 0 otherwise. In this case, β3 could be set to –50,000.

A different data scientist, however, might argue that a binary definition of gender is outdated and too crude and therefore suggest that “maleness” is measured by the testosterone level in the blood. In this case, the model might suggest that for every 1 ng/dL testosterone the number of hairs decreases by 70. Note that you have encountered the first example of how the beliefs of the data scientist—here on whether gender is binary or not—can shape an algorithm.

By looking at a very limited amount of data—one square centimeter of your Mom’s head, the hair found in your bed this morning, and some estimate on gender differences you pretty much pulled out of thin air—you therefore have come up with an algorithm:

hair = 281 • x1 – 1,000 • x2 – 70 • x3

The problem with this algorithm is that it is pretty wrong. Your Mom might be rather exceptional; your count might have been flawed; and conceptually, you ignored the fact that men have larger heads than women, so there is what statisticians call some correlation between x1 and x3 (and when you chose the coefficient for x3, you didn’t think about what x1 does at the other end of your equation). Equipped with the knowledge of the previous chapter, you can see how your cognitive biases might have helped to create this mess: you fell victim to the availability bias by basing your data collection on your immediate family members and exhibited extreme anchoring by not even considering your neighbor. And you exhibited gross overconfidence by believing this approach actually made any sense!

Luckily for you, statistics can come to the rescue: if you measure the amount of hairs as well as x1, x2, and x3 for a sample of people (your statistician friend might suggest at least 100-200 people, although a million people would be much better), statistics will optimize the values of your coefficients in such a way that the estimation error is minimized. The statistical estimation procedure will “play around” with and find the optimal estimates for four parameters: β1, β1, and β3 as well as the constant parameter c.

Let me make a technical note here: linear regression is still so simple statistically that using matrix algebra, the coefficients actually can be calculated. For more complex algorithms, however, “playing around” is indeed the only way to find a good solution. Techniques such as a maximum likelihood estimation find optimum solutions for the parameters iteratively. This hints at why the proliferation of advanced statistical techniques depended so much on the ever increasing speed of computers—and why one element of a data scientist’s skill (or the skill of the software tool she uses) lies in the computational efficiency, such as knowing clever ways for the maximum likelihood estimation to speed up the search process.

One of the greatest things about algorithms developed by statistics is that they speak to us—sort of. By reviewing the statistical outputs, we can learn not only a lot about the data and the phenomena we try to predict but also how the algorithm “thinks.” I will illustrate this briefly in the next sections.

What an Algorithm Will Give You

If you were to collect data on the hair of 200 people (I doubted my publisher would foot the bill for that, so I had to completely make up the following numbers), your statistical software might produce the following equation:

hair = 75,347 + 159 • x1 – 0.3 • x2 – 23 • x3

This is an exceedingly interesting result! There are at least three particularly noteworthy aspects of this equation:
  1. 1.

    This new equation makes a lot less errors than the original equation (you don’t see that but take my word for it). In fact, the estimation technique guarantees that it is the best estimate you can get (remember the concept of BLUE (best linear unbiased estimate) from Chapter 1?). If your livelihood depends on knowing people’s number of hairs (e.g., because you are a mad barber charging by the number of hairs cut but your customers reasonably require a quote before you get started), you will be much better off—very possibly you could not even run your business without this equation. However, there is a caveat: “error” is defined in one particular way, namely the squared difference between the algorithm’s estimate of the number of hairs and the actual number of hairs of each person in the sample (this is why statisticians also call this technique as ordinary least squares or OLS). That means that if you think differently about weighing errors (because OLS squares the errors, it penalizes large errors but discounts small errors—you might disagree with this), you might prefer a different set of coefficients.

     
  2. 2.

    You may have noticed that there is a very large number in the constant term while β1 is much lower than in the original equation. Essentially the algorithm grounds itself in an anchor (getting it half way to the average) and limits the amount of variation caused by individual attributes. This smacks of the stability bias and echoes the discussion in the previous section where I observed that nature had a point when it hardwired stability biases in our brains—some amount of anchoring does improve the estimates. Empirically we can observe that the poorer the predictors or the overall structure of the model, the more the equation will ground itself in the population average (in the extreme case where the independent variables are all meaningless, the equation will be a constant function with zero coefficients for all independent variables and thus estimate the population average for everyone, not a bad idea under such circumstances).

     
  3. 3.

    Finally, you might be struck by the coefficient of –0.3 for age (yep, that’s not even half a hair!). If a statistician examined this estimate more closely, she might tell you that it is “insignificant” (i.e., that most likely the coefficient is zero). A zero coefficient is how statistics tells you that a variable doesn’t make sense. Upon reflection, you can see why: in early age, you grow more hair, not less; and even as an adult, you not only lose hair (which then can be found in your bed in the morning) but also grow new hair. Most people therefore encounter a net hair loss only rather late in their life. The linear model simply assumes that every year you lose the same amount of hair—and therefore it really doesn’t make much sense. And it is important to understand that because of small sample sizes, estimated coefficients barely are exactly zero; coefficients usually contain a little bit of noise. The concept of significance unfortunately is a little bit complicated but what a statistician means by saying that a coefficient is insignificant is the following: if in reality the coefficient was zero (i.e., the variable was meaningless), because of the noise in the data and the sample size you have, you should expect the calculated coefficient to be in a range from here to there (in this example, maybe –0.5 to +0.5). As the actual coefficient you got happens to be within this range, you should assume that in reality it is zero, and therefore that the variable is meaningless. Is this certain? No, it isn’t! However, statisticians have tools to quantify the probability of their statements being correct—and in this context, we call this probability the confidence level. If a variable is said to be insignificant at the 99.9% confidence, it means that if the true coefficient is zero, estimates have a 99.9% probability to be within a specific range (which is calculated by the statistical software), and the coefficient you got for the variable actually is within that range. In fact, you also could ask the statistician, “If the true coefficient was -1,000, what is the probability that I would get the result we have here?” The statistician might look puzzled because this is not the way people usually think about it—but after some grumbling, he might tell you that it’s 0.0000417%. This means that it’s really unlikely that we lose in average 1,000 hairs per year—but based on your sample of 200 people, it’s not entirely impossible.

     

The last thing mentioned—telling us that a variable most likely is meaningless—is one of the most important ways a statistical algorithm helps us to debias our decision logic. Statisticians often say that statistical testing “rejects” our hypothesis: it teaches us that based on the data at hand, our belief appears to be wrong.

I spent many years as a consultant debiasing judgmental credit and insurance underwriting. In order to do this, I spent many hours with underwriters to list out all the information they look at; typically we arrived at long lists with 200-400 factors. I then worked with them to prioritize and short-list the 40-70 factors they believed to be most important. And then I applied statistics to validate these factors—and every single time, in over 100 such studies, about half of the prioritized factors turned out to be insignificant. Underwriters suffered from the whole list of cognitive biases I listed in the previous chapter. For example, if a Chinese credit officer once had a German borrower who defaulted in a spectacular bankruptcy, the bizarreness effect might cause this credit officer to reject all German applicants as exceedingly risky. As a German, I can assure you that this would be insane!

Based on the statistical test results, I then redesigned the assessment logic. While judgmental risk assessment typically happens in an underwriter’s head, I replaced it with a statistical algorithm. When the banks and insurance companies I worked for tested the new algorithms, they found that decisions were consistently better to the tune of reducing credit and insurance losses by 30-50% (sometimes more) while at the same time approving more deals (and hence enjoying faster growth of their business)! This is obviously worth a lot of money to financial institutions (and made paying for my services a really great investment).

For this reason, it is fair to say that statistical algorithms are an important tool in fighting biases. As you will see in subsequent chapters, however, this sadly does not mean that algorithms are perfect or cannot fall victim to biases themselves.

Summary

In this chapter, you explored what an algorithm is and identified a few important properties as far as biases are concerned:
  • In principle, statistical algorithms aim to make unbiased predictions; they do this by objectively analyzing all data points given to them.

  • Inspecting algorithms can offer important information—we not only learn more about the data and the phenomena we want to predict but we can also understand how the algorithm “thinks.” This is valuable because it allows us to then ask if that thinking might be flawed.

  • Where we enter specific variables in a model (as manifestations of beliefs on underlying causal relationships), statistical significance allows us to test whether our hypotheses are supported by the data. If not, chances are that we were biased and the algorithm helped us to detect this bias.

  • Empirically, statistical algorithms often perform better than subjective human judgment because they succeed in eliminating many biases.

  • Statistical algorithms are anchored in the population average, and the worse the model structure and the predictive power of the independent variables, the stronger this stability bias is.

In order to understand how biases can still sneak into a statistical algorithm, it helps to know in greater detail how data scientists actually go about developing these algorithms. This is what we will consider next.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.16.48.181