21. How to Marry Machine Learning with Traditional Methods

Tobias Baer¹

(1)

Kaufbeuren, Germany

When given a choice between keeping a cake and eating it too, I always try to find a way to do both. And I did succeed with that in the realm of machine learning!

You may have been surprised, maybe even upset, by the negative tone I took on machine learning in the previous chapter. This is not for lack of respect and admiration for machine learning—all I wanted to do was open your eyes to the limitations of machine learning. Giving machine learning tools to a data scientist who is not skilled in managing the risks of algorithmic bias can be as dangerous as giving a Porsche to a novice driver. And given what we learned about the overconfidence bias, you hopefully will agree that I must shout really loud in order for my warnings to have any chance of being heeded by all readers!

The truth is, machine learning does give you a lot of power. And often the best algorithms live in the best of all worlds—an artisanal model carefully crafted by a watchful data scientist who employs a wide range of machine learning techniques as mere tools in her work. In this chapter, I want to introduce four specific techniques to incorporate machine learning techniques into artisanal model designs that allow for plenty of oversight by data scientists in order to prevent algorithmic bias:

Feature-level machine learning
Segmentation informed by machine learning
Shift effects informed by machine learning
Second opinion provided by machine learning

In the following sections, I will briefly explain each of them.

Feature-Level Machine Learning

Machine learning has a particular advantage over other techniques with very granular (and hence big) data, such as a situation where for each unit of observation (e.g., a patient or loan applicant) there are a large number of transactions, such as readings from a Continuous Glucose Monitoring system or credit and debit card transactions. It also has distinct disadvantages, such as a tendency to overfit (i.e., become unstable) if categorical variables have very rare categories or if there is a hindsight bias in the data. Why then not just limit the use of machine learning to specific features that are derived specifically from those data sources where machine learning is at its best?

If you build a stand-alone estimate from a particular data input using machine learning, you have a complex and hopefully highly predictive feature that you now can carefully embed in an artisanal equation—which means that you can apply any amount of modifications or constraints to safeguard algorithms from biases. In fact, this approach even opens up the possibility to use federated machine learning—an approach where the data used to develop the algorithm sits on distributed machines (e.g., many different users’ mobile phones or Internet of Things refrigerators) and is never combined to one large database available to the data scientist’s scrutiny; the approach estimates a stand-alone algorithm on each device and then only sends the algorithm itself to a central server that uses the aggregation of all algorithms to continuously come up with an optimized version that is distributed back to all devices.

How to keep such features at bay depends on the complexity of your chosen approach. An approach called genetic algorithms will create and test all kind of variable transformations and send you the best transformations it found; these often are still sufficiently transparent for a subject matter expert to judge whether the transformation makes sense and is safe from a bias perspective.

Other approaches will render the feature a black box. For example, consider a recruiting process for branch sales staff via video chat. Machine-learning-powered video analytics could measure what percent of time the applicant smiles—a probably useful indicator of an applicant’s ability to build rapport with a customer, especially if the algorithm is able to distinguish fake smiles (where only one facial muscle moves—the one that can be consciously operated) from real smiles (which require two muscles to contract, one of which cannot be manipulated and hence truly reflects an empathetic emotional state). Here you cannot inspect the algorithm itself—instead, the presence of algorithmic bias in this feature needs to be established through the kind of analyses discussed in Chapter 19.

Let’s now assume that back-testing has revealed that this algorithm works much better for Zeta Reticulans than for Martians—as a result, roughly half of the time that a Martian smiles, the algorithm fails to pick it up, giving Martians therefore a systematically lower friendliness score.

A data scientist becoming aware of this issue thanks to the X-ray she ran on the complex smiling feature (e.g., she might have observed a correlation between propensity to smile and race) now could solve the problem by converting the original smile feature (which measures percent of total talk time) to a rank variable. How can this eliminate racial bias? If the rank is calculated within a race, the “top 20 percent of smilers” will always contain both 20% of all Martians and 20% of all Zeta Reticulans, even if your machine learning algorithm claims that the most smiling Martian smiles less than half as much as the most smiling Zeta Reticulan.

Furthermore, if a sudden problem arises during day-to-day operations (e.g., your model monitoring reveals that the machine learning algorithm calculating your smiling feature has a huge bias in favor of people wearing clothes in warm desert and sunset hues, which popped up as the latest fashion trend), there is a stop-gap solution available to switch off this single feature without halting the entire algorithm (which is much harder to do if everything is baked into a single black box).

Segmentation

Another source of out-performance of machine learning algorithms over artisanally derived ones is the ability to detect subsegments that require a different set of predictors. Many artisanal workhorses such as logistic regression apply the same set of predictors to everyone, and it is often difficult for a data scientist to notice that there is a subsegment requiring a totally different approach (and hence a separate model).

In order to have the best of two worlds, I start by building both an artisanal model and a machine learning challenger model. I then calculate for each observation in my sample the estimation error for each model, and from that I derive the difference in error between the two models. A positive difference implies that for this observation, the machine learning model was better, a negative that the artisanal model was better.

Now you can run a CHAID tree with your PCA-prioritized shortlist of predictors (see Step 3 discussed in Chapter 19) to predict the difference in errors. Find the end nodes with the largest positive error difference (i.e., those where the average error of the artisanal model is much larger than the average error of the machine learning model) and trace back the variables and cutoffs that defined these subsegments. Do these subsegments make business sense? Are they a proxy for something else (possibly even a proxy for a segment defined by a variable not included in the modeling dataset)?

Classic examples coming out of such an analysis could be self-employed customers contained within a retail credit card sample where most customers are salaried, or a sizeable segment with missing credit bureau information. As always, discussing these results with the front-line can yield invaluable insights—sometimes the CHAID tree only approximates what should be the “correct” segment definition based on business insight.

Note that sometimes the segmentation amounts to the reflection of an external bias—for example, if in a highly discriminatory environment Martians tend not to be admitted to universities, you might realize that the CHAID tree’s recommendation is to build a separate model for Martians because the whole set of features relating to the specifics of the applicant’s university education only works for Zeta Reticulans. You see how you might very quickly get into some very tricky tradeoffs—but the whole advantage of this hybrid approach is that as the artisanal data scientist, you are firmly in the driver’s seat to decide how to deal with this pattern in the data.

Once you have decided on the one or two segments that require a fundamentally different model, you can build separate artisanal algorithms for them. The result of this is stunning—often enough, through this technique I achieve a predictive power that is not just equivalent but higher than that of the machine learning benchmarking model (e.g., for binary outcomes, the outperformance may be 1-2 Gini points, which for some uses can be a lot of money—if your credit portfolio has an annual loss of $500 million, reducing that even by just a few percentage points would be enough for you to eat in a gourmet restaurant with 3 Michelin stars every night for the rest of your life…).

Shift Effects

Another source of insight by machine learning algorithms often missed by artisanal approaches is interaction effects; situations where only the combination of multiple attributes has meaning. For example, if your records indicate that the customer is female but the voice interacting with your automated voice response unit sounds male, you have a high likelihood of a fraudster impersonating your customer. In this case, there is a signal that should adjust the probability of fraud estimate upward. The perfect artisanal model would capture that signal but otherwise could still use the same set of predictive variables as for any other case in the population. Here the creation of subsegments (e.g., separate fraud models for female and male customers) would introduce unnecessary complexity (and effort)—instead, the shift effect technique simply adds additional variables to the artisanal model.

The by far easiest (and often sufficient) approach is to add a binary indicator (so-called dummy ) for each shift effect. The model could be “patched up” by introducing a new variable that is 1 if the customer on record is female and the voice sounds male and 0 otherwise. You could also consider more complex adjustments, such as introducing an interaction effect of customer’s gender and a “maleness” score of the voice (“tuning down” the warning signal if the sound of the voice is rather inconclusive).

The trickiest to identify (but often very powerful) opportunity lies in normalizing independent variables by dividing them through a contextual benchmark. For example, when I built a model to predict revenue of small businesses in an emerging market, I used predictors such as credit card receipts, electricity consumption, and floor space. The model had a couple of biases because in some industries, sales per square meter were particularly high, while in rural areas, credit cards were a lot less common than in cities. I therefore normalized the predictors by the median of their peers (e.g., rural versus urban pharmacies) and obtained a much more powerful (and equitable) model.

Just like the segmentation approach, the shift effect technique can also raise the performance of the hybrid model above that of the benchmark machine learning model. And it goes without saying that both techniques (segmentation and shift effects) can even be combined, giving pure machine learning a run for its money!

The Second Opinion Technique

When you compare the errors of an artisanal model and a machine learning model case by case, you will realize that the outperformance of the machine learning model only realizes itself on average—the number of cases where the machine learning model’s estimate is worse than the artisanal model’s one often is almost as large as the number of cases where the machine learning model does better.

A natural interpretation of this situation would be to conclude that cases where the two models disagree are in some way exceptional and therefore would benefit from the expertise of a human. The second opinion approach thus runs a machine learning model in parallel to an artisanal model and flags cases where the two have a large discrepancy for manual review. The manual review often can be greatly improved by having rules to flag the likely source of the discrepancy (e.g., by flagging attributes of the case that may be an anomaly—in many cases, all that is needed for the two models to align is for the human reviewer to adjust some input data) and by prescribing a framework or even specific steps for the manual review such as manually collecting specific additional information. In fact, in many situations I have created a full-fledged separate qualitative scorecard that not only systematically collects maybe 10-25 additional data points through the human reviewer but also actively eliminates judgmental bias by what I call psychological guardrails.

Strictly speaking, flagging a portion of cases for manual review by comparing two competing models is not a modeling technique. However, by including this in my list I want to reiterate my belief that the ultimate objective of the data scientist is to optimize a decision-problem, and the ideal architecture of the decisioning process may very well include steps outside of a statistical algorithm. Often the data scientist is uniquely positioned to advise business owners of such a possibility and create tremendous value by advocating it.

Summary

In this chapter, you learned that data scientists can indeed have the best of two worlds by using machine learning to extract valuable insights from the data while still using artisanal techniques to keep biases out of the model. Key take-aways are:

Where big and complex data can only be harnessed with machine learning, rather than building one big black box model, you could consider using machine learning to build a set of complex features (typically using just a particular data source or set of data fields for each feature, giving it a tightly defined business meaning).
When a benchmark machine learning model performs better than your artisanal model, you can use a CHAID tree to understand the types of cases that drive this outperformance.
If you find that the outperformance of the machine learning benchmark model comes from specific subsegments for which your artisanal model is inappropriate, you can consider building separate models for such subsegments. Often the problem is really concentrated in just one or two such segments.
If, by contrast, the outperformance comes from interaction effects, you can capture these effects through additional variables (e.g., indicator variables) that you incorporate in your artisanal model.
Artisanal models enhanced with machine learning in this way often perform better than the machine learning benchmark model.
If, however, even the artisanal approach does not succeed in removing a bias or the aforementioned techniques fail to replicate the insights garnered by the machine learning benchmark model, you should also consider revisiting the overall decision process architecture; it may be best to run artisanal and machine learning models in parallel and to have cases with strongly conflicting predictions of the two models reviewed by a human.

I disproved the adage that you cannot keep a cake and eat it, too. There is another piece of wisdom, though, that still holds: there’s no free lunch—so you must pay for your miracle cake. The hybrid approaches laid out here require time because they are essentially manual. In some situations, however, that time for manual model tuning simply is not available—and this is nowhere as acute as in the case of self-improving machine learning algorithms. By the time a data scientist has updated an artisanal model, a self-improving machine learning algorithm has already gone through several new generations, making the data scientist’s work obsolete before it even is finished.

In the next chapter, we therefore will discuss how best to keep biases out of self-improving machine learning algorithms.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 21. How to Marry Machine Learning with Traditional Methods

Create new playlist

Sign In