When Machine Learning Goes Off the Rails

by Boris Babic, I. Glenn Cohen, Theodoros Evgeniou, and Sara Gerke

WHAT HAPPENS WHEN machine learning—computer programs that absorb new information and then change how they make decisions—leads to investment losses, biased hiring or lending, or car accidents? Should businesses allow their smart products and services to autonomously evolve, or should they “lock” their algorithms and periodically update them? If firms choose to do the latter, when and how often should those updates happen? And how should companies evaluate and mitigate the risks posed by those and other choices?

Across the business world, as machine-learning-based artificial intelligence permeates more and more offerings and processes, executives and boards must be prepared to answer such questions. In this article, which draws on our work in health care law, ethics, regulation, and machine learning, we introduce key concepts for understanding and managing the potential downside of this advanced technology.

What Makes Machine Learning Risky

The big difference between machine learning and the digital technologies that preceded it is the ability to independently make increasingly complex decisions—such as which financial products to trade, how vehicles react to obstacles, and whether a patient has a disease—and continuously adapt in response to new data. But these algorithms don’t always work smoothly. They don’t always make ethical or accurate choices. There are three fundamental reasons for this.

One is simply that the algorithms typically rely on the probability that someone will, say, default on a loan or have a disease. Because they make so many predictions, it’s likely that some will be wrong, just because there’s always a chance that they’ll be off. The likelihood of errors depends on a lot of factors, including the amount and quality of the data used to train the algorithms, the specific type of machine-learning method chosen (for example, deep learning, which uses complex mathematical models, versus classification trees that rely on decision rules), and whether the system uses only explainable algorithms (meaning humans can describe how they arrived at their decisions), which may not allow it to maximize accuracy.

Second, the environment in which machine learning operates may itself evolve or differ from what the algorithms were developed to face. While this can happen in many ways, two of the most frequent are concept drift and covariate shift.

With the former the relationship between the inputs the system uses and its outputs isn’t stable over time or may be misspecified. Consider a machine-learning algorithm for stock trading. If it has been trained using data only from a period of low market volatility and high economic growth, it may not perform well when the economy enters a recession or experiences turmoil—say, during a crisis like the Covid-19 pandemic. As the market changes, the relationship between the inputs and outputs—for example, between how leveraged a company is and its stock returns—also may change. Similar misalignment may happen with credit-scoring models at different points in the business cycle.

In medicine, an example of concept drift is when a machine-learning-based diagnostic system that uses skin images as inputs in detecting skin cancers fails to make correct diagnoses because the relationship between, say, the color of someone’s skin (which may vary with race or sun exposure) and the diagnosis decision hasn’t been adequately captured. Such information often is not even available in electronic health records used to train the machine-learning model.

Covariate shifts occur when the data fed into an algorithm during its use differs from the data that trained it. This can happen even if the patterns the algorithm learned are stable and there’s no concept drift. For example, a medical device company may develop its machine-learning-based system using data from large urban hospitals. But once the device is out in the market, the medical data fed into the system by care providers in rural areas may not look like the development data. The urban hospitals might have a higher concentration of patients from certain sociodemographic groups who have underlying medical conditions not commonly seen in rural hospitals. Such disparities may be discovered only when the device makes more errors while out in the market than it did during testing. Given the diversity of markets and the pace at which they’re changing, it’s becoming increasingly challenging to foresee what will happen in the environment that systems operate in, and no amount of data can capture all the nuances that occur in the real world.

The third reason machine learning can make inaccurate decisions has to do with the complexity of the overall systems it’s embedded in. Consider a device used to diagnose a disease on the basis of images that doctors input—such as IDx-DR, which identifies eye disorders like diabetic retinopathy and macular edema and was the first autonomous machine-learning-based medical device authorized for use by the U.S. Food and Drug Administration. The quality of any diagnosis depends on how clear the images provided are, the specific algorithm used by the device, the data that algorithm was trained with, whether the doctor inputting the images received appropriate instruction, and so on. With so many parameters, it’s difficult to assess whether and why such a device may have made a mistake, let alone be certain about its behavior.

But inaccurate decisions are not the only risks with machine learning. Let’s look now at two other categories: agency risk and moral risk.

Agency Risk

The imperfections of machine learning raise another important challenge: risks stemming from things that aren’t under the control of a specific business or user.

Ordinarily, it’s possible to draw on reliable evidence to reconstruct the circumstances that led to an accident. As a result, when one occurs, executives can at least get helpful estimates of the extent of their company’s potential liability. But because machine learning is typically embedded within a complex system, it will often be unclear what led to a breakdown—which party, or “agent” (for example, the algorithm developer, the system deployer, or a partner), was responsible for an error and whether there was an issue with the algorithm, with some data fed to it by the user, or with the data used to train it, which may have come from multiple third-party vendors. Environmental change and the probabilistic nature of machine learning make it even harder to attribute responsibility to a particular agent. In fact, accidents or unlawful decisions can occur even without negligence on anyone’s part—as there is simply always the possibility of an inaccurate decision.

Executives need to know when their companies are likely to face liability under current law, which may itself also evolve. Consider the medical context. Courts have historically viewed doctors as the final decision-makers and have therefore been hesitant to apply product liability to medical software makers. However, this may change as more black-box or autonomous systems make diagnoses and recommendations without the involvement of (or with much weaker involvement by) physicians in clinics. What will happen, for example, if a machine-learning system recommends a nonstandard treatment for a patient (like a much higher drug dosage than usual) and regulation evolves in such a way that the doctor would most likely be held liable for any harm only if he or she did not follow the system’s recommendation? Such regulatory changes may shift liability risks from doctors to the developers of the machine-learning-enabled medical devices, the data providers involved in developing the algorithms, or the companies involved in installing and deploying the algorithms.

Moral Risk

Products and services that make decisions autonomously will also need to resolve ethical dilemmas—a requirement that raises additional risks and regulatory and product development challenges. Scholars have now begun to frame these challenges as problems of responsible algorithm design. They include the puzzle of how to automate moral reasoning. Should Tesla, for example, program its cars to think in utilitarian cost-benefit terms or Kantian ones, where certain values cannot be traded off regardless of benefits? Even if the answer is utilitarian, quantification is extremely difficult: How should we program a car to value the lives of three elderly people against, say, the life of one middle-aged person? How should businesses balance trade-offs among, say, privacy, fairness, accuracy, and security? Can all those kinds of risks be avoided?

Moral risks also include biases related to demographic groups. For example, facial-recognition algorithms have a difficult time identifying people of color; skin-lesion-classification systems appear to have unequal accuracy across race; recidivism-prediction instruments give Blacks and Hispanics falsely high ratings, and credit-scoring systems give them unjustly low ones. With many widespread commercial uses, machine-learning systems may be deemed unfair to a certain group on some dimensions.

The problem is compounded by the multiple and possibly mutually incompatible ways to define fairness and encode it in algorithms. A lending algorithm can be calibrated—meaning that its decisions are independent of group identity after controlling for risk level—while still disproportionately denying loans to creditworthy minorities. As a result, a company can find itself in a “damned if you do, damned if you don’t” situation. If it uses algorithms to decide who receives a loan, it may have difficulty avoiding charges that it’s discriminating against some groups according to one of the definitions of fairness. Different cultures may also accept different definitions and ethical trade-offs—a problem for products with global markets. A February 2020 European Commission white paper on AI points to these challenges: It calls for the development of AI with “European values,” but will such AI be easily exported to regions with different values?

Finally, all these problems can also be caused by model instability. This is a situation where inputs that are close to one another lead to decisions that are far apart. Unstable algorithms are likely to treat very similar people very differently—and possibly unfairly.

All these considerations, of course, don’t mean that we should avoid machine learning altogether. Instead, executives need to embrace the opportunities it creates while making sure they properly address the risks.

To Lock or Not to Lock?

If leaders decide to employ machine learning, a key next question is: Should the company allow it to continuously evolve or instead introduce only tested and locked versions at intervals? Would the latter choice mitigate the risks just described?

This problem is familiar to the medical world. The FDA has so far typically approved only “software as a medical device” (software that can perform its medical functions without hardware) whose algorithms are locked. The reasoning: The agency has not wanted to permit the use of devices whose diagnostic procedures or treatment pathways keep changing in ways it doesn’t understand. But as the FDA and other regulators are now realizing, locking the algorithms may be just as risky, because it doesn’t necessarily remove the following dangers:

Inaccurate decisions

Locking doesn’t alter the fact that machine-learning algorithms typically base decisions on estimated probabilities. Moreover, while the input of more data usually leads to better performance, it doesn’t always, and the amount of improvement can vary; improvements in unlocked algorithms may be greater or smaller for different systems and with different volumes of data. Though it’s difficult to understand how the accuracy (or inaccuracy) of decisions may change when an algorithm is unlocked, it’s important to try.

Environmental changes

It also matters whether and how the environment in which the system makes decisions is evolving. For example, car autopilots operate in environments that are constantly altered by the behavior of other drivers. Pricing, credit scoring, and trading systems may face a shifting market regime whenever the business cycle enters a new phase. The challenge is ensuring that the machine-learning system and the environment coevolve in a way that lets the system make appropriate decisions.

Agency risks

Locking an algorithm doesn’t eliminate the complexity of the system in which it’s embedded. For example, errors caused by using inferior data from third-party vendors to train the algorithm or by differences in skills across users can still occur. Liability can still be challenging to assign across data providers, algorithm developers, deployers, and users.

Moral risks

A locked system may preserve imperfections or biases unknown to its creators. When analyzing mammograms for signs of breast cancer, a locked algorithm would be unable to learn from new subpopulations to which it is applied. Since average breast density can differ by race, this could lead to misdiagnoses if the system screens people from a demographic group that was underrepresented in the training data. Similarly, a credit-scoring algorithm trained on a socioeconomically segregated subset of the population can discriminate against certain borrowers in much the same way that the illegal practice of redlining does. We want algorithms to correct for such problems as soon as possible by updating themselves as they “observe” more data from subpopulations that may not have been well represented or even identified before. Conversely, devices whose machine-learning systems are not locked could harm one or more groups over time if they’re evolving by using mostly data from a different group. What’s more, identifying the point at which the device gets comparatively worse at treating one group can be hard.

A Tool Kit for Executives

So how should executives manage the existing and emerging risks of machine learning? Developing appropriate processes, increasing the savviness of management and the board, asking the right questions, and adopting the correct mental frame are important steps.

Treat machine learning as if it’s human

Executives need to think of machine learning as a living entity, not an inanimate technology. Just as cognitive testing of employees won’t reveal how they’ll do when added to a preexisting team in a business, laboratory testing cannot predict the performance of machine-learning systems in the real world. Executives should demand a full analysis of how employees, customers, or other users will apply these systems and react to their decisions. Even when not required to do so by regulators, companies may want to subject their new machine-learning-based products to randomized controlled trials to ensure their safety, efficacy, and fairness prior to rollout. But they may also want to analyze products’ decisions in the actual market, where there are various types of users, to see whether the quality of decisions differs across them. In addition, companies should compare the quality of decisions made by the algorithms with those made in the same situations without employing them. Before deploying products at scale, especially but not only those that haven’t undergone randomized controlled trials, companies should consider testing them in limited markets to get a better idea of their accuracy and behavior when various factors are at play—for instance, when users don’t have equal expertise, the data from sources varies, or the environment changes. Failures in real-world settings signal the need to improve or retire algorithms.

Think like a regulator and certify first

Businesses should develop plans for certifying machine-learning offerings before they go to market. The practices of regulators offer a good road map. In 2019, for example, the FDA published a discussion paper that proposed a new regulatory framework for modifications to machine-learning-based software as a medical device. It laid out an approach that would allow such software to continuously improve while maintaining the safety of patients, which included a complete assessment of the company—or team—developing the software to ensure it had a culture of organizational excellence and high quality that would lead it to regularly test its machine-learning devices. If companies don’t adopt such certification processes, they may expose themselves to liability—for example, for performing insufficient due diligence.

Many start-ups provide services to certify that products and processes don’t suffer from bias, prejudice, stereotypes, unfairness, and other pitfalls. Professional organizations, such as the Institute of Electrical and Electronics Engineers and the International Organization for Standardization, are also developing standards for such certification, while companies like Google offer AI ethics services that examine multiple dimensions, ranging from the data used to train systems, to their behavior, to their impact on well-being. Companies might need to develop similar frameworks of their own.

Monitor continuously

As machine-learning-based products and services and the environments they operate in evolve, companies may find that their technologies don’t perform as initially intended. It is therefore important that they set up ways to check that these technologies behave within appropriate limits. Other sectors can serve as models. The FDA’s Sentinel Initiative draws from disparate data sources, such as electronic health records, to monitor the safety of medical products and can force them to be withdrawn if they don’t pass muster. In many ways companies’ monitoring programs may be similar to the preventive maintenance tools and processes currently used by manufacturing or energy companies or in cybersecurity. For example, firms might conduct so-called adversarial attacks on AI like those used to routinely test the strength of IT systems’ defenses.

Ask the right questions

Executives and regulators need to delve into the following:

Accuracy and competitiveness. How much is the performance of the machine-learning-based system likely to improve with the volume of new data from its use if we don’t lock the algorithm? What will such improvements mean for the business? To what extent will consumers understand the benefits and drawbacks of locked versus unlocked systems?

Biases. What data was used to train the algorithm? How representative is it of the population on which the algorithm will ultimately operate? Can we predict whether an unlocked algorithm will produce less-biased results than a locked one if we allow it to learn over time? Do the algorithm’s errors affect minorities or other groups in particular? Can a continuous monitoring approach establish “guardrails” that stop the algorithm from becoming discriminatory?

The environment. How will the environment in which the offering is used change over time? Are there conditions under which machine learning should not be allowed to make decisions, and if so, what are they? How can we ensure that the offering’s behavior evolves appropriately given how the environment itself is changing? When should we withdraw our offering because the gap between the environment and our offering’s behavior has become too big? What are the boundaries of the environment within which our offering can adapt and operate? How robust and safe are our machine-learning systems throughout their life cycles?

Agency. On which third-party components, including data sources, does the behavior of our machine-learning algorithms depend? How much does it vary when they’re used by different types of people—for example, less-skilled ones? What products or services of other organizations use our data or machine-learning algorithms, possibly exposing us to liability? Should we allow other organizations to use machine-learning algorithms that we develop?

Develop principles that address your business risks

Businesses will need to establish their own guidelines, including ethical ones, to manage these new risks—as some companies, like Google and Microsoft, have already done. Such guidelines often need to be quite specific (for example, about what definitions of fairness are adopted) to be useful and must be tailored to the risks in question. If you’re using machine learning to make hiring decisions, it would be good to have a model that is simple, fair, and transparent. If you’re using machine learning to forecast the prices of commodity futures contracts, you may care less about those values and more about the maximum potential financial loss allowed for any decision that machine learning makes.

Luckily, the journey to develop and implement principles doesn’t need to be a lonely one. Executives have a lot to learn from the multiyear efforts of institutions such as the OECD, which developed the first intergovernmental AI principles (adopted in 2019 by many countries). The OECD principles promote innovative, trustworthy, and responsibly transparent AI that respects human rights, the rule of law, diversity, and democratic values, and that drives inclusive growth, sustainable development, and well-being. They also emphasize the robustness, safety, security, and continuous risk management of AI systems throughout their life cycles.

The OECD’s recently launched AI Policy Observatory provides further useful resources, such as a comprehensive compilation of AI policies around the world.

image

Machine learning has tremendous potential. But as this technology, along with other forms of AI, is woven into our economic and social fabric, the risks it poses will increase. For businesses, mitigating them may prove as important as—and possibly more critical than—managing the adoption of machine learning itself. If companies don’t establish appropriate practices to address these new risks, they’re likely to have trouble gaining traction in the marketplace.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
44.211.24.175