Chapter 1. Sometimes Machines Make Bad Decisions

Why do humans so frequently manage, supervise, step in to assist, or override automated systems? What are the qualities of human decision-making that automated systems cannot replicate? Comparing these strengths and weaknesses helps us design useful AI and avoid the fallacies so common in popular discussions.

Math, Menus, and Manuals: How Machines Make Automated Decisions

If you want to design brains well, you first need to understand how automated systems make decisions. You will use this knowledge to compare techniques and decide when Autonomous AI will outperform existing methods. You will also combine automated decision making with AI to design more explainable, reliable, deployable brains. Though there are many subcategories and nuances, automated systems rely on three primary methods to make decisions: math, menus, and manuals.

Control theory uses math to calculate decisions

Control theory uses mathematical equations to calculate the next control action, usually based on feedback, using well-understood mathematical relationships. When you do this, you must trust that the math describes the system dynamics well enough to use it to calculate what to do next. Let me give you an example. There’s an equation that describes how much space a gas like air will take up based on its temperature and pressure. It’s called the ideal gas law. So, if we wanted to design a brain that controls a valve that inflates party balloons, we could use this equation to calculate how much to adjust the valve open and closed to inflate balloons to a particular size.

In the example above, we rely on math to describe what will happen so completely and accurately that we don’t even need feedback. Controlling based on equations like this is called open loop control because there is no closed feedback loop telling us whether our actions achieved the desired results. We trust the equation so much that we don’t even need feedback. But what if the equation doesn’t completely describe all of the factors that affect whether we succeed or not? For example, the Ideal Gas Law doesn’t model high pressure or low temperature gases, dense gases or heavy gases well. Even when we control responding to feedback, limitations of the mathematical model might lead to bad decisions.

I see the history of control theory as an evolution of capability where each control technology can do things that previous control technologies could not. The US Navy invented the PID controller to automatically steer ship rudders and control ship headings (the direction that it’s pointing). Imagine that a ship is pointing in one direction and the captain wants to change headings to point the ship in a different direction.

Figure 1-1. Ship changing heading

The controller uses math to calculate how much to move the rudder based on feedback it gets from its last action. There are three numbers that determine how the controller will behave: the P, the I and the D constant. The P constant moves you toward the target, so for the ship the P constant makes sure that the rudder action causes the ship to move towards it’s new heading. But what if the controller keeps turning the rudder and the ship sweeps right past the target heading? This is what the I constant is for. The I constant tracks how much total error you have in the system and keeps you from overshooting or undershooting the target. The D constant ensures that you arrive at the target destination smoothly instead of abruptly. So, a D in the ship controller would make it more likely that the ship will decelerate and arrive more precisely to its destination heading.

Figure 1-2. Example behaviors of various controllers

A ship with a P controller (no I and no D) might overshoot the target, sweeping the boat past the new heading. After the ship sweeps past the target, you’ll need to turn it back. The leftmost diagram in Figure 1-9 shows what this might look like. The horizontal line represents the destination heading. A PI controller will more quickly converge on the target because the I term makes sure that you overshoot the target as little as possible. The PD controller approaches the target more smoothly, but takes longer to reach it.

The PID controller can be very effective and you can find it in almost every modern factory and machine, but it can confuse disturbances and noise for events that it needs to respond to. For example, if a PID controller controlled the gas pedal on your car, it might confuse a speed bump (which is a minor and temporary disturbance) for a hill which requires significant acceleration. In this case, the controller might overaccelerate and exceed the commanded speed, then need to slow down.

The feedforward controller separately measures and responds to disturbances, not just the variable you are controlling. In contrast to PID feedback control, feedforward control acts the moment a disturbance occurs, without having to wait for the disturbance to affect the variable you are controlling (speed of the car in our case). This enables a feedforward controller to quickly and directly cancel out the effect of a disturbance. To do this, a feedforward controller produces its control action based on a measurement of the disturbance. Feedforward control is almost always implemented as an add-on to feedback control. The feedforward controller takes care of the major disturbance, and the feedback controller takes care of everything else that might cause the variable you are controlling (the speed we set in cruise control, for example) to deviate from its target destination. This allows the controller to better tell the difference between a speed bump and a hill by measuring and responding to the disturbance (the change in road elevation) instead of just measuring and responding only to the change in the vehicle’s speed. See Figure 1-10 for an example showing how much better feedback with feedforward control responds to a disturbance than feedback control alone.

Figure 1-3. Comparing how a car controlled by PID feedback control responds to a speed bump versus one controlled by feedforward, the fluctuation in speed is much smaller with feedforward control.

The more sophisticated feed forward controller has limitations too. Both PID and Feed Forward controllers can only control for one variable at a time for one goal per feedback loop. So you’d need two feedback / feedforward loops, for example, if you needed to control both the gas pedal and the steering wheel of the car. And neither of those loops can both maximize gas mileage and maintain constant speed at the same time.

So what happens if you need to control for more than one variable or pursue more than one goal? There are ways to solve this, but in real life we often see people create separate feedback loops that can’t talk to each other or coordinate actions. So in the same way that humans duplicate work and miscalculate what to do when we don’t coordinate with each other, separate control loops don’t manage multiple goals well and often waste energy.

Enter the latest in the evolution of widely adopted control systems: Model Predictive Control (MPC). MPC extends the capability of PID and Feedforward to control for multiple inputs and multiple outputs. Now, the same controller can be used to control multiple variables and pursue multiple goals. The MPC controller uses a very accurate system model to try various control actions in advance and then choose the best action. This control technique actually borrows from the second type of automated decision making (menus). It has many attractive characteristics, but lives or dies by the accuracy of the system model, or the equations that predict how the system will respond to your actions. But real systems change: machines wear, or equipment is replaced, the climate changes, and this can make the system model inaccurate over time. Many of us have experienced this in our vehicles. As the brakes wear, we need to apply the brakes earlier to stop. As the tires wear, we can’t drive as fast or turn as sharply without losing control. Since the MPC uses the system model to look ahead and try potential actions, an inaccurate model will mislead it to decide on actions that won’t work well on the real system. Because of this, many MPC systems were installed, then later decommissioned when the system drifted from the system model.

In 2020, Mckinsey Quantum Black built an Autonomous AI to help steer the Emirates Team New Zealand sailing team to victory by controlling the rudder of their boat. This AI brain can input many, many variables including ones that math-based controllers can’t, like video feeds from cameras and categories (like forward, backward, left, right). It learns by practicing in simulation and acquires creative strategies to pursue multiple goals simultaneously. For example, in its experimentation and self-discovery, it tried to sail the boat upside down because for a while, during practice in simulation, it seemed like a promising approach to accomplish sailing goals.

Control theory uses math to calculate what to do next, the techniques to do this continuously evolve, and Autonomous AI is simply an extension of these techniques that offer some really attractive control characteristics, like the ability to control for multiple variables and track toward multiple goals.

Optimization algorithms use menus of options to evaluate decisions

Optimization algorithms search a list of options and select a control action using objective criteria. Think about the way optimization works, like selecting options from a menu. For example, an optimization system might list all possible routes to deliver a package from manufacturing point A to delivery point B, then select sequential routing decisions by sorting the shortest route to the top of the list of options. You might come up with different control actions if you sort for the shortest trip duration. In this example, the route distance and the trip durations are the objective criteria (the goals of the optimization). Imagine playing tic-tac-toe this way. Tic-tac-toe is a simple two player game played on a grid where you place your symbol, an X or an O in squares on the grid and you win when you are the first player to occupy three squares in a row with your symbol. If you want to play the game like an optimization algorithm, you could use the following procedure:

  1. Make a list of all squares (there are 9, see Figure 1-4 for an example).

  2. Cross out (from the list) the squares that already have an X or an O in them.

  3. Choose an objective to optimize for. For example, you might decide to make moves based on how many blank squares there are adjacent to each square. This objective gives you the most flexibility for future moves. This is why many players choose the center square for their first move (there are 8 squares adjacent to the middle square).

  4. Sort your options based on the objective criteria.

  5. The top option is your next move. If there are multiple moves with the exact same objective criteria score, choose one randomly.

Figure 1-4. Diagram of tic-tac-toe board showing X making the first move,O making the second move, and the number of adjacent squares that are open or under blue control for each option available for blue’s next move and a list that tracks the attributes of each square

This exercise shows the first limitation of optimization algorithms. They don’t know anything about the task. This is why you need to choose a square randomly if there are multiple squares with the same objective score at the top of your search. Claude Shannon, one of the early pioneers of AI, talked about this in his famous 1950 paper about Chess-playing AI. He observed that there were two ways to program a chess AI. He labeled them System A and System B. System A, which is actually the third method of automated decision making (manuals), programs chess strategies. These rules and exceptions are difficult to manage and update, but they express understanding of the game. System B, which is optimization, searches possible legal chess moves with a single easy-to-maintain algorithm, but has no actual understanding of the concepts or strategies of chess.

Solutions are like points on a map

Optimization algorithms are like explorers searching the surface of the earth for the highest mountain or the lowest point. The solutions to problems are points on the map where, if you arrive, you achieve some good outcome. If your goal is altitude, you are looking for the peak of Mt. Everest, at 8,848 meters above sea level (see Figure 1-6 for a map of Earth colored by altitude in relation to sea level). If your goal is finding the location that is most packed with people (population density), you are looking for the Chinese island of Macau, which has a population density of 21,081 people/km² (see Figure 1-7 for a map of Earth colored by population density). If you’re looking for the coldest place on earth on average, then you’re looking for Vostok Station, Antarctica (see Figure 1-14 for a map of Earth colored by average temperature).

Figure 1-6. World map colored by altitude
Figure 1-7. World map colored by population density
Figure 1-8. World map colored by annual average temperature

Now, imagine that you are an optimization algorithm searching the earth for the highest peak. One way to ensure that you find the highest peak is to set foot on every square meter of earth, take measurements at each point and then, when you are done, sort your measurements by altitude. The highest point on earth will now be at the top of your list.

Since there are 510 million square kilometers of land mass on earth, this would take many, many lifetimes to get your answer. This method is called brute force search and is only feasible when the geographical search area of possible decisions is very small (like in the game Tic-Tac-Toe). For more complex problem geographies, we need another method.

A more efficient way to search the earth for the highest peak is to walk the earth and only take steps in the direction that slopes upward the most. Using this method, you can avoid exploring much of the geography by only traveling uphill. In optimization, this class of methods is called gradient-based methods because the slope of a hill is called a grade or a gradient. There are two challenges with this method. The first is that depending on where you start your search exploration, you could end up on a tall mountain that is not the highest point on earth. If you start your search in Africa, you could end up on Mt. Kilimanjaro (the world’s second tallest peak). If you start in North America, you could end up on top of one of the mountains in the Rocky Mountain range. You could even end up on a much lesser peak because once you ascend any peak using this search method, you cannot descend back down any hill. Figure 1-15 demonstrates how this works.

Figure 1-9. A gradient-based algorithm will stop searching in relation to the highest point on a curve.

The second limitation of this method is that it can only be used in situations where you can calculate the slope of the ground where you walk. If there are gaps in the terrain (think vertical drops or bottomless pits), it is not possible to calculate the slope (technically it’s infinite) at the vertical drops, so you can’t use gradient-based optimization methods to search for solutions in that space. See Figure 1-10 for examples of solution terrain that gradient algorithms cannot search.

Figure 1-10. These examples of function curves cannot be searched using gradient based methods.

Now imagine if you employed multiple explorers to start at different places in the landscape and search for the highest point. After each step, the explorers compare notes on their current altitude and elevation and use their combined knowledge to better map the earth. That might lead to quicker search and avoid all exploration getting stuck in a high spot that is not the peak of Mt. Everest. These and other innovations allow optimization algorithms to more efficiently and effectively explore more kinds of landscapes, even the ones shown in Figure 1-10. Many of these algorithms are inspired by processes in nature. Nature has many effective ways of exploring thoroughly much like water flowing over a patch of land. Here are a few examples:

Evolutionary Algorithms

Inspired by Darwin’s Theory of Natural Selection, evolutionary algorithms spawn a population of potential solutions decisions, test how well each of the solutions in the population achieve the process goals, kill off the ineffective solutions, then mutate the population to continue exploring.

Swarm Methods

Inspired by how ants, bees, and particles swarm, move, and interact, these optimization methods explore the solution space with many explorers that move along the landscape and communicate with each other about what they find.

Tree Methods

These methods treat potential solutions as branches on trees. Imagine a choose-your-own-adventure novel (and other interactive fiction) that asks you to decide which direction to take a certain point in the story. The decisions proliferate with the number of options at each decision point. Tree-based methods use various techniques to search the tree efficiently (not having to visit each branch) for solutions. Some of the more well known tree methods are branch and bound and Monte Carlo Tree Search (MCTS).

Simulated Annealing

Inspired by the way that metal cools, Simulated Annealing searches the space using different search behavior over time. Annealing is the process of heating up a material until it reaches an annealing temperature and then it will be cooled down slowly in order to change the material to a desired structure. When the material is hot, the molecular structure is weaker and is more susceptible to change. When the material cools down, the molecular structure is harder and is less susceptible to change. In the same way, Simulated Annealing casts a wide search net at first (exploring more), then becomes more sure that it’s zeroing in on the right atrea (exploring less) over time.

Figure 1-11. Some optimization algorithms that use multiple coordinated explorers

Solving the Game of Checkers: There Is No Spoon

There is a great scene in the movie “The Matrix” where the main character Neo, who while exploring an imaginary world that rogue machine AI conjured to entrap humans in their minds, finds a child, dressed as a Shaolin Monk, bending a spoon with his mind. Seeing that Neo is perplexed, the child explains that there is no spoon (he’s manipulating the imaginary code-based world that their human minds occupy).

I feel similarly about making the perfect sequential decision while performing a complex task. Optimization experts call these “best possible decisions” (like Mt. Everest or the highest points in Figure 1-9 and Figure 1-11) global optima. Optimization algorithms promise the possibility of reaching global optima for every decision, but this is not how it works in real, complex systems. For example, there is no “perfect move” during a chess match. There are strong moves, creative moves, weak moves, surprising moves, but no perfect moves for winning a particular game. That is, unless you’re playing checkers.

In 2009, after almost 20 years of continuously searching the space with optimization algorithms on powerful computers, researchers declared checkers “solved.” This means that no matter what the board looks like, or which players move it is, we know exactly what the optimal checkers move is. Checkers is roughly 1 million times more complex than Connect Four, with 500 billion billion possible positions (5 × 1020). So why don’t we just solve our industrial problems like checkers? Not so fast. Besides the fact that it took optimization algorithms 20 years to solve checkers, most real problems are even more complex than that. Remember the tree based optimization methods above? Well, one way that computer scientists devised to measure the complexity of tasks is to count the number of possible options at the average branch of the tree. This is called the branching factor. The branching factor for Checkers is 2.8 which means on average there’s about 3 possible moves for any turn during a Checkers game. The branching factor for Chess is 31 and the branching factor for the Chinese game of Go is 250.

Table 1-1. Branching factors for common games
Game Branching Factor





Connect Four






Then there’s uncertainty. One very convenient aspect of games like Checkers and Chess is that things always happen exactly the way you want them to. For example, if I want to move my Bishop all the way across the board (the chess term for putting the Bishop on this cross-board highway of the longest diagonal is fiancetto), I can be certain that the Bishop will make it all the way to G7 as I intended. But in the real life war campaigns that chess was modeled after, an offensive to take a certain hill is not guaranteed to succeed, so our Bishop might actually land at F6 or E5 instead of our intended G7. This kind of uncertainty about the success of each move will likely change our strategy.

Figure 1-12. Bishop moving from A1 to G7 on a chess board

So, for real problems, like bending spoons in the Matrix, when it comes to finding optimal solutions for each move you need to make, there is indeed no spoon.

You can spend an entire professional or academic career learning about optimization methods, but this overview should provide the context you need to design brains that incorporate optimization methods and outperform optimization methods for decision-making about specific tasks and processes. If you’d like to learn more about optimization methods, I recommend Numerical Optimization by Nocedal and Wright.

Expert Systems use stored expertise, like manuals

Expert systems look up control actions from a database of expert rules—essentially a complex manual. This provides compute-efficient access to effective control actions, but creating that database requires existing human expertise—after all, you have to know how to bake a cake in order to write the recipe. Expert systems leverage understanding of the system dynamics and effective strategies to control the system, but they require so many rules to capture all the nuanced exceptions that they can be cumbersome to program and manage.

Let’s use an example from a heating ventilation and air conditioning (HVAC) system like the one that might control the temperature in your office building. The system uses a damper valve that can be opened to let in fresh air or closed to recycle air. The system can save energy by recycling air at times of day when the price of energy is high or when the air is very cold and needs to be heated up. However, recycling too much air, especially when there are many people in the building, decreases air quality as carbon dioxide builds up.

Let’s say we implement an expert system with two simple rules that set the basic structure:

  • Close the damper to recycle air when energy is expensive or when air is very cold (or very hot).

  • Open the damper to let in fresh air when the air quality is reaching legal limits.

These rules represent the two fundamental strategies for controlling the system. See Figure 1-13 for a diagram of how the damper works.

Figure 1-13. How a damper valve in a commercial HVAC system recycles and freshens air

The first control strategy is perfect for saving money when energy is expensive and when temperatures are extreme. It works best when building occupancy is lower. The second strategy works well when energy is less expensive and building occupancy is high.

But we’re not done yet. Even though the first two rules in our expert system are simple to understand and manage, we need to add many more rules to execute these strategies under all possible conditions. The real world is fuzzy, and every rule has hundreds of exceptions that would need to be codified into an expert system. For example, the first rule tells us that we should recycle air when the energy is expensive and when the air temperature is extreme (really hot or cold). How expensive should the energy be in order to justify recycling air? And how much should you close the damper valve to recycle air? Well, that depends on the carbon dioxide levels in the rooms and on the outdoor temperature. It’s fuzzy—and the right answer depends on the surface of the landscape defined by the relationships between energy prices, outdoor air temperatures, and number of people in the building.

if (temp > 90 or temp < 20) and price > 0.20: # Recycle Air
	if temp < 20 and price > 0.17:
		valve = 0.3
	if temp < 10 and price > 0.17:
		valve = 0.2
	if temp < 00 and price > 0.17:
		valve = 0.1

The code above shows some of the additional rules required to effectively implement two HVAC control strategies under many different conditions.

Expert systems are like maps of the geographic terrain; recorded exploration drawn based on previous expeditions. They hold a special place in the history of AI: in fact, they comprised most of its second wave. The term artificial intelligence was coined in 1956 at a conference of computer scientists at Dartmouth College. The first wave of AI used symbolic (human-readable) representations of problems, logic, and search to reason and make decisions. This approach is often called Symbolic AI.

The second wave of AI primarily comprised expert systems. For some time, the hope was that the expert system would serve as the entire intelligent system even to the point of intelligence comparable to the human mind. Folks as famous as Marvin Minsky (1927 - 2016), regarded as the “Godfather of AI,” claimed this. From a research perspective, much of the exploration of what an expert system was and could be, was considered complete. Even, so widespread dissapointment in the capability of these systems was recorded.

A long AI “winter” descended amid disappointment that the expert system was not sufficient to replicate the intelligence of the human mind (the field of AI research calls this Artificial General Intelligence or AGI). Perception was missing, for example. Expert systems can’t perceive complex relationships and patterns in the world the way we see (identify objects based on patterns of shapes and colors), hear (detect and communicate based on patterns of sounds), and predict (forecast outcomes and probabilities based on correlations between variables). We also need a mechanism to handle the fuzzy exceptions that trip up expert systems. So, expert systems silently descended underground to be used in finance and engineering, where they shine at making high-value decisions to this day.

The current “summer” of AI swung the pendulum all the way in the opposite direction. The expert system was shunned; the baby was, in my opinion, thrown out with the bath water in favor of first perception, then learning algorithms that make sequential decisions.


We now have an opportunity to combine the best of expert systems with perception and learning agents in next-generation Autonomous AI!

Figure 1-14. Timeline of the history of AI

If you would like to learn more about how expert systems fit into the history of AI, I highly recommend Thinking Machines: The Quest for Artificial Intelligence, an accessible and relevant survey of the history of AI. If you would like to read details about a real expert system, I recommend this paper on DENDRAL, widely recognized as the first “real” expert system.

Fast forward to today and we find expert systems embedded in even the most advanced Autonomous AI. An expert once described to me Autonomous AI that was built to control self-driving cars. Deep in the logic that orchestrates learning AI to perceive and act while driving are expert rules that take over during safety-critical situations. The learning AI perceives and makes fuzzy, nuanced decisions, but the expert system components do what they are really good at, too: taking predictable action to keep the vehicle safe. This is exactly how we will use math, menus, and manuals when we design brains. We will assign decision-making technology to best execute each decision-making skill.

Now that I’ve discussed each method that machines use to make automated decisions, you can see that each method has strengths and weaknesses. In some situations, one method might be a clear and obvious choice for automated decisions. In other applications, another method might perform much better. Now, we can even consider mixing methods to achieve better results, the way that Model Predictive Control (MPC) does. It makes better control decisions by mixing math with manuals in the form of a constraint optimization algorithm. But first, let’s take a look at the capabilities of Autonomous AI.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.