10.1 Intuition vs. Cognition
Any visit to the business book section of a local bookstore will reveal plenty of titles that emphasize that managers need to trust their gut feeling and follow their instincts (e.g., Robinson 2006), indicating that their judgment should play an essential role in organizational decision-making processes. The last three decades of academic research have, however, also produced a counter-movement to this view, which mostly centers on the study of cognitive biases (Kahneman, Lovallo, and Sibony 2011). In this line of thought, human judgment is seen as inherently fallible. Intuition, as a decision-making system, has evolved in order to help us quickly make sense of the surrounding world, but its purpose is not necessarily to process all available data and carefully weigh alternatives. Managers can easily fall into the trap of trusting their initial feeling and thereby biasing decisions, rather than carefully deliberating and reviewing all available data and alternatives. A key for effective human judgment in forecasting is to be able to reflect upon initial impressions and to let further reasoning and information possibly overturn one’s initial gut feeling (Moritz, Siemsen, and Kremer 2014).
The presence of cognitive biases in time series forecasting is well documented. In particular, forecasters tend to be strongly influenced by recent data in a series, and they neglect to interpret this recent data in the context of the whole time series that has created it. This pattern is also called system neglect (Kremer, Moritz, and Siemsen 2011). It implies that forecasters tend to overreact to short-term shocks in the market and underreact to real and massive long-term shifts. Further, forecasters are easily misled by visualized data to “spot” illusionary trends. Simple random walks (such as stock market data) have a high likelihood of creating sequences of observations that seem to consistently increase or decrease over time by pure chance. This mirage of a trend is quickly seen as a real trend, even if the past series provides little indication that actual trends exist in the data. Using such illusionary trends for prediction can be highly misleading. Finally, if actual trends do exist in the data, human decision makers tend to dampen these trends as they extrapolate in the future; that is, their longer range forecasts tend to exhibit a belief that these trends are temporary in nature and will naturally degrade over time (Lawrence and Makridakis 1989). Such behavior may be beneficial for long-term forecasts where trends usually require dampening, but may decrease performance for more short-term forecasts.
This discussion also enables us to point out another important judgment bias in the context of time series. Consider series 1 in Figure 5.2. As mentioned earlier, the best forecast for this series is a long-run average of observed demand. Thus, plotting forecasts for multiple future periods would result in a flat line—the forecast would remain the same from month to month. A comparison of the actual demand series with the series of forecasts made reveals an odd picture; the demand series shows considerable variation, whereas the series of forecasts is essentially a straight line. Human decision makers tend to perceive this as odd and thus introduce variation into their series of forecasts, so that the series of forecasts more resembles the series of actual demands (Harvey, Ewart, and West 1997). Such behavior is often called “demand chasing” and can be quite detrimental to forecasting performance.
Another important set of biases relates to how people deal with the uncertainty inherent in forecasts. One key finding in this context is overprecision—human forecasters tend to underestimate the uncertainty in their own forecasts (Mannes and Moore 2013). This bias likely stems from a tendency to ignore or discount extreme cases. The result is that prediction intervals that are based on human judgment tend to be too narrow—people feel too precise about their predictions. While this bias has been very persistent and difficult to remove, recent research in this area has provided some promising results by showing that overconfidence can be reduced if decision makers are forced to assign probabilities to extreme outcomes as well (Haran, Moore, and Morewedge 2010). A related bias is the so-called hindsight bias. Here, decision makers tend to believe ex post that their forecasts are more accurate than they actually are (Biais and Weber 2009). This highlights the importance of constantly calculating, communicating, and learning from the accuracy of past judgmental forecasts.
Thus, while cognitive biases relate more generally to organizational decision making, they are also very relevant in our context of demand forecasting. However, to see judgment as only biased is a limited perspective. Statistical algorithms do not know what they do not know, and human judgment may have domain specific knowledge that enables better forecasting performance than any algorithm can achieve. Recent research shows that, with the proper avenues of decision making, the forecasting performance of human judgment can be extraordinary (Tetlock and Gardner 2015).
10.2 Domain-Specific Knowledge
One important reason why human judgment is still prevalent in organizational forecasting processes is the role of domain-specific knowledge in forecasting (Lawrence, O’Connor, and Edmundson 2000). Human forecasters may have information about the market that is not (or only imperfectly) incorporated into their current forecasting models, or the information is too difficult to quantify that it cannot be included in forecasting models. Such information in turn enables them to create better forecasts than any statistical forecasting model could accomplish. From this perspective, the underlying forecasting models appear incomplete. Key variables that influence demand are not included in the forecasting models used. In practice, forecasters will often note that their models do not adequately factor in promotions, which is why their judgment is necessary to adjust any statistical model. Yet, in modern times of business analytics, such arguments seem more and more outdated. Promotions are quantifiable in terms of discount, length, advertisement, and so forth. Good statistical models to incorporate such promotions into sales forecasts are now available (e.g., Cooper et al. 1999) and have been successfully applied in practice.
Besides incompleteness, forecasters may have noncodifiable information, that is, a highly tacit and personal understanding of the market. Sales people may, for example, be able to subjectively assess and interpret the mood of their customers during their interactions and incorporate such information into their forecasts. They may also get an idea of the customers’ own estimate of how their business is developing, even if no formal forecast information is shared. The presence of such information hints at model incompleteness as well, yet, unlike promotions, some of this information may be difficult to quantify and include in any forecasting model.
Another argument in favor of human judgment in forecasting is that such judgment can be apt at identifying interactions among predictor variables (Seifert et al. 2015). An interaction effect means that the effect of one particular variable on demand depends on the presence (or absence) of another variable. While human judgment is quite good at discerning such interaction effects, identifying the right interactions can be a daunting task for any statistical model, due to the underlying dimensionality. The number of possible j-way interaction terms among k variables is given by the binomial of k over j. For example, the possible number of two-way interactions among 10 variables is 9 + 8 + 7 + . . . + 2 + 1=45; the possible number of three-way interactions among 10 variables in a regression equation is 120. Including that many interaction terms in a regression equation can make the estimation and interpretation of any statistical model very challenging. Human judgment may be able to preselect meaningful interactions more easily.
The presence of domain-specific knowledge among forecasters overall is a valid argument why human judgment should play a role in organizational forecasting processes. There are, however, also arguments that may explain the presence of human judgment in such processes as well, but do not point to clear performance advantages of human judgment. One such argument is the “black box” argument: Statistical models are often difficult to understand for human decision makers; thus, they trust the method less and are more likely to discount it. In laboratory experiments, the users of forecasting software were more likely to accept the forecast from the software if they could select the model to be used from a number of alternatives (Lawrence, Goodwin, and Fildes 2002), and they tend to be more likely to discount a forecast if the source of the forecast is indicated as a statistical model as opposed to a human decision maker (Önkal et al. 2009). Quite interestingly, human decision makers tend to forgive other human experts that make errors in forecasts, but quickly lose their trust in an algorithm that makes prediction errors (Dietvorst, Simmons, and Massey 2015). Due to the noise inherent in time series, both humans and algorithms will make prediction errors eventually, yet over time, this implies that algorithms will be less trusted than humans. However, if we think about it, whether the user understands and trusts a statistical model does not, by itself, mean that the model yields bad forecasts! Since we care about the accuracy of our forecasts, not the popularity of our model, the “black box” argument, by itself, should not influence us against a statistical model. Another and potentially more damaging argument is the presence of political concerns around the forecast.
10.3 Political and Incentive Aspects
Dividing firms into functional silos is often a necessary aspect of organizational design to achieve focus in decision making. Such divisions usually go hand-in-hand with incentives—marketing and sales employees may, for example, be paid a bonus depending on the realized sales of a product, whereas operations employees may be provided with incentives based on costs. The precise key performance indicators used vary significantly from firm to firm. While such incentives may provide an impetus for action and effort within the corresponding functions, they necessarily create differing organizational objectives within the firm. Such differing objectives are particularly troublesome for cross-functional processes such as forecasting and sales and operations planning.
Since the forecast is a key input for many organizational decisions, decision makers try to influence the forecast to achieve their organizational objectives. Sales people could, for example, attempt to inflate the forecast in order to influence their manufacturing counterparts to provide more available inventory, or they may reduce the forecast in order to overachieve sales and get a big bonus; representatives from the finance department may alter the forecast so that it conforms more with how they want to represent the company and its prospects to investors; operations may again inflate the forecast to increase their safety margin (instead of cleanly identifying a safety stock). Relying on a statistical model will eliminate the ability to influence decision making through the forecast; as such, any organizational change in that direction will encounter resistance.
A recent article provides a fascinating description of seven different ways of how ineffective processes can lead forecasters to play games with their forecasts (Mello 2009). Enforcing behavior occurs when forecasters try to maintain a higher forecast than they actually anticipate, with the purpose of reducing the discrepancy between forecasts and company financial goals. If senior management creates a climate where goals simply have to be met, forecasters may acquiesce and adjust their forecasts accordingly to reduce any dissonance between their forecasts and these goals. Filtering occurs when forecasters lower their forecasts to reflect supply or capacity limitations. This often occurs if these forecasts are driven by operations personnel that will use the opportunity to mask their inability to meet the actual predicted demand. If forecasts are strongly influenced by sales personnel, hedging can occur, where forecasts overestimate demand in order to drive operations to make more products available. A similar strategy can occur if forecasts are influenced by downstream supply chain partners that anticipate a supply shortage and want to secure a larger proportion of the resulting allocation.1 On the contrary, sandbagging involves lowering the sales forecast so that actual demand is likely to exceed it; this strategy becomes prevalent if forecasts and sales targets are not effectively differentiated within the organization. Second guessing occurs when powerful individuals in the forecasting process override the forecast with their own judgment. This is often a symptom of general mistrust in the forecast or the process that delivered it. The game of spinning occurs if lower-level employees or managers deliberately alter (usually increase) the forecast in order to influence the responses of higher-level managers. This is a result of higher-level management “killing the messenger”. That is, if forecasters get criticized for delivering forecasts that are low, they will adjust their behavior to deliver “more pleasant” forecasts instead. Finally, withholding occurs when members of the organization fail to share critical information related to the forecast. This is often a deliberate ploy to create uncertainty about demand among other members in the organization.
In summary, there are good and bad reasons why human judgment is used in modern forecasting processes. The question of whether its presence improves forecasting is ultimately an empirical one. In practice, most companies will use a statistical forecast as a basis for their discussion, but often adjust this forecast based on the consensus of the people involved. Forecasting is ultimately a statement about reality, and thus the quality of a forecast can be easily judged ex post (see Chapter 11). One can thus compare over time whether the adjustments made in this consensus adjustment process actually improved or decreased the accuracy of the initial statistical forecast, which has been called a “Forecast Value Added” analysis (Gilliland 2013). In a study of over 60,000 forecasts across four different companies, such a comparison revealed that, on average, judgmental adjustments to the statistical forecast increased accuracy (Fildes et al. 2009). However, a more detailed look also revealed that smaller adjustments (which were more frequent) tended to decrease performance, whereas larger adjustments increased performance. One interpretation of this result is that larger adjustments were usually based on model incompleteness, that is, promotions and foreseeable events that the statistical model did not consider. The smaller adjustments represent the remaining organizational quibbling, influencing behavior and distrust in the forecasting software. One can thus conclude that a good forecasting process should only be influenced by human judgment in exceptional circumstances and with clear indication that the underlying model is incomplete; organizations are otherwise well advised to limit the influence of human judgment in the process.
10.4 Correction and Combination
If judgmental forecasts are used in addition to statistical forecasts, two methods exist that may help to improve the performance of these judgmental forecasts: combination and correction. Combination methods simply combine judgmental forecasts with statistical forecasts in a mechanical way as described at the end of Chapter 8. In other words, the simple average of two forecasts—whether judgmental or statistical—can outperform either one (Clemen 1989).
Correction methods, on the other hand, attempt to de-bias a judgmental forecast before use. Theil’s correction method is one such attempt, and it follows a simple procedure. A forecaster starts by running a simple regression between his/her past forecasts and past demand in the following form:
Results from this regression equation are then used to de-bias all forecasts made after this estimation by calculating
where a0 and a1 in equation (27) are the estimated regression intercept and slope parameters from equation (26). There is some evidence that this method works well in de-biasing judgmental forecasts and leads to better performance of such forecasts (Goodwin 2000). However, the cautionary warning would be to examine whether the sources of bias change over time. For example, the biases human forecasters experience when forecasting a time series for the first time may be very different from the biases they are subject to with more experience in forecasting the series; thus, initial data may not be valid for the estimation of equation (26). Further, if forecasters know that their forecasts will be bias-corrected in this fashion, they may show a behavioral response to overcome and outsmart this correction mechanism.
The essence of forecast combination methods has also been discussed in the so-called Wisdom of Crowds literature (Surowiecki 2004). The key observation in this line of research is more than 100 years old: Francis Galton, a British polymath and statistician, famously observed that during bull-weighing competitions at county fairs (where fairgoers would judge the weight of a bull), individual estimates could be far off the actual weight of the bull, but the average of all estimates was spot on and even outperformed the estimates of bull-weighing experts. In general, estimates provided by groups of individuals tended to be closer to the true value than estimates provided by individuals. An academic debate ensued whether this phenomenon was either due to group decision making, that is, groups being able to identify the more accurate opinions through discussion, or due to statistical aggregation, that is, group decisions representing a consensus that was far from the extreme opinions within the group, thus canceling out error. Decades of research established that the latter explanation is more likely to apply. In fact, group consensus processes to discuss forecasts can be highly dysfunctional because of underlying group pressure and other behavioral phenomena. Group decision-making processes that limit the dysfunctionality inherent in group decision making, such as the Delphi method and the nominal group technique, exist, but the benefits of such methods for decision making in forecasting compared to simple averaging are not clear. In fact, the simple average of opinions seems to work well (Larrick and Soll 2006). Further, the average of experts in a field is not necessarily better than the average of amateurs (Surowiecki 2004). In other words, decision makers in forecasting are often well advised to skip the process of group meetings to find a consensus; rather, they should prepare their forecasts independently. The final consensus can then be established by a simple or weighted average of these independent forecasts. The random error inherent in human judgment can thus be effectively filtered out. The benefit of team meetings in forecasting should therefore be more seen in stakeholder management and accountability, than in actually improving the quality of the forecast.
This principle of aggregating independent opinions to create better forecasts is powerful, but also counterintuitive. We have an inherent belief that experts should be better judges, and reaching consensus in teams should create better decisions. The wisdom of crowd argument appears to contradict some of these beliefs, since it implies that seeking consensus in a group may not lead to better outcomes, and that experts can be beaten by a group of amateurs. The latter effect has been recently demonstrated in the context of predictions in the intelligence community (Spiegel 2014). As part of the Good Judgment Project, random individuals from across the United States have been preparing probability judgments on global events. Their pooled predictions often beat the predictions of trained CIA analysts with access to confidential data. If amateurs can beat trained professionals in a context where such professionals clearly have domain knowledge, the power of the wisdom of crowds becomes quite apparent. The implication is that for key forecasts in an organization, having multiple forecasts prepared in parallel (and independently) and then simply taking the average of such forecasts may be a simple yet effective way of increasing judgmental forecasting accuracy.
10.5 Key Takeaways
• Human judgment can improve forecasts, especially if humans possess information that is hard to consider within a statistical forecasting method.
• Cognitive biases imply that human intervention will often make forecasts worse. That a statistical method is hard to understand does not mean that a human forecaster will be able to improve the forecast.
• Incentive structures may reward people to make forecasts worse. People will try to influence the forecast, since they have an interest in influencing the decision that is based on the forecast.
• Measure whether and when human judgment actually improves forecasts. It may make sense to restrict judgmental adjustments to only those contexts where concrete information of model incompleteness is present (e.g., the forecast does not factor in promotions, etc.).
• When relying on human judgment in forecasting, get independent judgments from multiple forecasters and then average these opinions.
______________
1Note that such order gaming behavior can be avoided by using a uniform instead of a proportional allocation rule by the supplying company (Cachon and Lariviere 1999). With uniform allocation, limited supply is equally divided among all customers; if a customer receives more supply than he/she ordered, the excess units are allocated equally among the remaining customers.
3.139.85.170