Important concept and some new information that challenged me. Went on a bit, but mandatory reading if you manage in an industry that routinely applies judgement.

Wherever there is judgment, there is noise, and more of it than you think.

Reducing noise can be just as effective at improving decision quality as reducing bias: they affect the math in the same way, and noise is often a larger contributor and easier to address.

bias and noise are interchangeable in the error equation, and the decrease in overall error will be the same, regardless of which of the two is reduced. […] In terms of overall error, noise and bias are independent: the benefit of reducing noise is the same, regardless of the amount of bias.

Computers are better than humans

Variation between individuals is usually the dominating noise factor:

In every case we reviewed in which the share of occasion noise in total system noise could be measured, occasion noise was a smaller contributor than were differences among individuals

The noise introduced by this human variation is often so high that reducing even by using a random but consistent model can lead to better outcomes(!)

Decades later, a review of fifty years of research concluded that models of judges consistently outperformed the judges they modeled.

Their striking finding was that any linear model, when applied consistently to all cases, was likely to outdo human judges in predicting an outcome from the same information. In one of the three samples, 77% of the ten thousand randomly weighted linear models did better than the human experts. In the other two samples, 100% of the random models outperformed the humans. Or, to put it bluntly, it proved almost impossible in that study to generate a simple model that did worse than the experts did.

That said, computerized models don’t outpeform calibrated humans by that much.

Models are consistently better than people, but not much better. There is essentially no evidence of situations in which people do very poorly and models do very well with the same information.

Why do we keep preferencing human judgement though? The two reasons that stood out to me were:

We expect machines to be perfect. If this expectation is violated, we discard them. Because of this intuitive expectation, however, people are likely to distrust algorithms and keep using their judgment, even when this choice produces demonstrably inferior results. This attitude is deeply rooted and unlikely to change until near-perfect predictive accuracy can be achieved.

And also the intuition that models can’t capture nuance … but that doesn’t actually matter:

Why do complex rules of prediction harm accuracy, despite the strong feeling we have that they draw on valid insights? For one thing, many of the complex rules that people invent are not likely to be generally true. But there is another problem: even when the complex rules are valid in principle, they inevitably apply under conditions that are rarely observed.

The authors conclude that rather than try and convince everyone to use models more — a losing game — we should instead do what we can to improve human judgement:

This observation has an important implication for the improvement of judgment. Despite all the evidence in favor of mechanical and algorithmic prediction methods, and despite the rational calculus that clearly shows the value of incremental improvements in predictive accuracy, many decision makers will reject decision-making approaches that deprive them of the ability to exercise their intuition. As long as algorithms are not nearly perfect—and, in many domains, objective ignorance dictates that they will never be—human judgment will not be replaced. That is why it must be improved.

Performance Assessment

An aside for the book but a challenging concept for me was that IQ (GMA) probably is something to be tested for more explicitly. I’ve never used tests as part of interviewing, and they have a bad reputation in my industry, but:

The conclusion is clear. GMA contributes significantly to the quality of performance in occupations that require judgment, even within a pool of high-ability individuals. The notion that there is a threshold beyond which GMA ceases to make a difference is not supported by the evidence.

That said, the best predictor still appears to be work samples. I’m curious about the unique predictive performance of each, i.e. with a good work sample test, does a GMA test provide any furthur signal?

Research has shown that work sample tests are among the best predictors of on-the-job performance.

Also found more supporting evidence that performance reviews are a waste of time.

Performance reviews continue to be one of the most dreaded rituals of organizations, hated almost as much by those who have to perform them as by those who receive them. One study found that a staggering 90% of managers, employees, and HR heads believe that their performance management processes fail to deliver the results they expected.

For example, relative ratings might make sense when, regardless of people’s absolute performance, only a fixed percentage of them can be promoted—think of colonels being evaluated for promotion to general. But forcing a relative ranking on what purports to measure an absolute level of performance, as many companies do, is illogical.

Mediating Assessment Protocol

I appreciated the protocol provided at the end of the book for applying decision hygeine to “one off” decisions like deciding whether to acquire a company (this list is a quote).

  1. At the beginning of the process, structure the decision into mediating assessments. (For recurring judgments, this is done only once.)
  2. Ensure that whenever possible, mediating assessments use an outside view. (For recurring judgments: use relative judgments, with a case scale if possible.)
  3. In the analytical phase, keep the assessments as independent of one another as possible.
  4. In the decision meeting, review each assessment separately.
  5. On each assessment, ensure that participants make their judgments individually; then use the estimate-talk-estimate method.
  6. To make the final decision, delay intuition, but don’t ban it.
Cover image for Noise