Monday, July 16, 2012

How complex systems fail

Many thanks to Tim O'Reilly for sending me this thoughtful piece by Richard I. Cook, MD, from the Cognitive Technologies Laboratory at the University of Chicago.  It is a pithy article with a long title:  "How Complex Systems Fail (Being a Short Treatise on the Nature of Failure; How Failure is Evaluated; How Failure is Attributed to Proximate Cause; and the Resulting New Understanding of Patient Safety)."

Some excerpts follow.  If you are not careful, you can quickly fall into a terribly pessimistic view of the potential to improve quality and safety in clinical settings.  But if you think about it a bit more, you can see that having a "learning organization," one that constantly strives to be very good at getting better, is the best answer we have.

Complex systems contain changing mixtures of failures latent within them.
The complexity of these systems makes it impossible for them to run without multiple flaws being present. Because these are individually insufficient to cause failure they are regarded as minor factors during operations. Eradication of all latent failures is limited primarily by economic cost but also because it is difficult before the fact to see how such failures might contribute to an accident. The failures change constantly because of changing technology, work organization, and efforts to eradicate failures.

Complex systems run in degraded mode.
A corollary to the preceding point is that complex systems run as broken systems. The system continues to function because it contains so many redundancies and because people can make it function, despite the presence of many flaws. After accident reviews nearly always note that the system has a history of prior ‘proto-accidents’ that nearly generated catastrophe. Arguments that these degraded conditions should have been recognized before the overt accident are usually predicated on na├»ve notions of system performance. System operations are dynamic, with components (organizational, human, technical) failing and being replaced continuously.

Post-accident attribution accident to a ‘root cause’ is fundamentally wrong.
Because overt failure requires multiple faults, there is no isolated ‘cause’ of an accident. There are multiple contributors to accidents. Each of these is necessary insufficient in itself to create an accident. Only jointly are these causes sufficient to create an accident. Indeed, it is the linking of these causes together that creates the circumstances required for the accident. Thus, no isolation of the ‘root cause’ of an accident is possible. The
evaluations based on such reasoning as ‘root cause’ do not reflect a technical understanding of the nature of failure but rather the social, cultural need to blame specific, localized forces or events for outcomes.

Hindsight biases post-accident assessments of human performance.
Knowledge of the outcome makes it seem that events leading to the outcome should have appeared more salient to practitioners at the time than was actually the case. This means that ex post facto accident analysis of human performance is inaccurate. The outcome knowledge poisons the ability of after-accident observers to recreate the view of practitioners before the accident of those same factors. It seems that practitioners “should have known” that the factors would “inevitably” lead to an accident. Hindsight bias remains the primary obstacle to accident investigation, especially when expert human performance is involved.

Failure free operations require experience with failure.
Recognizing hazard and successfully manipulating system operations to remain inside the tolerable performance boundaries requires intimate contact with failure. More robust system performance is likely to arise in systems where operators can discern the “edge of the envelope”. This is where system performance begins to deteriorate, becomes difficult to predict, or cannot be readily recovered. In intrinsically hazardous systems, operators are expected to encounter and appreciate hazards in ways that lead to overall performance that is desirable. Improved safety depends on providing operators with calibrated views of the hazards. It also depends on providing calibration about how their actions move system performance towards or away from the edge of the envelope.


Unknown said...

thank for the great and insightful post.
va home care services

Anonymous said...

This is similar to Leveson's work with STAMP and CAST. (See

Ian E. Gorman said...

Nevil Shute, in his autobiography "Slide Rule", gives an interesting history of the development of the rigid airship R-100 from 1926 to 1930 and the subsequent crash on its first flight (from England to India) in 1931. This histstory, along with his history of the development of R-101 in the same period, clearly illustrate the points of this article.

e-Patient Dave said...

I read this early today on my phone and only now have a chance to comment.

First, as you probably know, I fully agree about complexity, so the fundamental point of this post is solid, IMO.

But discouragement? HELL no; complexity is why it's usually erroneous to blame an *individual*, and complexity is why it's vital for everyone to share in the awareness that something could go wrong at any time, so "Let's all work together to be vigilant."

And that in turn is why it's a problem for anyone (especially a surgeon) to get hostile about improvements or mistakes that are pointed out.

Re: "Post-accident attribution accident to a ‘root cause’ is fundamentally wrong" - this point seems off base. As I understand it, the purpose of root cause analysis is to understand what happened; but do people who do it think there IS just one root cause? What do rigorous trainers say about that?

All in all, this doesn't make me pessimistic because it doesn't mean there's no point in trying, and certainly doesn't mean there's no point in working together to do everything we can for safety and quality.

Anonymous said...

There is a thread that runs through these observations that might be considered a failure itself: the failure to see systems ecologically, and to analyze information in a more comprehensive, strategic and predictive manner.

Our analogies continue to be of systems as machines, to be broken apart into components, and examined so. We assume independence of failures, except for very localized patterns in regard to place, provider, equipment, scheduling. 'Mixtures of failures', as if they are not related in predictable ways to each other. Different department, different problem. Which some are. Our analyses will not be adequate until our map matches the living organisms in play.

I suggest an analogy that is closer to the evidence. Medical systems are ecologies of relationships between humans, tools, pathogens, and competing goals. To 'run in degraded mode' suggests conflicting optimizations given realized tradeoffs. These optimizations and tradeoffs, rather than just worn out caregivers and widgets, should also be the subject of measure and analysis. There is no ideal state – just as there is no ideal human body, or ideal physician, or ideal pharmaceutical. It is not a normative state that variation should be measured against, but an optimal state given conditions. This does not mean that harm should happen: it should not. But it will. And it will especially if we do not understand intersecting conditions in more sophisticated ways. It is a Buckminster Fuller argument: we are not matter, we are energy.

For example,

'Root causes' are not only inadequate for reasons described, but because they are never used to generate larger analytical recognition of patterns to clarify the relationships that led to their attention in the first place. And 'hindsight' and 'failure free' challenges for patient safety fall squarely in the 'do you know the animal you work with?' category. Humans never evolved to be perfect performers, even if medical school cultivated many to think it is possible. Meanwhile, our brains are well equipped to justify the difference, and move on.

Improved 'Complex Systems Analyses' will require us to know more about ourselves, and variations in our interactions with others, tools, and environments, than simply deviation from idealized functions and parts. But it would allow us to look for opportunities to maximize cooperation in service of better work and safety environments, and to reorient interactions that corrode improvement.

It also requires us to step up our intellectual and analytical game, as the true failure may be that we are not asking all the right questions yet.