excellence in problem management: in search of the lost root cause -

The traditional search for a single root cause doesn’t work with incidents affecting complex IT systems: A critical look at Root Cause Analysis (RCA)

The Outage

On October 4, 2021, Facebook experienced a worldwide disruption of its services. The event attracted a lot of attention because of its massive implications: the outage took down most of Facebook’s services for almost 6 hours ¹. In a short article posted that same day, Santosh Janardhan, Facebook’s VP of infrastructure, explains the background and reasons for the outage, saying that “its root cause was a faulty configuration change on our end”.

If you are familiar with incident investigation, you will be familiar with the term “root cause” as being the identification of the underlying cause of the outage. If you’re not, you may wonder what “causes” have to do with “roots”. The idea is that just like in medicine, problems reveal themselves as symptoms fuelled by underlying or original causes. The mere title “root cause” brings to mind a kind of carrot (the root) from which spring branches and leaves (the symptoms). In this image, one original cause produces a host of branches and leaves that characterise the incident.

In IT, the concept of root cause has deep roots itself. Both ITIL V3 and the BABOK V3 employ it widely. For example, the BABOK V3 ² defines root cause analysis as “a systematic examination of a problem or situation that focuses on the problem’s origin as the proper point of correction rather than dealing only with its effects.” In ITIL 2011, Problem Management is said to include ³: “the activities required to diagnose the root cause of Incidents and to determine the resolution to those problems.” The resolution is defined as ⁴: “Action taken to repair the Root Cause of an Incident or Problem, or to implement a Workaround.”

Woods and his colleagues call this ⁵ “the sequence of event model”, where “events preceding the accident happen linearly, in a fixed order, and the accident itself is the last event in the sequence”. They further link it explicitly to the concept of root cause ⁶: “Consistent with the idea of a linear chain of events is the notion of a root cause – a trigger at the beginning of the chain that sets everything in motion (the first domino that falls and then, one by one, the rest).”

In a second article that he published the day after the event, Janardhan gives more information about the Facebook outage:

“This was the source of yesterday’s outage. During one of these routine maintenance jobs, a command was issued with the intention to assess the availability of global backbone capacity, which unintentionally took down all the connections in our backbone network, effectively disconnecting Facebook data centres globally. Our systems are designed to audit commands like these to prevent mistakes like this, but a bug in that audit tool prevented it from properly stopping the command.”

Although this brings more detail, it is still the same thought process of one event leading to another, as explained by Woods and his colleagues ⁷: “The sequence-of-events idea is pervasive, even if multiple parallel or converging sequences are sometimes depicted to try to capture some of the greater complexity of the precursors to an accident.”

So, despite the larger extent of Janardhan’s second explanation, it still is in line with the sequence of event model, the identification of a “root cause,” or a “source,” i.e., the first in the chain of events that, if prevented from happening, will prevent the whole sequence.

Woods and his colleagues note that searching for a single root cause or source is counter-productive when studying incidents in “high consequence, complex systems.” Instead of this simple model of one event leading to another in a linear sequence, they describe a model where incidents are the result of a conjunction of events that undermine the defences of the system sufficiently to produce an unwelcome outcome. Each event occurring in isolation from others is insufficient to weaken the system. Only the conjunction of several events will produce the incident. They note that in these kinds of systems ⁸: “there is no single cause for a failure but a dynamic interplay of multiple contributors. The search for a single or root cause retards our ability to understand the interplay of multiple contributors.”

Very often, Woods and his colleagues say, this search for a single or root cause stops when a person or group of people can be identified as having acted in a way that led to the adverse outcome. Therefore, the search for a root cause is often closely associated with human error. This is despite the observation that it is human action that most of the time prevents the adverse outcomes that are latent in these complex systems. ⁹

Where have all the root causes gone?

Despite its prevalence in IT, there is some movement out of the root cause way of thinking. Remember that it was taking centre stage in ITIL 2011 Problem Management. More recently, problem management has been redefined in ITIL 4 as ¹⁰: “The practice of reducing the likelihood and impact of incidents by identifying actual and potential causes of incidents, and managing workarounds and known errors.”

ITIL 4 further distances itself from the idea of a single root cause by stating that ¹¹: “It is important to remember that the concept of a single root cause has very limited applicability in complex evolving environments. Quite often, incidents are caused by improbable combinations of improbable factors. Consequentially, the investigation of problems (especially in reactive problem management) should not be limited to the identification of the first possible cause of incidents.” This echoes the small warning in the BABOK about the limitation of root cause analysis, noting that it ¹²: “May be difficult with complex problems; the potential exists to lead to a false trail and/or dead-end conclusion.” But neither ITIL 4 nor the BABOK V3 give detailed advice on how to proceed with problem management in complex systems. They both still rely on techniques that subscribe to the sequence of event model, e.g., Five Whys, Fishbone diagram, fault tree analysis.

Woods and his colleagues ¹³ propose a rich set of techniques that could be useful for problem managers. These include considering the identification of causes as the beginning of the investigation, not its conclusion. They differentiate between first and second stories. With first stories, problem managers are focused on root causes and human error. Second stories, however, recognise that adverse outcomes occur when people attempt to do their job as best as possible while maintaining safety in situations with multiple disparate objectives. The situation is often not of their choosing. It has been designed by other people who are usually not present when a failure occurs. Therefore, Woods and his colleagues propose separating what they call the sharp end, where people operate the system, from the blunt end, where the system is designed and where rules and regulations are defined.

In the case of Facebook, is the source of the outage to be found in the issuing of the wrong command, or is it the bug in the auditing system? Or maybe the source is what led to both of these weaknesses? The design or documentation of the command that made it possible to get it wrong? The design or usability of the maintenance procedure? The training of the operators? The quality assurance of the audit system? If it was known that these kinds of maintenance procedures are hazardous, was there an intervention team ready to step in if anything adverse happened so as to contain the failure and reduce the outage time?

Replacing the simple notion of the root cause or root cause analysis with more modern and more sophisticated incident analysis methods may help IT departments to improve their resilience, as Janardhan promises at the end of his second article.

References:

¹ https://en.wikipedia.org/wiki/2021_Facebook_outage
² IIBA, A Guide to the Business Analysis Body of Knowledge, v3, 2015, p. 335
³ Cabinet Office, ITIL® Service Operation, 2011 Edition, TSO, p. 97
⁴ Cabinet Office, ITIL® Service Operation, 2011 Edition, TSO, p. 338
⁵ Woods, David D.; Sidney Dekker; Richard Cook; Leila Johannesen. Behind Human Error (p. 41). Ashgate Publishing Ltd. Kindle Edition. 2010
⁶ Ibid
⁷ Ibid
⁸ Woods, David D.; Sidney Dekker; Richard Cook; Leila Johannesen. Behind Human Error (p. 8). Ashgate Publishing Ltd. Kindle Edition. 2010.
⁹ Woods, David D.; Sidney Dekker; Richard Cook; Leila Johannesen. Behind Human Error (p. 272). Ashgate Publishing Ltd. Kindle Edition. 2010.
¹⁰ AXELOS, ITIL® Foundation: ITIL 4 Edition, 2019, p. 192
¹¹ AXELOS, Problem Management ITIL®4 Practice Guide, p. 20
¹² IIBA, A Guide to the Business Analysis Body of Knowledge, v3, 2015, p. 337
¹³ Woods, David D.; Sidney Dekker; Richard Cook; Leila Johannesen. Behind Human Error. Ashgate Publishing Ltd. Kindle Edition. 2010

Tags:

#problemmanagement

mehr Insights

Alle Insights

AI enhances the consultant, the consultant gives meaning to AI

company newsdigital solutionsgovernance & service managementquality assurance & testing

9. Juni 2026

Enterprise architecture: a key lever for successful AI integration

governance & service management

18. Mai 2026

AI chatbots and agentic AI

digital solutionsgovernance & service management

13. Februar 2026

Cookie	Dauer	Beschreibung
IDE	1 year 24 days	Google DoubleClick IDE-Cookies werden verwendet, um Informationen darüber zu speichern, wie der Nutzer die Website nutzt, um ihm relevante Werbung entsprechend seinem Profil zu präsentieren.
test_cookie	15 Minuten	Der test_cookie wird von doubleclick.net gesetzt und dient dazu, festzustellen, ob der Browser des Benutzers Cookies unterstützt.
VISITOR_INFO1_LIVE	5 Monate 27 Tage	Ein Cookie, das von YouTube gesetzt wird, um die Bandbreite zu messen, die bestimmt, ob der Nutzer die neue oder die alte Playeroberfläche erhält.
YSC	Session	Das YSC-Cookie wird von Youtube gesetzt und dient dazu, die Aufrufe von eingebetteten Videos auf Youtube-Seiten zu verfolgen.

Cookie	Dauer	Beschreibung
__hstc	1 Jahr 24 Tage	Dies ist das Haupt-Cookie, das von Hubspot gesetzt wird, um Besucher zu verfolgen. Es enthält die Domäne, den ursprünglichen Zeitstempel (erster Besuch), den letzten Zeitstempel (letzter Besuch), den aktuellen Zeitstempel (dieser Besuch) und die Sitzungsnummer (wird bei jeder nachfolgenden Sitzung erhöht).
_ga	2 Jahre	Das _ga-Cookie, das von Google Analytics installiert wird, berechnet Besucher-, Sitzungs- und Kampagnendaten und verfolgt auch die Nutzung der Website für den Analysebericht der Website. Das Cookie speichert Informationen anonym und weist eine zufällig generierte Nummer zu, um eindeutige Besucher zu erkennen.
_ga_JYCPSB48B8	2 Jahre	Dieses Cookie wird von Google Analytics installiert.
CONSENT	16 Jahre 2 Monate 25 Tage 10 Stunden	YouTube setzt dieses Cookie über eingebettete YouTube-Videos und registriert anonyme statistische Daten.
hubspotutk	1 Jahr 24 Tage	Dieses Cookie wird von HubSpot verwendet, um die Besucher der Website zu verfolgen. Dieses Cookie wird bei der Übermittlung eines Formulars an Hubspot weitergegeben und bei der Deduplizierung von Kontakten verwendet.

Cookie	Dauer	Beschreibung
__cf_bm	30 Minuten	Dieses Cookie wird von Cloudflare gesetzt und dient der Unterstützung des Cloudflare Bot Management.
__hssc	30 Minuten	HubSpot setzt dieses Cookie, um Sitzungen zu verfolgen und um zu bestimmen, ob HubSpot die Sitzungsnummer und die Zeitstempel im __hstc-Cookie erhöhen soll.

Cookie	Dauer	Beschreibung
__hssrc	Session	Dieses Cookie wird von Hubspot immer dann gesetzt, wenn es das Sitzungscookie ändert. Das __hssrc-Cookie, das auf 1 gesetzt ist, zeigt an, dass der Benutzer den Browser neu gestartet hat, und wenn das Cookie nicht existiert, wird angenommen, dass es sich um eine neue Sitzung handelt.
_GRECAPTCHA	5 Monate 27 Tage	Dieses Cookie wird vom Google-Recaptcha-Dienst gesetzt, um Bots zu identifizieren und die Website vor bösartigen Spam-Angriffen zu schützen.
cookielawinfo-checkbox-advertisement	1 Jahr	Dieser Cookie wird vom GDPR Cookie Consent Plugin gesetzt und dient dazu, die Zustimmung des Nutzers zu den Cookies der Kategorie "Werbung" zu erfassen.
cookielawinfo-checkbox-analytics	11 Monate	Dieses Cookie wird vom GDPR Cookie Consent Plugin gesetzt. Das Cookie wird verwendet, um die Zustimmung des Nutzers für die Cookies in der Kategorie "Analytics" zu speichern.
cookielawinfo-checkbox-functional	11 Monate	Das Cookie wird durch die GDPR-Cookie-Zustimmung gesetzt, um die Zustimmung des Nutzers für die Cookies in der Kategorie "Funktional" zu erfassen.
cookielawinfo-checkbox-necessary	11 Monate	Dieses Cookie wird vom GDPR Cookie Consent Plugin gesetzt. Das Cookie wird verwendet, um die Zustimmung des Nutzers für die Cookies der Kategorie "Notwendig" zu speichern.
cookielawinfo-checkbox-others	11 Monate	Dieses Cookie wird vom GDPR Cookie Consent Plugin gesetzt. Das Cookie wird verwendet, um die Zustimmung des Nutzers für die Cookies in der Kategorie "Andere" zu speichern.
cookielawinfo-checkbox-performance	11 Monate	Dieses Cookie wird vom GDPR Cookie Consent Plugin gesetzt. Das Cookie wird verwendet, um die Zustimmung des Nutzers für die Cookies in der Kategorie "Leistung" zu speichern.
viewed_cookie_policy	11 Monate	Das Cookie wird vom GDPR Cookie Consent Plugin gesetzt und wird verwendet, um zu speichern, ob der Nutzer der Verwendung von Cookies zugestimmt hat oder nicht. Es speichert keine persönlichen Daten.