Thoughts on the Recent FAA Software Outage

Thoughts on the Recent FAA Software Outage

20 Jan 2023

Can you image being that engineer and in future interviews being as “What was the biggest mistake you made in your career?” and saying “I moved the wrong file, and caused the first FAA ground stop since 2001 and cost airlines millions”.

Quick Background

The system that went down provided “Notices to Air Missions”, NOTAMs these are simple text information about conditions at airports such as constructions, closed running, landing obstructions, and more. NOTAMs are used during the route planning of a flight. Once a flight airborne the NOTAMs system is not needed to complete the flight.

Swish Cheese

There is a big difference between “who’s caused it?” and “who’s at fault? “. The individual’s actions caused the issue they are not necessarily at fault because there is normally a chain of events that lead to a issue like this. Just like aviation disasters. There might be environment factors, cultural factors, improper procedures, wrong assumes or external factors that are at fault that lead to the incident occurring. Doctor Reason’s Swiss Cheese Model illustrates this concept.

Swiss cheese model by James Reason published in 2000. Source: https://openi.nlm.nih.gov/detailedresult.php?img=PMC1298298_1472-6963-5-71-1&req=4, open-access, CC Attribution 2.0 Generic

Here are the questions I have to better understand and identify where the holes in the swiss cheese happened for this incident.

NOTAM’s Approximate Uptime

The NOTAM system was first introduced in 1993 and I can only find 1 instance of the system wide outage going down, the most recent one. After reading a numbers of articles there are many quotes like the one below indicating NOTAM outages are rare.

“I don’t ever remember the NOTAM system going down like this. I’ve been flying 53 years,” said John Cox, a former airline pilot and now an aviation-safety consultant. – www.cbsnews.com - Irina Ivanova

While the software outage started at 8:30PM ET the FAA backup phone system continued to allow departures until 7:30AM ET the next day. At this point the phone system was overwhelmed by the volume and the FAA ordered a ground stop at 7:30AM until the software system was restored at 9:30 AM ET.

The are 262,968 hours in 30 years and if we count the backup process as part of the overall NOTAM system there was about 2.5 hours of downtime.

uptime = (262,968 - 2.5) / 262,968 * 100 = 99.999049314%

Look at that five-nines.

That not to say we should brush off the outage. What happen was unacceptable and improvements need to be made to help mitigate something like this from happening again. However the reality is it will happen again. It like plane accidents, we can investigates, find the root cause, put in place fixed and mitigations, but another accident will happen It is the reality of human systems, they are flawed.

A Humbling Reminder of Centralized Systems

This event is a humbling reminder that shows into today’s modern age of software dependencies. A single person actions, heck a single keystroke, can affect millions. It reminds me that I am in a position where I could affect clients using the sofware systems I develop and support.

▲ Back to Top ▲