![]() |
Dan Bricklin's Web Site: www.bricklin.com
|
![]() |
Learning From Accidents and a Terrorist Attack
There are principles that may be gleaned by looking at Normal Accident Theory and the 9/11 Commission Report that are helpful for software development.
|
Introduction
![]() In my essay "Software That Lasts 200 Years" I list some needs for Societal Infrastructure Software. I point out generally that we can learn from other areas of engineering. I want to be more explicit and to list some principles to follow and examples from which to learn. To that end, I have been looking at fields other than software development to find material that may give us some guidance.
Part of my research has taken me to the study of major accidents and catastrophic situations involving societal infrastructure. I think these areas are fertile for learning about dealing with foreseen and unforeseen situations that stress a system. We can see what helps and what doesn't. In particular, I want to address the type of situations covered in Charles Perrow's "Normal Accidents" (such as at the Three Mile Island nuclear power plant as well as airline safety and nuclear defense) and "The 9/11 Commission Report" (with regard to activities during the hijackings and rescue efforts).
Normal Accident Theory
![]() Charles Perrow's book "Normal Accidents" was originally published in 1984 with an afterward added in 1999. It grew out of an examination of reports about accidents at nuclear power plants, initially driven by the famous major one that occurred on March 28, 1979, at the Three Mile Island nuclear plant in Pennsylvania. Perrow describes many different systems and accidents, some in great detail, including petrochemical plants, aircraft and air traffic control, marine transportation, dams, mines, spacecraft, weapons systems, and DNA research.
Perrow starts the book with a detailed description of the Three Mile Island accident, taking up 15 pages to cover it step by step. You see the reality of component failure, systems that interact in unexpected ways, and the confusion of the operators.
To help you get the flavor of what goes on during the accidents he covers, here is a summary of the Three Mile Island accident as I understand it:
Apparently a common failure of a seal caused moisture to get into the instrument air system which caused a change in pressure in the air system. The change in pressure triggered an automatic safety system on some valves to incorrectly think some pumps should be shut down, stopping the flow of water to a steam generator. That caused some turbines to automatically shut down. Stopping the turbines made an emergency pump need to come on which pumped water into pipes with valves that had accidentally been left closed during maintenance. The pipe valves had two indicators, but one was obscured by a repair tag hanging on the switch above it and they didn't check the other assuming all was well. When things started acting funny several minutes later, they checked but by then the steam generator boiled dry, causing heat not to be removed from the reactor core, which caused the control rods to drop in, stopping the reactor, but the reactor still generated enough heat to continue raising pressure in the vessel. The automatic safety device there, a relief valve, failed to reseat itself after relieving the pressure, letting core coolant be pushed out into a drain tank. The indicator for the relief valve failed, indicating that it was closed when it was not, so the draining continued for a long time without the operators knowing it was happening. Turning on some other pumps to fix a drop in pressure seemed to work for only a while, so they turned on another emergency source of water for the core, but only for a short time to avoid complications that it could cause if overused. Not knowing of the continued draining, the drop in reactor pressure didn't seem to match the increase in another gauge, and they had to choose which was giving a correct indication of what was going on. They were trying to figure out the level of coolant in the reactor (since too little would lead to a meltdown), but there were no direct measures of coolant level in this type of reactor, only indirect. The indicators that could indirectly help one figure out what was going on weren't behaving as they were trained to expect. Some pumps started thumping and shaking and were shut down. The computer printing out status messages got far behind before they found out about the unseated valve. The alarms were ringing, but you couldn't shut down the noise without also shutting down other indicators.
The story goes on and on -- this is just the beginning of that accident.
In addition to describing the many accidents and near accidents where good luck (or lack of bad luck) kept things safe, he also tries to figure out what makes some systems less prone to major accidents than others. It is, he believes, in the overall design.
Here are some quotes (emphasis added):
The main point of the book is to see...human constructions as systems, not as collections of individuals or representatives of ideologies. ...[T]he theme has been that it is the way the parts fit together, interact, that is important. The dangerous accidents lie in the system, not in the components. [Page 351]
...[Here is t]he major thesis of this book: systems that transform potentially explosive or toxic raw materials or that exist in hostile environments appear to require designs that entail a great many interactions which are not visible and in expected production sequence. Since nothing is perfect -- neither designs, equipment, operating procedures, operators, materials, and supplies, nor the environment -- there will be failures. If the complex interactions defeat designed-in safety devices or go around them, there will be failures that are unexpected and incomprehensible. If the system is also tightly coupled, leaving little time for recovery from failure, little slack in resources or fortuitous safety devices, then the failure cannot be limited to parts or units, but will bring down subsystems or systems. These accidents then are caused initially by component failures, but become accidents rather than incidents because of the nature of the system itself; they are system accidents, and are inevitable, or "normal" for these systems. [Page 330]
This theory of two or more failures coming together in unexpected ways and defeating safety devices, cascading through coupling of sub-systems into a system failure, is "Normal Accident Theory" [page 356-357]. The role of the design of a system comes up over and over again. The more that sub-systems are tightly coupled the more accident prone they will be. The most problematic are couplings that are not that obvious to the original designers, such as physical proximity that couples sub-systems. During a failure of one system (for example, a leak), a different system (the one it drips onto) is affected leading to an accident. (In computer systems this is very common, such as memory overruns in one area causing errors elsewhere.)
Another key point I found in the book is that in order to keep failures from growing into accidents the more an operator knows about what is happening in the system the better. Another point is that independent redundancy can be very helpful. However, back to coupling, redundancy and components that are interconnected in unexpected ways can lead to mysterious behavior, or incorrectly perceived correct behavior.
More examples from "Normal Accidents"
![]() An example he gives of independent redundant systems providing operators much information was of the early warning systems for incoming missiles in North America at the time (early 1980's). He describes the false alarms, several every day, most of which are dismissed quickly. When an alarm comes in to a command center, a telephone conference is started with duty officers at other command centers. If it looks serious (as it does every few days), higher level officials are added to the conference call. If it still looks real, then a third level conference is started, including the president of the U.S. (which hadn't happened so far at the time). The false alarms are usually from weather or birds that look to satellites or other sensors like a missile launch. By checking with other sensors that use independent technology or inputs, such as radar, they can see the lack of confirmation. They also look to intelligence of what the Soviets are doing (though the Soviets may be reacting to similar false alarms themselves or to their surveillance of the U.S.).
In one false alarm in November of 1979, many of the monitors reported what looked exactly like a massive Soviet attack. While they were checking it out, ten tactical fighters were sent aloft and U.S. missiles were put on low-level alert. It turned out that a training tape on an auxiliary system found its way into the real system. The alarm was suspected of being false in two minutes, but was certified false after six (preparing a counter strike takes ten minutes for a submarine-launched attack). In another false alarm, test messages had a bit stuck in the 1 position due to a hardware failure, indicating 2 missiles instead of zero. There was no loopback to help detect the error.
The examples relating to marine accidents are some of the most surprising and instructive. It seems that ships sometimes inexplicably turn suddenly and crash into each other much more than you would think. He sees this as relating to an organizational problem along with the tendency of people to decide upon a model of what is going on and then interpret information afterwards in that light. In ships, at the time, the captain had absolute authority and the rest of the crew usually just followed orders. (This is different than in airplanes where the co-pilot is free to question the pilot and there is air traffic control watching and in 2-way radio contact.)
In one case Perrow relates that a ship captain saw a different number of lights on another ship than the first mate. They didn't compare notes about the number of lights, especially after the captain indicated he had seen a ship. The captain thought the ship was traveling in the same direction as them (two lights), while the first mate correctly thought that it was coming at them (three). Misinterpreting what he was seeing, the captain thought it was getting closer because it was a slow, small fishing vessel, not because it was big and traveling towards him. Since passing is routine, they weren't contacted by the other ship. When he got close, he steered as if he was passing, and turned into the path of the oncoming vessel, killing eleven on his boat. In another case, apparent radar equipment errors made a ship think an oncoming ship was to its left when it was really to its right. Fog came in, and mid-course maneuvers by both ships were increasingly in a feedback loop that caused a collision.
Another book about failures
![]() To get a feeling for how common and varied failures are, and to see how some people have attempted to classify them for learning, there are books such as Trevor Kletz's "What Went Wrong? Case Histories of Process Plant Disasters" which chronicles hundreds of them. Examining many failures and classifying them for learning is very important. You don't want to just prevent an exact duplicate of a failure in the future, but rather the entire class it represents. Failures of parts and procedures are the common, normal situation. Everything working as planned is not. Safety systems are no panacea. "Better" training or people, while often helpful, won't stop it all.
Like Perrow, Kletz believes that design is critical for preventing (or minimizing the effect of) accidents. Here are some guidelines he discusses that relate to process plants:
![]() ![]() ![]() ![]() ![]() It is crucial that reports of encountered problems be made available to others for learning, especial those problems that result in accidents. There are many reasons for this. Kletz lists a few (on page 396) including our moral responsibility to prevent accidents if we can, and the fact that accident reports seem to have a greater impact on those reading them than just reciting principles.
The 9/11 Commission Report -- a story of reaction to a forced change in a system
![]() After reading "Normal Accidents", and with its lessons in mind, I read the sections in "The 9/11 Commission Report" that relate to the events during the hijacking of the planes and at the World Trade Center until the second tower collapsed. I was looking to learn about how the "system" responded once a failure (the hijacking and the buildings being struck) started. I was looking most toward finding descriptions of communications, decision making, and the role of the general populace because of my interest in those areas. I looked for uses of redundancy and communications and real-time information by the "operators" (those closest to what was happening). I looked for unplanned activities. I mainly dealt with the descriptions of what happened, not with the recommendations in the report.
Why look at terrorism? It is different than normal failures.
Terrorism is an extreme form of "failure" and accident. The perpetrators and planners look for weak components in a system and try to cause unexpected failures with maximum destruction and impact. Many traditional books on engineering failure, such as Perrow's "Normal Accidents", explicitly do not tackle it.
I see terrorism (for our purposes here) as being a form of change to a working system, with often purposeful, forced close-coupling, and that has bad effects. We can learn from it about dealing with changes to a system that must be dealt with and that were not foreseen by the original designers. It is like a change in environment or a change in the system configuration.
The entire report is available online for free and in printed form for a nominal fee. I have put together excerpts that I found in the Commission Report that I think are instructive. They are on a separate page, with anchors on each excerpt so that they can be referred to. The page is "Some Excerpts From the 9/11 Commission Report".
I think it is worth reading the actual, complete chapters, but in lieu of that, anybody interested in communications or dealing with disastrous situations like this should at least read the excerpts. I found it fascinating, horrifying, sad, and very real. As an engineer, I saw information from which to learn and then build systems that will better serve the needs of society. Such systems would be helpful in many trying situations, natural and man-made, foreseen and unforeseen, and could save lives and suffering. Let us learn from a bad situation to help society in the future. As Kletz points out, as engineers and designers, it is our duty.
Some key quotes from the 9/11 Commission Report:
...[T]he passengers and flight crew [of Flight 93] began a series of calls from GTE airphones and cellular phones. These calls between family, friends, and colleagues took place until the end of the flight and provided those on the ground with firsthand accounts. They enabled the passengers to gain critical information, including the news that two aircraft had slammed into the World Trade Center. [page 12]
The defense of U.S. airspace on 9/11 was not conducted in accord with preexisting training and protocols. It was improvised by civilians who had never handled a hijacked aircraft that attempted to disappear, and by a military unprepared for the transformation of commercial aircraft into weapons of mass destruction. [page 31]
General David Wherley-the commander of the 113th Wing [of the District of Columbia Air National Guard at Andrews Air Force Base in Maryland]-reached out to the Secret Service after hearing secondhand reports that it wanted fighters airborne. A Secret Service agent had a phone in each ear, one connected to Wherley and the other to a fellow agent at the White House, relaying instructions that the White House agent said he was getting from the Vice President. [page 44]
We are sure that the nation owes a debt to the passengers of United 93.Their actions saved the lives of countless others, and may have saved either the Capitol or the White House from destruction. [page 45]
According to another chief present, "People watching on TV certainly had more knowledge of what was happening a hundred floors above us than we did in the lobby.... [W]ithout critical information coming in . . . it's very difficult to make informed, critical decisions[.]" [page 298]
[Quoting a report about the Pentagon disaster:] "Almost all aspects of communications continue to be problematic, from initial notification to tactical operations. Cellular telephones were of little value.... Radio channels were initially oversaturated.. . . Pagers seemed to be the most reliable means of notification when available and used, but most firefighters are not issued pagers." [page 315]
The "first" first responders on 9/11, as in most catastrophes, were private-sector civilians. Because 85 percent of our nation's critical infrastructure is controlled not by government but by the private sector, private-sector civilians are likely to be the first responders in any future catastrophes. [page 317]
The NYPD's 911 operators and FDNY dispatch were not adequately integrated into the emergency response... In planning for future disasters, it is important to integrate those taking 911 calls into the emergency response team and to involve them in providing up-to-date information and assistance to the public. [page 318]
The Report strongly suggests that the billions of dollars spent on military infrastructure failed to stop any of the hijacked planes from hitting their targets. It was civilians, using everyday airphones and the unreliable cellular system, together with our civilian news gathering and disseminating system, and intuition and improvisation, that probably stopped one. Courage and bravery were shown by all, civilians and official personnel.
I thought the Report, in its analysis, paid too little attention to the important role of civilians and professionals acting out of their prepared roles. There is a lack of attention to societal communications, including TV, radio, Internet, cellular (voice, GPS, cell cameras, etc.), and too much just on those specific to officials. TV news was a crucial source for all, including the highest levels of government. While the phone network bogged down, it did provide crucial help, and civilian non-PSTN systems, such as Nextel Direct Connect, the Internet, and message-based systems did work well. Even the President suffered from a version of what we all get when traveling with wireless. "[H]e was frustrated with the poor communications that morning. He could not reach key officials, including Secretary Rumsfeld, for a period of time. The line to the White House shelter conference room-and the Vice President-kept cutting off." [page 40] The Vice President learned of the first crash from an assistant who told him to turn on his television on which he then saw the second crash. [page 35]
There are other examples of the general populace being an important component of what is usually thought of as being the province of "law enforcement". The AMBER Alert system is apparently working, as is the "America's Most Wanted" TV show, both of which use the general populace as a means for information gathering in response to detailed descriptions and requests. In Israel, the general populace has been instrumental in detecting suspicious behavior and even taking action to thwart or minimize terrorist attacks. According to the 9/11 Commission Report, a civilian passenger with years of experience in Israel apparently tried unsuccessfully to stop the hijackers on AA Flight 11. The fourth hijacked plane, UA 93, was stopped by civilians. An almost-catastrophe on an airplane was thwarted by attendants and passengers on AA Flight 63 when they restrained "shoe bomber" Richard Reid, now knowing that suicide terrorism on airplanes was a possibility.
What do we learn here with respect to reaction to disasters?
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() The Secret Service, an organization whose mission involves working against the unexpected, shows up in places you wouldn't expect, such as air defense and even providing information at the World Trade Center. This shouldn't be surprising. In addition to planning and post-event analysis, they specialize in improvisation, and are trained to be "...prepared to respond to any eventuality".
Examination of the 9/11 Commission Report comes up with some of the same lessons as Perrow, namely the need for people nearest to what's happening to have access to detailed, real-time information relating to what is happening in many parts of the system that may not have been foreseen.
Here are some additional things that we learn about cases like this:
![]() ![]() ![]() ![]() What we can do
![]() Here are some of my thoughts about reacting to catastrophes:
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Summary and Next Steps
![]() This essay covers a wide range of topics. It introduces "Normal Accident Theory", looks at some of the aspects of a major terrorist attack, and proposes some areas for design that are suggested by the results of that attack. The original goal, though, was to come up with some principles that could be applied to making software that fits with the long-term needs of society. Here are some of those principles:
![]() ![]() ![]() ![]() ![]() ![]() The next step will be to put these together with other principles gleaned from other areas.
-Dan Bricklin, 7 September 2004
|
|
© Copyright 1999-2018 by Daniel Bricklin
All Rights Reserved.
|