How would you like to compete in a bowling match where a curtain quickly drops down just in front of the foul line after you deliver your ball down the alley? You never know whether you scored a strike with the first ball, or whether you missed the pins completely. If you don’t find out how you score, what’s the sense of starting the ball rolling? That’s not the kind of game most of us would like to play. We much prefer a game in which we can follow our scores with each roll.
Compelling evidence suggests that, in too many instances, system safety — as currently practiced — has one attribute comparable to bowling behind such a curtain. System safety analyses are prepared, quantified and analyzed with increasing frequency. But, how often is a long-term feedback loop established to convey the “system safety score” achieved by the analyst?
THE OPEN LOOP FLAW
The notion of what constitutes a system safety “closed loop” illuminates this problem area. In system theory, a feedback loop is conceived as the information flow from the system output back to the system. This “feedback” is used to change the operation of the system, to correct undesired output.
A completed feedback loop is often termed a “closed loop” by system safety specialists. Two widely held views of what constitutes a “closed loop” can be observed in the system safety field. One is the notion that when a safety recommendation to control a hazard is reported to have been implemented, the recommendation file is closed and the safety loop is closed. A second view is that if the design performs as expected during prototype or preproduction or component tests, this information confirms the analysis and closes the loop for the safety analyst.
Both views reflect inadequate feedback loops.
If we consider the bowling pins as hazards, how many hazard pins did our system safety analysis ball really knock off our workplace alley? What contribution, in the real world, did our system safety analysis efforts make to our team’s score and season record at the alley? Evidence suggests that many system safety practitioners, after providing their safety analyses and opinions, don’t know how the game or the season ended. Furthermore, they may not be looking to find out. Unfortunately, few managers seem to be asking them to do so.
What is the evidence supporting these conclusions? Among other things, the evidence includes:
- the common system safety notion of what constitutes a “closed loop,’.
- the absence of required links between analysis and accidents in documents like MIL-STD-882-A [1],
- publications in the system safety field,
- emphasis on predictive (usually fault tree) analyses, and
- personal observations over the years by the authors mn accident investigations.
Among recent system safety textbooks, Brown’s 1976 book
[2] is silent on the matter. Hammer’s widely used 1980 book
[3] uses the term “closing the loop” in the sense of ensuring that corrective action is taken after hazard analyses find hazards that are not properly controlled. He devotes 3 lines in a 320 page textbook to using accident investigations to update hazard analyses. The 1981 NUREG 0492 Fault Tree Handbook
[4] does not address this issue. Malasky’s 1982 book
[5] does not mention the hazard analysis review function in connection with accident investigations. Clemens’ article about system safety methods in a 1982 Hazard Prevention article
[6] mentions no accident investigation methods or linkages for closing the loop. These popular publications are representative of the system safety literature. Of the system safety publications we have surveyed, only Roland and Moriarty’s 1983
[7] system safety engineering book speaks directly to this issue.
From the accident investigation perspective, a review of 39 accident investigation manuals disclosed that NONE of the manuals required or even suggested that predictive system safety analyses be compared with what occurred. No system safety plan we found required that analyses be tested during accident investigations, or updated by new information from an accident When this has been done during investigations, however, the results have been significant, as will be discussed shortly.
Department of Defense MIL-STD-882A clearly illustrates this missing linkage. When one compares 882A and accident investigation manuals used in the Department, one finds that no common model of the accident phenomenon is used. Further, the definitions in the investigation manuals are used in a different context from those in 882A. The bottom line is that accident data and 882A outputs are in different terms, so the two systems do not routinely help each other.
The system safety loop must not be considered closed until the predicted safety risks and safety control system performance levels, as determined by system safety analyses, have been validated or upgraded with ongoing
system performance information. Did these predictive analyses accurately predict operating disruptions, mishap scenarios and losses resulting from the hazards identified by accident investigators? Was the assessment of safety risk presented to appropriate decision makers during early safety reviews accurate, and is the system now performing as it was originally advertised? Although the vast majority of the information used to validate system safety will be generated after a system is operational, few such direct comparisons are reported in the literature.
Prototype and pre-production tests do expose their fair share of dings. bangs and glitches, but what happens after that phase? By “closing” the loop too soon in a system’s life cycle, system safety practitioners cannot determine if their analytical “models’ successfully predicted real world safety and accident performance, or how their models might be upgraded promptly and efficiently to do so.
For each mishap after a system is operational, significant questions should be raised by investigators and system safety analysts like:
- Was the system subjected to a system safety analysis?
- If so, did investigators use the analyses?
- If so, were the hazards which precipitated this mishap identified by the final system safety analysis?
- If so, were the hardware, people, procedures and management interactions functioning now as described by the analysts then?
- If so, did the analyses adequately deal with the consequences of the mishap?
If you answer these questions "no" in an investigation, your system safety program probably is afflicted with the "open loop flaw."
In the accident at hand, proper methods often disclose potential SYSTEM flaws. But did the investigation discover the SYSTEM SAFETY program’s “open loop” flaw? Has the risk inherent in this “open loop” flaw remained overlooked, unevaluated and uncorrected, and if so how often have accident problems been repeated because the system safety program flaw exists. This open loop has cascading effects on a system’s performance as risks continue to be overlooked, escape evaluation and produce serious consequences.
HOW CAN YOU TELL IF YOU HAVE A PROBLEM?
The following set of questions can help safety managers and their organization’s senior executives determine if their safety programs contain the "open loop flaw."
- Are documented system safety analyses used routinely in mishap or accident investigations (Al)? At what stage of the Al?
- Who (name) compares system safety analysis documents with accident scenarios?
- When was the last time
A. a system safety analysis was upgraded as a result of..
- a report of significant observations;
- a job safety analysis;
- a failure report;
- a maintenance report;
- quality control reports;
- a change analysis: or
- an energy exchange trace.
B. an analyst and investigator talked about changes in a system safety analysis document?
- What analytical methods do you use in both system safety analyses and accident investigations?
IS THIS FLAW WORTH FIXING?
The answer is a resounding YES! Closing the loop has been very worthwhile when it has been done. Several examples illustrate the benefits.
Several years ago, the United States Air Force noted that a new aircraft was crashing at a rate which exceeded that predicted in system safety analyses. A special independent review team consisting of both Air Force (operational, design/acquisition and logistic) members and airframe and engine contractor members was formed. Its purpose was to ensure that all reasonable corrective actions were being accomplished. One of the areas examined by the review team was the airframe/engine system safety program. The team considered it both active and aggressive. It had all the basic elements necessary for effective identification, evaluation and elimination or control of hazards. But, the team’s report also noted the system safety analyses had not identified the causes for four engine-related crashes. Although the four accidents were admittedly difficult to predict, comparing the original system safety analyses with the real world accidents showed that the system safety program needed a mid-course correction. The first hazard analysis ball had not knocked down all the hazard pins and another ball was necessary.
One of the engine failures involved a small flaw in an engine compressor disc which grew through fatigue until the disc failed catastrophically. Although the fault tree had identified this failure mode, the predicted failure rate was so low that, if it were valid, the failure was not expected. Another engine quit when a feedback/control cable subsystem experienced a fatigue induced by faults during assembly/maintenance. Without the information provided by the cable, the compressor variable vanes were improperly positioned and the engine stagnated. The fault tree model did not address the assembly/maintenance fault modes which led to failure. In addition, although similar engines in other types of aircraft had experienced this failure, responsible management organizations were not altered.
The third loss occurred after a cylindrical pin was left out during the manufacture of an engine component which was unique to the new aircraft. Missing the pin, a valve component was free to rotate and cause the engine to stagnate. The fault tree had not been carried down to a level which could have detected this fault. It had been assumed (erroneously) that normal and back-up systems were totally redundant below a certain level. In addition, the team noted that Failure Mode and Effects analyses did not address missing parts.
The fourth loss involved an internal engine bolt backing out of its hole and starting a reaction which resulted in catastrophic engine failure. When the team reviewed the fault tree analysis, it discovered the analysis did not go to the piece part level for components such as bolts.
Note that this was considered a good safety program, well funded, staffed and directed. But, like good bowlers, it sometimes failed to knock down all the pins with the first ball. Armed with the information from the real world, however, the program was able to zero in on the pins still standing. In all, six new system safety program initiatives were implemented.
- Information flow between the program office and field operations of similar type engines was streamlined.
- Unique parts with little operational experience to validate analyses and tests received a special review for criticality and control. Thirty six new safety critical areas were identified and controls implemented as a result of this experience.
- Assembly and quality control procedures during manufacture, overhaul and repair of components with single point loss of thrust potential underwent a special review for people-induced faults.
- A Maintenance Hazard Analysis reviewed critical tech order procedures, identifying potential in-the-field problems.
- The problems of interfacing the fault trees of the engine and airframe contractors, two independent companies, were examined closely. Problems were discovered not only in document integration, but also in the understanding of airframe failure effects on engine operation and vice versa. More consistency between airframe and engine fault tree logic provided increased visibility to the faults which could have serious implications across the interface.
We wish we could say these actions stopped accidents for that aircraft. We can’t. Additional losses have occurred. But with each loss, there has been feedback to the system safety analyses and the percentage of unforecasted losses is decreasing.
The logistic managers of another older aircraft also took advantage of real world performance feedback to improve their system safety program. The aircraft involved was designed in the late ‘5Os and produced in the ‘60s. The original development program did not include a system safety program. Although the aircraft’s safety record was relatively good, program managers felt that many of the causes of accidents which continued to haunt the
Program should be the subject of system safety analyses. Fault trees were developed for the main problem area, the flight control system, and for a secondary problem area, the landing gear. Therefore, maintenance repair and overhaul procedures and critical parts redesign changes were implemented. These fault trees have also proven useful as an accident investigation tool.
In a different field, a new design of certain types of railroad cars was subjected to special safety analyses and safety tests during design and prototype development. It passed all tests. After the equipment was put into service, indications of new types of problems arose in accidents. The earlier tests, however, had satisfied the designers. The new information from accident investigations about the adverse safety performance of the cars did not produce a reconsideration of the original safety analyses and tests. Continuing accidents and outside pressures on the industry eventually forced implementation of a safety retrofitting program at a cost estimated to be over $200,000,000.
A related issue concerned emergency response actions in the accidents involving these cars. The guidelines for these actions, developed by thoughtful individuals with extensive experience and implemented through the resultant training, were followed faithfully by response officials. The results were disastrous. Numerous responders were fatally or seriously injured. As the toll mounted, the guidelines were reviewed, and with the benefit of re-analysis, were substantially improved. The result no known fatalities or disabling injuries since the old guidelines were changed to incorporate the feedback from good accident investigations, and the improved guidelines were implemented.
All the above cases have one common factor. The system safety plan lacked routine updates of the original safety analyses based on accident investigations. In some cases, lives and resources were unnecessarily lost in subsequent accident.
One of the problems that surfaces when this is attempted is the incompatibility between predictive safety analysis methods and the methods needed to investigate and analyze accidents. Fault tree methods dominate the former activity, while other events sequencing methods dominate the latter. Fault tree has been found to have narrow applications in accident investigations, and advanced investigative analysis methods have not been widely tried for predictive analyses. Compatible methods will be needed in the interests of efficiency if for no other reason.
WHAT SHOULD BE DONE?
The examples suggest several specific actions by the system safety community. These actions affect changes to safety program requirements and to system safety analysis and accident investigation practices.
The first step that must be taken is to acknowledge that the open loop flaw” exists and merits action.
Other steps that should then be taken include:
- amendment of system safety program plans to require a linking of predictive analyses and accident investigations;
- development of compatible technical methods for both predictive safety analyses and accident investigation and analyses; and
- exchanges of findings when this has been done to arrive at best ways to do it, using the System Safety Society’s HAZARD PREVENTION journal, for example.
Other changes will undoubtedly be suggested as this “open loop flaw” is faced squarely by the dedicated persons presently managing or staffing system safety functions in a variety of industries and functional areas. The cumulative benefits appear to be enormous.
ABOUT THE AUTHORS:
Ludwig Benner Jr. is an adjunct faculty member and Field Instructor for the University of Southern California’s Institute of Safety and Systems Management. He was with the National Transportation Safety Board, as Chief of its Hazardous Materials Division. He directed numerous accident investigations, studies, evaluations of safeguards and procedures in hazardous materials transportation safety. He received his Chemical Engineering degree from Carnegie Institute of Technology. He is a registered Professional Safety Engineer. He has testified on safety matters before the U.S. Congress, and served on two Virginia Legislative Study Commissions, National Academy of Sciences committees and panels, and on several Federal agency safety projects and advisory groups. He is a Fellow of the System Safety Society.
Robert W. Sweginnis is a Field Instructor in Aviation Safety for the Institute of Safety and Systems Management, Extension and In-Service Programs Office, University of Southern California. His previous experience includes Technical Assistance Coordinator, Directorate of Aerospace Safety, Air Force Inspection and Safety Center; Chief, Developing Systems Branch, Systems Safety Division, Air Force Inspection and Safety Center, and Senior System Safety Engineer in the A-1O program; plus 8 years as a full-time USAF safety officer, including service on 10 mishap investigation boards. He received his B.S. in Aeronautical Engineering at New York University, was awarded an MS. in Engineering Administration from Southern Methodist University. He has engaged in additional studies in System Safety Engineering and Aircraft Crash Survival. He is chapter secretary for the American Society of Safety Engineers, and a member of the System Safety Society, and the International Society of Air Safety Investigators. Mr. Sweginnis is a registered Professional Safety Engineer in the State of California, and a Certified Safety Professional (Engineering).
[1] U.S. Department of Defense, “Military Standard System Safety Program Requirements,” MIL-STD-882A, June 28, 1977