[SystemSafety] "Reliability" culture versus "safety" culture

Peter Bernard Ladkin ladkin at rvs.uni-bielefeld.de
Mon Jul 29 14:37:32 CEST 2013


As a few of you know, I have recently been involved in what appears to be a technical-culture clash, 
between "reliability" and "safety" engineers, which has led/leads to organisational problems, for 
example the scope of technical standards. Some suspect that such a culture clash is moderately 
rigid. I would like to figure out as many specific technical differences as I can. It is moderately 
important to me that the expression of such differences attain universal assent (that is, from both 
cultures as well as any others....)

Here are some I know about already.

1. Root Cause Analysis. Reliability people set store by methods such as Five-Why's, and Fishbone 
Diagrams, which people analysing accidents or serious incidents consider hopelessly inadequate (in 
Nancy's word, "silly").

2. Root Cause Analysis. Reliability people often look to identify "the" root cause of a quality 
problem, and many methods are geared to identifying "the" root cause. Accident analysts are 
(usually) adamant that there is hardly ever (in the words of many, "never") just *one* cause which 
can be called root.

3. FMEA. There are considerable questions with today's complex systems of how to calculate 
maintenance cycles. Even a military road vehicle nowadays can be considered a "system of systems", 
in that the system-subsystem hierarchy is quite deep. Calculating maintenance cycles requires 
obtaining some idea of MTBFs of components. Components may be simple, or line-replacable units, or 
units that require shop maintenance. Physical components may or may not correspond to functional 
blocks (there is a notation, Functional Block Diagrams or FBDs, which is widely used). There are 
ways of calculating MTBFs and maintenance procedures for components hierarchically arranged in FBDs. 
They may well work well enough for the control of complexity to determine the requirements for 
regular maintenance.

However, if functional failures contribute to hazards, these methods, which are approximate, do not 
appear to work well for assessing the likelihoods of hazards arising. (This is true even for those 
hazards which arise exclusively as a result of failures.)

4. FMEA. People who work with FMEA for reliability goals are not so concerned with completeness. 
Indeed, I have had reliability-FMEA experts dismiss the subject when I brought it up, claiming it to 
be "impossible". However, people who use FMEA for the analysis of failures of safety-relevant 
systems and their hazards must be very concerned, as a matter of due diligence, that their analyses 
(their listing of failure modes) as far as possible leave nothing out (in other words, that they are 
as complete as possible).

5. Testing. Safety people generally know (or can be presumed to know) of the work which tells them 
that assessing software-based systems for high reliability through testing cannot practically be 
accomplished, if the desired reliability is higher than one error in 10,000 to 100,000 operational 
hours (e.g., Littlewood/Strigini, Butler/Finelli, both 1993).

Whereas reliability people believe that statistical analysis of testing is practical and worthwhile. 
For example , from a paper in the 2000 IEEE R&M Symposium:
 > Abstract: When large hardware-software systems are run-in or an acceptance testing is made, a
 > problem is when to stop the test and deliver/accept the system. The same problem exists when a
 > large software program is tested with simulated operations data. Based on two theses from the
 > Technical University of Denmark the paper describes and evaluates 7 possible algorithms. Of these
 > algorithms the three most promising are tested with simulated data. 27 different systems are
 > simulated, and 50 Monte Carlo simulations made on each system. The stop times generated by the
 > algorithm is compared with the known perfect stop time. Of the three algorithms two is selected
 > as good. These two algorithms are then tested on 10 sets of real data. The algorithms are tested 
 > with three different levels of confidence. The number of correct and wrong stop decisions are
 > counted. The conclusion is that the Weibull algorithm with 90% confidence level takes the right
 > decision in every one of the 10 cases.

6 .... and onwards. I would like to collect as many examples as possible of such differences. Do 
some of you have other contrasts to contribute? I would like to share with colleagues, so I do 
intend to attribute to the contributor if this is OK. (Desired-anonymous examples will also be kept, 
as desired, anonymous.)

PBL

Prof. Peter Bernard Ladkin, Faculty of Technology, University of Bielefeld, 33594 Bielefeld, Germany
Tel+msg +49 (0)521 880 7319  www.rvs.uni-bielefeld.de






More information about the systemsafety mailing list