[SystemSafety] Difference between software reliability and astrology

Wed Aug 21 13:08:40 CEST 2024

On 2024-08-21 11:59 , Paul Sherwood wrote:
> My point was that, accepting the likely infeasibility of using "Bernoulli/Poisson mathematics", 
> we  look for an alternative. It seems to me that we may be able to apply a Bayesian approach in 
> that search.
>
It is still a category mistake. It seems to be that you're mucking around with the terminology in a 
way that I don't think will help your goal. Let me explain in a bit more detail.

First, you are talking about using an operating system. An operating system is a 
continuously-running system, not a discrete on-demand function which returns an output value. So its 
failure behaviour is not a Bernoulli process. You can drop the "Bernoulli" bit.

Is the failure behaviour a Poisson process? It could be if you clear the memory and reboot the OS 
after each individual failure. But that would render use of it largely impractical if your client is 
planning for what happens when it fails. (Recall the asymmetric-lift accident with the Osprey 
tilt-rotor at Patuxent River, Maryland on 2000-12-11. When the asymmetric hydraulics occurred due to 
a hydraulic-line failure, the rotor-control SW was continuously rebooting and so inhibited any 
recovery. Recall also the accident in Tempe in 2018 with the Uber self-driving vehicle which killed 
Elaine Hertzberg. The sensor-fusion SW recognised an object but couldn't classify it and was 
continuously resetting until a fraction of a second before impact, when it was clearly too late to 
avoid impact. Continuous reset is the bane of any continuously-operating critical software.)

But keep in mind you can't be letting it fail. For SIL 4 safety functions, it has to be running more 
than 100 million operating hours between failures on average. That is the constraint from 61508-1 
Table 3, which is independent of any means of describing the failure behaviour.

If you are looking for a stochastic process that will better describe Linux kernel failure behaviour 
than the Poisson process, it is not clear to me what taking a "Bayesian approach" might mean. 
Matching the conditions of a mathematically-described stochastic process to software failure 
behaviour has nothing to do with "Bayesian" or "classical/frequentist" statistical inference. The 
distributions come from mathematics that is independent of the use of such processes in statistical 
inference. When I was a kid, it was called "probability theory". And it was a different subject from 
"mathematical statistics", which however used probability theory essentially. When I was in 
university, you couldn't study mathematics without a compulsory amount of probability theory, but 
you could completely avoid statistics if you wanted to.

Even if you find one, it is not going to do you any good unless you can have your kernel running a 
million (SIL 2) or ten million (SIL 3) or 100 million (SIL 4) operating hours on average between 
failures. And if you can have your kernel demonstrably doing that, why would you need any 
named/described stochastic process to describe its failure behaviour?

The point here is that you can't "look and see" that your kernel has that desired failure behaviour. 
It's just too much time for a given operating profile. So what most safety-critical OS builders do 
is define a class of essential kernel functions, modularise them all as much as possible so that 
they can be observed, analysed and assessed largely separately from each other, then put them 
together and (if you like) invoke a limit theorem to infer the failure behaviour of the whole from 
the assessed failure behaviours of the parts.

PBL

Prof. Dr. Peter Bernard Ladkin
Causalis Limited/Causalis IngenieurGmbH, Bielefeld, Germany
Tel: +49 (0)521 3 29 31 00