[SystemSafety] Difference between software reliability and astrology

Wed Aug 21 15:48:49 CEST 2024

On 2024-08-21 12:08, Prof. Dr. Peter Bernard Ladkin wrote:
> On 2024-08-21 11:59 , Paul Sherwood wrote:
>> My point was that, accepting the likely infeasibility of using 
>> "Bernoulli/Poisson mathematics", we  look for an alternative. It seems 
>> to me that we may be able to apply a Bayesian approach in that search.
>> 
> It is still a category mistake. It seems to be that you're mucking 
> around with the terminology in a way that I don't think will help your 
> goal. Let me explain in a bit more detail.

Please note I'm not "mucking around" intentionally, and I appreciate 
your offer of help.

> First, you are talking about using an operating system. An operating 
> system is a continuously-running system, not a discrete on-demand 
> function which returns an output value.

Hmmm. Let's break that apart...

- "operating system" is not a particularly well-defined term - I've had 
arguments with people in the safety community who insisted that a 
hypervisor is not an operating system

- the use of "system" in the phrase "operating system" may be just as 
confusing as the use of the letters "memory" in the concept "memoryless"

- the  concept indicated by the use of the word "system" in the phrase 
"operating system", is very different from the concept indicated by the 
use of the word "system" in the phrase "continuously running system"

 From here on I'll use "OS", both for brevity and to disambiguate the 
"system" trap I've mentioned.

As I understand it, an OS (e.g. RTOS, freeRTOS, Linux etc) is not a 
"system" in the sense of "systems" that contribute to safety. An OS is 
just software. Inert. bits on a storage medium. It may be deployed and 
executed on hardware, to form (for example) part of a 
"continuously-running system" which includes both the hardware and the 
software.

In practice the behaviour of such a system will be affected by the 
behaviour of the hardware (and its firmware) as well as by the OS - and 
in fact the behaviour will also be affected by the software tools 
(compilers, linkers) etc. used to construct the OS, and any applications 
running on it.

> So its failure behaviour is not a Bernoulli process. You can drop the 
> "Bernoulli" bit.

 From a physical perspective, the behaviour of such a constructed system 
appears continuous, but considering what the OS itself is actually 
doing, every action is discrete. So for example, if we consider the OS 
providing a "real-time" scheduling capability, it seems to me that each 
individual schedule interval could be considered a discrete demand, in 
principle.

That is not to say that we could apply the Bernoulli mathematics - as 
you said nearly a decade ago, that approach "looks close to infeasible".

> Is the failure behaviour a Poisson process? It could be if you clear 
> the memory and reboot the OS after each individual failure. But that 
> would render use of it largely impractical if your client is planning 
> for what happens when it fails.

Agreed... so no, it's not practical to apply the Poisson mathematics 
either.

> (Recall the asymmetric-lift accident with the Osprey tilt-rotor at 
> Patuxent River, Maryland on 2000-12-11. When the asymmetric hydraulics 
> occurred due to a hydraulic-line failure, the rotor-control SW was 
> continuously rebooting and so inhibited any recovery. Recall also the 
> accident in Tempe in 2018 with the Uber self-driving vehicle which 
> killed Elaine Hertzberg. The sensor-fusion SW recognised an object but 
> couldn't classify it and was continuously resetting until a fraction of 
> a second before impact, when it was clearly too late to avoid impact. 
> Continuous reset is the bane of any continuously-operating critical 
> software.)

Also agreed.

> But keep in mind you can't be letting it fail. For SIL 4 safety 
> functions, it has to be running more than 100 million operating hours 
> between failures on average. That is the constraint from 61508-1 Table 
> 3, which is independent of any means of describing the failure 
> behaviour.

Understood, but I wonder a bit about the numbers in the table. Can you 
(or anyone on the list) help me understand how the committee arrived at 
10^-5, 10^-6, 10^-7, 10^-8 as targets?

> If you are looking for a stochastic process that will better describe 
> Linux kernel failure behaviour than the Poisson process, it is not 
> clear to me what taking a "Bayesian approach" might mean. Matching the 
> conditions of a mathematically-described stochastic process to software 
> failure behaviour has nothing to do with "Bayesian" or 
> "classical/frequentist" statistical inference. The distributions come 
> from mathematics that is independent of the use of such processes in 
> statistical inference. When I was a kid, it was called "probability 
> theory". And it was a different subject from "mathematical statistics", 
> which however used probability theory essentially. When I was in 
> university, you couldn't study mathematics without a compulsory amount 
> of probability theory, but you could completely avoid statistics if you 
> wanted to.

I studied in the early 80s, and had the same experience - so I did the 
probability theory, and skipped the stats :)

> Even if you find one, it is not going to do you any good unless you can 
> have your kernel running a million (SIL 2) or ten million (SIL 3) or 
> 100 million (SIL 4) operating hours on average between failures.

Back to OS, system etc - a kernel is not the same thing as an OS. A 
Linux-based OS contains (a version of) the Linux kernel integrated with 
lots of other software.

> And if you can have your kernel demonstrably doing that, why would you 
> need any named/described stochastic process to describe its failure 
> behaviour?

Great question.

In practice we would need to demonstrate the failure rate of the system 
running the OS (containing the kernel). And given that we have 
established (as previously discussed) that the there's a lot of 
variation in microprocessors etc, we would have to figure out how to 
distinguish between software and hardware effects.

> The point here is that you can't "look and see" that your kernel has 
> that desired failure behaviour. It's just too much time for a given 
> operating profile. So what most safety-critical OS builders do is 
> define a class of essential kernel functions, modularise them all as 
> much as possible so that they can be observed, analysed and assessed 
> largely separately from each other, then put them together and (if you 
> like) invoke a limit theorem to infer the failure behaviour of the 
> whole from the assessed failure behaviours of the parts.

Agreed. I'm not confident that approach works for a Linux-based OS, and 
I take it you aren't either.

br
Paul