[SystemSafety] Difference between software reliability and astrology

Wed Aug 21 18:26:21 CEST 2024

On 2024-08-21 16:28, Prof. Dr. Peter Bernard Ladkin wrote:
>>> First, you are talking about using an operating system. An operating 
>>> system is a continuously-running system, not a discrete on-demand 
>>> function which returns an output value.
>> 
>> Hmmm. Let's break that apart...
> 
> Let's not. For the purposes of assessing reliability, it's not that 
> relevant.

It's relevant for us - we are trying to distinguish between software 
reliability and system reliability.

>>> So its failure behaviour is not a Bernoulli process. You can drop the 
>>> "Bernoulli" bit.
>> 
>> From a physical perspective, the behaviour of such a constructed 
>> system appears continuous, but considering what the OS itself is 
>> actually doing, every action is discrete.
> 
> So what? Suppose you have a sensor sampling at 400 Hz (typical for 
> aircraft-dynamics sensors, for example). The piece of SW dealing with 
> those readings (aka control system) is going to want to ascertain rates 
> of change and other stuff, so it needs to keep a history of readings 
> (over a short period of time). If you have history variables then you 
> aren't memoryless. If you're not memoryless then you aren't a Bernoulli 
> process, discrete or not.

I already agreed with you - I don't believe complex software behaviour 
can be considered as memoryless process.

>>> But keep in mind you can't be letting [the OS] fail. For SIL 4 safety 
>>> functions, it has to be running more than 100 million operating hours 
>>> between failures on average. That is the constraint from 61508-1 
>>> Table 3, which is independent of any means of describing the failure 
>>> behaviour.
>> 
>> Understood, but I wonder a bit about the numbers in the table. Can you 
>> (or anyone on the list) help me understand how the committee arrived 
>> at 10^-5, 10^-6, 10^-7, 10^-8 as targets?
> 
> (1) There is no theoretical reason why powers of 10 are chosen.
> 
> (2) They come from the aerospace regulations, and the "accepted means 
> of compliance". The regs contain certain powers of ten for "hazardous 
> condition" and "catastrophic condition" and sometimes other hazard 
> classes ("minor" and "major") and the AMC nowadays interprets phrases 
> such as "not expected to occur within the lifetime of the aircraft 
> [fleet]" into probabilities expressed in powers of ten. The reason is 
> likely that civil air transport was having continual and improving 
> success with what in effect turns out to be its risk matrix, for half a 
> century before 61508 came along.

Super, Peter - thanks again

br
Paul