[SystemSafety] State of the art for "safe Linux"

Mon Aug 5 13:33:19 CEST 2024

I've been continuing to wrestle with the dragon of 'safe Linux' with 
colleagues
and we are now preparing for renewed discussion with certification 
authorities.

The following is my attempt at a review of the current literature and I 
would
very much appreciate feedback, and references to other relevant 
materials I may
have missed.

br
Paul

# State-Of-The-Art

One of the sensible principles expressed by the safety engineering 
community
is that practitioners should keep up with (and learn from) the
'state-of-the-art'.

In fact this is stated directly in ISO 26262: "The achievement of an 
objective
of the ISO 26262 series of standards is judged considering the 
corresponding
requirements of these standards, the state-of-the-art regarding 
technical
solutions and the applicable engineering domain knowledge at the time of 
the
development."

This is true not only for the standards, but also in some legal 
frameworks.
For example, quoting from the German Federal Portal [0]:

"'State of the art' is a common legal term. Technical development is 
quicker
than legislation. For this reason, it has proven successful for many 
years in
many areas of law to refer to the 'state of the art' in laws instead of 
trying
to specify concrete technical requirements in the law. What is the state 
of
the art at a certain point in time can be determined, for example, on 
the
basis of existing national or international standards such as DIN, ISO, 
DKE or
ISO/IEC or based on role models in the respective areas that have been
successfully tested in practice. Since the necessary technical measures 
may
differ depending on the specific case, it is not possible to describe 
the
'state of the art' in generally applicable and conclusive terms."

Here we consider what is the 'state-of-the-art' in terms of published 
research
regarding the use of complex software in modern safety-critical systems, 
and
in particular the use of Linux-based software.

# Research

We mainly consider the most recent research we can find, even though 
some
software knowledge is arguably timeless e.g. Brooks (1975)[1], Glass 
(2002)[2].

It appears that the first serious consideration of the use of Linux for
safety-related systems was by Faulkner (2002)[3]. Faulkner concluded 
that
"'vanilla' Linux would be broadly acceptable for use in safety related
applications of SIL 1 and SIL 2 integrity" and that "it may also be 
feasible
to certify Linux for use in SIL 3 applications by the provision of some
further evidence from testing and analysis."

Faulkner recommended funding a project to achieve SIL 3 certification. 
As far
as we can tell, no such project was actually indertaken.

A decade later Bulwahn and Ochs (2013)[4] stated that use of Linux could
reduce BMW's development and qualification costs while improving quality
and confidence.

In 2014, OSADL established the SIL2LinuxMP project to seek SIL2 
certification
of Linux. Although this initiative was active for three years, and led
by motivated experts with significant industry sponsors including BMW 
and
Intel, Platschek et al. (2018)[5] ultimately reported that certification 
was
"in reach" but not achieved.

Allende et al. (2021)[6] argued that companies and governments already
rely heavily on Linux for critical applications, and there is 
"remarkable
interest" in certification of Linux for use in safety-related systems, 
even
though it was not designed with functional safety in mind, and in spite 
of the
size and complexity of the Linux kernel itself.

Allende and his colleagues noted that:

- Traditional safety techniques and measures have not been defined for
   safety-related systems involving modern features such as artificial
   intelligence, high-performance computing devices, general purpose 
operating
   systems and security requirements.

- The test coverage in IEC 61508 and ISO 26262 is "hardly achievable (if
   feasible)" for systems involving these modern features, and therefore
   "novel complementary methods and approaches" are required.

Building on prior art, including L. Cucu-Grosjean et al (2012)[7],
S.Draskovic et al. (2021)[8], Mc Guire and Allende (2020)[9], and in
anticipation of Allende's 2022 PhD thesis [10], the authors applied 
statistical
techniques, including Maximum Likelihood Estimation and Simple 
Good-Turing, on
a practical case study involving a Linux-based Autonomous Emergency 
Braking
system. They calculated a probability of software-related failure for 
their
example (1.42e−4), but noted that current safety standards "do not 
provide any
reference value to determine whether this probability result is within a
tolerable risk".

The reason for this gap seems to be that the original standards authors
assumed (perhaps based on their own experience with simpler single-core
microcontroller systems) that appropriately-written software behaves 
such that
it will always either pass or fail, with certainty. Thus the standards 
consider
probability of failure for hardware, but not for software.

In light of this fundamental flaw it would be tempting to call foul on 
the
standards, at least with respect to the implementation of modern 
software
running on multicore systems, but there is more to learn.

Some of the cited authors, e.g. Lukas Bulwahn, Nicholas Mc Guire and 
Jens
Petersohn, are expert practitioners who have dedicated a significant 
portion
of their careers to the work of deploying Linux in critical production
systems. In spite of the demonstrable track-record of Linux-based 
solutions in
critical systems, the sustained efforts by these and many other experts 
to
advance the state-of-the-art by way of research initiatives (e.g. 
SIL2LinuxMP,
ELISA), and the clear commercial opportunity for use of Linux in
safety-related systems, it seems that so far no-one has been able to 
establish
a generally viable method for certification of Linux-based systems which 
would
be acceptable to Certification Authorities.

Note that there is still resistance by some in the safety engineering
community towards the use of open source in general, and Linux in 
particular,
more than two decades after Faulkner [3] originally concluded that Linux 
would
be suitable for SIL 1 and SIL 2, and recommended funding a project to 
certify
Linux to SIL 3.

This resistance remains in spite of the concerted efforts of several 
thousand
software experts who have regularly contributed over the last twenty 
years to
improve Linux since Faulkner originally recommended it, and in spite of 
the
broad adoption of Linux for government initiatives, critical 
infrastructure,
telecoms, mobile devices, scientific instruments and space exploration.

More than a decade after BMW's original research [4] identified the 
opportunity,
there is still no certified Linux solution for them to adopt.

Allende et al. [4] drew the following conclusion:

"We also consider the execution probability estimation of untested paths 
as a
step forwards in the field of test coverage of complex safety-related 
systems.
Contrary to the techniques that have been employed traditionally, we 
take
into account the uncertainty that these systems possess... This method 
also
contributes in providing adequate explanation when full coverage is not
achievable, as stated by the IEC 61508 standard (IEC 61508-3 Ed2 Table 
B.2).
Nonetheless, there is a need for a reference value, equivalent to PFH 
ranges
for hardware failures, to comprehend the risk associated with untested 
paths.
Consequently, we believe this needs to be discussed with Certification
Authorities (CAs)."

So far it appears that Allende's PhD work [10] has been cited only once, 
by
Chen et al. (2023)[11]. Chen and colleagues explored the variability in
Linux path execution under various conditions, and they "demonstrated 
that
both system load and file system influence the path variability of the 
Linux
kernel."

In other words, and more generally, software on a multicore device will 
behave
differently based on

- how much load is placed on the system
- how it is configured

Note that this must be true for traditional software (including 
proprietary
programs that have been safety-certified) too, not least because the 
hardware
itself is non-deterministic, so any software running on it is bound to 
be
subject to variations in its input states, creating conditions and
opportunities for variation in its outputs.

We have found no research citations yet for the work by Chen et al. 
[11],
presumably because it is recent.

# Conclusion

While the conclusion of Chen et al. [11] may be 'state-of-the-art' in 
research
terms, it's been widely understood by software practitioners for some 
decades.
Arguably it should have been obvious to the authors of the standards.

For the avoidance of doubt, summarising the above 'state-of-the-art', 
let's
spell this out:

- No amount of test coverage will ever be enough to represent the full 
range
   of behaviours of modern software running on a multicore 
microprocessor.

- Some of the current practices (e.g for test coverage) embodied in the
   applicable safety standards (e.g. IEC 61508, ISO 26262) are 
inappropriate
   for modern systems, because:

   - They are based on a false premise, i.e. they assume that software 
should
     be deterministic.

   - They lead to significant waste of time and resources in pursuit of
     irrelevant goals.

   - They discourage use of open source software, leading directly to 
increased
     costs (e.g. spend on inferior proprietary solutions) and increased 
risks
     (e.g. creation of new software from scratch).

Mc Guire, Bulwahn et al. have demonstrated in multiple research papers 
what is
already obvious to the expert software community, i.e. modern multicore 
systems
running multi-threaded software exhibit stochastic behaviours. When
considering rare events such as failure rates, we must apply statistical
techniques to measure confidence in modern software in general (and 
Linux in
particular), just as we do for mechanical, electrical and electronic 
systems.

# Where next

We plan to move on to the core topic, which is how to establish 
confidence
(in the statistical sense, as well as the normal interpretation of the 
word)
about each particular release of complex software running on a modern 
multicore
processor, where it is intended to perform critical functions that may 
cause
harm if something goes wrong.

## References

[0] 
(https://www.bsi.bund.de/EN/Themen/KRITIS-und-regulierte-Unternehmen/Kritische-Infrastrukturen/Allgemeine-Infos-zu-KRITIS/Stand-der-Technik-umsetzen/stand-der-technik-umsetzen_node.html)
[1] Fred Brooks, "The Mythical Man-Month: Essays on Software 
Engineering"
[2] Robert Glass, "Facts and Fallacies of Software Engineering"
[3] A. Faulkner, "Preliminary assessment of Linux for safety related 
systems"
[4] L. Bulwahn, T. Ochs, and D. Wagner, "Research on an Open-Source 
Software
Platform for Autonomous Driving Systems."
[5] Andreas Platschek, Nicholas Mc Guire, Lukas Bulwahn, "Certifying 
Linux:
Lessons Learned in Three Years of SIL2LinuxMP"
[6] Imanol Allende, Nicholas Mc Guire, Jon Perez-Cerrolaza, Lisandro G. 
Monsalve,
Jens Petersohn and Roman Obermaisser, "Statistical Test Coverage for
Linux-Based Next-Generation Autonomous Safety-Related Systems"
[7] L. Cucu-Grosjean, L. Santinelli, M. Houston, C. Lo, T. Vardanega,
  L. Kosmidis, J. Abella, E. Mezzetti, E. Quinones, and F. J. Cazorla,
  "Measurement-based probabilistic timing analysis for multi-path 
programs"
[8] S.Draskovic, R.Ahmed, P.Huang,and L.Thiele "Schedulability of 
probabilistic
mixed-criticality systems"
[9] Mc Guire, N., & Allende, I. (2020). "Approaching certification of 
complex
systems."
[10] Imanol Allende, "Statistical Path Coverage for Non-Deterministic 
Complex
Safety-Related Software Testing", (2022)
[11] Yucong Chen, Xianzhi Tang, Shuaixin Xu, Fangfang Zhu, Qingguo Zhou 
&
Tien-Hsiung Weng "Analyzing execution path non-determinism of the Linux 
kernel
in different scenarios"