[SystemSafety] State of the art for "safe Linux"
Paul Sherwood
paul.sherwood at codethink.co.uk
Mon Aug 5 13:33:19 CEST 2024
I've been continuing to wrestle with the dragon of 'safe Linux' with
colleagues
and we are now preparing for renewed discussion with certification
authorities.
The following is my attempt at a review of the current literature and I
would
very much appreciate feedback, and references to other relevant
materials I may
have missed.
br
Paul
# State-Of-The-Art
One of the sensible principles expressed by the safety engineering
community
is that practitioners should keep up with (and learn from) the
'state-of-the-art'.
In fact this is stated directly in ISO 26262: "The achievement of an
objective
of the ISO 26262 series of standards is judged considering the
corresponding
requirements of these standards, the state-of-the-art regarding
technical
solutions and the applicable engineering domain knowledge at the time of
the
development."
This is true not only for the standards, but also in some legal
frameworks.
For example, quoting from the German Federal Portal [0]:
"'State of the art' is a common legal term. Technical development is
quicker
than legislation. For this reason, it has proven successful for many
years in
many areas of law to refer to the 'state of the art' in laws instead of
trying
to specify concrete technical requirements in the law. What is the state
of
the art at a certain point in time can be determined, for example, on
the
basis of existing national or international standards such as DIN, ISO,
DKE or
ISO/IEC or based on role models in the respective areas that have been
successfully tested in practice. Since the necessary technical measures
may
differ depending on the specific case, it is not possible to describe
the
'state of the art' in generally applicable and conclusive terms."
Here we consider what is the 'state-of-the-art' in terms of published
research
regarding the use of complex software in modern safety-critical systems,
and
in particular the use of Linux-based software.
# Research
We mainly consider the most recent research we can find, even though
some
software knowledge is arguably timeless e.g. Brooks (1975)[1], Glass
(2002)[2].
It appears that the first serious consideration of the use of Linux for
safety-related systems was by Faulkner (2002)[3]. Faulkner concluded
that
"'vanilla' Linux would be broadly acceptable for use in safety related
applications of SIL 1 and SIL 2 integrity" and that "it may also be
feasible
to certify Linux for use in SIL 3 applications by the provision of some
further evidence from testing and analysis."
Faulkner recommended funding a project to achieve SIL 3 certification.
As far
as we can tell, no such project was actually indertaken.
A decade later Bulwahn and Ochs (2013)[4] stated that use of Linux could
reduce BMW's development and qualification costs while improving quality
and confidence.
In 2014, OSADL established the SIL2LinuxMP project to seek SIL2
certification
of Linux. Although this initiative was active for three years, and led
by motivated experts with significant industry sponsors including BMW
and
Intel, Platschek et al. (2018)[5] ultimately reported that certification
was
"in reach" but not achieved.
Allende et al. (2021)[6] argued that companies and governments already
rely heavily on Linux for critical applications, and there is
"remarkable
interest" in certification of Linux for use in safety-related systems,
even
though it was not designed with functional safety in mind, and in spite
of the
size and complexity of the Linux kernel itself.
Allende and his colleagues noted that:
- Traditional safety techniques and measures have not been defined for
safety-related systems involving modern features such as artificial
intelligence, high-performance computing devices, general purpose
operating
systems and security requirements.
- The test coverage in IEC 61508 and ISO 26262 is "hardly achievable (if
feasible)" for systems involving these modern features, and therefore
"novel complementary methods and approaches" are required.
Building on prior art, including L. Cucu-Grosjean et al (2012)[7],
S.Draskovic et al. (2021)[8], Mc Guire and Allende (2020)[9], and in
anticipation of Allende's 2022 PhD thesis [10], the authors applied
statistical
techniques, including Maximum Likelihood Estimation and Simple
Good-Turing, on
a practical case study involving a Linux-based Autonomous Emergency
Braking
system. They calculated a probability of software-related failure for
their
example (1.42e−4), but noted that current safety standards "do not
provide any
reference value to determine whether this probability result is within a
tolerable risk".
The reason for this gap seems to be that the original standards authors
assumed (perhaps based on their own experience with simpler single-core
microcontroller systems) that appropriately-written software behaves
such that
it will always either pass or fail, with certainty. Thus the standards
consider
probability of failure for hardware, but not for software.
In light of this fundamental flaw it would be tempting to call foul on
the
standards, at least with respect to the implementation of modern
software
running on multicore systems, but there is more to learn.
Some of the cited authors, e.g. Lukas Bulwahn, Nicholas Mc Guire and
Jens
Petersohn, are expert practitioners who have dedicated a significant
portion
of their careers to the work of deploying Linux in critical production
systems. In spite of the demonstrable track-record of Linux-based
solutions in
critical systems, the sustained efforts by these and many other experts
to
advance the state-of-the-art by way of research initiatives (e.g.
SIL2LinuxMP,
ELISA), and the clear commercial opportunity for use of Linux in
safety-related systems, it seems that so far no-one has been able to
establish
a generally viable method for certification of Linux-based systems which
would
be acceptable to Certification Authorities.
Note that there is still resistance by some in the safety engineering
community towards the use of open source in general, and Linux in
particular,
more than two decades after Faulkner [3] originally concluded that Linux
would
be suitable for SIL 1 and SIL 2, and recommended funding a project to
certify
Linux to SIL 3.
This resistance remains in spite of the concerted efforts of several
thousand
software experts who have regularly contributed over the last twenty
years to
improve Linux since Faulkner originally recommended it, and in spite of
the
broad adoption of Linux for government initiatives, critical
infrastructure,
telecoms, mobile devices, scientific instruments and space exploration.
More than a decade after BMW's original research [4] identified the
opportunity,
there is still no certified Linux solution for them to adopt.
Allende et al. [4] drew the following conclusion:
"We also consider the execution probability estimation of untested paths
as a
step forwards in the field of test coverage of complex safety-related
systems.
Contrary to the techniques that have been employed traditionally, we
take
into account the uncertainty that these systems possess... This method
also
contributes in providing adequate explanation when full coverage is not
achievable, as stated by the IEC 61508 standard (IEC 61508-3 Ed2 Table
B.2).
Nonetheless, there is a need for a reference value, equivalent to PFH
ranges
for hardware failures, to comprehend the risk associated with untested
paths.
Consequently, we believe this needs to be discussed with Certification
Authorities (CAs)."
So far it appears that Allende's PhD work [10] has been cited only once,
by
Chen et al. (2023)[11]. Chen and colleagues explored the variability in
Linux path execution under various conditions, and they "demonstrated
that
both system load and file system influence the path variability of the
Linux
kernel."
In other words, and more generally, software on a multicore device will
behave
differently based on
- how much load is placed on the system
- how it is configured
Note that this must be true for traditional software (including
proprietary
programs that have been safety-certified) too, not least because the
hardware
itself is non-deterministic, so any software running on it is bound to
be
subject to variations in its input states, creating conditions and
opportunities for variation in its outputs.
We have found no research citations yet for the work by Chen et al.
[11],
presumably because it is recent.
# Conclusion
While the conclusion of Chen et al. [11] may be 'state-of-the-art' in
research
terms, it's been widely understood by software practitioners for some
decades.
Arguably it should have been obvious to the authors of the standards.
For the avoidance of doubt, summarising the above 'state-of-the-art',
let's
spell this out:
- No amount of test coverage will ever be enough to represent the full
range
of behaviours of modern software running on a multicore
microprocessor.
- Some of the current practices (e.g for test coverage) embodied in the
applicable safety standards (e.g. IEC 61508, ISO 26262) are
inappropriate
for modern systems, because:
- They are based on a false premise, i.e. they assume that software
should
be deterministic.
- They lead to significant waste of time and resources in pursuit of
irrelevant goals.
- They discourage use of open source software, leading directly to
increased
costs (e.g. spend on inferior proprietary solutions) and increased
risks
(e.g. creation of new software from scratch).
Mc Guire, Bulwahn et al. have demonstrated in multiple research papers
what is
already obvious to the expert software community, i.e. modern multicore
systems
running multi-threaded software exhibit stochastic behaviours. When
considering rare events such as failure rates, we must apply statistical
techniques to measure confidence in modern software in general (and
Linux in
particular), just as we do for mechanical, electrical and electronic
systems.
# Where next
We plan to move on to the core topic, which is how to establish
confidence
(in the statistical sense, as well as the normal interpretation of the
word)
about each particular release of complex software running on a modern
multicore
processor, where it is intended to perform critical functions that may
cause
harm if something goes wrong.
## References
[0]
(https://www.bsi.bund.de/EN/Themen/KRITIS-und-regulierte-Unternehmen/Kritische-Infrastrukturen/Allgemeine-Infos-zu-KRITIS/Stand-der-Technik-umsetzen/stand-der-technik-umsetzen_node.html)
[1] Fred Brooks, "The Mythical Man-Month: Essays on Software
Engineering"
[2] Robert Glass, "Facts and Fallacies of Software Engineering"
[3] A. Faulkner, "Preliminary assessment of Linux for safety related
systems"
[4] L. Bulwahn, T. Ochs, and D. Wagner, "Research on an Open-Source
Software
Platform for Autonomous Driving Systems."
[5] Andreas Platschek, Nicholas Mc Guire, Lukas Bulwahn, "Certifying
Linux:
Lessons Learned in Three Years of SIL2LinuxMP"
[6] Imanol Allende, Nicholas Mc Guire, Jon Perez-Cerrolaza, Lisandro G.
Monsalve,
Jens Petersohn and Roman Obermaisser, "Statistical Test Coverage for
Linux-Based Next-Generation Autonomous Safety-Related Systems"
[7] L. Cucu-Grosjean, L. Santinelli, M. Houston, C. Lo, T. Vardanega,
L. Kosmidis, J. Abella, E. Mezzetti, E. Quinones, and F. J. Cazorla,
"Measurement-based probabilistic timing analysis for multi-path
programs"
[8] S.Draskovic, R.Ahmed, P.Huang,and L.Thiele "Schedulability of
probabilistic
mixed-criticality systems"
[9] Mc Guire, N., & Allende, I. (2020). "Approaching certification of
complex
systems."
[10] Imanol Allende, "Statistical Path Coverage for Non-Deterministic
Complex
Safety-Related Software Testing", (2022)
[11] Yucong Chen, Xianzhi Tang, Shuaixin Xu, Fangfang Zhu, Qingguo Zhou
&
Tien-Hsiung Weng "Analyzing execution path non-determinism of the Linux
kernel
in different scenarios"
More information about the systemsafety
mailing list