[SystemSafety] State of the art for "safe Linux"

Mon Aug 5 14:07:30 CEST 2024

Hi Paul

I'd start with an easier question... what do you mean by Linux

It's a Kernel, plus a whole array of other features; but what would the Software BoM for Linux actually show?  What is part of Linux, and what are add-ons or apps?

Andrew

-----Original Message-----
From: systemsafety <systemsafety-bounces at lists.techfak.uni-bielefeld.de> On Behalf Of Paul Sherwood
Sent: Monday, August 5, 2024 12:33 PM
To: trustable-software at lists.trustable.io; The System Safety List <systemsafety at lists.techfak.uni-bielefeld.de>
Subject: [SystemSafety] State of the art for "safe Linux"

I've been continuing to wrestle with the dragon of 'safe Linux' with colleagues and we are now preparing for renewed discussion with certification authorities.

The following is my attempt at a review of the current literature and I would very much appreciate feedback, and references to other relevant materials I may have missed.

br
Paul

# State-Of-The-Art

One of the sensible principles expressed by the safety engineering community is that practitioners should keep up with (and learn from) the 'state-of-the-art'.

In fact this is stated directly in ISO 26262: "The achievement of an objective of the ISO 26262 series of standards is judged considering the corresponding requirements of these standards, the state-of-the-art regarding technical solutions and the applicable engineering domain knowledge at the time of the development."

This is true not only for the standards, but also in some legal frameworks.
For example, quoting from the German Federal Portal [0]:

"'State of the art' is a common legal term. Technical development is quicker than legislation. For this reason, it has proven successful for many years in many areas of law to refer to the 'state of the art' in laws instead of trying to specify concrete technical requirements in the law. What is the state of the art at a certain point in time can be determined, for example, on the basis of existing national or international standards such as DIN, ISO, DKE or ISO/IEC or based on role models in the respective areas that have been successfully tested in practice. Since the necessary technical measures may differ depending on the specific case, it is not possible to describe the 'state of the art' in generally applicable and conclusive terms."

Here we consider what is the 'state-of-the-art' in terms of published research regarding the use of complex software in modern safety-critical systems, and in particular the use of Linux-based software.

# Research

We mainly consider the most recent research we can find, even though some software knowledge is arguably timeless e.g. Brooks (1975)[1], Glass (2002)[2].

It appears that the first serious consideration of the use of Linux for safety-related systems was by Faulkner (2002)[3]. Faulkner concluded that "'vanilla' Linux would be broadly acceptable for use in safety related applications of SIL 1 and SIL 2 integrity" and that "it may also be feasible to certify Linux for use in SIL 3 applications by the provision of some further evidence from testing and analysis."

Faulkner recommended funding a project to achieve SIL 3 certification. 
As far
as we can tell, no such project was actually indertaken.

A decade later Bulwahn and Ochs (2013)[4] stated that use of Linux could reduce BMW's development and qualification costs while improving quality and confidence.

In 2014, OSADL established the SIL2LinuxMP project to seek SIL2 certification of Linux. Although this initiative was active for three years, and led by motivated experts with significant industry sponsors including BMW and Intel, Platschek et al. (2018)[5] ultimately reported that certification was "in reach" but not achieved.

Allende et al. (2021)[6] argued that companies and governments already rely heavily on Linux for critical applications, and there is "remarkable interest" in certification of Linux for use in safety-related systems, even though it was not designed with functional safety in mind, and in spite of the size and complexity of the Linux kernel itself.

Allende and his colleagues noted that:

- Traditional safety techniques and measures have not been defined for
   safety-related systems involving modern features such as artificial
   intelligence, high-performance computing devices, general purpose operating
   systems and security requirements.

- The test coverage in IEC 61508 and ISO 26262 is "hardly achievable (if
   feasible)" for systems involving these modern features, and therefore
   "novel complementary methods and approaches" are required.

Building on prior art, including L. Cucu-Grosjean et al (2012)[7], S.Draskovic et al. (2021)[8], Mc Guire and Allende (2020)[9], and in anticipation of Allende's 2022 PhD thesis [10], the authors applied statistical techniques, including Maximum Likelihood Estimation and Simple Good-Turing, on a practical case study involving a Linux-based Autonomous Emergency Braking system. They calculated a probability of software-related failure for their example (1.42e−4), but noted that current safety standards "do not provide any reference value to determine whether this probability result is within a tolerable risk".

The reason for this gap seems to be that the original standards authors assumed (perhaps based on their own experience with simpler single-core microcontroller systems) that appropriately-written software behaves such that it will always either pass or fail, with certainty. Thus the standards consider probability of failure for hardware, but not for software.

In light of this fundamental flaw it would be tempting to call foul on the standards, at least with respect to the implementation of modern software running on multicore systems, but there is more to learn.

Some of the cited authors, e.g. Lukas Bulwahn, Nicholas Mc Guire and Jens Petersohn, are expert practitioners who have dedicated a significant portion of their careers to the work of deploying Linux in critical production systems. In spite of the demonstrable track-record of Linux-based solutions in critical systems, the sustained efforts by these and many other experts to advance the state-of-the-art by way of research initiatives (e.g. 
SIL2LinuxMP,
ELISA), and the clear commercial opportunity for use of Linux in safety-related systems, it seems that so far no-one has been able to establish a generally viable method for certification of Linux-based systems which would be acceptable to Certification Authorities.

Note that there is still resistance by some in the safety engineering community towards the use of open source in general, and Linux in particular, more than two decades after Faulkner [3] originally concluded that Linux would be suitable for SIL 1 and SIL 2, and recommended funding a project to certify Linux to SIL 3.

This resistance remains in spite of the concerted efforts of several thousand software experts who have regularly contributed over the last twenty years to improve Linux since Faulkner originally recommended it, and in spite of the broad adoption of Linux for government initiatives, critical infrastructure, telecoms, mobile devices, scientific instruments and space exploration.

More than a decade after BMW's original research [4] identified the opportunity, there is still no certified Linux solution for them to adopt.

Allende et al. [4] drew the following conclusion:

"We also consider the execution probability estimation of untested paths as a step forwards in the field of test coverage of complex safety-related systems.
Contrary to the techniques that have been employed traditionally, we take into account the uncertainty that these systems possess... This method also contributes in providing adequate explanation when full coverage is not achievable, as stated by the IEC 61508 standard (IEC 61508-3 Ed2 Table B.2).
Nonetheless, there is a need for a reference value, equivalent to PFH ranges for hardware failures, to comprehend the risk associated with untested paths.
Consequently, we believe this needs to be discussed with Certification Authorities (CAs)."

So far it appears that Allende's PhD work [10] has been cited only once, by Chen et al. (2023)[11]. Chen and colleagues explored the variability in Linux path execution under various conditions, and they "demonstrated that both system load and file system influence the path variability of the Linux kernel."

In other words, and more generally, software on a multicore device will behave differently based on

- how much load is placed on the system
- how it is configured

Note that this must be true for traditional software (including proprietary programs that have been safety-certified) too, not least because the hardware itself is non-deterministic, so any software running on it is bound to be subject to variations in its input states, creating conditions and opportunities for variation in its outputs.

We have found no research citations yet for the work by Chen et al. 
[11],
presumably because it is recent.

# Conclusion

While the conclusion of Chen et al. [11] may be 'state-of-the-art' in research terms, it's been widely understood by software practitioners for some decades.
Arguably it should have been obvious to the authors of the standards.

For the avoidance of doubt, summarising the above 'state-of-the-art', let's spell this out:

- No amount of test coverage will ever be enough to represent the full range
   of behaviours of modern software running on a multicore microprocessor.

- Some of the current practices (e.g for test coverage) embodied in the
   applicable safety standards (e.g. IEC 61508, ISO 26262) are inappropriate
   for modern systems, because:

   - They are based on a false premise, i.e. they assume that software should
     be deterministic.

   - They lead to significant waste of time and resources in pursuit of
     irrelevant goals.

   - They discourage use of open source software, leading directly to increased
     costs (e.g. spend on inferior proprietary solutions) and increased risks
     (e.g. creation of new software from scratch).

Mc Guire, Bulwahn et al. have demonstrated in multiple research papers what is already obvious to the expert software community, i.e. modern multicore systems running multi-threaded software exhibit stochastic behaviours. When considering rare events such as failure rates, we must apply statistical techniques to measure confidence in modern software in general (and Linux in particular), just as we do for mechanical, electrical and electronic systems.

# Where next

We plan to move on to the core topic, which is how to establish confidence (in the statistical sense, as well as the normal interpretation of the
word)
about each particular release of complex software running on a modern multicore processor, where it is intended to perform critical functions that may cause harm if something goes wrong.

## References

[0]
(https://www.bsi.bund.de/EN/Themen/KRITIS-und-regulierte-Unternehmen/Kritische-Infrastrukturen/Allgemeine-Infos-zu-KRITIS/Stand-der-Technik-umsetzen/stand-der-technik-umsetzen_node.html)
[1] Fred Brooks, "The Mythical Man-Month: Essays on Software Engineering"
[2] Robert Glass, "Facts and Fallacies of Software Engineering"
[3] A. Faulkner, "Preliminary assessment of Linux for safety related systems"
[4] L. Bulwahn, T. Ochs, and D. Wagner, "Research on an Open-Source Software Platform for Autonomous Driving Systems."
[5] Andreas Platschek, Nicholas Mc Guire, Lukas Bulwahn, "Certifying
Linux:
Lessons Learned in Three Years of SIL2LinuxMP"
[6] Imanol Allende, Nicholas Mc Guire, Jon Perez-Cerrolaza, Lisandro G. 
Monsalve,
Jens Petersohn and Roman Obermaisser, "Statistical Test Coverage for Linux-Based Next-Generation Autonomous Safety-Related Systems"
[7] L. Cucu-Grosjean, L. Santinelli, M. Houston, C. Lo, T. Vardanega,
  L. Kosmidis, J. Abella, E. Mezzetti, E. Quinones, and F. J. Cazorla,
  "Measurement-based probabilistic timing analysis for multi-path programs"
[8] S.Draskovic, R.Ahmed, P.Huang,and L.Thiele "Schedulability of probabilistic mixed-criticality systems"
[9] Mc Guire, N., & Allende, I. (2020). "Approaching certification of complex systems."
[10] Imanol Allende, "Statistical Path Coverage for Non-Deterministic Complex Safety-Related Software Testing", (2022) [11] Yucong Chen, Xianzhi Tang, Shuaixin Xu, Fangfang Zhu, Qingguo Zhou & Tien-Hsiung Weng "Analyzing execution path non-determinism of the Linux kernel in different scenarios"
_______________________________________________
The System Safety Mailing List
systemsafety at TechFak.Uni-Bielefeld.DE
Manage your subscription: https://lists.techfak.uni-bielefeld.de/mailman/listinfo/systemsafety