[SystemSafety] State of the art for "safe Linux"

Mon Aug 5 14:01:33 CEST 2024

your "#where to next" sounds like “turtles all the way down”

why not break the cycle of attempting to square the circle and 

model deterministic (non-software or at least non-complex software) designs for monitoring non-deterministic systems?

> On Aug 5, 2024, at 7:33 AM, Paul Sherwood <paul.sherwood at codethink.co.uk> wrote:
> 
> I've been continuing to wrestle with the dragon of 'safe Linux' with colleagues
> and we are now preparing for renewed discussion with certification authorities.
> 
> The following is my attempt at a review of the current literature and I would
> very much appreciate feedback, and references to other relevant materials I may
> have missed.
> 
> br
> Paul
> 
> # State-Of-The-Art
> 
> One of the sensible principles expressed by the safety engineering community
> is that practitioners should keep up with (and learn from) the
> 'state-of-the-art'.
> 
> In fact this is stated directly in ISO 26262: "The achievement of an objective
> of the ISO 26262 series of standards is judged considering the corresponding
> requirements of these standards, the state-of-the-art regarding technical
> solutions and the applicable engineering domain knowledge at the time of the
> development."
> 
> This is true not only for the standards, but also in some legal frameworks.
> For example, quoting from the German Federal Portal [0]:
> 
> "'State of the art' is a common legal term. Technical development is quicker
> than legislation. For this reason, it has proven successful for many years in
> many areas of law to refer to the 'state of the art' in laws instead of trying
> to specify concrete technical requirements in the law. What is the state of
> the art at a certain point in time can be determined, for example, on the
> basis of existing national or international standards such as DIN, ISO, DKE or
> ISO/IEC or based on role models in the respective areas that have been
> successfully tested in practice. Since the necessary technical measures may
> differ depending on the specific case, it is not possible to describe the
> 'state of the art' in generally applicable and conclusive terms."
> 
> Here we consider what is the 'state-of-the-art' in terms of published research
> regarding the use of complex software in modern safety-critical systems, and
> in particular the use of Linux-based software.
> 
> # Research
> 
> We mainly consider the most recent research we can find, even though some
> software knowledge is arguably timeless e.g. Brooks (1975)[1], Glass (2002)[2].
> 
> It appears that the first serious consideration of the use of Linux for
> safety-related systems was by Faulkner (2002)[3]. Faulkner concluded that
> "'vanilla' Linux would be broadly acceptable for use in safety related
> applications of SIL 1 and SIL 2 integrity" and that "it may also be feasible
> to certify Linux for use in SIL 3 applications by the provision of some
> further evidence from testing and analysis."
> 
> Faulkner recommended funding a project to achieve SIL 3 certification. As far
> as we can tell, no such project was actually indertaken.
> 
> A decade later Bulwahn and Ochs (2013)[4] stated that use of Linux could
> reduce BMW's development and qualification costs while improving quality
> and confidence.
> 
> In 2014, OSADL established the SIL2LinuxMP project to seek SIL2 certification
> of Linux. Although this initiative was active for three years, and led
> by motivated experts with significant industry sponsors including BMW and
> Intel, Platschek et al. (2018)[5] ultimately reported that certification was
> "in reach" but not achieved.
> 
> Allende et al. (2021)[6] argued that companies and governments already
> rely heavily on Linux for critical applications, and there is "remarkable
> interest" in certification of Linux for use in safety-related systems, even
> though it was not designed with functional safety in mind, and in spite of the
> size and complexity of the Linux kernel itself.
> 
> Allende and his colleagues noted that:
> 
> - Traditional safety techniques and measures have not been defined for
>  safety-related systems involving modern features such as artificial
>  intelligence, high-performance computing devices, general purpose operating
>  systems and security requirements.
> 
> - The test coverage in IEC 61508 and ISO 26262 is "hardly achievable (if
>  feasible)" for systems involving these modern features, and therefore
>  "novel complementary methods and approaches" are required.
> 
> Building on prior art, including L. Cucu-Grosjean et al (2012)[7],
> S.Draskovic et al. (2021)[8], Mc Guire and Allende (2020)[9], and in
> anticipation of Allende's 2022 PhD thesis [10], the authors applied statistical
> techniques, including Maximum Likelihood Estimation and Simple Good-Turing, on
> a practical case study involving a Linux-based Autonomous Emergency Braking
> system. They calculated a probability of software-related failure for their
> example (1.42e−4), but noted that current safety standards "do not provide any
> reference value to determine whether this probability result is within a
> tolerable risk".
> 
> The reason for this gap seems to be that the original standards authors
> assumed (perhaps based on their own experience with simpler single-core
> microcontroller systems) that appropriately-written software behaves such that
> it will always either pass or fail, with certainty. Thus the standards consider
> probability of failure for hardware, but not for software.
> 
> In light of this fundamental flaw it would be tempting to call foul on the
> standards, at least with respect to the implementation of modern software
> running on multicore systems, but there is more to learn.
> 
> Some of the cited authors, e.g. Lukas Bulwahn, Nicholas Mc Guire and Jens
> Petersohn, are expert practitioners who have dedicated a significant portion
> of their careers to the work of deploying Linux in critical production
> systems. In spite of the demonstrable track-record of Linux-based solutions in
> critical systems, the sustained efforts by these and many other experts to
> advance the state-of-the-art by way of research initiatives (e.g. SIL2LinuxMP,
> ELISA), and the clear commercial opportunity for use of Linux in
> safety-related systems, it seems that so far no-one has been able to establish
> a generally viable method for certification of Linux-based systems which would
> be acceptable to Certification Authorities.
> 
> Note that there is still resistance by some in the safety engineering
> community towards the use of open source in general, and Linux in particular,
> more than two decades after Faulkner [3] originally concluded that Linux would
> be suitable for SIL 1 and SIL 2, and recommended funding a project to certify
> Linux to SIL 3.
> 
> This resistance remains in spite of the concerted efforts of several thousand
> software experts who have regularly contributed over the last twenty years to
> improve Linux since Faulkner originally recommended it, and in spite of the
> broad adoption of Linux for government initiatives, critical infrastructure,
> telecoms, mobile devices, scientific instruments and space exploration.
> 
> More than a decade after BMW's original research [4] identified the opportunity,
> there is still no certified Linux solution for them to adopt.
> 
> Allende et al. [4] drew the following conclusion:
> 
> "We also consider the execution probability estimation of untested paths as a
> step forwards in the field of test coverage of complex safety-related systems.
> Contrary to the techniques that have been employed traditionally, we take
> into account the uncertainty that these systems possess... This method also
> contributes in providing adequate explanation when full coverage is not
> achievable, as stated by the IEC 61508 standard (IEC 61508-3 Ed2 Table B.2).
> Nonetheless, there is a need for a reference value, equivalent to PFH ranges
> for hardware failures, to comprehend the risk associated with untested paths.
> Consequently, we believe this needs to be discussed with Certification
> Authorities (CAs)."
> 
> So far it appears that Allende's PhD work [10] has been cited only once, by
> Chen et al. (2023)[11]. Chen and colleagues explored the variability in
> Linux path execution under various conditions, and they "demonstrated that
> both system load and file system influence the path variability of the Linux
> kernel."
> 
> In other words, and more generally, software on a multicore device will behave
> differently based on
> 
> - how much load is placed on the system
> - how it is configured
> 
> Note that this must be true for traditional software (including proprietary
> programs that have been safety-certified) too, not least because the hardware
> itself is non-deterministic, so any software running on it is bound to be
> subject to variations in its input states, creating conditions and
> opportunities for variation in its outputs.
> 
> We have found no research citations yet for the work by Chen et al. [11],
> presumably because it is recent.
> 
> # Conclusion
> 
> While the conclusion of Chen et al. [11] may be 'state-of-the-art' in research
> terms, it's been widely understood by software practitioners for some decades.
> Arguably it should have been obvious to the authors of the standards.
> 
> For the avoidance of doubt, summarising the above 'state-of-the-art', let's
> spell this out:
> 
> - No amount of test coverage will ever be enough to represent the full range
>  of behaviours of modern software running on a multicore microprocessor.
> 
> - Some of the current practices (e.g for test coverage) embodied in the
>  applicable safety standards (e.g. IEC 61508, ISO 26262) are inappropriate
>  for modern systems, because:
> 
>  - They are based on a false premise, i.e. they assume that software should
>    be deterministic.
> 
>  - They lead to significant waste of time and resources in pursuit of
>    irrelevant goals.
> 
>  - They discourage use of open source software, leading directly to increased
>    costs (e.g. spend on inferior proprietary solutions) and increased risks
>    (e.g. creation of new software from scratch).
> 
> Mc Guire, Bulwahn et al. have demonstrated in multiple research papers what is
> already obvious to the expert software community, i.e. modern multicore systems
> running multi-threaded software exhibit stochastic behaviours. When
> considering rare events such as failure rates, we must apply statistical
> techniques to measure confidence in modern software in general (and Linux in
> particular), just as we do for mechanical, electrical and electronic systems.
> 
> # Where next
> 
> We plan to move on to the core topic, which is how to establish confidence
> (in the statistical sense, as well as the normal interpretation of the word)
> about each particular release of complex software running on a modern multicore
> processor, where it is intended to perform critical functions that may cause
> harm if something goes wrong.
> 
> ## References
> 
> [0] (https://www.bsi.bund.de/EN/Themen/KRITIS-und-regulierte-Unternehmen/Kritische-Infrastrukturen/Allgemeine-Infos-zu-KRITIS/Stand-der-Technik-umsetzen/stand-der-technik-umsetzen_node.html)
> [1] Fred Brooks, "The Mythical Man-Month: Essays on Software Engineering"
> [2] Robert Glass, "Facts and Fallacies of Software Engineering"
> [3] A. Faulkner, "Preliminary assessment of Linux for safety related systems"
> [4] L. Bulwahn, T. Ochs, and D. Wagner, "Research on an Open-Source Software
> Platform for Autonomous Driving Systems."
> [5] Andreas Platschek, Nicholas Mc Guire, Lukas Bulwahn, "Certifying Linux:
> Lessons Learned in Three Years of SIL2LinuxMP"
> [6] Imanol Allende, Nicholas Mc Guire, Jon Perez-Cerrolaza, Lisandro G. Monsalve,
> Jens Petersohn and Roman Obermaisser, "Statistical Test Coverage for
> Linux-Based Next-Generation Autonomous Safety-Related Systems"
> [7] L. Cucu-Grosjean, L. Santinelli, M. Houston, C. Lo, T. Vardanega,
> L. Kosmidis, J. Abella, E. Mezzetti, E. Quinones, and F. J. Cazorla,
> "Measurement-based probabilistic timing analysis for multi-path programs"
> [8] S.Draskovic, R.Ahmed, P.Huang,and L.Thiele "Schedulability of probabilistic
> mixed-criticality systems"
> [9] Mc Guire, N., & Allende, I. (2020). "Approaching certification of complex
> systems."
> [10] Imanol Allende, "Statistical Path Coverage for Non-Deterministic Complex
> Safety-Related Software Testing", (2022)
> [11] Yucong Chen, Xianzhi Tang, Shuaixin Xu, Fangfang Zhu, Qingguo Zhou &
> Tien-Hsiung Weng "Analyzing execution path non-determinism of the Linux kernel
> in different scenarios"
> _______________________________________________
> The System Safety Mailing List
> systemsafety at TechFak.Uni-Bielefeld.DE
> Manage your subscription: https://lists.techfak.uni-bielefeld.de/mailman/listinfo/systemsafety