[SystemSafety] State of the art for "safe Linux"

Fri Aug 9 20:53:00 CEST 2024

Paul,

I don't think you've said anything about the application that you are 
considering using Linux for. So IMO the first question to ask is: is 
Linux an appropriate OS for the application? If it's a complex and 
primarily not-real-time application, and you need to interface to a lot 
of peripherals via USB or other standard commercial interfaces for which 
there are Linux device drivers available, and the alternative would most 
likely be MS Windows, then the answer may be yes. If not, then is Linux 
appropriate, or might it be overkill?

A company that I spend time working with needs to develop a new user 
interface device based on a touch screen. The choice is between:

- a SoC device with 64Mb dynamic RAM running a custom version of Linux 
and using Qt as the GUI framework; and
- a much simpler MCU with about 600kB SRAM running FreeRTOS and LVGL as 
the GUI framework.

This app isn't safety-critical; but if it was then I would definitely 
recommend the second option, to simplify the certification process and 
to avoid having to maintain the Linux port.

Regards David

On 2024-08-05 12:33, Paul Sherwood wrote:
> I've been continuing to wrestle with the dragon of 'safe Linux' with 
> colleagues
> and we are now preparing for renewed discussion with certification 
> authorities.
>
> The following is my attempt at a review of the current literature and 
> I would
> very much appreciate feedback, and references to other relevant 
> materials I may
> have missed.
>
> br
> Paul
>
> # State-Of-The-Art
>
> One of the sensible principles expressed by the safety engineering 
> community
> is that practitioners should keep up with (and learn from) the
> 'state-of-the-art'.
>
> In fact this is stated directly in ISO 26262: "The achievement of an 
> objective
> of the ISO 26262 series of standards is judged considering the 
> corresponding
> requirements of these standards, the state-of-the-art regarding technical
> solutions and the applicable engineering domain knowledge at the time 
> of the
> development."
>
> This is true not only for the standards, but also in some legal 
> frameworks.
> For example, quoting from the German Federal Portal [0]:
>
> "'State of the art' is a common legal term. Technical development is 
> quicker
> than legislation. For this reason, it has proven successful for many 
> years in
> many areas of law to refer to the 'state of the art' in laws instead 
> of trying
> to specify concrete technical requirements in the law. What is the 
> state of
> the art at a certain point in time can be determined, for example, on the
> basis of existing national or international standards such as DIN, 
> ISO, DKE or
> ISO/IEC or based on role models in the respective areas that have been
> successfully tested in practice. Since the necessary technical 
> measures may
> differ depending on the specific case, it is not possible to describe the
> 'state of the art' in generally applicable and conclusive terms."
>
> Here we consider what is the 'state-of-the-art' in terms of published 
> research
> regarding the use of complex software in modern safety-critical 
> systems, and
> in particular the use of Linux-based software.
>
> # Research
>
> We mainly consider the most recent research we can find, even though some
> software knowledge is arguably timeless e.g. Brooks (1975)[1], Glass 
> (2002)[2].
>
> It appears that the first serious consideration of the use of Linux for
> safety-related systems was by Faulkner (2002)[3]. Faulkner concluded that
> "'vanilla' Linux would be broadly acceptable for use in safety related
> applications of SIL 1 and SIL 2 integrity" and that "it may also be 
> feasible
> to certify Linux for use in SIL 3 applications by the provision of some
> further evidence from testing and analysis."
>
> Faulkner recommended funding a project to achieve SIL 3 certification. 
> As far
> as we can tell, no such project was actually indertaken.
>
> A decade later Bulwahn and Ochs (2013)[4] stated that use of Linux could
> reduce BMW's development and qualification costs while improving quality
> and confidence.
>
> In 2014, OSADL established the SIL2LinuxMP project to seek SIL2 
> certification
> of Linux. Although this initiative was active for three years, and led
> by motivated experts with significant industry sponsors including BMW and
> Intel, Platschek et al. (2018)[5] ultimately reported that 
> certification was
> "in reach" but not achieved.
>
> Allende et al. (2021)[6] argued that companies and governments already
> rely heavily on Linux for critical applications, and there is "remarkable
> interest" in certification of Linux for use in safety-related systems, 
> even
> though it was not designed with functional safety in mind, and in 
> spite of the
> size and complexity of the Linux kernel itself.
>
> Allende and his colleagues noted that:
>
> - Traditional safety techniques and measures have not been defined for
>   safety-related systems involving modern features such as artificial
>   intelligence, high-performance computing devices, general purpose 
> operating
>   systems and security requirements.
>
> - The test coverage in IEC 61508 and ISO 26262 is "hardly achievable (if
>   feasible)" for systems involving these modern features, and therefore
>   "novel complementary methods and approaches" are required.
>
> Building on prior art, including L. Cucu-Grosjean et al (2012)[7],
> S.Draskovic et al. (2021)[8], Mc Guire and Allende (2020)[9], and in
> anticipation of Allende's 2022 PhD thesis [10], the authors applied 
> statistical
> techniques, including Maximum Likelihood Estimation and Simple 
> Good-Turing, on
> a practical case study involving a Linux-based Autonomous Emergency 
> Braking
> system. They calculated a probability of software-related failure for 
> their
> example (1.42e−4), but noted that current safety standards "do not 
> provide any
> reference value to determine whether this probability result is within a
> tolerable risk".
>
> The reason for this gap seems to be that the original standards authors
> assumed (perhaps based on their own experience with simpler single-core
> microcontroller systems) that appropriately-written software behaves 
> such that
> it will always either pass or fail, with certainty. Thus the standards 
> consider
> probability of failure for hardware, but not for software.
>
> In light of this fundamental flaw it would be tempting to call foul on 
> the
> standards, at least with respect to the implementation of modern software
> running on multicore systems, but there is more to learn.
>
> Some of the cited authors, e.g. Lukas Bulwahn, Nicholas Mc Guire and Jens
> Petersohn, are expert practitioners who have dedicated a significant 
> portion
> of their careers to the work of deploying Linux in critical production
> systems. In spite of the demonstrable track-record of Linux-based 
> solutions in
> critical systems, the sustained efforts by these and many other 
> experts to
> advance the state-of-the-art by way of research initiatives (e.g. 
> SIL2LinuxMP,
> ELISA), and the clear commercial opportunity for use of Linux in
> safety-related systems, it seems that so far no-one has been able to 
> establish
> a generally viable method for certification of Linux-based systems 
> which would
> be acceptable to Certification Authorities.
>
> Note that there is still resistance by some in the safety engineering
> community towards the use of open source in general, and Linux in 
> particular,
> more than two decades after Faulkner [3] originally concluded that 
> Linux would
> be suitable for SIL 1 and SIL 2, and recommended funding a project to 
> certify
> Linux to SIL 3.
>
> This resistance remains in spite of the concerted efforts of several 
> thousand
> software experts who have regularly contributed over the last twenty 
> years to
> improve Linux since Faulkner originally recommended it, and in spite 
> of the
> broad adoption of Linux for government initiatives, critical 
> infrastructure,
> telecoms, mobile devices, scientific instruments and space exploration.
>
> More than a decade after BMW's original research [4] identified the 
> opportunity,
> there is still no certified Linux solution for them to adopt.
>
> Allende et al. [4] drew the following conclusion:
>
> "We also consider the execution probability estimation of untested 
> paths as a
> step forwards in the field of test coverage of complex safety-related 
> systems.
> Contrary to the techniques that have been employed traditionally, we take
> into account the uncertainty that these systems possess... This method 
> also
> contributes in providing adequate explanation when full coverage is not
> achievable, as stated by the IEC 61508 standard (IEC 61508-3 Ed2 Table 
> B.2).
> Nonetheless, there is a need for a reference value, equivalent to PFH 
> ranges
> for hardware failures, to comprehend the risk associated with untested 
> paths.
> Consequently, we believe this needs to be discussed with Certification
> Authorities (CAs)."
>
> So far it appears that Allende's PhD work [10] has been cited only 
> once, by
> Chen et al. (2023)[11]. Chen and colleagues explored the variability in
> Linux path execution under various conditions, and they "demonstrated 
> that
> both system load and file system influence the path variability of the 
> Linux
> kernel."
>
> In other words, and more generally, software on a multicore device 
> will behave
> differently based on
>
> - how much load is placed on the system
> - how it is configured
>
> Note that this must be true for traditional software (including 
> proprietary
> programs that have been safety-certified) too, not least because the 
> hardware
> itself is non-deterministic, so any software running on it is bound to be
> subject to variations in its input states, creating conditions and
> opportunities for variation in its outputs.
>
> We have found no research citations yet for the work by Chen et al. [11],
> presumably because it is recent.
>
> # Conclusion
>
> While the conclusion of Chen et al. [11] may be 'state-of-the-art' in 
> research
> terms, it's been widely understood by software practitioners for some 
> decades.
> Arguably it should have been obvious to the authors of the standards.
>
> For the avoidance of doubt, summarising the above 'state-of-the-art', 
> let's
> spell this out:
>
> - No amount of test coverage will ever be enough to represent the full 
> range
>   of behaviours of modern software running on a multicore microprocessor.
>
> - Some of the current practices (e.g for test coverage) embodied in the
>   applicable safety standards (e.g. IEC 61508, ISO 26262) are 
> inappropriate
>   for modern systems, because:
>
>   - They are based on a false premise, i.e. they assume that software 
> should
>     be deterministic.
>
>   - They lead to significant waste of time and resources in pursuit of
>     irrelevant goals.
>
>   - They discourage use of open source software, leading directly to 
> increased
>     costs (e.g. spend on inferior proprietary solutions) and increased 
> risks
>     (e.g. creation of new software from scratch).
>
> Mc Guire, Bulwahn et al. have demonstrated in multiple research papers 
> what is
> already obvious to the expert software community, i.e. modern 
> multicore systems
> running multi-threaded software exhibit stochastic behaviours. When
> considering rare events such as failure rates, we must apply statistical
> techniques to measure confidence in modern software in general (and 
> Linux in
> particular), just as we do for mechanical, electrical and electronic 
> systems.
>
> # Where next
>
> We plan to move on to the core topic, which is how to establish 
> confidence
> (in the statistical sense, as well as the normal interpretation of the 
> word)
> about each particular release of complex software running on a modern 
> multicore
> processor, where it is intended to perform critical functions that may 
> cause
> harm if something goes wrong.
>
> ## References
>
> [0] 
> (https://www.bsi.bund.de/EN/Themen/KRITIS-und-regulierte-Unternehmen/Kritische-Infrastrukturen/Allgemeine-Infos-zu-KRITIS/Stand-der-Technik-umsetzen/stand-der-technik-umsetzen_node.html)
> [1] Fred Brooks, "The Mythical Man-Month: Essays on Software Engineering"
> [2] Robert Glass, "Facts and Fallacies of Software Engineering"
> [3] A. Faulkner, "Preliminary assessment of Linux for safety related 
> systems"
> [4] L. Bulwahn, T. Ochs, and D. Wagner, "Research on an Open-Source 
> Software
> Platform for Autonomous Driving Systems."
> [5] Andreas Platschek, Nicholas Mc Guire, Lukas Bulwahn, "Certifying 
> Linux:
> Lessons Learned in Three Years of SIL2LinuxMP"
> [6] Imanol Allende, Nicholas Mc Guire, Jon Perez-Cerrolaza, Lisandro 
> G. Monsalve,
> Jens Petersohn and Roman Obermaisser, "Statistical Test Coverage for
> Linux-Based Next-Generation Autonomous Safety-Related Systems"
> [7] L. Cucu-Grosjean, L. Santinelli, M. Houston, C. Lo, T. Vardanega,
>  L. Kosmidis, J. Abella, E. Mezzetti, E. Quinones, and F. J. Cazorla,
>  "Measurement-based probabilistic timing analysis for multi-path 
> programs"
> [8] S.Draskovic, R.Ahmed, P.Huang,and L.Thiele "Schedulability of 
> probabilistic
> mixed-criticality systems"
> [9] Mc Guire, N., & Allende, I. (2020). "Approaching certification of 
> complex
> systems."
> [10] Imanol Allende, "Statistical Path Coverage for Non-Deterministic 
> Complex
> Safety-Related Software Testing", (2022)
> [11] Yucong Chen, Xianzhi Tang, Shuaixin Xu, Fangfang Zhu, Qingguo Zhou &
> Tien-Hsiung Weng "Analyzing execution path non-determinism of the 
> Linux kernel
> in different scenarios"
> _______________________________________________
> The System Safety Mailing List
> systemsafety at TechFak.Uni-Bielefeld.DE
> Manage your subscription: 
> https://lists.techfak.uni-bielefeld.de/mailman/listinfo/systemsafety

-- 
David Crocker
+44 7977 211486