[SystemSafety] Deterministic or not?

Wed Aug 7 14:33:15 CEST 2024

Peter,
Thank you for this, it is very helpful. Comments inline...

On 2024-08-07 11:37, Prof. Dr. Peter Bernard Ladkin wrote:
> On 2024-08-07 10:58 , Paul Sherwood wrote:
>> Accepting Dewi's statement that "Prof. Peter Ladkin is one of the 
>> World's leading experts on practical statistical evaluation of 
>> critical software", I hope you don't mind me asking you this question 
>> directly.
>> 
>> Do you consider that the behaviour of critical software running on a 
>> multicore microprocessor can and should be deterministic?
> 
> Pass.
> 
> I think it certainly "can" be deterministic if the processor is so 
> configured and the software is simple enough and was built to 
> principles such as those of SPARK.  But I don't know that it is 
> reasonable to expect behaviour to be deterministic in general 
> safety-related circumstances. Twenty years ago, Kevin & co related a 
> Byzantine failure in critical civil-aviation control software; Dewi has 
> related a more recent case at Airbus. Certainly you don't want those 
> kinds of indeterminism occurring, which is why it is an airworthiness 
> issue. But to say "should" likely goes beyond the state of the 
> practice.
> 
> I attach a note Harold Thimbleby, Martyn and I wrote just over a month 
> ago. It contains a bit more detail, in particular a list (from Martyn) 
> of all the co-occurring things which can cause even single-threaded 
> perfect-semantics SW behaviour to be non-deterministic.

Given how recent it is, and the credentials of the authors, I believe I 
can consider the paper you provided to be a contribution to the 
'state-of-the-art' so will add it to my list of references.

For what it's worth, I recently asked a colleague to document some of 
the factors which can affect determinism in microprocessors. His 
document is reproduced below, in the hope that you and the community 
here may find it interesting.

br
Paul

# Application Processor non-determinism

Modern application-class processors (those designed for running rich 
operating
systems as opposed to simple real-time or bare-metal code) tend to 
employ
various techniques in hardware and software for improving the average
throughput of the system; they improve performance for the average case.

Sadly many of these techniques alter the determinism of the system: the 
amount
of time taken to perform a given calculation or do some given task is no 
longer
constant time; the observed jitter can be significantly increased, which 
in
itself then has a knock-on effect to the real-time scheduling of other
calculations and tasks.

Below is a non-exhaustive list of some of these techniques and a very
high-level (and in some cases simplified) description of how they work,
why they're there, and how they can cause required runtime to vary.

## Performance "hacks"

### Multi-level shared caches

A typical application processor has five levels of memory of decreasing 
speed,
and normally only the two extremes are visible in the programers' model:

   - Register file (fastest, directly addressable from code)
   - Level 1 cache (order of tens of kB)
   - Level 2 cache (1-4MB)
   - Level 3 cache (4MB+)
   - Physical RAM (slowest, directly addressable from code)

There is another layer in some systems, which is called 'swap' and is
even slower than physical memory and also invisible to the programmer.

When code needs to fetch data into a register (assuming a load/store
instruction set architecture), it reads though all the levels of cache.

The problem arises as when new data is read through the cache it will 
evict old
data and replace it with the newly read data, and that only the register 
file
and L1 cache are specific to a core, your CPU topology might resemble:

          | 16 GB DRAM physical memory ----------------------> |
          | 16 MB L3 ----------------------------------------> |
          | 4 MB L2 --------------> | 4MB L2 ----------------> |
          | 128kB L1   | 128kB L1   | 128kB L1   | 128 kb L1   |
          | CPU Core   | CPU Core   | CPU Core   | CPU Core    |

The important observation here is that in this example, the L2 and L3 
caches
which are doing a good job improving the general performance of the 
system are
shared between multiple cores, meaning workloads running on a neighbour 
can
replace the contents of those caches, worsening your performance in ways 
that
cannot be easily modelled.

Additionally, in some systems the physical DRAM is not uniform: if a 
core
requires data that is stored or cached in another cluster's high level 
caches
or DRAM it will need to communicate over some internal (or external in 
the case
of multiple physical CPU sockets) bus to access that data.

### Memory management TLB

Access to data stored in RAM is inconsistent in access speed, even when 
a
single thread is executing as the OS kernel may be invoked to perform
calculations and MMU reconfigurations for any memory access.

Because application processors tend to have very large quantities of 
physical
memory (compared to a microcontroller) and need to run multiple
independently-developed processes at the same time, it is typical for 
the
operating system to provide a virtual view onto memory to each process 
so they
can all share the same working model - essentially existing and running 
from
the same virtual address space.  Each process's virtual address space 
maps onto
distinct physical pages (typically 4kB) of memory.  The mapping is 
controlled
by the OS's kernel by configuring the memory management unit (MMU).

Traditionally this reconfiguration would happen on each task or context 
switch,
but as higher performance is sought and the number of pages of RAM a 
typical
system has grown such that storing the entire mapping in the MMU has 
become
unfeasible, OSes and CPUs have moved to an on-demand model using a small 
cache
for the mapping.

The cache, called the Translation Lookaside Buffer (TLB), contains a 
small
number of recently-used mappings.  If a process requests data from a 
virtual
address space location for which there is no mapping in the TLB, the CPU 
will
trigger a fault, which the OS kernel picks up, looks through its data
structures in DRAM to work out what physical address the requested 
exists in,
updates the TLB to contain this mapping, and resumes the process.

### Superscaler and Out-of-order execution

A modern application processor will have many instances of a given 
resource,
such as the arithmetic logic unit (ALU).  The reason for this is that 
the
instruction stream may have multiple instructions that are not 
inter-dependant
(for example, an instruction does not depend on the result of a prior
instruction), and so these can be executed simultaneously if hardware 
resources
allow.

This means that how long instructions can stall for while waiting for 
the
results of another can vary depending on the CPU's implementation and 
what
other processes are running concurrently on the CPU.

For example:

     0x0    MOV r0, #6
     0x4    MOV r1, #7
     0x8    MUL r3, r0, r1
     0xc    MOV r2, #8
     0x10   ADD r4, r2, r2

This code calculates 6 * 7 and 8 * 2.  The MUL instruction is dependant 
on the
MOV instructions populating r0 and r1 before it can execute.  The ADD
instruction is dependant on the MOV instruction populating r2.  There is 
no
dependency between the calculations themselves, and modern application 
CPUs can
analyse this dependency and could perform the following execution plan:

1. Execute 0x0, 0x4 simultaneously.
2. Execute 0x8, 0xc simultaneously.
3. Execute 0x10.

However, a CPU capable of out-of-order execution can go a step further.  
It can
reorder the execution of instructions such that the dependencies are 
still met
but the code makes better use of the CPU's currently available resources 
and
has fewer stalls due to long-running instructions (such as MUL), so it 
might
end up being executed:

1. Execute 0x0, 0x4, 0xc simultaneously.
2. Execute 0x8, 0x10 simultaneously.

The issue with determinism here is that the amount of concurrency a 
superscaler
CPU and the decision on in what order to execute the instructions is 
dependant
on state not available to the programmer.  This is worsened in 
situations where
Symmetrical Multithreading (SMT, sometimes called Hyperthreading) is 
employed
by the CPU.  This is where you have two virtual CPUs sharing the 
resources of a
single physical CPU: if one virtual CPU is not using a resource, the 
other
*may* be able to continue execution.

### Speculative execution / branch prediction

The cause of the "Spectre" family of CPU security bugs.  Some modern 
CPUs will
see a conditional branch instruction in their pipeline and take a guess 
as to
if the condition is met long before it is reached.  They use this guess 
to
start fetching and processing instructions based on that prediction so 
an
entire pipeline flush is not required at the branch point if it guessed
correctly.  Sadly this can change the contents of some CPU caches 
(regardless
of if the branch condition was met or not) leading to Spectre (which 
infers
data held by other processes based on cache hit/miss timing differences) 
but it
also changes the determinism because the context the CPU uses to make 
its
branch prediction is not visible to us: it may change each time the code 
is
executed.  For example, the CPU might make its prediction based on what 
does or
does not exist in the L1 code cache meaning it might predict wrongly on 
the
initial pass through code but a second pass shortly after might improve 
its
predictions.

## Architectural factors

### Asymmetrical cores

Many application processors that are designed for mobile use (phones, 
laptops,
cars) are multi-core but those cores are not of the same design.  Arm 
call this
big.LITTLE, Intel E/P (Efficiency/Performance).

Typically big/performance cores will be superscaler out-of-order with 
large
dedicated L2 caches, and little/efficiency cores will be in-order and 
share
caches.

In a laptop or mobile phone context, heavy loads and "race to sleep" 
loads can
run on the big/performance cores while consistent background or 
housekeeping
loads are run on the little/efficiency cores.

This means that threads that are migrated around the available cores in 
a
system may have differing performance and careful affinity pinning or 
power
management is required.

### Clock drift/thermal noise

Minor as it is, you typically have a master oscillator which generates a 
clock
pulse at a given frequency +/- a few parts-per-million (variable with 
price of
clock).  However these are often temperature-sensitive and their 
consistency
becomes worse as temperature changes.  Similarly, large application 
processors
tend to get hotter than microcontrollers, and thermal noise in SoC die 
or clock
circuitry could conceivably cause measurable differences in execution 
time,
with a knock-on effect to how future threads are scheduled.

### Serialised RAM/storage access

Typically RAM chips can only be asked to access one address at a time, 
and they
will take an amount of time to find and return the data. It is not 
typical in
embedded systems for each core (or hyperthread) to have its own 
dedicated RAM
bus, and so requests to physical RAM must be serialised and performed 
one at a
time.  This means that one core may stall if a neighbour recently 
fetched (or
wrote) data which was not in any of the levels of cache, altering its
performance characteristics.

### Hypervisors

Hypervisors are just operating systems which multitask processes like 
any other
- except with the special consideration of providing emulations of the
programmer's model at a higher privilege level to allow another kernel 
to run,
and as such the issues described above still remain, and in many cases 
become
worse (now there are *two* layers of TLB and interrupt management).