How to Prevent Disasters in Mission Critical Systems

The definition of a mission critical system is that a failure can have disastrous consequences, including death, destruction, bankrupting entire big companies, economic crisis, …

Examples for such mission critical systems are:

Air traffic control,
Industrial processes, e.g. in chemical industry,
Atomic power plants,
Railway interlocking systems,
Maritime control,
Big stock traders.

Of course, such systems are usually designed in a redundant way, such that one mistake should not cause a failure of the whole system.

One of the most critical components of such a system is human machine interaction.

Why is this component so critical?

Humans usually have overriding power over decisions made by the technical system. Therefore human error can have big consequences.
This is an interaction between two completely different systems: human psychology vs computer system.
Often, the system is primarily controlled by humans, i.e., the technical systems cannot make decisions.
Usually the technical systems provide information to the human. This information is the basis for decisions made by that human. Therefore, incorrect information can cause wrong human decisions. Correct information presented in a confusing way may also cause wrong human decisions.
Often, it is hard to implement redundancy at this interface. Especially in systems which require quick decision making by humans, it is impossible to require multiple humans to authorize a decision.
So, human-machine interaction is likely be a single point of failure in many systems.
(Single point of failure means that one single mistake at that point can directly cause disaster.)

This leads us to the following questions:

How to prevent disastrous mistakes in human-machine interaction?

There are two main categories of disaster prevention:

Design the system carefully such that mistakes are less likely or such that mistakes have less severe consequences. “System design” includes technical design, staff training, work procedures, …
Learn from mistakes and (near-)disasters, and make suitable adaptations to the system, so that these kinds of mistakes cannot happen anymore.

Obviously you need both.

Here, we assume that the system is already carefully designed.
Therefore we concentrate on the latter, i.e., learning from mistakes and (near-)disasters.

How to learn from mistakes:

You can only learn from mistakes when you know what the mistake is.

In case of a (near-)disaster it is often difficult to find out the cause. In most cases, the cause is not obvious. You need to analyze all data which is available from the system.

If data is insufficient, then you may not be able to correctly identify the real cause. If you fail to identify the real cause, you are in danger to create the same type of disaster again.

If you fail to learn from a near-disaster, you are destined to have a full-blown disaster later.

So having sufficient relevant data is vital for preventing repeated disasters of the same type.

Here are some kinds of data that are relevant for most systems:

Logs of the technical system:
- logs of sensor data ( temperature of a reactor at sensor X, GPS-positions of aircraft )
- logs of internal states of the technical system ( communication between technical components, communication channels, CPU-usage, free storage space, … )
- …
Interrogation of humans.
Recording of the human machine interface. This can reproduce what was presented by the system to the human, and what was entered into the system by the human.
Recording of communication between humans.

The either-or fallacy:

In conversations about this topic, especially with laymans, I often hear that there is no need for human-machine-interface recording because there are system logs anyways.

This is a dangerous fallacy. Here’s why:

It is true that system logs may give some hints to what was presented to the human or what was entered by the human. But these are just hints, not facts.

The different between a hint and a fact is:

A fact is true.
A hint may be true when several assumptions are true in the specific case.

It is dangerous to base a disaster-analysis on assumptions. If any of the assumptions is not true, then you will come up with wrong conclusions. This causes you to miss the real cause, which remains unfixed. Therefore, the same cause can happen again, potentially causing a full-blown disaster, even when the first occurrence of that cause only has caused a near-disaster.

So it is important to avoid the need for making assumptions. You have to have as many relevant facts as possible.

In our case, an assumption may be that the software, or hardware works correctly or that windows on the screen are positioned a certain way etc.

It is dangerous to assume that software or hardware works correctly, when the real cause of the disaster is actually a bug in that software or a hardware malfunction.

Therefore you need a precise recording of human-computer interaction for collecting facts of that component of the system.

Vice versa is also true: Recordings of human-computer interaction cannot replace technical logs either. The same reasoning applies: From human-machine-interaction logs you may get some hints about what happens inside technical components. But you’ll never get facts.

So, you need both.

Either-or is a very dangerous fallacy. (If you look more closely at a system, then you’ll find many more examples where either-or thinking can be dangerous. Getting this right needs an in-depth system analysis.)

What about interrogation of humans?
Interrogation of humans is very important, but one has to account for the unreliability of human memory, and for the possibility of lying, especially when there is threat of legal prosecution or of being fired.

So, it is always good to have facts available from recording of human computer interaction and from human-human communication because that’s as close as possible to the human where facts can still be gathered.

When you have recordings of human computer interaction and of human-human communication, the interrogation of humans does not need to ask for facts about what was visible and what was done or what was said. The interrogation can concentrate on important things about the human side of human-machine-interaction:

Was there confusion at some place?
Was information presented in an overwhelming way?
Did you overlook something that was presented on the screen? Why?
( The “Why?” may need scientific research, which can lead to improvements of the human machine interface. )
Where you prepared for such a situation?
“I knew what to do I did not how to enter it into the system.”
“I accidentally clicked this button because I expected another button here.”
“The system is so clumsy that it took me too long to activate an appropriate emergency measure.”
…

The result of such a human interrogation together with facts from recording and logging will lead to an analysis which is very likely to identify the correct cause.

When the correct cause is identified it can lead to one or multiple of the following measures:

Changing the technical system.
Training of staff.
Improvements of the human machine interface.
Changing work procedures.
…

As soon as an appropriate measure is implemented, the same cause cannot happen anymore ( or is less likely to happen anymore ). Therefore, your system will be safer.

How well is this done in your industry?

From my experience, air traffic control providers in developed countries are very very good at doing this correctly. Recording human machine interaction is mandated by regulations. The same applies to capturing data inside technical systems, and to recording communication between humans. These regulations are usually publicly available at organizations like Eurocontrol or national air traffic regulators.

But what about other industries where disasters can cause as much (or more) damage as air traffic accidents?

What about your industry?

Please comment below or send me an email to christian.linhart@clinhart.com if you want to discuss this with me.

DI Christian Linhart GmbH

Screen-Recording, Mission Critical Software, Software Project Consulting

How to Prevent Disasters in Mission Critical Systems

Leave a Reply Cancel reply