Reasoning Under Uncertainty — Fault Trees, Bayesian Networks, and Ranked Hypotheses

You are looking at a vehicle with a complaint: "reduced range, feels sluggish in the cold." Before you plug in a scan tool, three questions are already forming in your head. How can this system fail? What does the evidence tell me about which fault is active? Which hypothesis should I chase first?

Those three questions are answered, respectively, by fault tree analysis, Bayesian network inference, and ranked hypotheses with prior probabilities. They are not rival tools — they are layers of the same cake. The fault tree gives you the structure of failure. The Bayesian network lets you push evidence through that structure. The ranked hypothesis list is how you decide what to do next, on Tuesday, with a limited budget of time and parts.

The goal of this monograph is not mathematical rigor — there are textbooks for that, listed at the end. The goal is intuition. You should leave this page able to draw all three on the back of a napkin.

§ 1Fault trees — how systems fail

A fault tree is a picture of how an undesired event can be caused by more basic events. It is read top-down. At the top sits the thing you do not want — a crash, a fire, a motor that fails to deliver torque. Below it, branching downward through logic gates, sit the combinations of lower-level failures that would produce the top event.

The two gates you actually need

Ninety percent of useful fault trees use only two gates.

AND gate

The output occurs only if every input occurs. Think of a redundant system: both power supplies must fail. If the inputs are independent and have probabilities p₁, p₂, …, the output probability is their product: p₁·p₂·…

OR gate

The output occurs if any input occurs. If the inputs are independent, the output probability is 1 − (1−p₁)·(1−p₂)·… — the probability that none of them fails, subtracted from one.

The fault tree is a deductive tool. You start from the thing you want to prevent and work your way down to the root causes you can actually design against, test for, or monitor in the field. When you sum up the probabilities (technically: compute the minimal cut sets), you get a number for how often the top event occurs — a number that ISO 26262 and IEC 61025 will ask you for.

Figure 1 — Interactive

EV motor fails to deliver commanded torque

IGBT short-circuit 0.020

Gate driver fault 0.030

Encoder / resolver fail 0.040

HV contactor opens 0.010

DC-link capacitor 0.030

Drag any slider to change the probability of a basic event. The intermediate nodes and the top event recompute live. Notice how redundant branches behind an AND gate drive the probability down multiplicatively, while single failure points feeding an OR gate dominate the total. This is the quickest lesson in safety engineering you can receive.

Where fault trees shine — and where they don't

Fault trees are strongest when the system is engineered, the failure modes are catalogued, and you need a defensible number — "the probability of dangerous failure per hour is less than 10⁻⁸." They are the backbone of ISO 26262 ASIL decomposition, DO-178C avionics software arguments, and IEC 61508 industrial safety cases.

They are weaker when you need to diagnose what has already gone wrong. A fault tree tells you P(top event) given component failure rates. It does not tell you, once the top event has occurred and you've observed three symptoms, which branch is the culprit. For that, you want to read the tree in the opposite direction — which is exactly what the next section is about.

Safety certification Quantifying the probability of violating a safety goal for ASIL / SIL arguments.

Design reviews Identifying single-points-of-failure and whether a redundant channel actually buys you anything.

FMEA companion Translating a flat failure-mode spreadsheet into a structured probability of the top-level hazard.

Accident investigation After a loss-of-propulsion event, which combinations of root causes are consistent with the flight/drive data?

§ 2Bayesian networks — updating beliefs from evidence

A Bayesian network is a graph of variables in which each arrow says "this thing influences that thing." Attached to each node is a small table — the conditional probability table, or CPT — which answers the question: "given the state of my parents, how likely is each of my possible states?"

That is the entire structure. Its power lies not in the structure itself but in what you can do with it. Once the graph and the CPTs are specified, you can observe any subset of the variables and ask for the posterior probability of any other subset. The math that makes this go is Bayes' rule, applied repeatedly — but the conceptual move is the one that matters.

Two directions of inference

Predictive (top-down)

You know the state of upstream causes and want to predict downstream symptoms. "If the bearing is worn, how likely am I to see vibration?"

Diagnostic (bottom-up)

You observe symptoms and want to infer the hidden cause. "I see vibration and overheating — how likely is it that the bearing is worn?" This is the motion that a fault tree cannot perform, and it is the motion you perform every time you diagnose a car.

The prior probabilities — the unconditional probabilities of the causes, written P(cause) — encode what you believe before looking. The likelihood terms — P(symptom | cause) — come from physics, from field data, or from expert elicitation. Bayes combines them into the posterior: P(cause | symptom) ∝ P(symptom | cause) · P(cause).

Figure 2 — Interactive

A two-cause diagnostic network

Vibration:

Overheating:

Posterior beliefs given evidence

P(Bearing wear | E)

0.100

P(Cooling fault | E)

0.050

Click the evidence buttons to tell the network what you have observed. The posterior probability of each hidden cause recomputes via exact enumeration over the joint. Notice two things. First: observing high vibration strongly implicates the bearing (direct parent) but also slightly lowers the cooling-fault posterior, because the bearing has become a better explanation for the co-observed heat — this is explaining away. Second: observing overheating alone raises both hidden causes roughly in proportion to their priors and their likelihood of producing heat.

Explaining away — the one sentence you will be quoted on

When two independent causes can each produce the same effect, observing that effect raises the probability of both. But confirming one cause lowers the probability of the other, because the effect now has a sufficient explanation. A fault tree cannot do this. A Bayesian network does it for free.

Condition-based maintenance Fusing vibration, temperature, and current signatures into a single posterior on component health.

DTC interpretation Translating a constellation of diagnostic trouble codes into ranked root causes rather than a flat list.

Sensor validation Deciding whether a surprising sensor reading reflects reality or sensor fault, using context from other sensors.

Warranty analytics Inferring failure mechanisms from field returns when ground-truth teardown is expensive.

§ 3Ranked hypotheses with priors — acting on what's likely

The Bayesian network answered: "given what I have observed, how likely is each cause?" In practice you rarely care about all causes equally — you care about the order. Which suspect do I investigate first? Which fix do I attempt before I start pulling things apart? This is the job of the ranked hypothesis list.

The one equation you must internalize

P(H_i | E) ∝ P(E | H_i) · P(H_i)

Read it right to left. P(H_i) is your prior — how common this hypothesis is before you look, in this fleet, in this climate, at this mileage. P(E | H_i) is the likelihood — if this hypothesis were true, how well would it explain the evidence you see? Multiply them, normalize across all hypotheses so they sum to one, and sort descending. That is the ranking.

The prior matters enormously. If you forget it, you will diagnose a rare, textbook-beautiful failure mode when the actual answer is "the tire pressure is low." The humbling thing about priors is they are often the best signal you have — base rates from your warranty database are gold, and no clever model will recover what an ignored prior costs you.

Figure 3 — Interactive

Diagnostic ranker for "EV shows reduced range"

Observed symptoms — click to toggle on/off

Adjust priors (base rates for your fleet)

Each hypothesis is scored by prior × product of likelihoods over observed symptoms (naive-Bayes, assuming conditional independence of symptoms given the cause). Toggle symptoms on and off to see how the ranking reorders. Then adjust the priors — try dropping "driver behavior" to near zero and watch a rarer, more mechanical hypothesis climb. This is precisely what calibrating a diagnostic decision-support tool feels like in practice.

Why ranking is often better than diagnosis

A ranked list is honest about uncertainty. It does not pretend to a single answer when the evidence supports three. A mechanic given the top three hypotheses with their posterior weights makes better decisions than one given a single "most likely" with no uncertainty attached. Vehicle Health Management systems in modern fleets increasingly produce ranked posteriors rather than point decisions, precisely because they must compose cleanly with human judgment at the service bay.

Decision support at service Presenting a technician with the top-3 suspect components, weighted by posterior, instead of a single diagnostic code.

Fleet triage Ranking which vehicles to recall or inspect first, given telematics evidence and known population priors.

Active diagnosis Choosing the next test to run as the one that most sharply separates the top hypotheses — "value of information" in Bayesian terms.

Spare-parts logistics Stocking distribution centers according to the expected demand implied by ranked posteriors across the fleet.

§ 4How they fit together

The three tools are rarely used in isolation in a serious VHM program.

The fault tree is built during design, as part of the safety case. It defines the failure structure — which components, in which combinations, can produce which hazards. The probabilities on the leaves come from reliability handbooks, test data, and field returns.

That same structure can be re-read as a Bayesian network for diagnostic purposes. The topology is nearly identical; the CPTs replace the simple gate logic, and you add observable symptom nodes. Now the model can ingest evidence. Modern tools will even compile a fault tree directly into a Bayesian network, so your diagnostic model inherits the rigor of your safety analysis.

The ranked hypothesis list is the output that the service technician, the OTA update system, or the fleet operations dashboard actually consumes. It is what the Bayesian network's posterior looks like when it hits the real world — sorted, truncated, and often accompanied by suggested next actions.

In one sentence

Fault trees tell you what can break, Bayesian networks tell you what probably did break, and ranked hypothesis lists tell you what to do about it on Tuesday morning.

§ 5Practical notes & pitfalls

On fault trees

The classical fault tree assumes independence of the basic events. Common-cause failures — a shared power supply, a shared temperature regime, a shared software fault — break that assumption spectacularly. Explicitly model them as additional nodes, or use extensions like dynamic fault trees. Also: your tree is only as good as your failure-mode catalog. An event you didn't think of has probability zero in your model, and non-zero in reality.

On Bayesian networks

The size of a conditional probability table grows as 2ⁿ in the number of parents. A node with six binary parents needs 64 conditional probabilities — more than you can elicit reliably from a single expert. Mitigations: noisy-OR and noisy-AND parameterizations, canonical models, hierarchical structure, and learning parameters from data when you have it. Exact inference is NP-hard in general; for real networks, use variable elimination or message passing for trees, and approximate methods (loopy belief propagation, MCMC, variational) for the rest.

On priors

A prior is an answer to the question "what do I already know before looking?" That answer is almost never "nothing." Sensible priors come from: your warranty database, field returns, reliability-centered-maintenance analyses, physics of degradation, and structured expert elicitation. Uniform priors are a statement, not a default — they say you believe each hypothesis is equally plausible, which is usually false. A bad prior can dominate the posterior when evidence is weak, so it is worth the investment to get them right and to test sensitivity.

On the naive-Bayes assumption

The ranker in Figure 3 assumes symptoms are conditionally independent given the hypothesis. This is almost never exactly true — but it is often close enough to produce the correct ranking, which is what you actually care about. If accuracy of the posterior values matters (not just order), move to a full Bayesian network.

On model maintenance

A diagnostic model is a living thing. Fleets age. Operating environments shift. New failure modes emerge as components are redesigned. Post-deployment monitoring must include calibration drift (are my predicted probabilities matching observed frequencies?) and structural drift (are there hypotheses I should be adding?). A model that was good two years ago can be silently misleading today.

§ 6References

J. Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann, 1988. The foundational monograph; chapter 2 on evidence propagation remains the cleanest exposition in print.
D. Koller and N. Friedman. Probabilistic Graphical Models: Principles and Techniques. MIT Press, 2009. The modern graduate-level treatment.
F. V. Jensen and T. D. Nielsen. Bayesian Networks and Decision Graphs. Springer, 2nd ed., 2007. Particularly readable on modeling decisions alongside probabilities.
IEC 61025:2006. Fault Tree Analysis (FTA). International Electrotechnical Commission. The reference standard for fault-tree notation and semantics.
ISO 26262:2018. Road vehicles — Functional safety. Parts 1–12. Requires quantitative safety arguments; fault trees and FMEDAs are central.
W. E. Vesely et al. Fault Tree Handbook with Aerospace Applications. NASA Office of Safety and Mission Assurance, 2002. Free, authoritative, full of worked examples.
SAE JA1012. A Guide to the Reliability-Centered Maintenance (RCM) Standard. SAE International. On how diagnostic reasoning feeds maintenance decisions.
A. Darwiche. Modeling and Reasoning with Bayesian Networks. Cambridge University Press, 2009. Strong on inference algorithms and sensitivity analysis.
M. J. Druzdzel and L. C. van der Gaag. Building probabilistic networks: Where do the numbers come from? IEEE Transactions on Knowledge and Data Engineering, 2000. A classic on eliciting priors and CPTs from experts.
M. G. Pecht. Prognostics and Health Management of Electronics. Wiley, 2008. Domain-specific application to the kinds of systems Vehicle Health Management is built on.