Fault Detection and Isolation — An Interactive Primer

1

A system that watches itself.

Every engineered machine that matters — an aircraft, a pacemaker, an electric-vehicle traction motor — does two things at once. It performs its job, and it quietly checks that it is still capable of performing its job. That second task is called Fault Detection and Isolation, or FDI.

The goal is simple to state: decide, from the signals already flowing through the machine, whether something is wrong, and if so, what. Detection is the yes-or-no; isolation is the pointing-finger. A good FDI scheme answers both in time to matter.

The trick is that faults hide. They look like disturbances. They look like noise. They look, sometimes, like perfectly normal behavior at a slightly different operating point. Everything in FDI is an attempt to build a signal — a residual — that stays small when the system is healthy and grows loud when it is not.

Healthy

Figure 1 · InteractiveThe classical FDI architecture. The diagnostic block runs a copy of what the plant should be doing and compares it to reality. Click “inject fault” to see the residual trace rise above threshold.

2

Model-based methods — the classical backbone.

If you have a mathematical model of the machine, you can use it as a ruler against which reality is measured. Everything in this family follows the same recipe: predict, compare, decide.

2.1Observer-based approaches

An observer is a model of the plant that runs in parallel with the real thing. It takes the same inputs u, produces an estimated output ŷ, and corrects itself using the difference between measured and predicted output. The residual is that difference.

ẋ̂ = A x̂ + B u + L (y − ŷ), ŷ = C x̂, r = y − ŷ

When the model is right and nothing is broken, r stays near zero (noise aside). When something breaks — a sensor drifts, an actuator saturates, a parameter changes — r grows. The observer gain L is chosen to make the estimate converge quickly without amplifying noise too much.

true state y estimate ŷ residual threshold

Figure 2 · InteractiveThe estimate chases the true state. Press “inject fault” to see the residual break through threshold — the moment of detection.

Flavors worth knowing

LuenbergerThe original deterministic observer. Gain L placed by pole assignment. Clean, cheap, assumes the model is right.
Unknown Input Observer (UIO)Designed to make the residual blind to a known class of disturbances — measurement offsets, load torque — while still sensitive to the faults you care about. Pays in observer order.
Sliding Mode Observer (SMO)Uses discontinuous correction to force the state error onto a surface. Robust to matched uncertainty; popular in sensorless PMSM control for back-EMF reconstruction.
High-gain observerDrives the estimate fast by making L large. Aggressive, noise-sensitive, used where fast detection matters more than a clean residual.
Extended / Unscented Kalman FilterThe stochastic cousins. The same idea with a noise model; the residual “innovation” is the basis for statistical tests.

2.2Parity space methods

An observer is dynamic — it has state. Parity space methods are static: they stack measurements over a short window and look for algebraic relations that must hold if everything is working.

If the plant is y = Cx + Du + noise, then for any vector v chosen so that vᵀC = 0, the quantity vᵀ(y − Du) must equal vᵀ · noise. That number is the parity residual. It cannot know what the state is; it only knows that the numbers on each side of an equation disagree.

The art is picking v so that each residual reacts to a specific fault and not others. Line up several such residuals and you get a structured residual set: a table where each column of ones and zeros is a fingerprint.

	fault f₁	fault f₂	fault f₃	fault f₄
r₁	1	0	1	1
r₂	0	1	1	0
r₃	1	1	0	1
r₄	0	1	1	1

Each column is a signature. Click a fault to see which residuals it lights up. Because every column is unique, any single fault can be distinguished from the others.

Figure 3 · InteractiveA structured residual set. If you observe the pattern “r₁=1, r₂=0, r₃=1, r₄=0”, only fault f₁ matches. This is how isolation actually happens in practice.

2.3Parameter estimation

Some faults do not show up as a step in the output. They show up as drift in a physical parameter — stator resistance climbing because a winding is heating, inductance falling because a magnet is demagnetizing, a capacitor’s ESR rising over years of service. Watching the output waits for the consequence. Watching the parameter catches the cause.

The standard tool is Recursive Least Squares (RLS) or the Extended Kalman Filter, run online on the system equation. The estimated parameter is tracked like a state; deviations from its nominal value are the fault indicator.

true θ (drifting) RLS estimate θ̂ healthy band

Figure 4 · InteractiveThe RLS estimate (rust) tracks the true parameter (ink) as it drifts out of the healthy band. The fault is declared once the estimate leaves the green region with confidence.

2.4Analytical redundancy — the unifying idea

Step back from the three families above. What do they share? All of them generate residuals by comparing measured signals against a prediction produced by a model. This is analytical redundancy: using the system’s own equations in place of a duplicated sensor.

Figure 5All model-based FDI schemes — and, as we’ll see, many data-driven ones — are variations on this single pattern.

3

Data-driven methods — pattern over physics.

Sometimes a first-principles model is out of reach. The machine is too complex, the faults too many, the parameters too uncertain. Data-driven FDI trades the model for examples of what normal looks like, and flags whatever falls outside.

3.1Multivariate statistics

Collect a matrix of historical measurements from healthy operation. Most of that variability lives in a small number of directions — the rest is noise. Principal Component Analysis (PCA) finds those directions. A new measurement is normal if it projects into roughly the same region.

Two statistics do most of the work:

Hotelling’s T²Measures how far the projection onto the principal subspace is from the center. Detects unusual operating points within the normal directions.
SPE / Q statisticMeasures the residual energy orthogonal to the principal subspace. Detects novel behavior the model has never seen.

healthy observation T² fault (unusual point on the axis) SPE fault (off the axis)

Figure 6 · InteractiveHealthy data clusters along a principal axis. A T² fault is still “on-axis” but far from center — unusual operating point. An SPE fault is off-axis — behavior that has never been observed. The two statistics catch different failure modes.

3.2Signal processing

Some faults announce themselves in the frequency content of a signal. A cracked rotor bar in an induction motor imprints sidebands around the supply frequency. A bearing defect produces tone families tied to the geometry of the race. A loose mount modulates the carrier with the shaft speed.

The toolkit — FFT, wavelet transform, Hilbert-Huang, envelope analysis, cepstrum — is about getting the fault signature out of whatever it’s hiding inside.

Figure 7 · InteractiveMotor current spectrum. A bearing outer-race defect lifts tones at the BPFO frequency and its harmonics. A broken rotor bar produces sidebands at (1−2s)f_s and (1+2s)f_s. Eccentricity shows up at f_s±f_r. Each signature tells a different story.

3.3Machine learning

When the signature is too messy or too multidimensional for a spectral rule of thumb, classifiers take over. SVMs, random forests, gradient-boosted trees, and one-class methods learn a decision boundary between healthy and faulty — or, more realistically, between healthy and everything that is not healthy.

Figure 8A classifier learns where to draw the line. The features — RMS, kurtosis, spectral energy bands, wavelet coefficients — are the engineer’s real contribution; the algorithm just finds the boundary.

One-class vs. multi-class

In practice you often have abundant healthy data and very few labeled faults. One-class SVM and Isolation Forest learn only what normal looks like and call everything else anomalous. This matches reality better than training on rare, possibly unrepresentative fault examples.

3.4Deep learning

Modern diagnostics reach for deep networks when the input is high-dimensional and the signature is not something you want to handcraft. Three patterns dominate:

CNNs on vibrationTreat a raw or time-frequency image of vibration as an image classification problem. Learned filters discover bearing-defect patterns that would take a PhD in tribology to specify by hand.
LSTM / Transformer on time seriesCapture long-range temporal structure — the way a drift builds up over hours of operation. Useful for wear-out and slowly-developing electrical faults.
Autoencoders for unsupervised anomaly detectionTrain a reconstruction network on healthy data only. Whatever the network reconstructs poorly is flagged. This is the neural-network face of PCA’s SPE statistic.

Figure 9The autoencoder learns to compress and reconstruct healthy data. Faulty inputs leave the learned manifold, so the reconstruction error — a neural residual — grows.

4

Hybrid methods — the current frontier.

Pure model-based schemes are brittle in the face of unmodeled dynamics. Pure data-driven schemes are brittle outside their training distribution. The answer, everywhere in industry right now, is to combine them.

Figure 10The physics model provides structure and extrapolation; the data-driven layer captures what physics missed. Together they produce a residual that is both generalizable and accurate.

Three dominant patterns

Residual-learningRun a physics model; train a network only on its residual. The network absorbs unmodeled dynamics without destroying the physical core.
Physics-informed neural networks (PINNs)The loss function includes a penalty for violating known physical laws. The network cannot wander into physically nonsense regions.
Digital twinsA calibrated high-fidelity simulator of the specific machine, updated online from fleet data. Residuals against the twin catch faults that fleet-average models would miss.

5

The design problems that decide everything.

Picking a residual generator is the glamorous part. Making it work in the field is about seven quieter problems.

5.1Residual generation vs. residual evaluation

Every FDI scheme splits into two independent questions. Generation: what signal do I build? Evaluation: what decision logic do I apply to it?

You can bolt a CUSUM detector onto a Luenberger observer or a PCA score. You can pair the same observer with a fixed threshold, a GLR test, or a Bayesian decision rule. Keeping the halves separate makes the design problem tractable and the options clear.

5.2Robustness vs. sensitivity — the fundamental trade

A residual should be loud when there is a fault and silent when there is not. But disturbances, noise, and model mismatch also excite it. You cannot perfectly separate the two; you can only trade them off.

maximize sensitivity to faults minus sensitivity to disturbances

The formal statement of this trade lives in the H_∞ / H₋ framework. You shape a transfer function from fault to residual (keep this large in a band where faults live) and from disturbance to residual (keep this small). The trick — and the continuing research — is that these bands often overlap.

Figure 11A good residual pushes the ROC curve into the top-left corner — high detection, low false alarms. A disturbance-decoupled observer (teal) strictly dominates a naive one (rust) at every operating point.

5.3Threshold design

Once the residual exists, someone has to set a number above which an alarm is raised. The options differ in how they model the world.

Fixed thresholdSimplest. Tune once, live with it. Breaks whenever operating conditions drift.
Adaptive thresholdMoves with operating point — speed, load, temperature. Standard in automotive.
CUSUMAccumulates small deviations. Catches slow drifts a fixed threshold would miss.
GLR (Generalized Likelihood Ratio)The statistically optimal test when the fault signature is known. Outputs a time of change and magnitude estimate along with the alarm.
BayesianPosterior probability of fault, updated every sample. Gives a principled way to trade off false alarms against missed detections given their cost.

Threshold 0.45

True pos.

—

False neg.

—

False alarms

—

True neg.

—

Figure 12 · InteractiveDrag the threshold. Too low and the left half (healthy) starts triggering alarms — nuisance. Too high and the right half (faulty) slips through — missed detection. There is no free lunch.

5.4Fault isolation

Detection says something is wrong. Isolation says what. The three workhorses:

Structured residualsThe parity-space fingerprint idea of §2.2 — each fault produces a unique binary signature across a bank of residuals.
Directional residualsEach fault produces a residual vector pointing in a known direction in signal space. The angle of the observed residual picks out the fault.
Dedicated observer scheme (DOS)One observer per fault, each engineered to be sensitive to exactly one fault and blind to the others. Diagnostic by construction; expensive by count.

5.5Fault identification and estimation

Beyond “which”, you often want to know how much and when it started. Identification recovers the fault signal — magnitude, time profile, severity. Tools include augmented-state observers that treat the fault as an unknown input to estimate, adaptive observers that parameterize the fault and adapt, and GLR tests that directly output the estimated fault onset and amplitude.

In automotive diagnostics, identification feeds directly into derating: a small fault triggers torque limiting; a large one triggers shutdown. The quantitative estimate determines the response.

5.6Detectability and isolability

Before you design anything, there is a prior question: are the faults even distinguishable from the outputs you have? This is a structural property of the system, not a matter of algorithmic cleverness.

A fault is detectable if it leaves a trace in the measured output that no disturbance and no healthy operating-point change can mimic. It is isolable from a second fault if the two leave different traces. Both properties have algebraic characterizations — rank conditions on the system’s transfer matrix extended with the fault and disturbance directions.

If the check fails, no observer, no classifier, and no network can fix it. The only remedy is an additional sensor.

5.7Incipient vs. abrupt faults

Not all faults behave the same way. Abrupt faults hit in a single sample — an IGBT opens, a wire breaks, a sensor saturates. Incipient faults develop slowly — a bearing wears, insulation degrades, a magnet loses flux. The signals tell different stories and call for different detectors.

Figure 13Two fault shapes, two detection strategies. A CUSUM detector is almost blind to an abrupt fault (triggers late); a fixed threshold is almost blind to an incipient one (never triggers before catastrophic). Mixing the two in parallel is the robust production approach.

6

References & further reading

A short curated list. The first three are the canonical textbooks; the rest point into specific topics raised above.

Foundational texts

Isermann, R. — Fault-Diagnosis Systems: An Introduction from Fault Detection to Fault Tolerance. Springer, 2006. The engineer’s reference. Covers observers, parity, parameter estimation, and applications in one coherent volume.
Chen, J. and Patton, R. J. — Robust Model-Based Fault Diagnosis for Dynamic Systems. Kluwer, 1999. The book on UIOs and disturbance-decoupled residual generation.
Ding, S. X. — Model-Based Fault Diagnosis Techniques. 2nd ed., Springer, 2013. The most complete modern treatment of the model-based side, including H_∞/H₋ formulations.
Gertler, J. — Fault Detection and Diagnosis in Engineering Systems. Marcel Dekker, 1998. The classical parity-space and structured-residual reference.
Blanke, M. et al. — Diagnosis and Fault-Tolerant Control. 3rd ed., Springer, 2016. Bridges detection and reconfiguration; strong on structural analysis.

Data-driven and statistical

Qin, S. J. — “Survey on data-driven industrial process monitoring and diagnosis.” Annual Reviews in Control, 2012. The reference survey on PCA/PLS process monitoring.
MacGregor, J. F. and Kourti, T. — “Statistical process control of multivariate processes.” Control Engineering Practice, 1995. T² and SPE as deployed in real plants.
Basseville, M. and Nikiforov, I. — Detection of Abrupt Changes: Theory and Application. Prentice Hall, 1993. The canonical book on CUSUM, GLR, and change detection.

Deep learning & hybrid

Lei, Y. et al. — “Applications of machine learning to machine fault diagnosis: A review and roadmap.” Mechanical Systems and Signal Processing, 2020. Balanced survey including CNN/LSTM approaches and open challenges.
Raissi, M., Perdikaris, P., Karniadakis, G. E. — “Physics-informed neural networks.” J. Computational Physics, 2019. The foundational PINN paper.
Willard, J. et al. — “Integrating scientific knowledge with machine learning for engineering and environmental systems.” ACM Computing Surveys, 2022. The taxonomy of hybrid physics-ML methods.

Motor and inverter diagnostics

Nandi, S., Toliyat, H. A., Li, X. — “Condition monitoring and fault diagnosis of electrical motors — a review.” IEEE Trans. Energy Conversion, 2005. Still the canonical reference for MCSA and motor fault signatures.
Choi, U.-M., Blaabjerg, F., Lee, K.-B. — “Study and handling methods of power IGBT module failures in power electronic converter systems.” IEEE Trans. Power Electronics, 2015.
Riera-Guasp, M., Antonino-Daviu, J. A., Capolino, G.-A. — “Advances in electrical machines, power electronics, and drives for condition monitoring and fault detection.” IEEE Trans. Industrial Electronics, 2015.