Find the widest possible gap between two groups. Use a trick to make the gap curved. Then enclose a single cluster with the same machinery — and you have an anomaly detector.
You have two groups of points in a plane and you'd like to draw a line between them. How should you choose which line? There are infinitely many that separate the groups cleanly — steeper, shallower, tilted left, tilted right — and from the perspective of training accuracy, they're all equally good. They all classify the training data perfectly. But some of them feel right and some of them feel wrong, and the intuition that says so turns out to be mathematically precise.
Vladimir Vapnik and Corinna Cortes formalized that intuition in 1995. Of all the lines that separate the two classes, pick the one that leaves the widest possible gap between itself and the nearest points of either class. That gap — the distance from the line to the closest training point — is the margin. Maximizing the margin gives a unique answer, and the unique answer turns out to generalize far better than an arbitrary separator. Points far from the decision boundary don't matter. Only the few points that sit right at the edge of the gap — the support vectors — determine everything.
Among all lines that separate the two classes, the best one is the line that sits as far as possible from the nearest point on either side. The margin is everything.
Below is a 2-D scatter of two linearly separable classes with the maximum-margin separator drawn on top. The solid line is the decision boundary. The dashed lines are the two margins — the frontier of the gap. Points sitting exactly on the margins (there will always be at least one per class in a perfectly fit solution) are the support vectors, marked with rings. Click anywhere in the plot to add a new point; the solver refits and the boundary updates. Toggle between green and red to choose which class your new point belongs to.
The mathematics that produces this picture is compact enough to state in three lines. Write the separator as w·x + b = 0. Points with class label yi = +1 should sit on one side and yi = −1 on the other. The SVM finds w and b by solving:
minimize ½ ‖w‖² subject to yi(w·xi + b) ≥ 1 for all i
The margin width turns out to be 2 / ‖w‖, so minimizing ‖w‖ maximizes the margin. The constraint says every training point must sit outside the margin on the correct side. That's it. Everything else in support vector theory — kernels, soft margins, one-class variants — is an elaboration of this core.
The formulation above assumes the classes are perfectly separable with a straight line, which real data almost never is. Classes overlap, labels are noisy, one or two points sit on the wrong side of where they "should" be. With the hard-margin constraint yi(w·xi + b) ≥ 1, a single mislabeled point makes the optimization infeasible — no solution exists.
The fix, due to Cortes and Vapnik in that same 1995 paper: introduce slack variables ξi ≥ 0 that measure how badly each point violates its margin. A point inside the margin (but on the right side of the boundary) has a small slack. A point on the wrong side of the boundary entirely has slack greater than 1. Then add a penalty to the objective for total slack:
minimize ½ ‖w‖² + C · Σ ξi subject to yi(w·xi + b) ≥ 1 − ξi
The hyperparameter C sets the exchange rate between margin and violations. Large C means "slack is expensive" — the solver tries hard to classify every point correctly, producing a narrow margin that hugs the data closely. Small C means "slack is cheap" — the solver prefers a wide, smooth margin even if several points violate it. C is the single most important knob on an SVM, and it should be tuned by cross-validation, not by vibes.
To feel this trade-off, here is the same dataset but with enough overlap between classes that no clean separator exists. Slide C from tight to loose and watch the decision boundary, the margin width, the number of support vectors, and the count of misclassified points all shift together.
Notice that as C shrinks, the number of support vectors grows. That's not a bug — at small C the solver is happy to pay slack and keep the margin wide, which means more points sit on or inside the margin and hence qualify as support vectors. The count of support vectors is a handy diagnostic: if nearly every training point is a support vector, your C is too small; if only a handful are, your C may be too large for a noisy problem.
Soft margins let us tolerate noise around an essentially linear separator, but plenty of real problems aren't even approximately linear. Classes sit inside each other in rings; they twist around each other in spirals; two features interact in an XOR pattern that no hyperplane can cut cleanly. For these, we need a curved boundary. SVMs get curved boundaries through a move of surprising elegance.
Start with the observation that if we had the data in a higher-dimensional space where it became linearly separable, the original SVM machinery would just work. For concentric circles in 2-D, the feature r² = x₁² + x₂² splits inner from outer at a single threshold — and in (x₁, x₂, r²) space, a plane does the job. Lift the data, run linear SVM, project the boundary back, and you get a circle.
The problem is that for complex patterns, we don't know what features to add, and even if we did, computing them explicitly can be ruinously expensive (sometimes infinite-dimensional). The kernel trick sidesteps both problems. If you look carefully at the SVM's dual form, the only way feature vectors ever appear is through inner products xi·xj. A kernel function K(x, y) computes the inner product of two lifted vectors without ever computing the lifts themselves. Replace every inner product with a kernel evaluation and you get a nonlinear SVM — the same algorithm, running in a space you never visit.
K(x, y) = exp( − γ ‖x − y‖² )
The RBF (Radial Basis Function) kernel above is the workhorse choice: infinite-dimensional implicit feature space, smooth decision boundaries, one hyperparameter γ controlling locality. Small γ = wide Gaussians = smooth, almost-linear boundaries. Large γ = narrow Gaussians = tightly wiggling boundaries that can memorize the training data.
Here are two concentric rings of points — a pattern that no straight line can separate. The linear SVM does its honest best and produces a meaningless boundary that cuts through both rings. Flip to the RBF kernel and the boundary wraps the inner ring cleanly. Adjust γ to see the transition between smooth-and-almost-linear and jagged-and-overfit.
Choosing γ is a balance just like choosing C. Default practice: do a joint grid search over log C ∈ [-3, 3] and log γ ∈ [-3, 3], pick the pair with the best cross-validation score. For RBF kernels, γ also depends on the scale of your features — which is why standardizing inputs is mandatory before fitting any kernel SVM.
Now turn the problem inside out. Suppose you don't have two classes. You have one — the normal data — and you'd like to recognize when a new point doesn't belong. No anomaly labels available, because you don't know what the anomalies look like (if you did, you'd just train a classifier). This is the setting of novelty detection, and it is everywhere: machine condition monitoring, fraud, intrusion, quality screening, rare medical events.
Schölkopf and colleagues proposed, in 1999, a beautiful adaptation of the SVM for exactly this problem. Work in the lifted RBF feature space, where every data point maps to a vector of unit norm. In that space, find the hyperplane that separates the data from the origin with maximum margin. Project the decision surface back to input space, and it becomes a closed curve that encloses the bulk of the training data — a tight wrapping of the "normal" region. Anything outside the wrapping is flagged as anomalous.
f(x) = Σ αi K(x, xi) − ρ
The formulation has a single intuitive hyperparameter: ν (nu), which bounds both the fraction of training points allowed to fall outside the wrapping and the fraction of support vectors. Choosing ν = 0.05 says "I expect about 5% of my training data is contaminated with outliers, and I want the boundary loose enough that those sit outside it." Choosing ν = 0.5 says "half of my data should be outside" — which produces a very tight boundary around the densest core.
Two clusters of "normal" training data, no labels involved, and a One-Class SVM drawing the boundary of the normal region with an RBF kernel. The green-to-pale gradient is the decision function: green regions are safely normal, pale regions are safely anomalous, and the solid line is the threshold between them. Click anywhere on the plot to probe a test point — its score will appear, and you'll see whether the model calls it an inlier or an outlier.
Two behaviors are worth studying. First, slide ν up and watch the boundary contract toward the dense cores of each cluster — points that used to be safely inside will pop out as the model becomes more selective. Second, slide γ up and watch the boundary become jagged, eventually fragmenting into tight little pockets around individual points. Both extremes are failure modes: overly broad (small ν, small γ) and you catch no anomalies; overly tight (large ν, large γ) and you flag your own normal data. The right settings come from cross-validated scoring on a held-out set of known anomalies when available, and from judgement when not.
Datasets with a few thousand to tens of thousands of labeled examples where SVMs often match or beat more elaborate models — especially in high-dimensional spaces like text, where the curse of dimensionality hurts many competitors but doesn't bother margin-based methods.
Linear SVM on TF-IDF features was the dominant method for two decades and remains a strong baseline. High-dimensional, sparse features; relatively clean linear structure; everything SVMs are designed for.
Small labeled sample sizes, engineered high-dimensional features (gene expression profiles, molecular descriptors), strong regularization need. Kernel SVMs with domain-specific kernels (string, graph) were a cornerstone of the field.
OCSVM's home turf: machine condition monitoring, sensor fault detection in automotive and aerospace, early-warning signals in industrial processes. Train on healthy operating points, flag deviations. Particularly useful when faults are rare or never labeled.
OCSVM on normal network traffic patterns, flagging anomalous sessions or packet sequences. Same logic as fault detection, different domain vocabulary.
Before deep learning, SVMs on hand-crafted image features (HOG, SIFT) were state-of-the-art. Today they're still a reasonable baseline and useful in transfer-learning pipelines where you train a classifier on top of frozen pretrained embeddings.
When you have tens to hundreds of examples per class, not thousands, SVM's strong regularization gives better generalization than most alternatives. A good default when deep learning's appetite for data can't be satisfied.
Production-line pass/fail screening where labeled failures are scarce. OCSVM trained on "known-good" measurements flags any unit whose multivariate signature drifts, without needing an enumerated list of what can go wrong.