← Back to Autonomy
Monograph № 4 · On Margin-Based Classification

Support Vector Machines
& their one-class cousin

Find the widest possible gap between two groups. Use a trick to make the gap curved. Then enclose a single cluster with the same machinery — and you have an anomaly detector.

By Majid Mazouchi

Pick the widest gap

You have two groups of points in a plane and you'd like to draw a line between them. How should you choose which line? There are infinitely many that separate the groups cleanly — steeper, shallower, tilted left, tilted right — and from the perspective of training accuracy, they're all equally good. They all classify the training data perfectly. But some of them feel right and some of them feel wrong, and the intuition that says so turns out to be mathematically precise.

Vladimir Vapnik and Corinna Cortes formalized that intuition in 1995. Of all the lines that separate the two classes, pick the one that leaves the widest possible gap between itself and the nearest points of either class. That gap — the distance from the line to the closest training point — is the margin. Maximizing the margin gives a unique answer, and the unique answer turns out to generalize far better than an arbitrary separator. Points far from the decision boundary don't matter. Only the few points that sit right at the edge of the gap — the support vectors — determine everything.

Key idea

Among all lines that separate the two classes, the best one is the line that sits as far as possible from the nearest point on either side. The margin is everything.

The maximum-margin hyperplane

Below is a 2-D scatter of two linearly separable classes with the maximum-margin separator drawn on top. The solid line is the decision boundary. The dashed lines are the two margins — the frontier of the gap. Points sitting exactly on the margins (there will always be at least one per class in a perfectly fit solution) are the support vectors, marked with rings. Click anywhere in the plot to add a new point; the solver refits and the boundary updates. Toggle between green and red to choose which class your new point belongs to.

Class +1 Class −1 Support vector
The SVM solves a constrained optimization: maximize the margin 2 / ‖w‖ subject to yi(w·xi + b) ≥ 1 for every point. The result is determined by only the support vectors — delete any non-support-vector point and the solution doesn't move. Click to add points and watch which ones become support vectors.

The mathematics that produces this picture is compact enough to state in three lines. Write the separator as w·x + b = 0. Points with class label yi = +1 should sit on one side and yi = −1 on the other. The SVM finds w and b by solving:

minimize ½ ‖w‖²   subject to yi(w·xi + b) ≥ 1 for all i

The margin width turns out to be 2 / ‖w‖, so minimizing ‖w‖ maximizes the margin. The constraint says every training point must sit outside the margin on the correct side. That's it. Everything else in support vector theory — kernels, soft margins, one-class variants — is an elaboration of this core.

Life is not linearly separable

The formulation above assumes the classes are perfectly separable with a straight line, which real data almost never is. Classes overlap, labels are noisy, one or two points sit on the wrong side of where they "should" be. With the hard-margin constraint yi(w·xi + b) ≥ 1, a single mislabeled point makes the optimization infeasible — no solution exists.

The fix, due to Cortes and Vapnik in that same 1995 paper: introduce slack variables ξi ≥ 0 that measure how badly each point violates its margin. A point inside the margin (but on the right side of the boundary) has a small slack. A point on the wrong side of the boundary entirely has slack greater than 1. Then add a penalty to the objective for total slack:

minimize ½ ‖w‖² + C · Σ ξi   subject to yi(w·xi + b) ≥ 1 − ξi

The hyperparameter C sets the exchange rate between margin and violations. Large C means "slack is expensive" — the solver tries hard to classify every point correctly, producing a narrow margin that hugs the data closely. Small C means "slack is cheap" — the solver prefers a wide, smooth margin even if several points violate it. C is the single most important knob on an SVM, and it should be tuned by cross-validation, not by vibes.

Playing with C

To feel this trade-off, here is the same dataset but with enough overlap between classes that no clean separator exists. Slide C from tight to loose and watch the decision boundary, the margin width, the number of support vectors, and the count of misclassified points all shift together.

Class +1 Class −1 Support vector
Large C produces a narrow margin that fits the training data tightly — and risks overfitting. Small C allows wide, smooth margins with many support vectors — and risks underfitting. The "right" C almost never coincides with the best training accuracy; it's the one that gives the best validation accuracy, and it's typically orders of magnitude away from the endpoints of whatever range you first considered.

Notice that as C shrinks, the number of support vectors grows. That's not a bug — at small C the solver is happy to pay slack and keep the margin wide, which means more points sit on or inside the margin and hence qualify as support vectors. The count of support vectors is a handy diagnostic: if nearly every training point is a support vector, your C is too small; if only a handful are, your C may be too large for a noisy problem.

The kernel trick

Soft margins let us tolerate noise around an essentially linear separator, but plenty of real problems aren't even approximately linear. Classes sit inside each other in rings; they twist around each other in spirals; two features interact in an XOR pattern that no hyperplane can cut cleanly. For these, we need a curved boundary. SVMs get curved boundaries through a move of surprising elegance.

Start with the observation that if we had the data in a higher-dimensional space where it became linearly separable, the original SVM machinery would just work. For concentric circles in 2-D, the feature r² = x₁² + x₂² splits inner from outer at a single threshold — and in (x₁, x₂, r²) space, a plane does the job. Lift the data, run linear SVM, project the boundary back, and you get a circle.

The problem is that for complex patterns, we don't know what features to add, and even if we did, computing them explicitly can be ruinously expensive (sometimes infinite-dimensional). The kernel trick sidesteps both problems. If you look carefully at the SVM's dual form, the only way feature vectors ever appear is through inner products xi·xj. A kernel function K(x, y) computes the inner product of two lifted vectors without ever computing the lifts themselves. Replace every inner product with a kernel evaluation and you get a nonlinear SVM — the same algorithm, running in a space you never visit.

K(x, y) = exp( − γ ‖x − y‖² )

The RBF (Radial Basis Function) kernel above is the workhorse choice: infinite-dimensional implicit feature space, smooth decision boundaries, one hyperparameter γ controlling locality. Small γ = wide Gaussians = smooth, almost-linear boundaries. Large γ = narrow Gaussians = tightly wiggling boundaries that can memorize the training data.

Kernels in action

Here are two concentric rings of points — a pattern that no straight line can separate. The linear SVM does its honest best and produces a meaningless boundary that cuts through both rings. Flip to the RBF kernel and the boundary wraps the inner ring cleanly. Adjust γ to see the transition between smooth-and-almost-linear and jagged-and-overfit.

Class +1 (inner) Class −1 (outer)
The background shading is the decision function f(x) = Σ αi yi K(x, xi) + b, evaluated on a grid. The solid curve is f(x) = 0, the boundary. Linear SVM produces a straight line that cannot work here. RBF with moderate γ produces a smooth closed curve that separates the rings; with γ too large, the boundary degenerates into lumpy islands around individual points.

Choosing γ is a balance just like choosing C. Default practice: do a joint grid search over log C ∈ [-3, 3] and log γ ∈ [-3, 3], pick the pair with the best cross-validation score. For RBF kernels, γ also depends on the scale of your features — which is why standardizing inputs is mandatory before fitting any kernel SVM.

One class, no labels

Now turn the problem inside out. Suppose you don't have two classes. You have one — the normal data — and you'd like to recognize when a new point doesn't belong. No anomaly labels available, because you don't know what the anomalies look like (if you did, you'd just train a classifier). This is the setting of novelty detection, and it is everywhere: machine condition monitoring, fraud, intrusion, quality screening, rare medical events.

Schölkopf and colleagues proposed, in 1999, a beautiful adaptation of the SVM for exactly this problem. Work in the lifted RBF feature space, where every data point maps to a vector of unit norm. In that space, find the hyperplane that separates the data from the origin with maximum margin. Project the decision surface back to input space, and it becomes a closed curve that encloses the bulk of the training data — a tight wrapping of the "normal" region. Anything outside the wrapping is flagged as anomalous.

f(x) = Σ αi K(x, xi) − ρ

The formulation has a single intuitive hyperparameter: ν (nu), which bounds both the fraction of training points allowed to fall outside the wrapping and the fraction of support vectors. Choosing ν = 0.05 says "I expect about 5% of my training data is contaminated with outliers, and I want the boundary loose enough that those sit outside it." Choosing ν = 0.5 says "half of my data should be outside" — which produces a very tight boundary around the densest core.

Enclosing the normal

Two clusters of "normal" training data, no labels involved, and a One-Class SVM drawing the boundary of the normal region with an RBF kernel. The green-to-pale gradient is the decision function: green regions are safely normal, pale regions are safely anomalous, and the solid line is the threshold between them. Click anywhere on the plot to probe a test point — its score will appear, and you'll see whether the model calls it an inlier or an outlier.

Training data Boundary
Click anywhere on the plot to test a point. Points inside the dark contour are flagged as normal; points outside are flagged as anomalous.
The decision surface with an RBF kernel looks a lot like a kernel density estimate thresholded at a quantile — and that's no coincidence. One-Class SVM with RBF is, to within a monotone transformation of the score, doing exactly that. As ν rises, the boundary tightens around the densest regions, shedding peripheral points. As γ rises, the boundary becomes more jagged, tracking the local shape of each cluster; too large and it fragments into islands.

Two behaviors are worth studying. First, slide ν up and watch the boundary contract toward the dense cores of each cluster — points that used to be safely inside will pop out as the model becomes more selective. Second, slide γ up and watch the boundary become jagged, eventually fragmenting into tight little pockets around individual points. Both extremes are failure modes: overly broad (small ν, small γ) and you catch no anomalies; overly tight (large ν, large γ) and you flag your own normal data. The right settings come from cross-validated scoring on a held-out set of known anomalies when available, and from judgement when not.

Where these methods earn their keep

Small-to-medium classification

Datasets with a few thousand to tens of thousands of labeled examples where SVMs often match or beat more elaborate models — especially in high-dimensional spaces like text, where the curse of dimensionality hurts many competitors but doesn't bother margin-based methods.

Text & document classification

Linear SVM on TF-IDF features was the dominant method for two decades and remains a strong baseline. High-dimensional, sparse features; relatively clean linear structure; everything SVMs are designed for.

Bioinformatics & chemometrics

Small labeled sample sizes, engineered high-dimensional features (gene expression profiles, molecular descriptors), strong regularization need. Kernel SVMs with domain-specific kernels (string, graph) were a cornerstone of the field.

Novelty & fault detection

OCSVM's home turf: machine condition monitoring, sensor fault detection in automotive and aerospace, early-warning signals in industrial processes. Train on healthy operating points, flag deviations. Particularly useful when faults are rare or never labeled.

Intrusion detection

OCSVM on normal network traffic patterns, flagging anomalous sessions or packet sequences. Same logic as fault detection, different domain vocabulary.

Image classification baselines

Before deep learning, SVMs on hand-crafted image features (HOG, SIFT) were state-of-the-art. Today they're still a reasonable baseline and useful in transfer-learning pipelines where you train a classifier on top of frozen pretrained embeddings.

Calibrated decisions with few examples

When you have tens to hundreds of examples per class, not thousands, SVM's strong regularization gives better generalization than most alternatives. A good default when deep learning's appetite for data can't be satisfied.

Quality control & screening

Production-line pass/fail screening where labeled failures are scarce. OCSVM trained on "known-good" measurements flags any unit whose multivariate signature drifts, without needing an enumerated list of what can go wrong.

Practical notes from the trenches

  1. Standardize your features, always. SVMs are scale-sensitive because the margin is measured in the feature space. A feature with range 1–10,000 will dominate one ranging 0–1 regardless of which is actually informative. Z-score or min-max normalize every feature before fitting, and apply the same scaling to test data.
  2. Grid search C and γ jointly, in log space. Try C ∈ {10⁻³, 10⁻², ..., 10³} and γ ∈ {10⁻³, 10⁻², ..., 10³}. The best pair is rarely at the default C = 1, γ = 1/n_features. Cross-validate — don't pick by training accuracy.
  3. RBF is almost always the first kernel to try. Polynomial kernels can overfit dramatically at higher degrees. Linear is worth trying when you have many more features than samples. Sigmoid kernels are mostly a historical curiosity. Custom kernels are worth the trouble only for structured data (text, graphs, sequences).
  4. SVM scales badly to large datasets. Training is roughly O(n²) to O(n³) in the number of samples. Beyond ~50,000 points, switch to linear SVM (LibLinear) with stochastic gradient descent, or consider approximate methods like Nyström or random Fourier features for RBF kernels.
  5. Don't interpret raw SVM scores as probabilities. The decision value f(x) = w·x + b is an unnormalized margin-distance, not a calibrated probability. If you need probabilities, use Platt scaling (fit a sigmoid to the scores on held-out data) or isotonic regression.
  6. Class imbalance needs explicit handling. Standard SVM treats every example equally and will happily achieve 99% accuracy by predicting "majority class always." Use the class_weight='balanced' option or set per-class C values inversely proportional to class frequencies.
  7. For OCSVM, ν is interpretable — use it. If your domain knowledge says 5% of production units are defective, set ν = 0.05 and you've encoded that prior directly. This is unusual and valuable; most anomaly detectors don't give you such a clean hyperparameter.
  8. OCSVM is sensitive to γ more than ν. With RBF kernel, ν is relatively robust (it just shifts the boundary). γ controls the fundamental geometry of what's "near" versus "far" in feature space, and wrong γ can give absurd results (everything is an anomaly, or nothing is). Tune γ first, ν second.
  9. Check support vector counts as a diagnostic. If 90%+ of your training points are support vectors, the model is near-memorizing — your C is too small or γ too small (too smooth). If fewer than ~10%, the model is very confident and possibly overfit — your C is too large.
  10. Use Platt's SMO algorithm or its successors for training. Any serious SVM library (libsvm, scikit-learn, thundersvm) implements Sequential Minimal Optimization under the hood. You almost never need to implement SVM from scratch in production — but doing it once as an exercise clarifies a lot about why the method behaves the way it does.

References & further reading

  1. Cortes, C., & Vapnik, V. (1995). Support-Vector Networks. Machine Learning, 20(3), 273–297. The founding paper; introduces soft-margin SVM.
  2. Boser, B. E., Guyon, I. M., & Vapnik, V. N. (1992). A Training Algorithm for Optimal Margin Classifiers. COLT. The kernel trick, formalized.
  3. Schölkopf, B., Platt, J. C., Shawe-Taylor, J., Smola, A. J., & Williamson, R. C. (2001). Estimating the Support of a High-Dimensional Distribution. Neural Computation, 13(7), 1443–1471. The one-class SVM paper.
  4. Tax, D. M. J., & Duin, R. P. W. (2004). Support Vector Data Description. Machine Learning, 54, 45–66. An alternative one-class formulation using enclosing spheres rather than hyperplanes.
  5. Platt, J. C. (1998). Sequential Minimal Optimization: A Fast Algorithm for Training Support Vector Machines. Microsoft Research Technical Report MSR-TR-98-14. The SMO algorithm that every modern solver is built on.
  6. Burges, C. J. C. (1998). A Tutorial on Support Vector Machines for Pattern Recognition. Data Mining and Knowledge Discovery, 2, 121–167. The introductory tutorial that taught a generation.
  7. Schölkopf, B., & Smola, A. J. (2002). Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press. The comprehensive textbook on kernel methods.
  8. Vapnik, V. N. (1998). Statistical Learning Theory. Wiley. Vapnik's own account, heavy on theory; the source of the margin-generalization bounds.
  9. Chang, C.-C., & Lin, C.-J. (2011). LIBSVM: A Library for Support Vector Machines. ACM TIST. The reference implementation used under the hood of scikit-learn and countless others.
  10. Pedregosa, F. et al. (2011). Scikit-learn: Machine Learning in Python. JMLR. See sklearn.svm.SVC, LinearSVC, and OneClassSVM.