Autonomy

Bayesian Optimization — an Interactive Explainer

2026-04-19T14:00:00-04:00

Some functions are cheap to evaluate. Those are the easy ones. The interesting problems live on the other side — training a neural network, running a wind-tunnel experiment, tuning a motor controller on a dyno — where a single evaluation takes hours, costs money, and gives you one noisy number back. Classical optimization assumes you can evaluate $f(x)$ millions of times, or that you have gradients. Neither assumption holds here. You need a method that is sample-efficient — one that extracts as much information as possible from every measurement and uses that information to decide where to sample next. That’s what Bayesian optimization does, and the demo below shows it running on a benchmark designed to trip it up.

Two ingredients: a belief and a strategy

Every Bayesian-optimization algorithm is built from two interchangeable pieces: a surrogate model that represents our current belief about $f$, and an acquisition function that scores each candidate point by how useful evaluating it would be.

01 · Surrogate. A cheap statistical model fit to whatever observations we have so far. Because data is scarce, the model must express uncertainty — a prediction alone isn’t enough; we also need to know where the model is confident and where it’s guessing. By far the most common choice is a Gaussian Process, which gives a full posterior distribution over functions rather than a point estimate.

02 · Acquisition. A rule for choosing the next sample. The surrogate tells us, at each point, what we expect to see and how uncertain we are. The acquisition function combines these into a single score we can maximize cheaply. Good acquisition functions balance exploitation (sampling where the surrogate predicts good values) and exploration (sampling where the surrogate is uncertain).

Bayesian optimization treats the choice of where to sample next as itself an optimization problem — one we can solve cheaply, since the acquisition function lives on the surrogate, not on the real objective.

Interactive demo — see it run, step by step

Below is a working Bayesian-optimization loop. The objective is the Forrester function $f(x) = (6x-2)^2 \sin(12x - 4)$, a standard BO benchmark with a deceptive local minimum near $x = 0.15$ and a global minimum near $x = 0.76$. Pretend you don’t know its shape.

The shaded band is the GP posterior (solid line = mean, band = ±2σ). Gold dots are observations. The orange marker shows where the acquisition function is maximized — that’s the next candidate. Click Next iteration to evaluate at the suggested point and update the model. Or click anywhere on the upper plot to sample there manually.

Objective & GP posterior

GP mean ±2σ observed true f(x) next

Acquisition function — Expected Improvement

α(x) argmax

Acquisition

GP lengthscale ℓ0.10

Parameters

UCB exploration κ2.0

EI / PI margin ξ0.01

Show true f(x)

Actions

Iteration0

Observations3

Best f*—

at x*—

Next candidate—

Global optimum−6.021 @ 0.7572

Gaussian processes, briefly

A Gaussian Process is a distribution over functions — any finite collection of function values is jointly Gaussian, fully specified by a mean function $m(x)$ (usually zero) and a covariance kernel $k(x, x’)$.

The kernel encodes our prior about smoothness. The squared-exponential (RBF) kernel, used in the demo above, says that nearby inputs have highly correlated outputs and that correlation decays with distance on a scale set by the lengthscale $\ell$:

\[k(x, x') = \sigma_f^2 \,\exp\!\left(-\frac{\lVert x - x'\rVert^2}{2\ell^2}\right)\]

Given observations $\mathbf{y} = [y_1, \ldots, y_n]^\top$ at inputs $X = {x_1, \ldots, x_n}$ with noise variance $\sigma_n^2$, the posterior at any test point $x_*$ is Gaussian with mean and variance:

\[\mu(x_*) = \mathbf{k}_*^\top \left(K + \sigma_n^2 I\right)^{-1} \mathbf{y}\] \[\sigma^2(x_*) = k(x_*, x_*) - \mathbf{k}_*^\top \left(K + \sigma_n^2 I\right)^{-1} \mathbf{k}_*\]

Here $K_{ij} = k(x_i, x_j)$ and $(\mathbf{k}*)_i = k(x_i, x*)$. Two properties make this an ideal BO surrogate: the posterior mean interpolates the noise-free observations, and the posterior variance collapses to zero at observed points and grows smoothly away from them. That’s exactly the signal an acquisition function needs. In practice you solve the linear system via Cholesky decomposition in $\mathcal{O}(n^3)$ — perfectly fine for BO, where $n$ rarely exceeds a few hundred.

Three ways to score a candidate

Each acquisition function takes the GP posterior $\mathcal{N}(\mu(x), \sigma^2(x))$ and collapses it into a single scalar. What they differ on is how they weigh the two knobs — expected value and uncertainty.

Expected Improvement (EI)

For minimization with current best $f^$, define the improvement at $x$ as $I(x) = \max(0, f^ - f(x) - \xi)$, where $\xi \geq 0$ encourages more exploration. EI is its expected value under the GP posterior:

\[\alpha_{\text{EI}}(x) = (f^* - \mu(x) - \xi)\,\Phi(z) + \sigma(x)\,\phi(z), \qquad z = \frac{f^* - \mu(x) - \xi}{\sigma(x)}\]

EI is self-calibrating: when $\sigma \to 0$ it reduces to pure exploitation; where $\sigma$ is large and $\mu$ is not too terrible, it favors exploration. This is why it’s the default choice in most BO packages.

Lower Confidence Bound (UCB)

The simplest possible acquisition. Pick the point with the lowest optimistic estimate of $f$ — a linear combination of the posterior mean and a scaled standard deviation:

\[\alpha_{\text{LCB}}(x) = \mu(x) - \kappa\,\sigma(x)\]

The exploration weight $\kappa$ is the tuning knob. $\kappa = 0$ is pure greedy exploitation; large $\kappa$ becomes uncertainty-sampling. Srinivas et al. give a theoretical schedule for $\kappa$ that yields no-regret bounds.

Probability of Improvement (PI)

Just the probability that a sample at $x$ beats the current best by at least $\xi$:

\[\alpha_{\text{PI}}(x) = \Phi\!\left(\frac{f^* - \mu(x) - \xi}{\sigma(x)}\right)\]

Historically first, but known to under-explore — it happily picks points with microscopic improvement probability as long as any improvement is likely. The margin $\xi$ partially compensates. In practice EI is almost always preferred.

The whole algorithm, in twelve lines

Everything above combines into a single loop. The outer loop queries the expensive objective; the inner optimization of the acquisition function is cheap, because it operates on the surrogate, not on $f$ itself.

# Given: objective f, domain 𝒳, budget T, acquisition α

initialize D ← { (xᵢ, f(xᵢ)) } for i = 1..n₀     # e.g. Latin hypercube or random

for t = 1, 2, ..., T:
    # 1. fit / update surrogate
    GP ← fit_gp(D)

    # 2. maximize acquisition — cheap, uses only the surrogate
    x_next ← argmax over x ∈ 𝒳  of  α(x | GP, f*_D)

    # 3. query expensive objective at the chosen point
    y_next ← f(x_next)

    # 4. augment dataset
    D ← D ∪ { (x_next, y_next) }

return argmin over (x, y) ∈ D of y

Step 2 is the only subtle one. The acquisition function is cheap to evaluate but can be multi-modal, so practitioners use multi-start L-BFGS, DIRECT, or dense grid search on the surrogate. None of this touches the real objective.

Where it lives

The common thread across BO applications is a function you can’t see inside, that returns a noisy scalar, and costs real time or real money to query.

Hyperparameter tuning. Training a deep network costs hours. BO finds strong configurations in 20–50 trials instead of thousands of random ones.
Experimental design. Materials discovery, chemistry, biology — any setting where each data point is a physical experiment.
Controller calibration. Tuning PID, MPC, or motor-control parameters against a high-fidelity simulator or dyno where each run is slow.
Engineering design. Airfoil shapes, antenna geometries, chip layouts — design variables evaluated by expensive CFD or EM solvers.
Robotics and policy search. Tuning gait parameters or policy coefficients on hardware, where every rollout risks wear or damage.
A/B testing at scale. Treating each experimental configuration as an expensive sample when user traffic or exposure is the bottleneck.

Bayesian optimization is the default tool whenever that query cost dominates.

The interactive demo above is implemented with vanilla JavaScript and the Canvas 2D API — the GP fit (RBF kernel, Cholesky solve), the three acquisition functions, and all plotting run in the browser with no external libraries. If you want to see how it’s stitched together, view source on the page.

Gaussian Processes — an Interactive Explainer

2026-04-19T09:00:00-04:00

A Gaussian Process is one of those ideas that looks abstract on paper and clicks the moment you can actually play with one. The object of this post is that moment. Below is an interactive plot where the prior, the posterior, and all the kernel hyperparameters respond in real time. The prose around it tries to match what you’re seeing on the screen to the math underneath — and, in the last section, to the very un-glamorous reality of the $O(n^3)$ cost that determines when GPs earn their keep.

What is a GP, really?

The plain-English version. Imagine you’re trying to guess an unknown function from just a few measurements. A Gaussian Process is a principled way of saying: “here are all the functions I think are plausible” — and then updating that belief every time you see a new data point.

Three ideas to hold onto:

It’s a distribution over functions, not over numbers. Instead of “$x = 3.2 \pm 0.5$”, a GP gives you “the function could be this shape, or this one, or this one” — an infinite family of curves, each with a probability.
The kernel encodes your assumption about smoothness: points that are close in input should have similar output values. That one assumption is enough to turn a handful of measurements into a full curve with confidence bounds.
The magic is calibrated uncertainty. Near your data the GP is confident, far from it the GP is humble and the error bars grow. That honesty is what makes GPs useful for control, optimization, and diagnostics — the model tells you when not to trust it.

A one-line mental model: a GP is linear regression with infinitely many features, where the kernel silently handles the infinite sum for you.

The one-sentence definition

A Gaussian Process is a distribution over functions such that any finite collection of function values has a joint Gaussian distribution. It is fully specified by a mean function $m(x)$ and a covariance (kernel) function $k(x, x’)$:

\[f(x) \sim \mathcal{GP}\!\left( m(x),\; k(x, x') \right)\]

In practice we almost always set $m(x) = 0$ (after centering the data) and do all the modeling work through the kernel.

Interactive demo

Posterior: click on the plot to add (or remove) observations. The GP conditions on them — the mean threads through the points, variance collapses nearby, and stays high far away.

Posterior mean

95% credible band

Sampled functions

Observations

Length scale ℓ 1.00

Smoothness of sampled functions

Signal variance σ_f² 1.00

Vertical amplitude of functions

Noise σ_n 0.10

Observation noise std

Sample paths 5

Number of drawn functions

Kernel: squared exponential (RBF)

k(x, x') = σ_f² · exp( −(x − x')² / (2ℓ²) )

How to read the plot

The prior

Click Prior and slide Sample paths up. Every colored curve is one function drawn from $\mathcal{GP}(0, k)$. The shaded band is the 95% credible region. Before seeing any data, the GP says “the function could be any of these.”

The length scale $\ell$ controls how wiggly these samples are:

Small $\ell$ — nearby inputs are only weakly correlated, so samples look rough.
Large $\ell$ — strong correlation, so samples look smooth.

The signal variance $\sigma_f^2$ scales the vertical amplitude of the prior.

The posterior

Click Posterior and then click anywhere on the plot to drop observations. Click an existing point to remove it. Two things happen:

The mean curve threads through the observations (or very close, limited by the noise $\sigma_n$).
The uncertainty band collapses near data points and re-opens in regions with no data.

Key takeaway. This is the main selling point for GPs: calibrated uncertainty that grows where you haven’t looked. That’s exactly why they shine in sparse-data regimes like active learning, Bayesian optimization, and safe control.

The math in three lines

Given training data $(X, y)$ with i.i.d. Gaussian noise $\sigma_n^2$, the joint distribution of the observed targets and the function value $f_$ at a test point $x_$ is:

\[\begin{bmatrix} y \\ f_* \end{bmatrix} \sim \mathcal{N}\!\left( \begin{bmatrix} 0 \\ 0 \end{bmatrix},\; \begin{bmatrix} K(X,X) + \sigma_n^2 I & K(X, x_*) \\ K(x_*, X) & K(x_*, x_*) \end{bmatrix} \right)\]

Conditioning on $y$ using the standard Gaussian conditioning identity gives closed-form posterior mean and variance:

\[\mu_* = K(x_*, X) \left[ K(X,X) + \sigma_n^2 I \right]^{-1} y\] \[\sigma_*^2 = K(x_*, x_*) - K(x_*, X) \left[ K(X,X) + \sigma_n^2 I \right]^{-1} K(X, x_*)\]

The widget implements exactly this. For numerical stability we use a Cholesky decomposition:

K + σ_n²·I = L · Lᵀ       (Cholesky, O(n³))
α = Lᵀ \ (L \ y)           (triangular solves)
μ*  = k*ᵀ · α              (mean at test point)
v   = L \ k*
σ*² = k(x*, x*) − vᵀv      (variance at test point)

Common kernels

Squared Exponential (RBF). Infinitely differentiable. Produces very smooth functions. Default choice, used in the demo above.
Matérn ($\nu = 3/2,\ 5/2$). Controllable smoothness. Often more realistic than RBF for physical signals that are continuous but not infinitely smooth.
Periodic. Encodes periodicity with a fixed period. Good for oscillatory signals like motor ripple or seasonal effects.
Linear. Recovers Bayesian linear regression as a special case.
Sums and products. Kernels are closed under addition and multiplication, so you can compose them — e.g. Periodic · RBF for a slowly-decaying oscillation.

Hyperparameter learning

The kernel hyperparameters $\theta = {\ell, \sigma_f, \sigma_n}$ are typically learned by maximizing the log marginal likelihood of the observed data:

\[\log p(y \mid X, \theta) = -\tfrac{1}{2}\, y^{\!\top} \!\left[K + \sigma_n^2 I\right]^{-1} y \;-\; \tfrac{1}{2} \log \!\left| K + \sigma_n^2 I \right| \;-\; \tfrac{n}{2} \log 2\pi\]

The three terms have a clean interpretation: data fit, complexity penalty, and a constant. This is the built-in Occam’s razor that makes GPs elegant — overly wiggly models are penalized by the log-determinant term.

Computational complexity

GPs are conceptually clean but computationally heavy. The cost is dominated by a single operation: inverting (or Cholesky-factorizing) the $n \times n$ kernel matrix $K + \sigma_n^2 I$, where $n$ is the number of training points.

Exact GP — the honest numbers

Operation	Time	Memory	What’s happening
Training (one-time)	`O(n³)`	`O(n²)`	Build `K`, do Cholesky `K = LLᵀ`, solve for `α = K⁻¹y`
Predict mean (per test point)	`O(n)`	`O(n)`	Inner product `k*ᵀα` — cheap once `α` is cached
Predict variance (per test point)	`O(n²)`	`O(n)`	Triangular solve `v = L \ k`, then `σ² = k** − vᵀv`
Hyperparameter learning (per iter.)	`O(n³)`	`O(n²)`	New `θ` → rebuild `K` → redo Cholesky → gradient of log-marginal-likelihood

Rule of thumb. Exact GP is comfortable up to a few thousand points on a laptop. At $n \approx 10{,}000$ you’re hitting the wall (memory for $K$ alone is roughly 800 MB in float64, and one Cholesky takes minutes). Beyond that, you need approximations.

Why it’s $O(n^3)$

The $O(n^3)$ cost comes from the Cholesky decomposition of the kernel matrix, which is the dominant numerical step. Every time hyperparameters change during training, the matrix changes, and the whole factorization has to be redone. Mean prediction at a new test point is cheap because it’s just a dot product against a precomputed vector. Variance prediction is more expensive because each test point needs its own triangular solve against $L$.

Scaling beyond exact GP

Method	Training	Prediction	Idea
Exact GP	`O(n³)`	`O(n²)`	Full Cholesky — baseline
Sparse GP / FITC	`O(n·m²)`	`O(m²)`	`m` inducing points summarize `n` training points
SVGP (variational)	`O(m³)` per batch	`O(m²)`	Mini-batch training; scales to millions of points
KISS-GP / SKI	`O(n + m log m)`	`O(1)` amortized	Structured kernel interpolation on a grid
Local GPs	`O(k³)` per local model	`O(k²)`	Partition input space, fit a small GP per region

Here $m$ is the number of inducing (pseudo-)points, usually $m \ll n$, and $k$ is the local neighborhood size. In practice, $m$ in the range 50–500 is common.

Embedded context. For real-time control (GP-augmented MPC, for instance), even the $O(n)$ or $O(m)$ prediction cost matters. Common tricks: freeze hyperparameters offline, precompute $\alpha$, use a small fixed inducing set, or switch to a parametric approximation once the GP has been learned.

When to reach for a GP

Small-to-medium data where uncertainty quantification matters more than raw throughput (GP inference is $O(n^3)$, impractical beyond ~10k points without approximations).
Bayesian optimization — GP surrogate + acquisition function (EI, UCB) for expensive black-box optimization.
Active learning — pick the next query where posterior variance is highest.
Safe / uncertainty-aware control — GP-augmented MPC uses the GP posterior mean to correct model mismatch and the variance to gate how aggressively the controller trusts that correction.
System identification and calibration — non-parametric regression with confidence bounds that grow in extrapolation regions.
Diagnostics and prognostics — detect when the current operating point is out-of-distribution relative to training data.

Limitations to be honest about

Scaling. Naïve GP is $O(n^3)$ / $O(n^2)$. Sparse variational GPs, inducing points, and structured kernels (KISS-GP, SKI) push this further.
High input dimensions. Stationary kernels suffer in high-$D$ without ARD or dimensionality reduction.
Kernel choice matters. A misspecified kernel gives confidently wrong uncertainty — the variance is only calibrated given the model.
Non-Gaussian likelihoods. Classification and count data need approximations (Laplace, EP, variational).

The interactive demo above is implemented with vanilla JavaScript and the Canvas 2D API — Cholesky solve, posterior sampling, and kernel evaluation are all in-browser, no external libraries. View source on the page if you want to read it.