Bayesian Optimization — an Interactive Explainer

Some functions are cheap to evaluate. Those are the easy ones. The interesting problems live on the other side — training a neural network, running a wind-tunnel experiment, tuning a motor controller on a dyno — where a single evaluation takes hours, costs money, and gives you one noisy number back. Classical optimization assumes you can evaluate $f(x)$ millions of times, or that you have gradients. Neither assumption holds here. You need a method that is sample-efficient — one that extracts as much information as possible from every measurement and uses that information to decide where to sample next. That’s what Bayesian optimization does, and the demo below shows it running on a benchmark designed to trip it up.

Two ingredients: a belief and a strategy

Every Bayesian-optimization algorithm is built from two interchangeable pieces: a surrogate model that represents our current belief about $f$, and an acquisition function that scores each candidate point by how useful evaluating it would be.

01 · Surrogate. A cheap statistical model fit to whatever observations we have so far. Because data is scarce, the model must express uncertainty — a prediction alone isn’t enough; we also need to know where the model is confident and where it’s guessing. By far the most common choice is a Gaussian Process, which gives a full posterior distribution over functions rather than a point estimate.

02 · Acquisition. A rule for choosing the next sample. The surrogate tells us, at each point, what we expect to see and how uncertain we are. The acquisition function combines these into a single score we can maximize cheaply. Good acquisition functions balance exploitation (sampling where the surrogate predicts good values) and exploration (sampling where the surrogate is uncertain).

Bayesian optimization treats the choice of where to sample next as itself an optimization problem — one we can solve cheaply, since the acquisition function lives on the surrogate, not on the real objective.

Interactive demo — see it run, step by step

Below is a working Bayesian-optimization loop. The objective is the Forrester function $f(x) = (6x-2)^2 \sin(12x - 4)$, a standard BO benchmark with a deceptive local minimum near $x = 0.15$ and a global minimum near $x = 0.76$. Pretend you don’t know its shape.

The shaded band is the GP posterior (solid line = mean, band = ±2σ). Gold dots are observations. The orange marker shows where the acquisition function is maximized — that’s the next candidate. Click Next iteration to evaluate at the suggested point and update the model. Or click anywhere on the upper plot to sample there manually.

Objective & GP posterior

GP mean ±2σ observed true f(x) next

Acquisition function — Expected Improvement

α(x) argmax

Acquisition

GP lengthscale ℓ0.10

Parameters

UCB exploration κ2.0

EI / PI margin ξ0.01

Show true f(x)

Actions

Iteration0

Observations3

Best f*—

at x*—

Next candidate—

Global optimum−6.021 @ 0.7572

Gaussian processes, briefly

A Gaussian Process is a distribution over functions — any finite collection of function values is jointly Gaussian, fully specified by a mean function $m(x)$ (usually zero) and a covariance kernel $k(x, x’)$.

The kernel encodes our prior about smoothness. The squared-exponential (RBF) kernel, used in the demo above, says that nearby inputs have highly correlated outputs and that correlation decays with distance on a scale set by the lengthscale $\ell$:

\[k(x, x') = \sigma_f^2 \,\exp\!\left(-\frac{\lVert x - x'\rVert^2}{2\ell^2}\right)\]

Given observations $\mathbf{y} = [y_1, \ldots, y_n]^\top$ at inputs $X = {x_1, \ldots, x_n}$ with noise variance $\sigma_n^2$, the posterior at any test point $x_*$ is Gaussian with mean and variance:

\[\mu(x_*) = \mathbf{k}_*^\top \left(K + \sigma_n^2 I\right)^{-1} \mathbf{y}\] \[\sigma^2(x_*) = k(x_*, x_*) - \mathbf{k}_*^\top \left(K + \sigma_n^2 I\right)^{-1} \mathbf{k}_*\]

Here $K_{ij} = k(x_i, x_j)$ and $(\mathbf{k}*)_i = k(x_i, x*)$. Two properties make this an ideal BO surrogate: the posterior mean interpolates the noise-free observations, and the posterior variance collapses to zero at observed points and grows smoothly away from them. That’s exactly the signal an acquisition function needs. In practice you solve the linear system via Cholesky decomposition in $\mathcal{O}(n^3)$ — perfectly fine for BO, where $n$ rarely exceeds a few hundred.

Three ways to score a candidate

Each acquisition function takes the GP posterior $\mathcal{N}(\mu(x), \sigma^2(x))$ and collapses it into a single scalar. What they differ on is how they weigh the two knobs — expected value and uncertainty.

Expected Improvement (EI)

For minimization with current best $f^$, define the improvement at $x$ as $I(x) = \max(0, f^ - f(x) - \xi)$, where $\xi \geq 0$ encourages more exploration. EI is its expected value under the GP posterior:

\[\alpha_{\text{EI}}(x) = (f^* - \mu(x) - \xi)\,\Phi(z) + \sigma(x)\,\phi(z), \qquad z = \frac{f^* - \mu(x) - \xi}{\sigma(x)}\]

EI is self-calibrating: when $\sigma \to 0$ it reduces to pure exploitation; where $\sigma$ is large and $\mu$ is not too terrible, it favors exploration. This is why it’s the default choice in most BO packages.

Lower Confidence Bound (UCB)

The simplest possible acquisition. Pick the point with the lowest optimistic estimate of $f$ — a linear combination of the posterior mean and a scaled standard deviation:

\[\alpha_{\text{LCB}}(x) = \mu(x) - \kappa\,\sigma(x)\]

The exploration weight $\kappa$ is the tuning knob. $\kappa = 0$ is pure greedy exploitation; large $\kappa$ becomes uncertainty-sampling. Srinivas et al. give a theoretical schedule for $\kappa$ that yields no-regret bounds.

Probability of Improvement (PI)

Just the probability that a sample at $x$ beats the current best by at least $\xi$:

\[\alpha_{\text{PI}}(x) = \Phi\!\left(\frac{f^* - \mu(x) - \xi}{\sigma(x)}\right)\]

Historically first, but known to under-explore — it happily picks points with microscopic improvement probability as long as any improvement is likely. The margin $\xi$ partially compensates. In practice EI is almost always preferred.

The whole algorithm, in twelve lines

Everything above combines into a single loop. The outer loop queries the expensive objective; the inner optimization of the acquisition function is cheap, because it operates on the surrogate, not on $f$ itself.

# Given: objective f, domain 𝒳, budget T, acquisition α

initialize D ← { (xᵢ, f(xᵢ)) } for i = 1..n₀     # e.g. Latin hypercube or random

for t = 1, 2, ..., T:
    # 1. fit / update surrogate
    GP ← fit_gp(D)

    # 2. maximize acquisition — cheap, uses only the surrogate
    x_next ← argmax over x ∈ 𝒳  of  α(x | GP, f*_D)

    # 3. query expensive objective at the chosen point
    y_next ← f(x_next)

    # 4. augment dataset
    D ← D ∪ { (x_next, y_next) }

return argmin over (x, y) ∈ D of y

Step 2 is the only subtle one. The acquisition function is cheap to evaluate but can be multi-modal, so practitioners use multi-start L-BFGS, DIRECT, or dense grid search on the surrogate. None of this touches the real objective.

Where it lives

The common thread across BO applications is a function you can’t see inside, that returns a noisy scalar, and costs real time or real money to query.

Hyperparameter tuning. Training a deep network costs hours. BO finds strong configurations in 20–50 trials instead of thousands of random ones.
Experimental design. Materials discovery, chemistry, biology — any setting where each data point is a physical experiment.
Controller calibration. Tuning PID, MPC, or motor-control parameters against a high-fidelity simulator or dyno where each run is slow.
Engineering design. Airfoil shapes, antenna geometries, chip layouts — design variables evaluated by expensive CFD or EM solvers.
Robotics and policy search. Tuning gait parameters or policy coefficients on hardware, where every rollout risks wear or damage.
A/B testing at scale. Treating each experimental configuration as an expensive sample when user traffic or exposure is the bottleneck.

Bayesian optimization is the default tool whenever that query cost dominates.

The interactive demo above is implemented with vanilla JavaScript and the Canvas 2D API — the GP fit (RBF kernel, Cholesky solve), the three acquisition functions, and all plotting run in the browser with no external libraries. If you want to see how it’s stitched together, view source on the page.