Localization, Mapping & SLAM
1 — Three questions, three problems
A robot moving through space — a vacuum, a forklift, a drone, a rover, an AR headset on your face — almost always needs to answer some combination of three deceptively simple questions:
- Where am I? — given a map, figure out my pose in it.
- What's around me? — given that I know where I am, build a map.
- Both at once? — given neither, figure out both simultaneously.
These map onto three classical problems: Localization, Mapping, and SLAM. They look related, and they are, but the third is hilariously harder than the other two combined. This page works through all three with seven interactive demos so you can see how the algorithms behave, fail, and recover.
All three problems are solved roughly the same way: maintain a probability distribution over what you don't know (pose, map, or both), and update it whenever you (a) move or (b) sense something. This is recursive Bayesian estimation. The rest is engineering: which distribution, which approximation, which sensor.
The methods covered here are all implemented in AtsushiSakai/PythonRobotics, a widely-used open-source collection of Python sample code for robotics algorithms. Where we name a specific algorithm, you can find a runnable Python version in that repo. We borrow its taxonomy of methods throughout this primer.
2 — Localization: "Where am I?"
You hand the robot a map and switch it on somewhere on that map. It doesn't know where. As it drives and senses, it has to figure out which point on the map matches what it's seeing.
The intuition
Imagine being teleported, blindfolded, into an apartment whose floor plan you've memorized. You take off the blindfold. You see a kitchen counter to your left and a window to your right. There might be three rooms in the apartment that match that description — so you're not sure yet. You take a step forward and now you see a hallway. Only one room had a hallway exit in that direction. Now you know where you are.
That's exactly how a localization algorithm works. It starts with many hypotheses ("I could be anywhere"), and as observations arrive, the hypotheses that don't match get killed off until only the right ones survive. The four classical methods differ in how they represent those hypotheses:
- Particles — a swarm of weighted samples (Monte Carlo Localization).
- A Gaussian — a single mean and covariance (Kalman filter family).
- A grid — one probability per cell (histogram filter).
- A scan match — no probabilities, just optimization (ICP / NDT-based).
The first three are below as live demos. The fourth shows up in §4 because it's also the workhorse of SLAM front ends.
Method 1 — Monte Carlo Localization (MCL) particle filter
The most flexible representation: maintain a swarm of weighted samples (particles) of where you could be. On every motion, each particle moves with motion noise. On every sensor observation, weight each particle by the likelihood of the observation given that particle's pose. Resample to keep the swarm focused on the high-likelihood regions.
Three things to notice:
- At start, particles are uniform — the robot has no idea where it is.
- After a few steps, particles cluster at multiple spots — anywhere consistent with the sensor history. This is perceptual aliasing: a single measurement cannot disambiguate similar-looking places.
- After more motion, the wrong clusters die off. Hit Kidnap robot to teleport it elsewhere — most filters fail at this; MCL with random particle injection recovers.
Method 2 — Extended Kalman Filter (EKF) localization single Gaussian
If you trust your initial pose and your sensors aren't too crazy, you can summarize the entire belief with a single Gaussian: a mean (best guess) and a covariance matrix (how confident, in which direction). The EKF runs the same predict / update cycle as MCL, but instead of a swarm it carries the mean and covariance through linearized motion and observation models.
update: K = P⁻ Hᵀ (H P⁻ Hᵀ + R)⁻¹, x̂ = x̂⁻ + K(z − h(x̂⁻)), P = (I − K H) P⁻
The covariance ellipse is the visual signature of a Kalman filter: it grows when the robot moves (uncertainty accumulating) and shrinks when a measurement comes in (information gained).
EKF localization is fast and composes naturally with multi-sensor fusion. It's what you find inside most production GPS+IMU+wheel-odometry stacks.
The Unscented Kalman Filter replaces the EKF's Jacobian linearization with a deterministic set of sigma points propagated through the true nonlinear functions. The UKF uses a deterministic sampling approach to capture the mean and covariance of the state distribution, unlike the EKF which linearizes the nonlinear functions using Jacobians. UKF is usually slightly more accurate when motion or measurement models are strongly nonlinear; EKF is simpler to implement when you can write the Jacobians cleanly. Both are in PythonRobotics.
Method 3 — Histogram filter localization grid belief
The most literal Bayesian filter: discretize the world into cells, store one probability per cell, and update them all at every step. No assumptions about Gaussianity, no resampling — just multiplication and renormalization. It's slow for large worlds but conceptually the cleanest.
The histogram filter handles the "I have no idea where I am" case naturally — start uniform and let the math do its work. Note how multimodality is handled: when only one landmark is in range, the belief has a circular ridge of equally-likely positions; a second observation breaks the symmetry.
Methods at a glance
| Method | Belief representation | Per-step cost | Best for… |
|---|---|---|---|
| EKF / UKF | Single Gaussian | O(n²) – O(n³) | Tracking when the prior is good and the world is roughly linear. |
| Ensemble Kalman Filter | Sample-approximated Gaussian | O(N·n) | Very high-dimensional state where full covariance is intractable. |
| Histogram filter | Probability per grid cell | O(cells) | Small worlds, global localization, multimodal beliefs. |
| Particle filter (MCL) | Weighted samples | O(N·sensors) | Arbitrary distributions, kidnapped robot, long-term recovery. |
| Scan matching (ICP, NDT) | No distribution; pure optimization | O(K·iterations) | You already know roughly where you are and you have a lidar scan. |
Use cases
- Warehouse robots (Amazon Kiva-style fleets) localize against a known floor plan and fiducial markers.
- Self-driving cars localize against pre-built HD maps using lidar + GPS + IMU; centimeter accuracy is the bar.
- Robot vacuums localize against a saved house map so they can resume cleaning where they stopped.
- AR/VR headsets localize against a scene model the device built earlier and re-uses across sessions.
- Indoor delivery and service robots use Adaptive MCL (the AMCL package in ROS is the canonical example) over occupancy grids.
Practical notes
Symmetric environments. Long featureless corridors, square rooms, and warehouse aisles produce multiple equally-good hypotheses. The filter will not converge. Add fiducials, use ceiling features, or bias the initial pose if you can.
Linearization breakdown (EKF/UKF). If your covariance grows large, the linear approximation around the mean stops being valid. Either reduce your update period, switch to UKF, or add a global recovery layer (PF on top of EKF, AMCL-style).
Sample impoverishment (PF). Resampling every step kills off particle diversity. Inject a small fraction of random particles each step, or only resample when effective sample size drops below a threshold.
The kidnapped robot problem. Pure EKF cannot recover from being moved without warning. MCL with random injection or a histogram filter that re-sees the world fresh will. Plan for this if your robot can be picked up or slipped.
3 — Mapping: "What's around me?"
Now the opposite: assume you know exactly where you are at every moment (perfect GPS, motion capture, wheel encoders with no slip). Build a map of what you observe.
The intuition
You're a surveyor with a known position at every step. You shine a laser at the world, it bounces back, and you get a distance. Multiply by direction and you have a 3D point. Do this thousands of times per second from a moving platform and you have a point cloud of the world.
Point clouds aren't enough for planning. The robot wants to know not just "there's a wall here" but also "this region is empty and safe to drive through." That's what occupancy grid maps give you: every cell stores a probability that it is occupied.
What you're watching is the textbook occupancy grid update from Moravec & Elfes (1985). Each cell stores a log-odds value:
When a ray endpoint lands on a cell, that cell's log-odds goes up. When a ray passes through a cell, log-odds goes down. Adding log-odds is equivalent to multiplying probabilities, which is what makes this update so fast.
Methods at a glance
| Map type | What it stores | Best for |
|---|---|---|
| Occupancy grid | P(occupied) per cell | 2D navigation, planning, vacuum robots |
| Gaussian grid map | Gaussian per cell aggregating raw range hits | Smoother surfaces; sensor noise modeling |
| NDT map | One Gaussian per voxel summarizing point distribution | Compact registration target for NDT scan matching |
| Voxel / OctoMap | P(occupied) per 3D voxel (sparse octree) | 3D navigation, drones, manipulators |
| Distance map | Distance to nearest obstacle per cell | Path planning costs, obstacle avoidance fields |
| Point cloud | List of 3D points | Visualization, ICP, surveying |
| Feature / landmark map | List of (id, 3D position) | Sparse SLAM, AR anchors |
| Object-shape map | Fitted primitives: lines, circles, rectangles | Compact representation, semantic reasoning |
| Topological map | Graph of places + edges | High-level planning, biology-inspired navigation |
| Signed Distance Field (TSDF) | Distance to nearest surface per voxel | 3D reconstruction (KinectFusion-style) |
| Neural fields (NeRF, Gaussian Splat) | Implicit / volumetric scene representation | Photorealistic reconstruction, novel views |
Once you have raw lidar points, mapping often involves extracting structure rather than dumping points into cells. PythonRobotics shows several approaches: k-means clustering to group lidar returns into objects, circle fitting for round obstacles like trees and pillars, and rectangle fitting for cars and walls. The output is a much smaller, more semantically useful map than a dense grid.
Use cases
- Indoor floor plans. Robot vacuums and warehouse robots produce occupancy grids you can literally screenshot and use as a blueprint.
- Mining and construction surveys. Mobile mapping rigs walk through tunnels and produce point clouds with sub-centimeter accuracy.
- Agriculture. Tractors with RTK-GPS map fields and crop heights for variable-rate fertilizer application.
- 3D reconstruction for film and games. Drone photogrammetry rebuilds entire neighborhoods.
- Autonomous-vehicle HD maps. Pre-recorded lidar maps with lane geometry, sign positions, and traffic-light locations.
Practical notes
Resolution vs memory. A 5 cm grid over a 100 m × 100 m warehouse is 4 million cells. A 1 cm grid is 100 million. 3D voxel grids blow up cubically. Production systems use sparse structures (octrees, hash tables) instead of dense arrays.
Dynamic objects pollute the map. If a person walks in front of the lidar while you're mapping, you'll record them as a wall. Real systems either filter dynamic returns (semantic segmentation), or use a slow forgetting factor so transient occupancy decays away.
Sensor model matters more than you think. Glass walls, mirrors, and dark fabrics confuse lidars. Sunlight saturates structured-light depth cameras. Map quality is bounded by how well you've modeled your sensor's failure modes.
4 — SLAM: "Both at once."
Now the hard one. The robot has neither a map nor a known pose. It has to figure out both, simultaneously, from raw sensor data.
Why it's so much harder
It's a chicken-and-egg problem. To localize, I need a map. To map, I need to know where I am. The only way out is to estimate them jointly: every pose helps build the map, and every piece of map helps refine past poses.
Modern SLAM splits cleanly into two halves:
- The front end turns raw sensor data into observations: feature detection, scan matching, descriptor matching, RANSAC.
- The back end takes those observations and produces a globally consistent estimate: filter-based or graph-based.
We'll look at one canonical front-end algorithm (ICP), then one classical back end (EKF-SLAM), and finally what loop closure does to a long trajectory.
Front end — Iterative Closest Point (ICP) scan matching
ICP is the workhorse of every lidar-based SLAM front end. Given two point clouds — the previous scan and the current scan — find the rigid transform (rotation + translation) that best aligns them. Repeat once per frame, accumulate, and you have lidar odometry.
The algorithm is alternating minimization:
- For each point in the moving cloud, find the nearest point in the fixed cloud (the correspondence).
- Solve a closed-form least-squares problem for the rotation R and translation t that minimize the sum of squared distances under those correspondences.
- Apply (R, t) to the moving cloud. Repeat until convergence.
The optimal rotation between two point sets has a beautiful closed form. Let H be the cross-covariance matrix of the centered correspondences. Then the SVD gives H = U Σ Vᵀ, and the optimal rotation is R = V Uᵀ. In 2D, this collapses to a single arctan:
That's the formula running in the demo. The translation is then the difference between the two centroids after rotation. Variants like point-to-plane ICP, NDT, and generalized ICP use the same loop with smarter cost functions.
Back end — EKF-SLAM filter-based
Now the joint estimation. Stack the robot pose and every landmark position into one giant state vector, with one giant covariance matrix capturing every correlation between them. The EKF runs on that joint state, treating each landmark observation as a constraint between robot and landmark.
The crucial property is that the covariance is not block-diagonal. Robot and landmarks become correlated as soon as the robot observes them. That correlation is what makes loop closure work: when the robot re-observes an old landmark, the constraint propagates through the whole covariance and drags the entire trajectory back into shape.
EKF-SLAM has a famous flaw: the covariance matrix is dense once landmarks become correlated, so each update is O(n²) in the number of landmarks. Above a few hundred landmarks it becomes prohibitively slow. This is why the field moved on.
FastSLAM 1.0 and 2.0 particle filter
FastSLAM exploits a clever factorization. Conditional on the robot's trajectory, the landmarks become independent of one another. So you can use a particle filter for the trajectory (handling its non-Gaussian uncertainty) and a small EKF per landmark per particle for the landmark positions. FastSLAM is a feature-based SLAM example where the blue line is ground truth, the black line is dead reckoning, the red line is the estimated trajectory, the red points are particles, black points are landmarks, and blue crosses are estimated landmark positions. The 2.0 variant improves the proposal distribution to keep particle diversity over long runs.
Graph-based SLAM optimization
The modern formulation. Forget the covariance matrix. Build a graph where nodes are robot poses (and optionally landmarks) and edges are constraints (odometry, observations, loop closures). Find the configuration of the graph that best satisfies all constraints in the least-squares sense:
This is a sparse nonlinear least-squares problem. Sparsity (each pose only connects to a few neighbors) makes it tractable for tens of thousands of poses. Solvers like g²o, GTSAM, and Ceres are the standard back ends. Almost every modern SLAM system — ORB-SLAM, Cartographer, Kimera, VINS — runs a graph back end.
The conceptual loop-closure picture
Before the methods table, here's the shape of the problem. Drift accumulates linearly with distance. Loop closure is the single observation that yanks the trajectory back into place.
Methods at a glance
| Approach | What it represents | Strengths / weaknesses |
|---|---|---|
| EKF-SLAM | One Gaussian over (robot, all landmarks) | Conceptually clean. O(n²) per update kills it above ~10³ landmarks. |
| FastSLAM 1.0 | Particles over robot trajectory + EKF per landmark | Scales to many landmarks. Particle depletion on long runs is real. |
| FastSLAM 2.0 | Same, with smarter proposal distribution | Robust to high motion/observation noise; standard for textbook PF SLAM. |
| Graph SLAM (g²o, GTSAM, Ceres) | Pose graph + sparse least-squares | What everyone uses now. Scales to 10⁵+ poses; modular. |
| Visual SLAM (ORB-SLAM3, DSO, LSD-SLAM) | Bundle adjustment on visual features or pixels | Just a camera; sensitive to motion blur, low texture, illumination. |
| Lidar SLAM (Cartographer, LOAM, KISS-ICP) | Scan-to-scan + scan-to-map matching, pose graph back end | Robust and accurate; lidars are expensive. |
| Visual-Inertial SLAM (VINS, OKVIS, ORB-SLAM3) | Tightly couples camera + IMU | State of the art for drones, AR/VR; metric scale comes for free. |
Use cases
- AR / VR. ARKit, ARCore, Quest, Vision Pro all run visual-inertial SLAM continuously to anchor virtual content.
- Drones in GPS-denied environments. Indoor inspection, under-bridge surveys, search and rescue.
- Self-driving in unmapped or stale-map areas. Fills the gaps between HD-map updates.
- Planetary rovers. No GPS on Mars. Curiosity and Perseverance navigate with stereo visual odometry and SLAM-style optimization.
- Robot vacuums and lawn mowers. Build the map on the first run, localize against it after.
Practical notes
Drift is inevitable; the only fix is loop closure. Plan for closures: drive in loops, reuse known anchors, or fuse GPS when available.
False loop closures kill SLAM dead. Wrongly matching the current scene to a similar-looking part of the map will pull your trajectory into a horribly wrong shape. Robust back-end methods (switchable constraints, robust kernels, max-mixtures) exist exactly to reject these.
ICP fails when initialization is bad. The nearest-point heuristic gets stuck in local minima. Always seed ICP with the best estimate you have (odometry, IMU integration, prior frame). For very large displacements, use a global registration method first (e.g., FPFH features + RANSAC).
Initialization is fragile. Pure visual SLAM cannot recover scale from a single frame. The IMU helps, but you need a few seconds of motion before the system stabilizes.
Dynamic environments are still hard. Most SLAM systems assume the world is static. Crowds, traffic, and weather break that assumption; semantic segmentation to mask out moving objects is now standard in production stacks.
5 — Loop closure: the trick that saves SLAM
It's worth dwelling on loop closure because it's the single most important idea in modern SLAM and the one most outsiders miss.
Imagine walking around a city blindfolded, counting your steps. After two hours you'll have a guess at where you are, but your error in step count and turn angle has compounded. Now suppose halfway through, you take off the blindfold for one second and recognize a fountain you passed earlier. That single recognition doesn't just tell you where you are now — it tells you that your path between the two visits forms a closed loop, which means you can redistribute the error you've accumulated.
That's loop closure. The optimization step that pushes the error backward through the trajectory is called pose-graph optimization. Toggle it off in Demo 7 and you'll see exactly the error compounding I described.
Think of the trajectory as a chain of rubber-band edges between poses. Odometry gives a soft constraint between consecutive poses ("step ~1 m, ~5° turn"). Loop closure adds a single rigid edge between two poses far apart in time. The optimizer relaxes the whole chain to minimize tension. The longer the loop, the more dramatically the chain reshapes.
How systems actually detect loop closures
- Bag-of-Words (DBoW2 / DBoW3) — the workhorse for visual SLAM. Each image is a histogram of visual-word occurrences from a pre-trained vocabulary. New frame is queried against all past keyframes; high similarity flags a candidate closure, which is then geometrically verified.
- Scan-Context — for lidar SLAM. Convert each scan into a polar-coordinate fingerprint robust to viewpoint changes.
- Learned descriptors — NetVLAD, MixVPR, AnyLoc. Increasingly common in production, especially for places with weather or lighting changes.
- GPS / WiFi / known anchors — when available, basically free loop closures.
6 — Side by side
| Localization | Mapping | SLAM | |
|---|---|---|---|
| What's known | Map | Pose | Neither |
| What's estimated | Pose | Map | Both, jointly |
| Difficulty | Easy–medium | Easy if pose is good | Hard |
| Failure mode | Wrong initial pose, perceptual aliasing | Pose drift corrupts the map | Drift + bad data association |
| Canonical algorithm | MCL / EKF / Histogram filter | Occupancy grid | Pose-graph optimization with ICP front end |
| Key trick | Multimodal beliefs | Log-odds update | Loop closure |
| Where you've used it | Vacuum returning to dock | First mapping run of a vacuum | Phone AR session |
7 — What a real SLAM stack looks like
If you peel open a modern visual-inertial SLAM system (say ORB-SLAM3 or a typical AR runtime), you'll find a pipeline that looks roughly like this:
↓
Front end — feature detection (ORB, FAST), descriptor matching, IMU pre-integration
↓ (matched 2D–3D correspondences)
Tracking thread — pose at frame rate via PnP / motion-only BA
↓ (keyframes, every ~10 frames)
Local mapping thread — add new map points, local bundle adjustment
↓
Loop & map merging thread — DBoW place recognition, pose-graph optimization, full BA
↓
[ globally consistent map + 6-DoF pose at 30+ Hz ]
The pattern — fast front-end loop, slow back-end optimization, occasional global correction — is universal. Replace the camera with a lidar (Cartographer, LIO-SAM) and the same architecture applies. Replace it with wheel encoders and beacon tags and you're back at warehouse-robot localization.
8 — References & further reading
Code & tutorials (start here)
- Sakai, A. et al. PythonRobotics — Python sample codes for almost every algorithm in this primer. Covers EKF / Ensemble / Unscented Kalman / Histogram / Particle filter localization; Gaussian-grid, NDT, ray-casting, lidar-to-grid, k-means clustering, circle/rectangle fitting, and Distance Map for mapping; ICP, EKF-SLAM, FastSLAM 1.0/2.0, and Graph-based SLAM. The most accessible reference codebase in the field.
- Sakai, A., Ingram, D., et al. "PythonRobotics: a Python code collection of robotics algorithms." arXiv:1808.10703 (2018). The companion paper.
Books
- Thrun, S., Burgard, W., & Fox, D. Probabilistic Robotics. MIT Press, 2005. — The canonical textbook.
- Barfoot, T. State Estimation for Robotics. Cambridge University Press, 2017. — A more modern, math-forward treatment with strong factor-graph emphasis.
- Siegwart, R., Nourbakhsh, I., & Scaramuzza, D. Introduction to Autonomous Mobile Robots. 2nd ed., MIT Press, 2011.
Foundational papers
- Smith, R., Self, M., & Cheeseman, P. "On the Representation and Estimation of Spatial Uncertainty." IJRR, 1986. — The paper that started it all.
- Moravec, H. & Elfes, A. "High Resolution Maps from Wide Angle Sonar." ICRA, 1985. — The original occupancy grid.
- Besl, P. & McKay, N. "A Method for Registration of 3-D Shapes." IEEE TPAMI, 1992. — The original ICP paper.
- Montemerlo, M., et al. "FastSLAM: A Factored Solution to the Simultaneous Localization and Mapping Problem." AAAI, 2002.
- Dellaert, F. & Kaess, M. "Factor Graphs for Robot Perception." Foundations and Trends in Robotics, 2017.
- Cadena, C., et al. "Past, Present, and Future of Simultaneous Localization and Mapping: Toward the Robust-Perception Age." IEEE T-RO, 2016.
- Durrant-Whyte, H. & Bailey, T. "Simultaneous Localization and Mapping: Part I & II." IEEE RAM, 2006.
- Julier, S. & Uhlmann, J. "A New Extension of the Kalman Filter to Nonlinear Systems." SPIE, 1997. — The UKF paper.
Influential systems
- Davison, A. J. "MonoSLAM: Real-Time Single Camera SLAM." IEEE TPAMI, 2007.
- Mur-Artal, R. & Tardós, J. D. "ORB-SLAM2 / ORB-SLAM3." IEEE T-RO, 2017 / 2021.
- Hess, W., et al. "Real-Time Loop Closure in 2D LIDAR SLAM." ICRA, 2016. — Cartographer.
- Zhang, J. & Singh, S. "LOAM: Lidar Odometry and Mapping in Real-time." RSS, 2014.
- Qin, T., Li, P., & Shen, S. "VINS-Mono: A Robust and Versatile Monocular Visual-Inertial State Estimator." IEEE T-RO, 2018.
- Vizzo, I., et al. "KISS-ICP: In Defense of Point-to-Point ICP." IEEE RA-L, 2023.
Online courses and lectures
- Cyrill Stachniss — SLAM and Mobile Robotics lecture series on YouTube.
- Sebastian Thrun — Udacity "Artificial Intelligence for Robotics" (CS373).
- Daniel Cremers — TUM "Computer Vision II: Multiple View Geometry" lectures.
- Davide Scaramuzza — UZH Robotics & Perception lectures.
Open-source codebases worth reading
- GTSAM (Georgia Tech) — factor-graph back end used by many systems.
- g²o — the original general-purpose graph optimizer.
- Ceres Solver (Google) — generic nonlinear least squares; the optimizer behind a lot of SLAM and bundle adjustment.
- ORB-SLAM3 — clean, well-documented C++ reference for visual-inertial SLAM.
- Cartographer — production-grade 2D/3D lidar SLAM.
- OctoMap — the standard 3D occupancy mapping library.
- Open3D — modern point-cloud processing including ICP variants.