A Primer on CUDA

Being an Introduction to Parallel Computing on the Graphics Processor

Threads · Blocks · Memory · Warps · The Toolkit

By Majid Mazouchi

This monograph introduces CUDA — NVIDIA's platform for general-purpose computation on the graphics processor — to the engineer already comfortable with C and the conceptual idea of parallelism, but new to writing code that runs on a GPU. We progress from the architectural distinction between CPU and GPU, through the thread hierarchy and memory model, to practical concerns of warp execution, memory coalescing, and the toolkit. Twelve interactive demonstrations are provided. The treatment is deliberately operational rather than encyclopaedic: where a choice arose between completeness and intuition, intuition was preferred.

§ I.Why a GPU at All
§ II.The Mental Model
§ III.The Thread Hierarchy
§ IV.A First Kernel
§ V.The Memory Hierarchy
§ VI.Warps & SIMT Execution
§ VII.Memory Coalescing
§ VIII.Shared Memory & Tiling
§ IX.Toolkit & Ecosystem
§ X.From Datacenter to Dashboard
§ XI.Robotics on the GPU
§ XII.The Inference Pipeline
§ XIII.Real-Time & the Automotive Split
§ XIV.Practice & Further Reading

§ I. Why a GPU at All

Before we write a line of CUDA, it pays to be honest about the question CUDA answers: why bother? A modern CPU is an extraordinary device — branch predictors, out-of-order execution, deep caches, megabytes of silicon devoted to making a single thread of instructions run as fast as physics allows. For most code most of the time, that is exactly what you want. CUDA is for when it is not.

The argument is one of arithmetic. A workstation CPU offers, generously, a few dozen cores. A contemporary GPU offers thousands of arithmetic units, each individually slower and far less clever, but available simultaneously. If your problem decomposes into many independent pieces of similar work — pixels in an image, elements of a matrix, samples in a batch, particles in a simulation — the GPU is asking the right question.

An architectural sketch

The CPU is built around a small number of latency-optimized cores. Each core has substantial private cache, complex control logic, and is designed to make any one instruction stream finish as quickly as possible. The GPU inverts this priority. It is built from many small throughput-optimized cores, organized into clusters NVIDIA calls Streaming Multiprocessors (SMs). Each SM holds dozens of arithmetic units that share fetch and decode logic, share an L1 cache and a small high-speed scratchpad, and execute instructions in lockstep over groups of threads.

The trade is explicit. CPU silicon is mostly cache and control; GPU silicon is mostly arithmetic. The CPU asks: how can I run this single thread fast? The GPU asks: how can I keep ten thousand threads busy at once?

workload 256 idle

Fig. 1. Eight latency-optimized CPU cores versus a swarm of GPU lanes processing the same total work. Press run to dispatch; each tile darkens as it completes.

Aside. The watt-per-flop story matters too: a GPU achieves higher arithmetic throughput per unit of power, which is why training a modern neural network on CPUs is not merely slow but uneconomic. The same physics that makes GPUs fast at graphics also makes them fast at any embarrassingly parallel numerical work.

What problems suit the GPU

The GPU is happiest when three conditions hold: (i) the work decomposes into many independent pieces, (ii) those pieces perform similar arithmetic — ideally the same operations on different data, what Flynn called SIMD and what NVIDIA calls SIMT, and (iii) the ratio of computation to memory traffic is high enough that the device is not starved waiting on data. Dense linear algebra, image processing, neural network forward and backward passes, finite-difference solvers, ray tracing, Monte Carlo simulation, and most of computer graphics fit comfortably. Problems with deep sequential dependencies or pervasive branching fit poorly.

§ II. The Mental Model

CUDA programs run on two devices at once. Your host is the CPU; the device is the GPU. The two have separate memory spaces, separate execution streams, and communicate over a bus — historically PCIe, increasingly NVLink on server-class systems.

The host–device dance

A typical CUDA program follows a recognizable rhythm. The host allocates memory on the device. The host copies input data over. The host launches a kernel — a function that will run on the GPU, executed by many threads in parallel. The host waits for, or asynchronously checks on, completion. The host copies results back. The device-side memory is freed.

// classic pattern
cudaMalloc(&d_x, N * sizeof(float));         // 1. allocate on device
cudaMemcpy(d_x, h_x, N * sizeof(float),
           cudaMemcpyHostToDevice);                // 2. ship inputs
my_kernel<<<blocks, threads>>>(d_x, d_y, N);     // 3. launch (asynchronous)
cudaMemcpy(h_y, d_y, N * sizeof(float),
           cudaMemcpyDeviceToHost);                // 4. blocks until done
cudaFree(d_x);                                  // 5. release

The triple-angle bracket syntax <<<blocks, threads>>> is a CUDA extension to C++. It is the device-launch operator. The compiler — NVIDIA's nvcc, a wrapper around the host compiler that splits source into host and device portions — turns it into the appropriate runtime call.

Kernels, in one sentence

A kernel is a function annotated __global__ that runs on the device, is invoked from the host, returns void, and is executed not once but by every thread in the launched grid. Each thread sees the same code; what differs is its identity — its position in the thread hierarchy, which it uses to compute which slice of the data it owns.

__global__ void add_vectors(const float* a, const float* b,
                            float* c, int n) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;   // my global index
    if (i < n) c[i] = a[i] + b[i];                   // guard the tail
}

Note. Three function-attribute qualifiers matter: __global__ is a kernel (callable from host, runs on device); __device__ is a device-only helper (callable from other device code); __host__ is plain CPU code, the default. A function may be both __host__ __device__, in which case nvcc emits two versions.

Memory spaces

Host memory and device memory are physically distinct. Pointers are not interchangeable: a pointer obtained from malloc cannot be dereferenced inside a kernel, and a pointer from cudaMalloc is meaningless on the host. CUDA does offer unified memory via cudaMallocManaged, where the runtime migrates pages between host and device on demand — convenient for prototyping, sometimes a performance trap. We will return to memory in §V.

§ III. The Thread Hierarchy

CUDA threads are not flat. They are organized into a three-tiered hierarchy — thread, block, grid — and a great deal of CUDA's character flows from this structure.

The three tiers

The smallest unit is a thread. Threads are grouped into blocks. Blocks are grouped into a grid. A kernel launch creates one grid; the grid contains some number of blocks; each block contains some number of threads. Both blocks and grids may be one-, two-, or three-dimensional, an indulgence that mostly serves problems with natural 2D or 3D structure (images, volumes).

Tier	What it is	What it shares	Identifier
Thread	One execution context	Registers (private)	`threadIdx`
Block	A group of co-resident threads	Shared memory, can synchronize	`blockIdx`, `blockDim`
Grid	All blocks in one launch	Global memory only	`gridDim`

Two facts about blocks deserve emphasis. First, threads within a block can cooperate: they can synchronize at a barrier (__syncthreads()) and exchange data through fast on-chip shared memory. Second, threads between blocks cannot — at least not cheaply, not without atomics or a kernel boundary. The block is therefore the natural unit of cooperation; the grid is the natural unit of independent work.

grid X 3 grid Y 3 block X 4 block Y 4 click any thread

Fig. 2. The thread hierarchy made tactile. The grid (outer) holds gridDim.x × gridDim.y blocks; each block holds blockDim.x × blockDim.y threads. Click any thread to see its identifiers and computed global index.

Why two tiers?

One could ask: why not a single flat array of threads? The answer is hardware. Each block is scheduled to a single SM and resides there until completion; threads in a block can therefore share fast on-chip resources. The grid, by contrast, can contain far more blocks than the device has SMs — the runtime schedules blocks onto SMs as they become free, much as an operating system schedules processes. This two-level design is what lets the same source code run on a small mobile GPU and a server GPU without modification: only the number of blocks running concurrently changes.

Convention. A block size of 128, 256, or 512 threads is typical and a good starting point. The hardware constraint is 1024 threads per block on contemporary architectures. Multiples of 32 are mandatory in spirit, because 32 is the warp size — see §VI.

§ IV. A First Kernel

We now write CUDA's "hello world": adding two vectors. The example is small but every essential idea — index computation, bounds guarding, host-device dance — is present.

// vector_add.cu
#include <cstdio>
#include <cuda_runtime.h>

__global__ void add(const float* a, const float* b, float* c, int n) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < n) c[i] = a[i] + b[i];
}

int main() {
    const int N = 1 << 20;                       // one million elements
    const size_t bytes = N * sizeof(float);

    float *h_a = (float*)malloc(bytes);
    float *h_b = (float*)malloc(bytes);
    float *h_c = (float*)malloc(bytes);
    for (int i = 0; i < N; ++i) { h_a[i] = i; h_b[i] = 2*i; }

    float *d_a, *d_b, *d_c;
    cudaMalloc(&d_a, bytes);
    cudaMalloc(&d_b, bytes);
    cudaMalloc(&d_c, bytes);

    cudaMemcpy(d_a, h_a, bytes, cudaMemcpyHostToDevice);
    cudaMemcpy(d_b, h_b, bytes, cudaMemcpyHostToDevice);

    int threads = 256;
    int blocks  = (N + threads - 1) / threads;     // ceiling division
    add<<<blocks, threads>>>(d_a, d_b, d_c, N);

    cudaMemcpy(h_c, d_c, bytes, cudaMemcpyDeviceToHost);

    printf("c[42] = %f\n", h_c[42]);          // expected: 126.0

    cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);
    free(h_a); free(h_b); free(h_c);
    return 0;
}

Three details deserve unpacking. The expression blockIdx.x * blockDim.x + threadIdx.x is the universal idiom for converting a thread's three-part identity into a flat global index — every CUDA programmer writes it ten thousand times. The bounds check if (i < n) is necessary because the launch configuration almost always over-provisions threads, since blocks × threads is rounded up to cover N. And the kernel call is asynchronous: the next host instruction executes immediately. We get away with omitting an explicit synchronization here only because cudaMemcpy implicitly waits.

N 32 block size 8

Fig. 3. Vector addition unrolled. Each block dispatches its threads to compute one slice of c = a + b. Step through one block at a time, or play the whole grid. Note the tail block, which over-provisions and relies on the bounds check.

Compiling and running

$ nvcc -O3 -arch=sm_80 vector_add.cu -o vector_add
$ ./vector_add
c[42] = 126.000000

The -arch=sm_80 flag targets compute capability 8.0 (Ampere; A100, RTX 30-series). Use sm_75 for Turing (RTX 20-series), sm_86 for consumer Ampere, sm_89 for Ada (RTX 40-series), sm_90 for Hopper (H100), sm_120 for Blackwell (RTX 50-series, B100/B200). Without an arch flag nvcc picks a default that may not match your hardware.

Caveat. CUDA error checking has a deserved reputation for being verbose and easy to forget. The runtime returns error codes from every cuda* call; kernel launches do not return errors directly — you check with cudaGetLastError() after launch and cudaDeviceSynchronize() for asynchronous errors. In any code you intend to keep, wrap calls in a checking macro from day one.

§ V. The Memory Hierarchy

If a single sentence captures GPU performance engineering, it is this: compute is cheap, memory is expensive. Reading from the slowest memory takes hundreds of cycles in which the SM could have been doing arithmetic. Knowing the hierarchy is therefore not optional.

click a tier

Fig. 4. The on-device memory hierarchy. Higher tiers are faster and smaller; lower tiers are slower and larger. Click any tier for its rough latency, scope, and lifetime.

The tiers, from fast to slow

Registers

Per-thread, on-chip, single-cycle latency. A thread's local variables live here when the compiler can place them. Registers are not declared explicitly; you write ordinary C, the compiler allocates. The SM has a fixed register file split among resident threads — typical limits are 65,536 32-bit registers per SM, with up to 255 per thread. A kernel that uses too many registers per thread reduces the number of threads that can co-reside, hurting occupancy (§IX).

Shared memory

Per-block, on-chip, roughly an order of magnitude slower than registers, two orders faster than global memory. Declared with __shared__. The natural place to stage data that all threads in a block will reuse — the canonical example is a tile of a matrix during multiplication (§VIII). Shared memory and L1 cache typically partition the same physical SRAM; the split is configurable on most architectures.

__global__ void use_shared() {
    __shared__ float tile[32][32];   // 4 KB per block
    int tx = threadIdx.x, ty = threadIdx.y;
    tile[ty][tx] = /* load from global */ 0.0f;
    __syncthreads();                // barrier: wait for all threads
    // now every thread can read every element of tile
}

Constant and texture memory

Read-only views into device memory with their own caching paths. Constant memory (__constant__) is small (64 KB) and broadcasts efficiently when all threads in a warp read the same address — useful for kernel parameters and lookup tables. Texture memory adds spatial-locality caching and hardware interpolation, useful in image processing.

Global memory

The big pool — gigabytes of off-chip DRAM. Visible to every thread and persistent across kernel launches within a program. High-bandwidth (terabytes per second on modern hardware) but high-latency (300–600 cycles unloaded). Anything cudaMalloc returned is global memory. Reaching it well — especially via coalesced access patterns — is the single largest performance lever in most kernels.

Local memory (a misleading name)

Despite "local," this lives in global memory. The compiler spills to local memory when registers run out or when arrays are indexed dynamically and cannot be kept in registers. You cannot declare it directly; you avoid it by keeping per-thread state small and statically indexed.

Host memory

System RAM. Reachable from the device only through PCIe / NVLink, with bandwidth one to two orders of magnitude below device DRAM. Pageable host memory must first be staged through pinned buffers; using cudaMallocHost (pinned memory) directly lets transfers overlap with kernels.

Tier	Where	Scope	Approx. latency
Register	SM	Thread	~1 cycle
Shared	SM (SRAM)	Block	~20–30 cycles
L1 cache	SM (SRAM)	SM	~30 cycles
L2 cache	Chip	Device	~200 cycles
Global / DRAM	Off-chip	Device	~400–800 cycles
Host	System	Host	~10,000+ cycles via PCIe

§ VI. Warps & SIMT Execution

A block looks, in source, like an arbitrary collection of threads. In hardware, each block is sliced into groups of 32 called warps, and the warp — not the thread — is the actual unit of execution.

A warp's 32 threads share a single instruction fetch. They execute the same instruction, in lockstep, on different registers. NVIDIA calls this SIMT — Single Instruction, Multiple Threads — to distinguish it from the more rigid SIMD of CPU vector units. The distinction matters because SIMT permits each thread to take its own control-flow path, at a cost.

Branch divergence

When threads in a warp disagree on a branch — some take the if, others take the else — the warp serializes. The hardware executes the if path with the dissenting threads masked off (their results discarded), then the else path with the original dissenters masked off. The two halves cannot run simultaneously. A warp that fully agrees runs at full speed; a warp that splits fifty-fifty pays roughly double.

divergence 40% cost: —

Fig. 5. A warp executing if (cond) A; else B;. Threads taking the true path execute first with the false-takers idle, then the false-takers execute. Adjust how many threads diverge.

What costs warp efficiency

Three patterns are common offenders. Data-dependent branching on per-thread values, especially when neighbours in a warp see different data. Loop bounds that vary per thread — the warp runs at the speed of its slowest member. Predicated arithmetic on rare conditions handled with if; sometimes computing both branches unconditionally and selecting the result is faster than diverging.

Practical note. A small amount of divergence is usually fine — the bounds-check tail in our vector-add example diverges in exactly one warp at the boundary. The pattern to avoid is divergence in the hot loop of a kernel, where every iteration pays.

Warp-level primitives

Modern CUDA exposes warp-shuffle intrinsics — __shfl_sync, __shfl_down_sync, and friends — that exchange data directly between threads of a warp without going through shared memory. They are fast and useful for reductions, scans, and broadcast patterns. The _sync suffix and explicit thread mask are mandatory since CUDA 9; older formless variants are deprecated.

§ VII. Memory Coalescing

When a warp issues a global memory load, the hardware does not service 32 individual requests. It tries to coalesce them into the smallest number of memory transactions — ideally one. Whether it succeeds depends entirely on the access pattern.

The rule, in essence: if the 32 threads of a warp access 32 consecutive 4-byte words — that is, a contiguous 128-byte segment, properly aligned — the hardware satisfies the warp with one 128-byte transaction. If they access scattered locations, the hardware issues a separate transaction for each distinct cache line, and bandwidth collapses proportionally.

pattern transactions: 1

Fig. 6. Sixteen threads of a half-warp accessing global memory under four patterns. Each colored band marks one memory transaction served by hardware; fewer is faster.

What this implies for code

Two practical consequences. First, lay out data so that thread i reads element i: array-of-structures (AoS) is usually worse than structure-of-arrays (SoA), because AoS makes neighbouring threads reach for non-adjacent memory. Second, when iterating a 2D array, the natural pairing is x-thread to x-column, not x-thread to x-row, since C row-major storage places column neighbours adjacent in memory.

// good: thread i reads element i
int i = blockIdx.x * blockDim.x + threadIdx.x;
float v = data[i];                       // coalesced

// bad: stride pulls each thread far from its neighbour
int i = blockIdx.x * blockDim.x + threadIdx.x;
float v = data[i * 8];                   // 8 transactions, not 1

Aside. Modern GPUs cache global accesses through L1 and L2, which softens the worst pathologies — repeated access to a strided pattern often hits cache after the first warp. But the upper bound on bandwidth is still set by the first read, and engineering for coalescing remains the cheapest performance win.

§ VIII. Shared Memory & Tiling

Shared memory is the lever that turns a memory-bound kernel into a compute-bound one. The pattern is universal enough to deserve its own name: tiling.

The idea: partition the output into tiles, each computed by one block. For each tile, cooperatively load the input data the block needs into shared memory. Then perform the actual arithmetic out of shared memory, which is roughly a hundred times faster than global. Each input element is loaded from global memory once but read from shared memory many times — the arithmetic intensity rises by exactly that ratio.

Tiled matrix multiplication

The textbook example is C = A · B for square matrices. The naive kernel reads each element of A once per output column and each element of B once per output row — much more global traffic than necessary. The tiled version is dramatically better:

#define TILE 16

__global__ void matmul_tiled(const float* A, const float* B,
                              float* C, int N) {
    __shared__ float As[TILE][TILE];
    __shared__ float Bs[TILE][TILE];

    int tx = threadIdx.x, ty = threadIdx.y;
    int row = blockIdx.y * TILE + ty;
    int col = blockIdx.x * TILE + tx;

    float acc = 0.0f;
    for (int t = 0; t < N / TILE; ++t) {
        // cooperative load: each thread brings one element of each tile
        As[ty][tx] = A[row * N + (t * TILE + tx)];
        Bs[ty][tx] = B[(t * TILE + ty) * N + col];
        __syncthreads();                      // wait for all loads

        for (int k = 0; k < TILE; ++k)        // dot product over tile
            acc += As[ty][k] * Bs[k][tx];
        __syncthreads();                      // before next overwrite
    }
    C[row * N + col] = acc;
}

phase 0 of —

Fig. 7. Tiled matrix multiply. The block's threads cooperatively load one tile from A (left) and one tile from B (top) into shared memory, then accumulate the partial product into C (centre). This repeats across the K dimension.

Two barriers, both essential

The first __syncthreads() ensures every thread has loaded its assigned shared-memory element before any thread reads from the tile. Skip it and you get garbage from threads that race ahead. The second prevents one warp from overwriting As/Bs for the next tile while another warp still reads the previous one. Both barriers must be reached by every thread in the block — putting __syncthreads() inside a divergent branch is undefined behaviour.

Performance. The tiled kernel is roughly an order of magnitude faster than the naive version at N=2048. To go further you reach for register-blocking, double-buffering, vectorized loads, and eventually the Tensor Cores via wmma or — sensibly — the cuBLAS library, which has done all this for you.

Bank conflicts

Shared memory is divided into 32 banks; concurrent accesses by a warp to different banks are serviced in parallel, while accesses by multiple threads to the same bank are serialized. The classical fix for column-wise access patterns is to pad the array width by one (__shared__ float tile[TILE][TILE+1]) — wasted space, but the access stride no longer aliases. Bank conflicts are second-order; do not optimize them before measuring.

§ IX. Toolkit & Ecosystem

CUDA is a platform, not just a language. What you actually use day to day is the toolkit: a compiler, a debugger, a profiler, and a constellation of libraries that already implement the hard kernels well.

The compiler

nvcc is NVIDIA's CUDA C++ compiler driver. It splits source files into host and device portions, hands the host portion to your system compiler (gcc, clang, MSVC), compiles the device portion to PTX (NVIDIA's virtual ISA) and then to SASS (the actual machine code for a target architecture), and links the two into one executable. --ptx emits the intermediate PTX; --cubin emits the binary; -Xptxas -v reports register and shared-memory usage per kernel — invaluable for diagnosing occupancy.

Profiling: the only honest source of truth

You cannot guess at GPU performance. Two tools matter. Nsight Systems profiles the whole timeline — kernel launches, memcpys, host code, CUDA streams — and tells you whether you are bound on launch overhead, transfer time, or kernel time. Nsight Compute drills into a single kernel and reports occupancy, achieved bandwidth, instruction mix, stall reasons, and dozens of other metrics. The phrase compute-bound versus memory-bound, often used loosely, has a precise meaning these tools will give you.

$ nsys profile --stats=true ./my_program
$ ncu --set full -o report ./my_program          # detailed kernel analysis

Occupancy

Occupancy is the ratio of active warps on an SM to the maximum possible. It is not the goal in itself — kernels can be excellent at low occupancy — but low occupancy can starve the SM of work to hide memory latency. Three resources cap it: registers per thread, shared memory per block, and threads per block. Use cudaOccupancyMaxPotentialBlockSize or the Nsight Compute occupancy section.

threads/block 256 regs/thread 32 shared/block (KB) 8 occupancy: —

Fig. 8. Occupancy as a function of per-block resource budgets. Limits assumed: 65,536 registers, 96 KB shared memory, 64 warps (2,048 threads), 32 blocks per SM — broadly Ampere/Hopper-class.

Libraries: don't write what NVIDIA already wrote

Library	What it gives you
`cuBLAS`	Dense linear algebra (BLAS); the right way to do matmul
`cuDNN`	Deep-learning primitives: convolutions, RNNs, attention
`cuFFT`	Fast Fourier transforms
`cuSPARSE` / `cuSOLVER`	Sparse linear algebra and direct solvers
`Thrust`	STL-like parallel algorithms (sort, scan, reduce) on the device
`CUB`	Lower-level building blocks Thrust is built on
`NCCL`	Multi-GPU and multi-node collectives
`TensorRT`	Optimized inference for trained neural networks
`CUTLASS`	Templates for writing your own GEMM-class kernels

Languages other than C++

You do not need to write CUDA C++ to use CUDA. PyTorch, TensorFlow, and JAX are CUDA programs by composition. CuPy is NumPy with a CUDA backend. Numba compiles a subset of Python to CUDA kernels with a decorator. Triton (from OpenAI) offers a Python DSL specifically for writing custom kernels with less ceremony than CUDA C++. For most engineers, the right strategy is: stay high level until profiling shows you cannot, then write CUDA C++ for the hot kernel.

§ X. From Datacenter to Dashboard

Most of CUDA was conceived for a card the size of a brick, drawing several hundred watts in a server rack. The same architecture, scaled down and rearranged, now lives in robots, drones, surgical instruments, and the perception domain controllers of contemporary vehicles. Understanding what changes — and what doesn't — between the two settings is essential for anyone shipping CUDA code to an embedded target.

The Jetson family, in lineage

NVIDIA's edge GPU line is the Jetson series. Each generation pairs a CUDA-capable GPU with an Arm CPU complex on a single SoC, with unified memory shared between them. The progression from the original Nano to the current Thor spans roughly four orders of magnitude in compute capacity and three in price, but the programming model is unchanged: you write the same CUDA C++ that runs on an H100, and nvcc targets the appropriate compute capability.

Module	Approx. perf	Power envelope	Memory	Typical use
Jetson Nano	~0.5 TFLOPS FP16	5–10 W	4 GB LPDDR4	hobby, light vision
Orin Nano (8 GB)	~40 TOPS	7–15 W	8 GB LPDDR5	compact robotics
Orin NX (16 GB)	~100 TOPS	10–25 W	16 GB LPDDR5	industrial, drones
AGX Orin (64 GB)	~275 TOPS	15–60 W	64 GB LPDDR5	autonomy dev kits
AGX Thor	~2,000 TOPS (FP4)	~130 W	128 GB LPDDR5X	next-gen autonomy, humanoids
DRIVE Orin / Thor	same silicon, automotive	same envelope	same	production vehicles

The DRIVE platform is, broadly, the same silicon as Jetson hardened and certified for automotive use — with ASIL-rated lockstep cores, automotive-grade quality processes, and a software stack (DriveOS, DriveWorks) targeting production vehicle programs rather than research kits. The CUDA portions of your code port between Jetson and DRIVE without modification.

click any chip

Fig. 9. The CUDA hardware spectrum on log–log axes. Edge modules in the lower-left corner; datacenter accelerators in the upper-right. Click any point for specifications. Numbers are approximate; consult the current datasheet for production decisions.

What changes at the edge

Three architectural facts dominate. First, the CPU and GPU share physical memory. There is no PCIe, no cudaMemcpy over a slow bus — instead, both processors address the same DRAM. CUDA still distinguishes host and device pointers conceptually, but with zero-copy or unified-memory allocation the same buffer can be touched by both without a transfer. This changes which optimizations matter: you stop worrying about transfer overhead and start worrying about memory contention between CPU and GPU.

Second, the power envelope is the binding constraint, not silicon area. The Orin Nano can be configured to run at 7 W or 15 W; the same chip delivers very different sustained throughput at each. Profiling on the edge means profiling under the production power mode, not the developer-kit default. nvpmodel on Jetson sets the operating point; tegrastats shows live power, frequency, and utilization.

Third, thermal throttling is real. A kernel that runs in 8 ms on a cool device may run in 14 ms after a minute of sustained load. Production validation must include thermal-soak testing. The corollary is that average-case benchmarks lie; characterize worst-case latency under enclosure-level thermal conditions.

Note. The shared-memory architecture also explains why Jetson modules typically ship with relatively modest DRAM — what the GPU has, the CPU has, and 8 GB shared between perception, planning, control, the OS, and ROS nodes is tighter than 8 GB of pure GPU memory on a discrete card. Memory pressure is the most common cause of mysterious slowdowns on these devices.

What stays the same

The kernel you wrote in §IV still launches with the same syntax, computes the same result, and uses the same warps, shared memory, and coalescing rules. The libraries — cuBLAS, cuDNN, TensorRT — are present, sometimes in slightly trimmed editions. The toolchain, debugger, and Nsight profilers all run on Jetson. From the source code's point of view, you are writing CUDA; the device's badge is a footnote.

§ XI. Robotics on the GPU

Robotics is a stack: sensors, perception, state estimation, planning, control, actuation. CUDA touches each layer differently. Some are obvious GPU work; others are surprising.

Perception

The clearest fit. Cameras produce dense pixel arrays at high frame rate; LiDAR produces hundreds of thousands of points per scan; radar produces range-Doppler tensors. All of these are SIMD problems by construction. Image preprocessing (debayer, undistort, resize, normalize), neural inference (object detection, segmentation, depth estimation), point cloud operations (voxelization, downsampling, ICP) — every step maps cleanly onto the architecture in §I.

The libraries you will actually call: cuDNN for the convolution primitives, TensorRT for whole-network inference, NPP (NVIDIA Performance Primitives) for classical image operations, VPI (Vision Programming Interface) for hardware-accelerated pipelines that mix GPU, ISP, and dedicated vision accelerators on Jetson, and the Isaac ROS GEMs for ROS 2 nodes that wrap CUDA underneath.

Planning

Less obvious but compelling. Sampling-based planners — RRT, RRT*, PRM — are embarrassingly parallel in their candidate generation phase. Modern motion planners (cuRobo from NVIDIA, the Isaac Manipulator stack) batch thousands of candidate trajectories, evaluate them against signed-distance fields stored in GPU memory, and return a feasible plan in single-digit milliseconds. The same strategy underlies MPPI (Model Predictive Path Integral) — sample N control sequences in parallel, simulate them through a dynamics model in parallel, weight by cost, and combine. With N = 1024 or 2048, GPU MPPI runs orders of magnitude faster than a sequential CPU implementation.

Control

Traditionally CPU territory: control loops are sequential by nature, and the data per step is tiny. But two trends bring CUDA in. First, MPC variants with batched rollouts (your active-suspension work, the GP-MPC literature) genuinely benefit from GPU parallelism — the rollout is the cost. Second, neural controllers and learned residuals run as inference workloads even when their outputs feed a classical control loop. The hybrid pattern — classical safety-critical low-level control on a deterministic CPU, with a GPU-resident learned policy contributing setpoints or residuals — is increasingly standard.

precision target rate total: —

Fig. 10. A perception-to-decision pipeline budgeted across stages, with the deadline marked. Swap the precision used for inference and watch the bars resize. Lower precision shrinks the inference stage but the fixed I/O stages remain.

Simulation

Worth a separate mention. Isaac Sim and the underlying PhysX physics engine are GPU-accelerated, allowing thousands of robot instances to be simulated in parallel — the foundation of modern reinforcement-learning pipelines for manipulation and locomotion. The same CUDA card that runs your inference at deployment runs your training simulator.

Aside. The split between Isaac (research-facing) and DRIVE (production-vehicle) is partly organizational and partly real: Isaac stresses iteration speed and breadth; DRIVE stresses certifiability and longevity. The CUDA layer beneath both is the same.

§ XII. The Inference Pipeline

Training a neural network and deploying one are different engineering problems. Training maximizes throughput on huge batches with backpropagation; deployment minimizes latency on a single sample, with no gradients in sight. The deployment side is what most edge engineers actually ship, and it has its own toolchain.

The road from PyTorch to Orin

The standard path has four steps. Train in PyTorch, JAX, or TensorFlow. Export to ONNX, an open intermediate representation. Compile the ONNX graph with TensorRT targeting the deployment GPU's compute capability, producing a serialized engine file. Load the engine at runtime and feed it samples.

# 1. train, then export
$ python export.py --model resnet50.pth --out resnet50.onnx

# 2. build engine for the target device (run ON the device, or with cross-compile)
$ trtexec --onnx=resnet50.onnx --saveEngine=resnet50.trt --fp16

# 3. INT8 with calibration data
$ trtexec --onnx=resnet50.onnx --saveEngine=resnet50_int8.trt \
          --int8 --calib=calib_cache.bin

What TensorRT actually does

The phrase "compiles the graph" understates it. TensorRT performs layer fusion — merging consecutive operations into single kernels (Conv + BN + ReLU becomes one kernel call) — which eliminates the global-memory round trip between layers. It performs kernel auto-tuning: for each layer, it benchmarks several candidate CUDA implementations on the actual target hardware and picks the fastest. It performs precision calibration: given representative input data, it determines per-tensor scaling factors that let INT8 arithmetic approximate the FP32 reference within an acceptable accuracy bound. The output is a binary specific to one GPU architecture; an engine built for Orin will not load on Thor and vice versa.

Numerical formats

Format	Bits	Use	Tensor Core support
FP32	32	Training, reference inference	Limited (Ampere+)
TF32	19 effective	Training (Ampere innovation)	Native
BF16	16	Training, robust inference	Native (Ampere+)
FP16	16	Mixed-precision training, inference	Native (Volta+)
INT8	8	Inference (calibrated)	Native (Turing+)
FP8 (E4M3 / E5M2)	8	Training and inference (Hopper+)	Native (Hopper+)
FP4 / INT4	4	Inference (Blackwell+)	Native (Blackwell)

The trade is straightforward: fewer bits means smaller weights, lower memory bandwidth pressure, and higher throughput on Tensor Cores that natively support the format — at some cost in numerical accuracy. The art is finding the precision where accuracy is acceptable and throughput is maximized. For most CNN inference, INT8 is the sweet spot; for transformer inference, FP16 or BF16 remains common, with INT8 and FP8 increasingly used.

precision FP32 model size: 1.0× · throughput: 1.0× · accuracy: ref

Fig. 11. Weight distribution under quantization. The continuous FP32 distribution is bucketed into progressively coarser representations. Below: relative model size, throughput, and accuracy delta — typical for a calibrated CNN on Tensor Core hardware.

PTQ versus QAT

Two paths to quantization. Post-Training Quantization (PTQ) takes a trained FP32 model, shows it a few hundred representative inputs, computes activation ranges, and converts to INT8 — minutes of work, accuracy drop usually under 1 % for standard CNNs, occasionally larger for sensitive architectures. Quantization-Aware Training (QAT) inserts fake-quantization nodes during training so the network learns to be robust to the eventual precision loss — more expensive, smaller accuracy drop, often essential for transformers and quantized-from-scratch deployments. Start with PTQ; reach for QAT when accuracy budget demands it.

Caveat. The INT8 quantization story on the GPU is closely related but not identical to the INT8 quantized-C export path you have on TriCore. Both rely on per-tensor or per-channel scale factors and zero-points. The GPU side leans on Tensor Core IMMA instructions for the actual INT8 matrix-multiply-accumulate; the embedded side does the multiplication in scalar integer ALUs. Calibration philosophy transfers; kernel implementation does not.

§ XIII. Real-Time & the Automotive Split

A GPU is fundamentally a throughput device. It maximizes work-per-second by overlapping many concurrent operations, hiding memory latency behind useful arithmetic. This is not naturally compatible with hard real-time deadlines, where the question is not how much work fits in a second but can this exact thing finish before time t.

Latency versus throughput

Throughput-optimized hardware achieves its numbers by deep pipelining and high in-flight work. The cost is variance: any single operation may take longer than the steady-state average, because it might land behind queued work, miss in cache, contend for memory, or trigger driver-side bookkeeping. CPU-side guarantees common in real-time operating systems — bounded interrupt latency, priority inheritance, deterministic context switch — translate poorly to the GPU programming model.

The mitigations CUDA provides are real but partial. Streams let independent work overlap; events let you synchronize between streams without blocking the host; graph capture (cudaGraph) records a complex sequence of kernel launches and replays it as a single submission, eliminating per-launch overhead and making the schedule more predictable. Persistent kernels — kernels that loop indefinitely on the device, polling for work — eliminate launch latency entirely at the cost of an SM permanently committed.

The architectural split in modern vehicles

The pattern that emerged in the production vehicles of the last five years separates concerns by safety integrity. Roughly:

Layer	Hardware	Workload	Safety class
Actuation, motor control	AURIX TriCore (or similar)	FOC, torque limiting, brake/steer firmware	ASIL-D
Vehicle dynamics, fusion	Mid-range MCU + dedicated SoC	State estimation, low-level supervisory	ASIL-C/D
Perception, planning	NVIDIA DRIVE Orin / Thor (or competitor)	Cameras, LiDAR, NN inference, behaviour	ASIL-B/C
Infotainment, in-cabin	Android-class SoC	UI, voice, OTA management	QM (no ASIL)

The split is functional, not arbitrary. Hard-deterministic cyclic control — the kind of code you write — stays on processors with deterministic timing, lockstep cores, and certifiable toolchains. Probabilistic perception, where a 99.9th-percentile latency in the tens of milliseconds is acceptable and where compute requirements far exceed what an MCU can supply, moves to GPU-class silicon. Communication between the two travels over deterministic buses (CAN, automotive Ethernet with TSN) with carefully bounded payloads. The GPU is not making the safety decision; it is informing one.

loop rate budget: 2.0 ms

Fig. 12. A control-loop budget. Set the target rate and toggle operations into the loop; the bar fills against the deadline. Green = comfortable margin, amber = tight, red = misses deadline. Costs are illustrative for an Orin-class SoC.

Alternative edge accelerators

CUDA dominates because its software stack — toolkit, libraries, frameworks, training pipeline — is unmatched, not because the underlying hardware is uniquely capable. The competitive landscape:

Hailo (Hailo-8, Hailo-15) — pure inference accelerators, very high TOPS-per-watt, no general-purpose programmability, framework integration via their own SDK and ONNX. Excellent fit when the workload is fixed inference of standard architectures.

Qualcomm — the Snapdragon Ride and Snapdragon Cockpit families ship with dedicated NPUs alongside Adreno GPUs, primarily targeting the same automotive niches as DRIVE.

AMD / Xilinx Versal — adaptive SoCs combining Arm cores, FPGA fabric, and AI engines. The flexibility cuts both ways: extraordinary at custom dataflows, harder to program than CUDA.

Intel Movidius / Habana — Movidius chips at the milliwatt edge, Gaudi at the datacenter side; the middle is thin.

Google Edge TPU — small, cheap, INT8-only, locked to TensorFlow Lite. Fine for fixed pipelines, restrictive for research.

Apple Neural Engine, ARM Ethos — embedded NPUs increasingly standard in consumer SoCs; relevant if your robot ships on a phone-class platform.

Engineering choice. Pick the accelerator that matches your software trajectory, not the one with the highest spec-sheet TOPS. A 50 TOPS device with a mature compiler and the libraries you need beats a 200 TOPS device whose toolchain you will spend a year fighting.

§ XIV. Practice & Further Reading

A short discipline of habits will save you weeks of confusion. Adopt them early.

Habits worth forming

Always check return codes. A macro CUDA_CHECK(call) that wraps every runtime call and aborts with file/line on error is non-negotiable in real code. Kernel launches are checked with cudaGetLastError() immediately after the launch and cudaDeviceSynchronize() if you want to surface asynchronous failures.

Profile before optimizing. Every CUDA programmer's first instinct on a slow kernel is wrong. Run Nsight Compute. Read the occupancy section, the memory section, and the stall reasons. Optimize what the profiler says is the bottleneck, not what your intuition fingers.

Validate against a CPU reference. The best CUDA kernel is the one whose output matches a slow, correct CPU implementation to within floating-point tolerance. Write that CPU version and keep it as a test.

Build small, then scale. Get the kernel correct on N=64 with one block of 32 threads. Then scale the data, then the launch configuration. Bugs in CUDA at full scale are excruciating to localize.

Keep launch parameters configurable. Block size and grid size should be variables, not magic numbers. The optimum varies by GPU; let your code adapt.

Common pitfalls. Forgetting to check kernel launch errors and chasing a phantom data bug. Mixing host and device pointers. Using printf from a kernel without flushing or running fewer threads than expected — printf works on device, but is rate-limited. Calling cudaDeviceSynchronize() in a tight loop and wondering why nothing overlaps. Allocating in cudaMallocManaged for benchmarking and reporting "fast" numbers that are really first-touch page migrations.

A learning path

For the engineer with C and parallelism background, a workable order: write vector-add until launch configuration and indexing feel automatic; write a reduction (sum of N floats) and learn warp shuffles; write the tiled matmul above and compare against cuBLAS; profile something realistic with Nsight Compute and read every metric until you understand it; pick a real project — a finite-difference solver, a small inference engine, an image filter — and ship it.

References

NVIDIA Corporation. CUDA C++ Programming Guide. The canonical reference, updated per CUDA release. docs.nvidia.com/cuda/cuda-c-programming-guide.
NVIDIA Corporation. CUDA C++ Best Practices Guide. The companion volume to the Programming Guide, focused on performance. docs.nvidia.com/cuda/cuda-c-best-practices-guide.
Kirk, D. B., and Hwu, W. W. Programming Massively Parallel Processors: A Hands-On Approach, 4th ed., Morgan Kaufmann, 2022. The standard textbook; chapters on tiling and reduction repay rereading.
Sanders, J., and Kandrot, E. CUDA by Example: An Introduction to General-Purpose GPU Programming, Addison-Wesley, 2010. Older but still the gentlest starting point.
Cheng, J., Grossman, M., and McKercher, T. Professional CUDA C Programming, Wrox, 2014. Strong on the runtime, streams, and multi-GPU.
NVIDIA Developer Blog. An Even Easier Introduction to CUDA, by Mark Harris. developer.nvidia.com/blog/even-easier-introduction-cuda.
NVIDIA. Nsight Systems and Nsight Compute documentation and training videos. developer.nvidia.com/nsight-compute.
Tillet, P., Kung, H. T., Cox, D. Triton: An Intermediate Language and Compiler for Tiled Neural Network Computations. MAPL 2019. The motivation behind Triton, useful context if you intend to escape CUDA C++.
Volkov, V. Better Performance at Lower Occupancy, GTC 2010. The classic counter-argument to the occupancy-as-goal mindset.
Harris, M. Optimizing Parallel Reduction in CUDA, NVIDIA technical report. The seven-version tour through reduction optimization is required reading.
NVIDIA Corporation. Jetson Developer Documentation and NVIDIA DRIVE Documentation. Hardware datasheets, power-mode references, JetPack release notes. developer.nvidia.com/embedded.
NVIDIA Corporation. TensorRT Developer Guide. The deployment-side companion to the Programming Guide; covers engine building, calibration, layer fusion, and the Python/C++ APIs. docs.nvidia.com/deeplearning/tensorrt.
NVIDIA Corporation. Isaac ROS Documentation and cuRobo: CUDA-Accelerated Robot Library. Reference implementations for GPU-accelerated perception and manipulation. nvidia-isaac-ros.github.io.
Williams, G., Drews, P., Goldfain, B., Rehg, J. M., Theodorou, E. A. Aggressive Driving with Model Predictive Path Integral Control, ICRA 2016. The reference application of GPU-batched MPPI to vehicle control.
Jacob, B., et al. Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference, CVPR 2018. The PTQ formulation underlying TensorRT INT8 calibration.
Krishnamoorthi, R. Quantizing deep convolutional networks for efficient inference: A whitepaper, 2018. Practical PTQ vs QAT guidance.
ISO. ISO 26262: Road vehicles — Functional safety. The ASIL framework referenced in §XIII; required reading for any production automotive software role.

A Primer on CUDA

Contents

§ I. Why a GPU at All

An architectural sketch

What problems suit the GPU

§ II. The Mental Model

The host–device dance

Kernels, in one sentence

Memory spaces

§ III. The Thread Hierarchy

The three tiers

Why two tiers?

§ IV. A First Kernel

Compiling and running

§ V. The Memory Hierarchy

The tiers, from fast to slow

Registers

Shared memory

Constant and texture memory

Global memory

Local memory (a misleading name)

Host memory

§ VI. Warps & SIMT Execution

Branch divergence

What costs warp efficiency

Warp-level primitives

§ VII. Memory Coalescing

What this implies for code

§ VIII. Shared Memory & Tiling

Tiled matrix multiplication

Two barriers, both essential

Bank conflicts

§ IX. Toolkit & Ecosystem

The compiler

Profiling: the only honest source of truth

Occupancy

Libraries: don't write what NVIDIA already wrote

Languages other than C++

§ X. From Datacenter to Dashboard

The Jetson family, in lineage

What changes at the edge

What stays the same

§ XI. Robotics on the GPU

Perception

Planning

Control

Simulation

§ XII. The Inference Pipeline

The road from PyTorch to Orin

What TensorRT actually does

Numerical formats

PTQ versus QAT

§ XIII. Real-Time & the Automotive Split

Latency versus throughput

The architectural split in modern vehicles

Alternative edge accelerators

§ XIV. Practice & Further Reading

Habits worth forming

A learning path

References