How a stack of matrix multiplications, a clever weighting trick called attention, and an ocean of text combine to produce machines that write. A working tour — with live demos you can poke — of what a transformer actually computes.
A large language model (LLM) does one deceptively simple thing: given a sequence of text, it predicts the next token. Everything else — answering questions, writing code, summarising, translating — is an emergent consequence of doing that one task extraordinarily well, across trillions of examples, with a network large enough to absorb the statistical structure of language.
The dominant architecture behind today's LLMs is the transformer, introduced in 2017. Its central innovation, self-attention, lets every position in a sequence look directly at every other position in a single step — no recurrence, no fixed window. That property is what makes the model both parallelizable on GPUs and remarkably good at long-range reasoning.
Before diving into the mathematics, here is the whole pipeline at a glance. Text enters on the left; a probability distribution over the next token leaves on the right.
The model has no memory between calls beyond the text in its context window. "Conversation" is an illusion created by re-feeding the entire transcript each turn. Everything the model "knows" in a session lives in those tokens plus its frozen weights.
Models don't see characters or whole words — they see tokens, sub-word
fragments produced by an algorithm such as Byte-Pair Encoding (BPE).
Common words become single tokens; rare words split into pieces. The word
tokenization might become token + ization; an emoji or
an unusual name may fragment into several bytes.
Each token ID indexes a row of the embedding matrix — a learned table of vectors, typically 768 to 12,288 numbers wide. This vector is the token's starting "meaning," a point in a high-dimensional space where semantically related tokens sit near one another.
Token count — not word count — drives cost and context limits. English averages roughly 0.75 words per token. Code, numbers, and non-Latin scripts tokenize less efficiently, which is why a 1,000-word essay and a 1,000-line config file consume very different budgets.
Attention treats its inputs as a set: by itself it has no notion of word order. "Dog bites man" and "man bites dog" would be indistinguishable. To fix this we inject positional information into the embeddings.
The original transformer used fixed sinusoids of different frequencies, one pair per dimension. Each position gets a unique fingerprint, and the smoothly varying frequencies let the model generalize to relative distances. Drag the controls below to see the pattern.
Each row is one embedding dimension; each column is a sequence position. Colour = value of sin/cos at that frequency.
Modern models often replace this with rotary position embeddings (RoPE), which rotate the query and key vectors by an angle proportional to position, or with ALiBi, which adds a distance-based penalty directly to attention scores. Both encode relative position and extrapolate to longer sequences more gracefully.
Here is the heart of the whole machine. For each token, attention asks: which other tokens should I pull information from, and how much? It answers using three learned projections of every token's vector:
Query (Q) — what this token is looking for. Key (K) — what each token offers. Value (V) — the actual content each token will hand over. The relevance of token j to token i is the dot product of i's query with j's key. High dot product → strong match → large weight on j's value.
The demo below uses a real toy sentence. Click any token to make it the query; the bars show how strongly it attends to every token (including itself). Notice how pronouns reach back to their referents and verbs attend to their subjects — the kind of structure a trained model discovers on its own.
Click a token to set the query. Bars = attention weights (they sum to 1).
Attention is a soft, content-based lookup — like a dictionary where every key matches the query a little, and the result is a weighted blend of all values rather than a single hit. The weights are computed fresh for every token at every layer.
The raw dot products (logits) can be any real numbers. Softmax turns them into a probability distribution: exponentiate, then normalize so they sum to one. Larger scores get exponentially more weight, but nothing ever goes negative or exceeds one.
Why divide by √dk inside attention? In high dimensions, dot products of
random vectors grow with the dimension. Without scaling, the logits get large, softmax saturates
into a near one-hot spike, and gradients vanish. Dividing by the square root of the key dimension
keeps the distribution well-behaved. The slider below shows the same effect via temperature:
low τ sharpens, high τ flattens.
Fixed logits, reshaped by temperature. Low τ → confident & peaky. High τ → diffuse & uncertain.
One attention computation captures one kind of relationship. Real models run many in parallel — heads — each with its own learned Q/K/V projections into a smaller subspace. One head might track subject–verb agreement, another may follow coreference, another local syntax. Their outputs are concatenated and mixed by a final linear layer.
The matrix below is a full attention map for one head over our toy sentence: row = query token, column = key token, brightness = weight. Because this is a decoder, the upper triangle is masked (a token cannot attend to the future) — hence the staircase. Hover any cell to read the weight.
Row attends to column. Dark = high weight. Upper-right is masked: no peeking ahead.
The formula in §04 is a dataflow. Drawn as a circuit, scaled dot-product attention is a short vertical pipeline: the query and key streams meet in a matrix multiply, get scaled and (in a decoder) masked, pass through softmax, and the resulting weights multiply the value stream. With Q, K ∈ ℝn×dₖ and V ∈ ℝn×dᵥ, the output is ℝn×dᵥ — one refined vector per token.
The atomic unit. Q and K produce weights; those weights blend V.
Multi-head attention wraps this unit. The input is projected by h independent sets of linear layers into h lower-dimensional Q/K/V triples; each runs its own scaled dot-product attention in parallel; the results are concatenated and passed through a final output projection. Each head is free to specialize on a different relationship.
h parallel attention heads, each with its own projections, then concat + mix.
Attention is only half of each layer. A complete transformer block wraps it with three more ingredients that make deep stacks trainable:
Every sub-layer adds its output back to its input: x + Sublayer(x). This gives
gradients a clean highway to flow backward through dozens of layers and lets each layer learn a
small refinement rather than a full transformation.
Before (or after) each sub-layer, activations are normalized to zero mean and unit variance, then rescaled by learned parameters. This stabilizes training across the depth of the network.
After attention mixes information across tokens, a position-wise FFN processes each token independently: expand to ~4× width, apply a nonlinearity (GELU/SwiGLU), project back. This is where much of the model's raw knowledge and capacity lives — often two-thirds of all parameters.
Drawn out, the two residual highways (rust, dashed) are the key to depth: each sub-block only has to learn a correction to the signal flowing past it.
Signal flows top to bottom. Residual lanes carry the input around each sub-block to the ⊕ adders.
Stack the block N times — 12 in early models, 80+ in the largest — and you have the body of the network. Information flows upward: lower layers tend to capture surface patterns and syntax; higher layers assemble meaning, task structure, and abstraction.
Three architectural families exist. Encoder-only models (BERT) see the whole sequence bidirectionally and excel at understanding tasks. Encoder-decoder models (T5) read an input and write an output, ideal for translation. Today's generative chat models are almost all decoder-only: a single stack with causal masking so each position attends only to itself and the past, trained purely to predict the next token.
One stack, one objective, trivially parallel training, and the same machinery handles every task by framing it as text continuation. Simplicity scales — and scaling, more than clever architecture, has driven most recent capability gains.
Everything so far has been self-attention: the query, key, and value all come from the same sequence — tokens attending to other tokens in the same stream. Cross-attention changes one thing: the queries come from one sequence while the keys and values come from another. It is the bridge that lets one stream read a different one.
This is the mechanism inside classic encoder-decoder models. The encoder reads the source (say, an English sentence) and produces a set of context vectors — its memory. Each decoder block then runs masked self-attention over what it has generated so far, and a cross-attention layer whose queries (from the decoder) probe the encoder's keys and values. That is how a translation decoder decides which source words to look at while emitting each target word.
The decoder's queries (Q) read the encoder's keys and values (K, V) — the rust links.
Cross-attention reaches well beyond translation. Speech models like Whisper let a text decoder cross-attend to encoded audio. Vision-language models use it so text tokens can query image-patch embeddings — the queries are words, the keys and values are pixels. It is the general recipe whenever one modality or sequence must condition on another.
Today's decoder-only chat models (GPT-style) have no cross-attention — they fold everything into one self-attention stream by simply concatenating context and prompt. Cross-attention re-appears mainly in encoder-decoder and multimodal designs, where keeping the two streams separate is an advantage.
The final layer projects each token's vector onto the vocabulary, producing a logit per possible next token. Softmax turns these into probabilities, and a decoding strategy chooses one. Greedy decoding always takes the top token (repetitive); sampling adds controlled randomness.
Three knobs shape the output. Temperature reshapes the distribution (as above). Top-k keeps only the k most likely tokens. Top-p (nucleus) keeps the smallest set whose cumulative probability exceeds p. The demo below applies temperature and top-k to a toy next-token distribution — greyed bars are discarded before sampling.
Toy distribution for the prompt "The weather today is ___". Adjust τ and k; greyed tokens are cut.
For factual or coding tasks, lower temperature (0–0.3) and you want determinism and precision. For brainstorming or creative writing, raise it (0.7–1.0). Temperature 0 is effectively greedy decoding — reproducible, but prone to loops.
Capability is built in stages.
The model reads a vast corpus and is optimized, token by token, to predict the next one (cross-entropy loss). This is where it absorbs grammar, facts, reasoning patterns, and world structure — the overwhelming majority of compute is spent here, often across thousands of GPUs for weeks.
A pre-trained model continues text but doesn't naturally follow instructions. SFT trains it on curated prompt–response pairs so it learns the helpful-assistant format.
Humans rank competing responses; a reward model learns those preferences; the LLM is then optimized (e.g. with PPO or DPO) to produce outputs the reward model scores highly. This aligns tone, helpfulness, and safety with human judgement.
Attention compares every token with every other, so compute and memory scale as O(n²) in sequence length n. Doubling the context roughly quadruples the attention cost — the main reason long context windows are expensive.
During generation, the keys and values of past tokens never change. Caching them means each new token costs O(n) instead of O(n²) recomputation — the single most important inference optimization. The flip side: the cache grows with context and dominates GPU memory at long lengths.
FlashAttention reorganizes the computation to avoid materializing the full n×n matrix in slow memory, giving large speed and memory wins with identical math. Grouped-query and multi-query attention (GQA/MQA) share keys and values across heads, shrinking the KV cache dramatically.
Weights stored in 16-bit can often be compressed to 8- or 4-bit with minor quality loss, cutting memory and increasing throughput — what makes capable models runnable on a single GPU or even a laptop.
· Context is working memory, not knowledge. Put the facts you need in
the prompt (retrieval) rather than trusting recall.
· Models hallucinate confidently. Next-token prediction optimizes plausibility,
not truth — verify anything load-bearing.
· Position matters. Information at the very start and end of a long context is
used more reliably than material buried in the middle.
· Prompt structure is real engineering. Clear instructions, examples, and
explicit output formats measurably change behaviour.