A Scientific Monograph · Machine Intelligence

Inside the
Language Model:
Attention & Architecture

How a stack of matrix multiplications, a clever weighting trick called attention, and an ocean of text combine to produce machines that write. A working tour — with live demos you can poke — of what a transformer actually computes.

By Majid Mazouchi Interactive Edition ~30 min read

§ 01 — Orientation

The big picture

A large language model (LLM) does one deceptively simple thing: given a sequence of text, it predicts the next token. Everything else — answering questions, writing code, summarising, translating — is an emergent consequence of doing that one task extraordinarily well, across trillions of examples, with a network large enough to absorb the statistical structure of language.

The dominant architecture behind today's LLMs is the transformer, introduced in 2017. Its central innovation, self-attention, lets every position in a sequence look directly at every other position in a single step — no recurrence, no fixed window. That property is what makes the model both parallelizable on GPUs and remarkably good at long-range reasoning.

Before diving into the mathematics, here is the whole pipeline at a glance. Text enters on the left; a probability distribution over the next token leaves on the right.

Tokenize

Split text into sub-word units; map each to an integer ID.

Embed

Look up a learned vector per token; add positional information.

Transformer ×N

Attention + feed-forward layers refine each vector using context.

Project

Map the final vector to a score (logit) for every vocabulary word.

Sample

Softmax into probabilities; pick the next token; repeat.

◆ Practical note

The model has no memory between calls beyond the text in its context window. "Conversation" is an illusion created by re-feeding the entire transcript each turn. Everything the model "knows" in a session lives in those tokens plus its frozen weights.

§ 02 — Representation

Tokens & embeddings

Models don't see characters or whole words — they see tokens, sub-word fragments produced by an algorithm such as Byte-Pair Encoding (BPE). Common words become single tokens; rare words split into pieces. The word tokenization might become token + ization; an emoji or an unusual name may fragment into several bytes.

Each token ID indexes a row of the embedding matrix — a learned table of vectors, typically 768 to 12,288 numbers wide. This vector is the token's starting "meaning," a point in a high-dimensional space where semantically related tokens sit near one another.

Embedding lookup x_i = E[ token_id_i ] + p_i where E ∈ ℝ^{(vocab × d)}, p_i = positional vector

◆ Practical note

Token count — not word count — drives cost and context limits. English averages roughly 0.75 words per token. Code, numbers, and non-Latin scripts tokenize less efficiently, which is why a 1,000-word essay and a 1,000-line config file consume very different budgets.

§ 03 — Order

Positional encoding

Attention treats its inputs as a set: by itself it has no notion of word order. "Dog bites man" and "man bites dog" would be indistinguishable. To fix this we inject positional information into the embeddings.

The original transformer used fixed sinusoids of different frequencies, one pair per dimension. Each position gets a unique fingerprint, and the smoothly varying frequencies let the model generalize to relative distances. Drag the controls below to see the pattern.

Sinusoidal positional encoding

Each row is one embedding dimension; each column is a sequence position. Colour = value of sin/cos at that frequency.

Sequence length 48

Dimensions shown 32

Modern models often replace this with rotary position embeddings (RoPE), which rotate the query and key vectors by an angle proportional to position, or with ALiBi, which adds a distance-based penalty directly to attention scores. Both encode relative position and extrapolate to longer sequences more gracefully.

§ 04 — The core idea

Attention, from scratch

Here is the heart of the whole machine. For each token, attention asks: which other tokens should I pull information from, and how much? It answers using three learned projections of every token's vector:

Query (Q) — what this token is looking for. Key (K) — what each token offers. Value (V) — the actual content each token will hand over. The relevance of token j to token i is the dot product of i's query with j's key. High dot product → strong match → large weight on j's value.

Scaled dot-product attention Attention(Q,K,V) = softmax( QK^⊤ / √d_k ) V

The demo below uses a real toy sentence. Click any token to make it the query; the bars show how strongly it attends to every token (including itself). Notice how pronouns reach back to their referents and verbs attend to their subjects — the kind of structure a trained model discovers on its own.

Self-attention explorer

Click a token to set the query. Bars = attention weights (they sum to 1).

Query: it — see what it attends to ↓

◆ Intuition

Attention is a soft, content-based lookup — like a dictionary where every key matches the query a little, and the result is a weighted blend of all values rather than a single hit. The weights are computed fresh for every token at every layer.

§ 05 — The weighting

Softmax & the √d_k scaling

The raw dot products (logits) can be any real numbers. Softmax turns them into a probability distribution: exponentiate, then normalize so they sum to one. Larger scores get exponentially more weight, but nothing ever goes negative or exceeds one.

Softmax with temperature τ softmax(z)_i = exp(z_i/τ) / Σ_j exp(z_j/τ)

Why divide by √d_k inside attention? In high dimensions, dot products of random vectors grow with the dimension. Without scaling, the logits get large, softmax saturates into a near one-hot spike, and gradients vanish. Dividing by the square root of the key dimension keeps the distribution well-behaved. The slider below shows the same effect via temperature: low τ sharpens, high τ flattens.

Softmax temperature

Fixed logits, reshaped by temperature. Low τ → confident & peaky. High τ → diffuse & uncertain.

Temperature τ 1.00

§ 06 — Parallelism

Multi-head attention

One attention computation captures one kind of relationship. Real models run many in parallel — heads — each with its own learned Q/K/V projections into a smaller subspace. One head might track subject–verb agreement, another may follow coreference, another local syntax. Their outputs are concatenated and mixed by a final linear layer.

Multi-head attention head_h = Attention(QW^Q_h, KW^K_h, VW^V_h)
MHA = Concat(head₁, …, head_H) W^O

The matrix below is a full attention map for one head over our toy sentence: row = query token, column = key token, brightness = weight. Because this is a decoder, the upper triangle is masked (a token cannot attend to the future) — hence the staircase. Hover any cell to read the weight.

Causal attention matrix

Row attends to column. Dark = high weight. Upper-right is masked: no peeking ahead.

§ 07 — Schematics

Anatomy of the attention block

The formula in §04 is a dataflow. Drawn as a circuit, scaled dot-product attention is a short vertical pipeline: the query and key streams meet in a matrix multiply, get scaled and (in a decoder) masked, pass through softmax, and the resulting weights multiply the value stream. With Q, K ∈ ℝ^n×dₖ and V ∈ ℝ^n×dᵥ, the output is ℝ^n×dᵥ — one refined vector per token.

Fig. 1 — Scaled dot-product attention

The atomic unit. Q and K produce weights; those weights blend V.

input / output attention op optional / mask

Multi-head attention wraps this unit. The input is projected by h independent sets of linear layers into h lower-dimensional Q/K/V triples; each runs its own scaled dot-product attention in parallel; the results are concatenated and passed through a final output projection. Each head is free to specialize on a different relationship.

Fig. 2 — Multi-head attention

h parallel attention heads, each with its own projections, then concat + mix.

linear / projection attention

§ 08 — The repeating unit

The transformer block

Attention is only half of each layer. A complete transformer block wraps it with three more ingredients that make deep stacks trainable:

1 · Residual connections

Every sub-layer adds its output back to its input: x + Sublayer(x). This gives gradients a clean highway to flow backward through dozens of layers and lets each layer learn a small refinement rather than a full transformation.

2 · Layer normalization

Before (or after) each sub-layer, activations are normalized to zero mean and unit variance, then rescaled by learned parameters. This stabilizes training across the depth of the network.

3 · The feed-forward network

After attention mixes information across tokens, a position-wise FFN processes each token independently: expand to ~4× width, apply a nonlinearity (GELU/SwiGLU), project back. This is where much of the model's raw knowledge and capacity lives — often two-thirds of all parameters.

One block (pre-norm form) x ← x + MHA( LayerNorm(x) )
x ← x + FFN( LayerNorm(x) )

Drawn out, the two residual highways (rust, dashed) are the key to depth: each sub-block only has to learn a correction to the signal flowing past it.

Fig. 3 — A decoder block (pre-norm)

Signal flows top to bottom. Residual lanes carry the input around each sub-block to the ⊕ adders.

norm / I-O attention feed-forward residual / add

§ 09 — Depth

Stacking & the decoder

Stack the block N times — 12 in early models, 80+ in the largest — and you have the body of the network. Information flows upward: lower layers tend to capture surface patterns and syntax; higher layers assemble meaning, task structure, and abstraction.

Three architectural families exist. Encoder-only models (BERT) see the whole sequence bidirectionally and excel at understanding tasks. Encoder-decoder models (T5) read an input and write an output, ideal for translation. Today's generative chat models are almost all decoder-only: a single stack with causal masking so each position attends only to itself and the past, trained purely to predict the next token.

◆ Why decoder-only won

One stack, one objective, trivially parallel training, and the same machinery handles every task by framing it as text continuation. Simplicity scales — and scaling, more than clever architecture, has driven most recent capability gains.

§ 10 — Two streams

Cross-attention

Everything so far has been self-attention: the query, key, and value all come from the same sequence — tokens attending to other tokens in the same stream. Cross-attention changes one thing: the queries come from one sequence while the keys and values come from another. It is the bridge that lets one stream read a different one.

Self vs. cross self: Q,K,V ← same X
cross: Q ← decoder · K,V ← encoder output

This is the mechanism inside classic encoder-decoder models. The encoder reads the source (say, an English sentence) and produces a set of context vectors — its memory. Each decoder block then runs masked self-attention over what it has generated so far, and a cross-attention layer whose queries (from the decoder) probe the encoder's keys and values. That is how a translation decoder decides which source words to look at while emitting each target word.

Fig. 4 — Cross-attention in an encoder-decoder

The decoder's queries (Q) read the encoder's keys and values (K, V) — the rust links.

attention feed-forward cross link (K,V from encoder)

Cross-attention reaches well beyond translation. Speech models like Whisper let a text decoder cross-attend to encoded audio. Vision-language models use it so text tokens can query image-patch embeddings — the queries are words, the keys and values are pixels. It is the general recipe whenever one modality or sequence must condition on another.

◆ Practical note

Today's decoder-only chat models (GPT-style) have no cross-attention — they fold everything into one self-attention stream by simply concatenating context and prompt. Cross-attention re-appears mainly in encoder-decoder and multimodal designs, where keeping the two streams separate is an advantage.

§ 11 — Generation

From logits to words

The final layer projects each token's vector onto the vocabulary, producing a logit per possible next token. Softmax turns these into probabilities, and a decoding strategy chooses one. Greedy decoding always takes the top token (repetitive); sampling adds controlled randomness.

Three knobs shape the output. Temperature reshapes the distribution (as above). Top-k keeps only the k most likely tokens. Top-p (nucleus) keeps the smallest set whose cumulative probability exceeds p. The demo below applies temperature and top-k to a toy next-token distribution — greyed bars are discarded before sampling.

Decoding playground

Toy distribution for the prompt "The weather today is ___". Adjust τ and k; greyed tokens are cut.

Temperature τ 0.80

Top-k 5

◆ Practical note

For factual or coding tasks, lower temperature (0–0.3) and you want determinism and precision. For brainstorming or creative writing, raise it (0.7–1.0). Temperature 0 is effectively greedy decoding — reproducible, but prone to loops.

§ 12 — Learning

How models are trained

Capability is built in stages.

Pre-training

The model reads a vast corpus and is optimized, token by token, to predict the next one (cross-entropy loss). This is where it absorbs grammar, facts, reasoning patterns, and world structure — the overwhelming majority of compute is spent here, often across thousands of GPUs for weeks.

Supervised fine-tuning (SFT)

A pre-trained model continues text but doesn't naturally follow instructions. SFT trains it on curated prompt–response pairs so it learns the helpful-assistant format.

Reinforcement learning from human feedback (RLHF)

Humans rank competing responses; a reward model learns those preferences; the LLM is then optimized (e.g. with PPO or DPO) to produce outputs the reward model scores highly. This aligns tone, helpfulness, and safety with human judgement.

Pre-training objective (next-token cross-entropy) ℒ = − Σ_t log P( x_t | x₁, …, x_t−1 ; θ )

§ 13 — Engineering

Practical engineering notes

The quadratic cost of context

Attention compares every token with every other, so compute and memory scale as O(n²) in sequence length n. Doubling the context roughly quadruples the attention cost — the main reason long context windows are expensive.

KV caching

During generation, the keys and values of past tokens never change. Caching them means each new token costs O(n) instead of O(n²) recomputation — the single most important inference optimization. The flip side: the cache grows with context and dominates GPU memory at long lengths.

Attention variants for efficiency

FlashAttention reorganizes the computation to avoid materializing the full n×n matrix in slow memory, giving large speed and memory wins with identical math. Grouped-query and multi-query attention (GQA/MQA) share keys and values across heads, shrinking the KV cache dramatically.

Quantization

Weights stored in 16-bit can often be compressed to 8- or 4-bit with minor quality loss, cutting memory and increasing throughput — what makes capable models runnable on a single GPU or even a laptop.

◆ Field notes for builders

· Context is working memory, not knowledge. Put the facts you need in the prompt (retrieval) rather than trusting recall.
· Models hallucinate confidently. Next-token prediction optimizes plausibility, not truth — verify anything load-bearing.
· Position matters. Information at the very start and end of a long context is used more reliably than material buried in the middle.
· Prompt structure is real engineering. Clear instructions, examples, and explicit output formats measurably change behaviour.

§ 14 — Sources

References & further reading

Vaswani, A. et al. (2017). Attention Is All You Need. NeurIPS. — The transformer paper. arxiv.org/abs/1706.03762
Alammar, J. (2018). The Illustrated Transformer. — The classic visual walkthrough. jalammar.github.io/illustrated-transformer
Devlin, J. et al. (2018). BERT: Pre-training of Deep Bidirectional Transformers. arxiv.org/abs/1810.04805
Radford, A. et al. (2019). Language Models are Unsupervised Multitask Learners (GPT-2). OpenAI.
Brown, T. et al. (2020). Language Models are Few-Shot Learners (GPT-3). arxiv.org/abs/2005.14165
Su, J. et al. (2021). RoFormer: Enhanced Transformer with Rotary Position Embedding. arxiv.org/abs/2104.09864
Press, O. et al. (2021). Train Short, Test Long: Attention with Linear Biases (ALiBi). arxiv.org/abs/2108.12409
Dao, T. et al. (2022). FlashAttention: Fast and Memory-Efficient Exact Attention. arxiv.org/abs/2205.14135
Ouyang, L. et al. (2022). Training Language Models to Follow Instructions with Human Feedback (InstructGPT / RLHF). arxiv.org/abs/2203.02155
Ainslie, J. et al. (2023). GQA: Training Generalized Multi-Query Transformer Models. arxiv.org/abs/2305.13245
Karpathy, A. (2023). Let's build GPT: from scratch, in code, spelled out. — Hands-on video implementation. youtube.com
Liu, N. et al. (2023). Lost in the Middle: How Language Models Use Long Contexts. arxiv.org/abs/2307.03172

The big picture

Tokens & embeddings

Positional encoding

Sinusoidal positional encoding

Attention, from scratch

Self-attention explorer

Softmax & the √dk scaling

Softmax temperature

Multi-head attention

Causal attention matrix

Anatomy of the attention block

Fig. 1 — Scaled dot-product attention

Fig. 2 — Multi-head attention

The transformer block

1 · Residual connections

2 · Layer normalization

3 · The feed-forward network

Fig. 3 — A decoder block (pre-norm)

Stacking & the decoder

Cross-attention

Fig. 4 — Cross-attention in an encoder-decoder

From logits to words

Decoding playground

How models are trained

Pre-training

Supervised fine-tuning (SFT)

Reinforcement learning from human feedback (RLHF)

Practical engineering notes

The quadratic cost of context

KV caching

Attention variants for efficiency

Quantization

References & further reading

Softmax & the √d_k scaling