A Visual Field Guide · Multimodal Models
Language models never read your pixels. Before a single word is predicted, a picture is quietly turned into tokens — little vectors that sit shoulder to shoulder with text. Here is how that happens, in plain language, with things you can poke.
When you attach a photo to a chatbot, it feels like the model is "looking" at it. Mechanically, something more interesting is going on. The raw image — or its base64 string — is never fed to the language model as text. It is intercepted first, run through a separate vision pipeline, and converted into a compact handful of visual tokens: vectors of numbers that live in the same context window as your words.
So the transformer that writes the answer doesn't see pixels and it doesn't see characters. It sees a sequence like:
The text tokens and the image tokens are the same kind of object — points in a high-dimensional space — so the model's attention can mix freely between "the word cat" and "the part of the picture that looks like a cat." Everything below is just different machinery for producing those [img] slots.
Pixels can't go in directly. A modest 224×224 image is ~50,000 pixels, and attention compares every element to every other element — that's billions of comparisons, hopeless. The Vision Transformer (ViT) trick is to chop the image into a grid of fixed-size patches (say 16×16 pixels each), flatten each patch into a vector, and run a learned linear projection that maps it into a tidy embedding. Each patch becomes one token.
Because attention has no built-in sense of order, a positional embedding is added so the model knows the top-left patch from the bottom-right. Drag the slider and watch the patch grid — and the token count — change.
Interactive · Patchification
patches = (resolution ÷ patch size)² · each → one projected token vector
Notice the tension: smaller patches = more detail but more tokens (and quadratically more compute). This single dial — patch size versus token budget — echoes through every design decision that follows.
Here's a subtlety. AttentionThe mechanism that lets every token compare itself to every other token and decide what to focus on. It has no built-in sense of order. treats its inputs as an unordered bag — shuffle the patches and the raw math gives the same answer. For text that's already a problem (word order matters); for a 2-D image it's worse, because a patch has both a row and a column. So each patch token gets a positional embedding added to it, encoding where it sat in the grid.
Text positions are a single line of numbers. Image positions are two-dimensional — and when you feed an image at a new resolution, those position codes often have to be stretched (interpolated) to fit. Hover the cells to see the difference.
Interactive · 1-D text vs 2-D image positions
TEXT · one index per token
IMAGE · a (row, col) pair per patch
Same patch contents, different position → the model can tell sky-blue (top) from water-blue (bottom).
Now we have visual tokens. How do they actually reach the LLM? There are four families. The first dominates today's chat models; the others are worth knowing because they reveal the trade-offs. Tap to explore.
Interactive · Method explorer
A frozen image encoder (usually a CLIP Vision Transformer) reads the image and outputs patch features. A small trainable projector — often just a two-layer MLP — reshapes those features so they land in the LLM's own word-embedding space. The result is a row of visual tokens that get interleaved with the text tokens and fed in as one sequence. This is "early fusion": the model treats picture and prose as one stream.
A high-res image can yield hundreds of patch features. A resampler introduces a small set of learnable "query" tokens — say 32 or 64 — that use attention to pull the salient information out of all those features and pack it into a fixed token budget, no matter how big the image was. Flamingo's Perceiver Resampler does this blind; BLIP-2's Q-Former also peeks at the text query, so it can keep the bits relevant to your question.
Instead of inserting visual tokens into the input stream, the image features stay separate and the LLM reaches out to them through new cross-attention layers slotted between its frozen blocks. A clever detail: each new layer starts behind a tanh gate initialized at zero, so at the very beginning the model behaves exactly like the original text-only LLM and only gradually learns to let vision in — protecting its language skills.
The most radical option. A VQ-VAE-style tokenizer maps each image region to the nearest entry in a learned codebook, turning the picture into a sequence of discrete integer tokens — exactly like words in a dictionary. Now one transformer can be trained from scratch to predict text and image tokens with the same machinery, which is what unlocks models that generate images token by token, not just read them.
| Method | Where vision enters | Token cost | Seen in |
|---|---|---|---|
| Projector | Interleaved in the input | Scales with image size | LLaVA, most chat VLMs |
| Resampler | Input, but fixed-length | Constant (e.g. 64) | Flamingo, BLIP-2 |
| Cross-attention | Inside the LLM layers | Off the input budget | Flamingo |
| Discrete tokens | Shared vocabulary | Many per image | Chameleon |
Once visual and text tokens share a sequence, attention lets a word "reach into" the picture. When the model processes the word sun, the patches covering the bright disc light up; on grass, the green foreground dominates. This is a stylised mock-up — real attention is messier and spread across many heads and layers — but the intuition holds. Tap a word.
Interactive · word → patch attention (illustrative)
Brighter cells = higher attention weight from that word onto that patch.
The word "token" hides an important fork. Most vision-language models that read images use continuous (soft) tokens: each is a raw vector of floats, never rounded to anything. Models built to also create images often use discrete tokens: each patch is snapped to one integer ID from a fixed codebook, just like a word. Toggle to feel the difference.
Interactive · Two flavours of "token"
Because images become tokens, they cost money and context space — sometimes a lot. Different providers count differently, usually by tiling the image and charging per tile (Claude is the exception — it uses a smooth area formula). Drop in your own image to read its real dimensions, see the tile grid, and compare. (Figures follow each provider's published method; always re-check current docs.)
Interactive · Image token calculator
A single fixed grid of patches is fine for a snapshot, but it smears the fine print on a receipt or the labels on a dashboard. The modern fix — used by LLaVA-NeXT, InternVL and others, often called AnyRes"Any resolution": split a big image into several high-res sub-tiles plus one shrunk global view, encode each, and concatenate all their tokens. — is to look twice: once at a shrunk global thumbnail for overall layout, and again at several full-resolution sub-tiles for detail. Each piece is encoded separately and all their tokens are concatenated.
Dynamic high-res, schematically
tokens(global) + Σ tokens(sub-tiles) → fine text becomes legible, at a higher token bill
This is exactly why a high-resolution screenshot can cost several times the tokens of a casual photo: you're paying for the thumbnail plus every detail tile. It's the resolution-versus-budget dial from §02, scaled up to whole images.
To ground all of this in one real system: when you send Claude an image through its API, the bytes are decoded, the picture is resized if it's large, and then it's turned into visual tokens. Claude's published estimate skips per-tile arithmetic and uses a clean area formula:
Two practical rules come with it. First, if the image's long edge is over ~1568 px (roughly 1.15 megapixels), Claude scales it down first, preserving aspect ratio — so sending a giant file mostly adds upload latency, not fidelity. (Newer Opus models can keep more pixels and therefore spend more tokens for sharper detail when you want it.) Second, the picture is padded slightly so its dimensions are a multiple of 28 px, which matters if you ask Claude for pixel coordinates — they come back relative to the resized image. Some worked numbers:
| Image size | Megapixels | ≈ tokens | Note |
|---|---|---|---|
| 200 × 200 | 0.04 MP | ~54 | tiny — may lose detail |
| 1000 × 1000 | 1.0 MP | ~1,334 | no resize needed |
| 1092 × 1092 | 1.19 MP | ~1,590 | near the sweet spot |
| 3000 × 2000 | 6.0 MP | capped after resize | scaled to ≤1568 long edge first |
Try the calculator in §07 with the Claude style setting to reproduce these. Figures track Anthropic's vision documentation; verify current numbers there before quoting them.
When you hand an image to an API instead of a URL, you usually inline it as a base64 string — the image's raw bytes rewritten using 64 plain-text characters (A–Z a–z 0–9 + /). It isn't compression and it isn't encryption; it's just a way to smuggle binary data through text-only channels like a JSON body. Which raises the famous worry: does that giant string get counted as text tokens?
It does not. The base64 is transport, not content. At the door, the server decodes it back into image bytes, hands the pixels to the vision pipeline, and the string itself is discarded before the language model ever runs. Press play and watch the box get unwrapped and thrown away.
Interactive · base64 → bytes → tokens
/9j/4AAQSkZJRgABAQ…
AAD/2wBDAAYEBQYFBAY…
GBwYIChAKDQ8WFRESF…
same image as a URL or as base64 → identical token cost
Not tokens — but it isn't free either. Two things to keep in mind:
Everything so far turns pixels into tokens. Flip the arrow and you get image generation. There are two dominant ways to run it backwards, and they map onto the soft-vs-discrete split from §06.
Interactive · two engines for generation
Autoregressive models (e.g. Chameleon) treat an image as a sequence of discrete codebookA fixed dictionary of image "words." A VQ-VAE maps each patch to its nearest entry; the integer index is the token. tokens and predict them one after another — exactly like writing a sentence — so a single transformer can read text and emit a picture in the same breath. Diffusion models instead start from noise in a continuous latent space and denoise toward an image; recent systems (the approach behind GPT-4o-style native image generation) lean this way to dodge the quantization blur of discrete tokens. The trade is familiar: discrete is unified and simple but coarser; diffusion is sharper but a different beast bolted alongside the language model.
Knowing how images become tokens explains most of the failures you'll hit. Each one traces back to the pipeline above.
Asked "how many sheep?", models often miscount. Patches fragment objects, and there's no explicit tally — just attention over a token soup.
"Is the cup left of the laptop?" Position embeddings are approximate, so precise left/right/above logic is shaky, especially after resizing.
If the resolution or tiling can't resolve small glyphs, the text never makes it into a clean token. Fix with higher res or a crop (§08).
Discrete-token pipelines round each patch to a codebook entry, discarding fine gradients and texture before the model ever sees them.
A small crafted sticker can hijack the patch embeddings and flip the model's read of the whole scene — the token view is exploitable.
When detail is missing, the language half happily fills the gap with a plausible guess, stated with the same confidence as a fact.
If you're building rather than just curious, the method usually follows from one question.
Interactive · decision guide
More pixels help only when the answer lives in fine detail — small text, gauge needles, license plates. For "what's in this scene?" a downscaled image is just as good and far cheaper. Many APIs expose a low/high detail switch; the low mode often charges a flat, tiny token count.
If you only care about one region (a chart, a label, a face), crop to it. You spend tokens on signal instead of sky. This beats sending a 12-megapixel frame and hoping the model zooms in.
Tiling is what lets models read high-resolution images at all — but token cost grows roughly with the number of tiles, i.e. with area. A 2× wider and taller image is ~4× the tokens. Budget accordingly when batching.
A long base64 image string does not "use up" text tokens — see §10. The bytes are decoded and routed to the vision pipeline before tokenization, so cost comes from patches and tiles, never from the length of the encoded string.
Reading and reasoning about images? A projector-style VLM is the simple, strong default. Long video or dozens of frames? A resampler keeps the token count flat. Want one model to both understand and draw? That's where discrete-token, native-multimodal designs earn their keep.