← Back to Autonomy

A Visual Field Guide · Multimodal Models

How LLMs
See an Image

Language models never read your pixels. Before a single word is predicted, a picture is quietly turned into tokens — little vectors that sit shoulder to shoulder with text. Here is how that happens, in plain language, with things you can poke.

§ 01The big idea, in one breath

When you attach a photo to a chatbot, it feels like the model is "looking" at it. Mechanically, something more interesting is going on. The raw image — or its base64 string — is never fed to the language model as text. It is intercepted first, run through a separate vision pipeline, and converted into a compact handful of visual tokens: vectors of numbers that live in the same context window as your words.

So the transformer that writes the answer doesn't see pixels and it doesn't see characters. It sees a sequence like:

[word] [word] [word] [img] [img] [img] [img] … [img] [word] [word] ?

The text tokens and the image tokens are the same kind of object — points in a high-dimensional space — so the model's attention can mix freely between "the word cat" and "the part of the picture that looks like a cat." Everything below is just different machinery for producing those [img] slots.

Mental model A vision system is a translator. Its only job is to turn a grid of pixels into the language model's native currency: token vectors.

§ 02Step one, almost always: cut it into patches

Pixels can't go in directly. A modest 224×224 image is ~50,000 pixels, and attention compares every element to every other element — that's billions of comparisons, hopeless. The Vision Transformer (ViT) trick is to chop the image into a grid of fixed-size patches (say 16×16 pixels each), flatten each patch into a vector, and run a learned linear projection that maps it into a tidy embedding. Each patch becomes one token.

Because attention has no built-in sense of order, a positional embedding is added so the model knows the top-left patch from the bottom-right. Drag the slider and watch the patch grid — and the token count — change.

Interactive · Patchification

196
patches
196
image tokens
×768
dims / token

patches = (resolution ÷ patch size)²  ·  each → one projected token vector

Notice the tension: smaller patches = more detail but more tokens (and quadratically more compute). This single dial — patch size versus token budget — echoes through every design decision that follows.

§ 03Giving each patch a place

Here's a subtlety. AttentionThe mechanism that lets every token compare itself to every other token and decide what to focus on. It has no built-in sense of order. treats its inputs as an unordered bag — shuffle the patches and the raw math gives the same answer. For text that's already a problem (word order matters); for a 2-D image it's worse, because a patch has both a row and a column. So each patch token gets a positional embedding added to it, encoding where it sat in the grid.

Text positions are a single line of numbers. Image positions are two-dimensional — and when you feed an image at a new resolution, those position codes often have to be stretched (interpolated) to fit. Hover the cells to see the difference.

Interactive · 1-D text vs 2-D image positions

TEXT · one index per token

IMAGE · a (row, col) pair per patch

Same patch contents, different position → the model can tell sky-blue (top) from water-blue (bottom).

§ 04Four ways to plug vision into the language model

Now we have visual tokens. How do they actually reach the LLM? There are four families. The first dominates today's chat models; the others are worth knowing because they reveal the trade-offs. Tap to explore.

Interactive · Method explorer

Projector — splice visual tokens straight into the prompt

LLaVA · most modern open VLMs · the default

A frozen image encoder (usually a CLIP Vision Transformer) reads the image and outputs patch features. A small trainable projector — often just a two-layer MLP — reshapes those features so they land in the LLM's own word-embedding space. The result is a row of visual tokens that get interleaved with the text tokens and fed in as one sequence. This is "early fusion": the model treats picture and prose as one stream.

image CLIP encoder projector (MLP) visual tokens LLM
Strengths
  • Dead simple; cheap to train (often only the projector)
  • Reuses powerful off-the-shelf encoders + LLMs
Costs
  • Visual tokens eat your context window
  • High-res images blow up the token count fast

Resampler — squeeze many patches into a fixed few

Flamingo (Perceiver Resampler) · BLIP-2 (Q-Former)

A high-res image can yield hundreds of patch features. A resampler introduces a small set of learnable "query" tokens — say 32 or 64 — that use attention to pull the salient information out of all those features and pack it into a fixed token budget, no matter how big the image was. Flamingo's Perceiver Resampler does this blind; BLIP-2's Q-Former also peeks at the text query, so it can keep the bits relevant to your question.

many features 64 learned queries
(attention)
64 fixed tokens LLM
Strengths
  • Constant token cost — great for video / many images
  • Throttles redundancy in big inputs
Costs
  • Compression can drop fine detail
  • Extra module to train and tune

Cross-attention — keep vision on the side, peek when needed

Flamingo (gated xattn-dense)

Instead of inserting visual tokens into the input stream, the image features stay separate and the LLM reaches out to them through new cross-attention layers slotted between its frozen blocks. A clever detail: each new layer starts behind a tanh gate initialized at zero, so at the very beginning the model behaves exactly like the original text-only LLM and only gradually learns to let vision in — protecting its language skills.

LLM block gated cross-attn ❄→🔥 LLM block ↑ image features attended as keys/values
Strengths
  • Doesn't consume input token slots
  • Gating preserves the base model's fluency
Costs
  • Adds many new parameters inside the LLM
  • Largely set aside in newer VLMs for the simpler projector

Discrete tokens — make pixels literally part of the vocabulary

Chameleon · VQ-VAE / VQGAN tokenizers · "native" multimodal

The most radical option. A VQ-VAE-style tokenizer maps each image region to the nearest entry in a learned codebook, turning the picture into a sequence of discrete integer tokens — exactly like words in a dictionary. Now one transformer can be trained from scratch to predict text and image tokens with the same machinery, which is what unlocks models that generate images token by token, not just read them.

image patch nearest codebook entry token #4827 one shared transformer
Strengths
  • Unifies understanding + generation in one model
  • Images obey the same autoregressive rules as text
Costs
  • Quantization throws away detail (information bottleneck)
  • Often needs many tokens per image; codebooks are finicky to train
MethodWhere vision entersToken costSeen in
ProjectorInterleaved in the inputScales with image sizeLLaVA, most chat VLMs
ResamplerInput, but fixed-lengthConstant (e.g. 64)Flamingo, BLIP-2
Cross-attentionInside the LLM layersOff the input budgetFlamingo
Discrete tokensShared vocabularyMany per imageChameleon

§ 05What the model actually looks at

Once visual and text tokens share a sequence, attention lets a word "reach into" the picture. When the model processes the word sun, the patches covering the bright disc light up; on grass, the green foreground dominates. This is a stylised mock-up — real attention is messier and spread across many heads and layers — but the intuition holds. Tap a word.

Interactive · word → patch attention (illustrative)

Brighter cells = higher attention weight from that word onto that patch.

§ 06Soft tokens vs. discrete tokens

The word "token" hides an important fork. Most vision-language models that read images use continuous (soft) tokens: each is a raw vector of floats, never rounded to anything. Models built to also create images often use discrete tokens: each patch is snapped to one integer ID from a fixed codebook, just like a word. Toggle to feel the difference.

Interactive · Two flavours of "token"

Why it matters Soft tokens preserve more nuance and are easier for understanding. Discrete tokens are clumsier (quantization loses detail) but let a single model treat image generation as "predict the next token," which is why they show up in native image-generating systems.

§ 07What an image actually costs you

Because images become tokens, they cost money and context space — sometimes a lot. Different providers count differently, usually by tiling the image and charging per tile (Claude is the exception — it uses a smooth area formula). Drop in your own image to read its real dimensions, see the tile grid, and compare. (Figures follow each provider's published method; always re-check current docs.)

Interactive · Image token calculator

🖼️
Tap to choose an image — or drag one here. (Stays on your device; nothing is uploaded.)
765
image tokens
4 tiles × 170 + 85
0.6%
of a 128k window
≈ $0.0077 in

Gotcha Some models (GPT-4o, Claude) auto-downscale large images before counting, so pre-resizing mostly saves latency, not tokens. Others (notably Gemini) count more literally, so shrinking a giant photo there can cut the bill dramatically.

§ 08Reading high-resolution: dynamic tiling

A single fixed grid of patches is fine for a snapshot, but it smears the fine print on a receipt or the labels on a dashboard. The modern fix — used by LLaVA-NeXT, InternVL and others, often called AnyRes"Any resolution": split a big image into several high-res sub-tiles plus one shrunk global view, encode each, and concatenate all their tokens. — is to look twice: once at a shrunk global thumbnail for overall layout, and again at several full-resolution sub-tiles for detail. Each piece is encoded separately and all their tokens are concatenated.

Dynamic high-res, schematically

global thumbnail — gist & layout
sub-tile 1 — full detail
sub-tile 2 — full detail
sub-tile 3 — full detail
sub-tile 4 — full detail

tokens(global) + Σ tokens(sub-tiles) → fine text becomes legible, at a higher token bill

This is exactly why a high-resolution screenshot can cost several times the tokens of a casual photo: you're paying for the thumbnail plus every detail tile. It's the resolution-versus-budget dial from §02, scaled up to whole images.

§ 09How Claude sees an image, concretely

To ground all of this in one real system: when you send Claude an image through its API, the bytes are decoded, the picture is resized if it's large, and then it's turned into visual tokens. Claude's published estimate skips per-tile arithmetic and uses a clean area formula:

image tokens ≈ (width px × height px) ÷ 750

Two practical rules come with it. First, if the image's long edge is over ~1568 px (roughly 1.15 megapixels), Claude scales it down first, preserving aspect ratio — so sending a giant file mostly adds upload latency, not fidelity. (Newer Opus models can keep more pixels and therefore spend more tokens for sharper detail when you want it.) Second, the picture is padded slightly so its dimensions are a multiple of 28 px, which matters if you ask Claude for pixel coordinates — they come back relative to the resized image. Some worked numbers:

Image sizeMegapixels≈ tokensNote
200 × 2000.04 MP~54tiny — may lose detail
1000 × 10001.0 MP~1,334no resize needed
1092 × 10921.19 MP~1,590near the sweet spot
3000 × 20006.0 MPcapped after resizescaled to ≤1568 long edge first

Try the calculator in §07 with the Claude style setting to reproduce these. Figures track Anthropic's vision documentation; verify current numbers there before quoting them.

§ 10The base64 envelope

When you hand an image to an API instead of a URL, you usually inline it as a base64 string — the image's raw bytes rewritten using 64 plain-text characters (A–Z a–z 0–9 + /). It isn't compression and it isn't encryption; it's just a way to smuggle binary data through text-only channels like a JSON body. Which raises the famous worry: does that giant string get counted as text tokens?

It does not. The base64 is transport, not content. At the door, the server decodes it back into image bytes, hands the pixels to the vision pipeline, and the string itself is discarded before the language model ever runs. Press play and watch the box get unwrapped and thrown away.

Interactive · base64 → bytes → tokens

1 · base64 string "text"
/9j/4AAQSkZJRgABAQ…
AAD/2wBDAAYEBQYFBAY…
GBwYIChAKDQ8WFRESF…
~33% bigger than the file · pure ASCII
2 · decode at the door discard ✕
decode()
🗑 string thrown away
never reaches the model
3 · raw pixels image
the bytes the box was carrying
4 · visual tokens to LLM
cost = tiles, not string length
The myth · if base64 were text
~1,365,000
tokens for a 1 MB photo (≈1.37M chars ÷ ~1 char/tok)
Reality · counted as an image
~765
a 1024×1024 image, GPT-4o-style tiling

same image as a URL or as base64 → identical token cost

So what does base64 affect?

Not tokens — but it isn't free either. Two things to keep in mind:

Watch out
  • Payload size: base64 inflates the bytes by ~33% (3 bytes → 4 characters), so a big image can bump against an API's maximum request-body size. URLs sidestep that.
  • Bandwidth & latency: a fatter request takes longer to upload, which can slow time-to-first-token.
Good to know
  • Data URIs (data:image/png;base64,…) are the exact same thing, just wrapped for the browser/HTML.
  • It's lossless: decoding gives back the original file byte-for-byte — base64 changes the spelling, not the picture.
The one-liner base64 is the shipping box. The door unwraps it to pixels, throws the box away, and only the pixels go on to become tokens. The label's length never enters the bill.

§ 11The reverse: making images, not just reading them

Everything so far turns pixels into tokens. Flip the arrow and you get image generation. There are two dominant ways to run it backwards, and they map onto the soft-vs-discrete split from §06.

Interactive · two engines for generation

Autoregressive models (e.g. Chameleon) treat an image as a sequence of discrete codebookA fixed dictionary of image "words." A VQ-VAE maps each patch to its nearest entry; the integer index is the token. tokens and predict them one after another — exactly like writing a sentence — so a single transformer can read text and emit a picture in the same breath. Diffusion models instead start from noise in a continuous latent space and denoise toward an image; recent systems (the approach behind GPT-4o-style native image generation) lean this way to dodge the quantization blur of discrete tokens. The trade is familiar: discrete is unified and simple but coarser; diffusion is sharper but a different beast bolted alongside the language model.

§ 12Where vision quietly breaks

Knowing how images become tokens explains most of the failures you'll hit. Each one traces back to the pipeline above.

🔢
Counting

Asked "how many sheep?", models often miscount. Patches fragment objects, and there's no explicit tally — just attention over a token soup.

🧭
Spatial relations

"Is the cup left of the laptop?" Position embeddings are approximate, so precise left/right/above logic is shaky, especially after resizing.

🔬
Tiny text

If the resolution or tiling can't resolve small glyphs, the text never makes it into a clean token. Fix with higher res or a crop (§08).

🌫️
Quantization loss

Discrete-token pipelines round each patch to a codebook entry, discarding fine gradients and texture before the model ever sees them.

🎭
Adversarial patches

A small crafted sticker can hijack the patch embeddings and flip the model's read of the whole scene — the token view is exploitable.

🪞
Confident hallucination

When detail is missing, the language half happily fills the gap with a plausible guess, stated with the same confidence as a fact.

§ 13Choosing an approach

If you're building rather than just curious, the method usually follows from one question.

Interactive · decision guide

What's your primary goal?

§ 14Practical notes for builders

Resolution is a dial, not a virtue

More pixels help only when the answer lives in fine detail — small text, gauge needles, license plates. For "what's in this scene?" a downscaled image is just as good and far cheaper. Many APIs expose a low/high detail switch; the low mode often charges a flat, tiny token count.

Crop before you send

If you only care about one region (a chart, a label, a face), crop to it. You spend tokens on signal instead of sky. This beats sending a 12-megapixel frame and hoping the model zooms in.

Tiling is your friend and your enemy

Tiling is what lets models read high-resolution images at all — but token cost grows roughly with the number of tiles, i.e. with area. A 2× wider and taller image is ~4× the tokens. Budget accordingly when batching.

The base64 myth

A long base64 image string does not "use up" text tokens — see §10. The bytes are decoded and routed to the vision pipeline before tokenization, so cost comes from patches and tiles, never from the length of the encoded string.

Match the method to the job

Reading and reasoning about images? A projector-style VLM is the simple, strong default. Long video or dozens of frames? A resampler keeps the token count flat. Want one model to both understand and draw? That's where discrete-token, native-multimodal designs earn their keep.

One-line summary Split into patches → project (or quantize) into tokens → add positions → feed alongside text. Every vision-language model is a variation on that sentence.

§ 15References & further reading

  1. Dosovitskiy et al., An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale (ViT), 2020. arXiv:2010.11929.
  2. Vaswani et al., Attention Is All You Need, 2017. arXiv:1706.03762.
  3. Radford et al., Learning Transferable Visual Models from Natural Language Supervision (CLIP), 2021. arXiv:2103.00020.
  4. Liu et al., Visual Instruction Tuning (LLaVA), 2023. arXiv:2304.08485.
  5. Alayrac et al., Flamingo: a Visual Language Model for Few-Shot Learning, 2022. arXiv:2204.14198.
  6. Li et al., BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and LLMs, 2023. arXiv:2301.12597.
  7. Chameleon Team (Meta), Chameleon: Mixed-Modal Early-Fusion Foundation Models, 2024. arXiv:2405.09818.
  8. van den Oord et al., Neural Discrete Representation Learning (VQ-VAE), 2017. arXiv:1711.00937.
  9. Yu et al., An Image is Worth 32 Tokens for Reconstruction and Generation (TiTok), 2024. arXiv:2406.07550.
  10. Liu et al., LLaVA-NeXT (dynamic high-res / AnyRes), 2024. Project page, llava-vl.github.io.
  11. Chen et al., InternVL 1.5 — dynamic high-resolution image inputs, 2024. arXiv:2404.16821.
  12. Zhou et al., Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model, 2024. arXiv:2408.11039.
  13. Anthropic, Vision — image input, token cost, and resizing. docs.claude.com/en/docs/build-with-claude/vision.
  14. OpenAI — Vision / image-input token counting documentation (platform.openai.com).
  15. Google — Gemini API documentation, image token counting.
  16. Hugging Face, Design choices for Vision Language Models (overview of resampling, cross-attention, interleaving).