A Visual Field Guide · Multimodal Models
Sound is a wiggling line. Video is a stack of pictures with a clock attached. Neither can enter a language model as-is — both get translated into the same currency as words: tokens. Here's how a waveform and a film reel become something a transformer can read.
Companion to “How LLMs See an Image.” Same idea — pixels, samples, and frames all end up as tokens.
A language model only really knows how to do one thing: take a sequence of tokens — points in a high-dimensional space — and predict what comes next. Text, images, sound, and video all reach it through the same back door. Each modality has its own little translator that chops the raw signal into pieces, turns each piece into a vector, and lines those vectors up next to the words.
For images, the pieces are square patches. For sound, they're slices of time. For video, they're slices of time and space at once. The shape of the chopping changes; the destination — a row of tokens in the context window — never does.
A microphone records sound as amplitude over time: one number, many thousands of times a second. Speech is usually sampledMeasuring the wave's height at fixed intervals. 16 kHz means 16,000 measurements per second. at 16,000 samples per second. So a mere ten seconds of audio is 160,000 raw numbers — far too long and too low-level to feed an attention mechanism directly, the same way raw pixels are hopeless for images.
Worse, a single sample carries almost no meaning on its own. What matters is patterns of frequencies over time — that's what makes an "ee" sound different from an "oo," or a violin from a snare. So step one is almost never the raw wave.
Interactive · a raw waveform
That's all sound is to a computer: a list of heights. Meaning lives in how those heights change, which is hard to read off the wiggle directly.
The trick that unlocks almost everything: slice the wave into short overlapping windows (say 25 ms wide, stepping 10 ms at a time) and, for each window, measure how much energy sits at each pitch. Stack those slices side by side and you get a spectrogram — time on the horizontal axis, frequency on the vertical, brightness for loudness. Sound has become a 2-D image.
Two refinements make it model-friendly. The frequency axis is squeezed onto the mel scaleA warped frequency axis that matches human pitch perception, emphasising the ~0–8 kHz range where speech lives., which mimics human hearing, and the loudness is run through a logarithm so a whisper and a shout sit in a similar numeric range. The result — a log-mel spectrogram — is exactly what Whisper and most speech systems consume. From here, the playbook is the image playbook: a small convolutional front end turns the spectrogram into a sequence of embeddings, then a transformer takes over.
Interactive · waveform → log-mel spectrogram
Vertical = pitch (mel bins, low at the bottom). Horizontal = time. Bright cells = strong energy. The curving bright bands are the kind of pattern that distinguishes one vowel from another.
Whisper's pipeline, schematically
A 30-second chunk becomes an 80×3000 grid (80 mel bins, one column per 10 ms), which the conv stem halves to about 1500 feature frames before the encoder. Big reduction, meaning preserved.
When audio is turned into discrete tokens (rather than soft vectors), there's a fork that shapes everything downstream — and it mirrors the soft-vs-discrete split from the images guide. The two families answer different questions.
Interactive · semantic vs. acoustic tokens
The cleverest systems use both: AudioLMA Google model that generates audio by first laying down semantic tokens for structure, then filling in acoustic tokens for fidelity. lays down semantic tokens for long-range structure, then conditions acoustic-token generation on them — coherence and fidelity at once. Newer codecs like SpeechTokenizer even fold semantics into the first layer of the acoustic stack.
Two routes dominate, and they're the audio twins of the image methods.
| Route | How audio enters | Best for | Seen in |
|---|---|---|---|
| Encoder + projector | An audio encoder (Whisper-style) makes soft features; a small projector maps them into the LLM's token space, interleaved with text | Understanding — transcription, Q&A, reasoning about sound | Qwen-Audio, SALMONN, Audio Flamingo |
| Discrete codec tokens | A neural codec turns audio into integer codebook tokens in a shared vocabulary, predicted like words | Generation — speech, music, full-duplex voice | AudioLM, VALL-E, MusicGen, Moshi |
The first keeps the rich continuous detail and is simplest to bolt onto an existing text LLM. The second makes the model able to speak, because emitting an audio token is the same operation as emitting a word — at the cost of quantization blur and longer sequences.
Generation runs the pipeline in reverse, and there are two common exits. A model can predict a spectrogram and then hand it to a vocoderA neural network that converts a spectrogram back into an audible waveform — the inverse of the spectrogram step. — a network that reconstructs the actual waveform from the frequency picture. Or, in codec-based systems, the model predicts a stream of codec tokens and the codec's decoder turns them straight back into sound. Either way: tokens in, waveform out, by inverting the same translator that read the audio in the first place.
Video looks intimidating but it's just three things you've already met, bundled: a sequence of images (frames), a clock telling you when each frame happens, and usually an audio track. The audio is handled exactly as in §02–§06. The frames are handled as images. The genuinely new ingredient is time — and time is what makes video expensive.
The simplest economy is to throw most frames away. Consecutive frames are nearly identical, so models sub-sample — many default to about one frame per second (Gemini's default, for instance). Each kept frame is encoded into image tokens; the rest are discarded. Slide the rate and watch the frame count — and the token bill — move.
Interactive · frame sampling over a 10-second clip
tokens ≈ frames × (~258 tokens/frame) · illustrative
Low rates are fine for "what's happening in this video?" but disastrous for fast motion — a tennis serve or a diving routine lives between the sampled frames. That's why research pushes toward higher frame rates (16 fps and beyond) paired with aggressive compression, rather than just sampling sparsely.
Treating each frame independently throws away motion. A richer idea — from video transformers like ViViT and VideoMAE — is the tubeletA small 3-D patch spanning a few pixels wide, a few tall, AND a few frames deep — one token that captures a bit of motion.: instead of a flat 2-D patch, you cut a little 3-D block that spans a few pixels and a few frames. Each tubelet becomes one token that already encodes a sliver of movement, not just a still.
Interactive · a tubelet spans frames
The highlighted cells — same spot, three moments — collapse into a single token. The model gets motion baked in, and the token count drops versus encoding every frame in full.
Alongside the spatial position from §03 of the images guide, each of these tokens also carries a temporal position — which moment it came from — so the model can reason about order and timing.
Even with sampling and tubelets, video produces an avalanche of tokens. The remaining tools should look familiar from the images guide, now aimed at the time axis too:
The numbers are striking: recent work shows you can keep only a small fraction of video tokens and retain most of the understanding performance — proof of just how redundant raw frames are.
Both bill by their natural unit — seconds of audio and sampled frames of video — and a long clip adds up fast. Estimate the order of magnitude below. (Rates follow common provider defaults and are illustrative; check current docs for exact numbers.)
Interactive · audio + video token estimator
As with images, the failures fall out of the pipeline.
Sub-sampling means anything that happens between sampled frames — a quick gesture, a fast ball — simply isn't in the tokens.
"At what second does X happen?" is shaky: temporal positions are coarse, especially after pooling and merging.
A spectrogram mixes everyone together. Separating two voices or speech-in-noise stresses the encoder hard.
Discrete acoustic tokens round away fine texture, so generated audio can sound slightly synthetic or lose subtle detail.
Lip-sync and "who said what when" demand tight audio–video alignment that loose fusion can miss; models may over-trust one stream.
An hour of video is a huge token sequence; details from the opening minutes fade by the end, just like long text.
It's the master dial for video cost and capability. Gist and dialogue? Sample low. Sports, sign language, anything fast? Push the rate up and budget for it, or use a model built for high-fps clips.
For lectures, meetings, and interviews, the speech carries most of the meaning at a fraction of the token cost of the picture. Transcribe-then-reason is often cheaper and better than watching every frame.
Understanding wants soft encoder features (Whisper-style). Generation wants discrete codec tokens. Mixing the right semantic and acoustic streams is what makes voice models both coherent and natural.
Clip to the relevant window before sending, and split very long media so the important parts don't drown in a sea of redundant frames or minutes of silence.