A Visual Field Guide · Multimodal Models

How LLMs
Hear & Watch

Sound is a wiggling line. Video is a stack of pictures with a clock attached. Neither can enter a language model as-is — both get translated into the same currency as words: tokens. Here's how a waveform and a film reel become something a transformer can read.

By Majid Mazouchi · Interactive monograph

Companion to “How LLMs See an Image.” Same idea — pixels, samples, and frames all end up as tokens.

§ 01The one shared trick

A language model only really knows how to do one thing: take a sequence of tokens — points in a high-dimensional space — and predict what comes next. Text, images, sound, and video all reach it through the same back door. Each modality has its own little translator that chops the raw signal into pieces, turns each piece into a vector, and lines those vectors up next to the words.

For images, the pieces are square patches. For sound, they're slices of time. For video, they're slices of time and space at once. The shape of the chopping changes; the destination — a row of tokens in the context window — never does.

[word] [word] [audio] [audio] [audio] … [frame▸patches] [frame▸patches] … [word] ?

Mental model Every modality is a translator with the same target language: token vectors. Learn the chopping rule for each, and you understand the whole thing.

§ 02Sound is a wave — and that's the problem

A microphone records sound as amplitude over time: one number, many thousands of times a second. Speech is usually sampledMeasuring the wave's height at fixed intervals. 16 kHz means 16,000 measurements per second. at 16,000 samples per second. So a mere ten seconds of audio is 160,000 raw numbers — far too long and too low-level to feed an attention mechanism directly, the same way raw pixels are hopeless for images.

Worse, a single sample carries almost no meaning on its own. What matters is patterns of frequencies over time — that's what makes an "ee" sound different from an "oo," or a violin from a snare. So step one is almost never the raw wave.

Interactive · a raw waveform

Wiggliness (frequency) medium

That's all sound is to a computer: a list of heights. Meaning lives in how those heights change, which is hard to read off the wiggle directly.

§ 03The spectrogram move: turn sound into a picture

The trick that unlocks almost everything: slice the wave into short overlapping windows (say 25 ms wide, stepping 10 ms at a time) and, for each window, measure how much energy sits at each pitch. Stack those slices side by side and you get a spectrogram — time on the horizontal axis, frequency on the vertical, brightness for loudness. Sound has become a 2-D image.

Two refinements make it model-friendly. The frequency axis is squeezed onto the mel scaleA warped frequency axis that matches human pitch perception, emphasising the ~0–8 kHz range where speech lives., which mimics human hearing, and the loudness is run through a logarithm so a whisper and a shout sit in a similar numeric range. The result — a log-mel spectrogram — is exactly what Whisper and most speech systems consume. From here, the playbook is the image playbook: a small convolutional front end turns the spectrogram into a sequence of embeddings, then a transformer takes over.

Interactive · waveform → log-mel spectrogram

time →press play

Vertical = pitch (mel bins, low at the bottom). Horizontal = time. Bright cells = strong energy. The curving bright bands are the kind of pattern that distinguishes one vowel from another.

Whisper's pipeline, schematically

16 kHz wave→ log-mel
80 × 3000→ conv stem
(↓ 2×)→ ~1500 audio
frames→ transformer

A 30-second chunk becomes an 80×3000 grid (80 mel bins, one column per 10 ms), which the conv stem halves to about 1500 feature frames before the encoder. Big reduction, meaning preserved.

§ 04Two kinds of audio token

When audio is turned into discrete tokens (rather than soft vectors), there's a fork that shapes everything downstream — and it mirrors the soft-vs-discrete split from the images guide. The two families answer different questions.

Interactive · semantic vs. acoustic tokens

The cleverest systems use both: AudioLMA Google model that generates audio by first laying down semantic tokens for structure, then filling in acoustic tokens for fidelity. lays down semantic tokens for long-range structure, then conditions acoustic-token generation on them — coherence and fidelity at once. Newer codecs like SpeechTokenizer even fold semantics into the first layer of the acoustic stack.

§ 05Plugging audio into the language model

Two routes dominate, and they're the audio twins of the image methods.

Route	How audio enters	Best for	Seen in
Encoder + projector	An audio encoder (Whisper-style) makes soft features; a small projector maps them into the LLM's token space, interleaved with text	Understanding — transcription, Q&A, reasoning about sound	Qwen-Audio, SALMONN, Audio Flamingo
Discrete codec tokens	A neural codec turns audio into integer codebook tokens in a shared vocabulary, predicted like words	Generation — speech, music, full-duplex voice	AudioLM, VALL-E, MusicGen, Moshi

The first keeps the rich continuous detail and is simplest to bolt onto an existing text LLM. The second makes the model able to speak, because emitting an audio token is the same operation as emitting a word — at the cost of quantization blur and longer sequences.

§ 06Making sound back

Generation runs the pipeline in reverse, and there are two common exits. A model can predict a spectrogram and then hand it to a vocoderA neural network that converts a spectrogram back into an audible waveform — the inverse of the spectrogram step. — a network that reconstructs the actual waveform from the frequency picture. Or, in codec-based systems, the model predicts a stream of codec tokens and the codec's decoder turns them straight back into sound. Either way: tokens in, waveform out, by inverting the same translator that read the audio in the first place.

§ 07Video = frames + time + sound

Video looks intimidating but it's just three things you've already met, bundled: a sequence of images (frames), a clock telling you when each frame happens, and usually an audio track. The audio is handled exactly as in §02–§06. The frames are handled as images. The genuinely new ingredient is time — and time is what makes video expensive.

The core tension A single 1080p frame can be hundreds of image tokens. A one-minute clip at 30 frames per second is 1,800 frames. Multiply it out and a short video would swamp the entire context window. Every video model is, at heart, a strategy for not doing that.

§ 08Sampling frames: the first economy

The simplest economy is to throw most frames away. Consecutive frames are nearly identical, so models sub-sample — many default to about one frame per second (Gemini's default, for instance). Each kept frame is encoded into image tokens; the rest are discarded. Slide the rate and watch the frame count — and the token bill — move.

Interactive · frame sampling over a 10-second clip

Sampling rate 1 fps

0s5s10s

frames kept

2,580

≈ visual tokens

tokens ≈ frames × (~258 tokens/frame) · illustrative

Low rates are fine for "what's happening in this video?" but disastrous for fast motion — a tennis serve or a diving routine lives between the sampled frames. That's why research pushes toward higher frame rates (16 fps and beyond) paired with aggressive compression, rather than just sampling sparsely.

§ 09Tokens across space and time

Treating each frame independently throws away motion. A richer idea — from video transformers like ViViT and VideoMAE — is the tubeletA small 3-D patch spanning a few pixels wide, a few tall, AND a few frames deep — one token that captures a bit of motion.: instead of a flat 2-D patch, you cut a little 3-D block that spans a few pixels and a few frames. Each tubelet becomes one token that already encodes a sliver of movement, not just a still.

Interactive · a tubelet spans frames

t = 0

t = 1

t = 2

→

1 token

one tubelet

The highlighted cells — same spot, three moments — collapse into a single token. The model gets motion baked in, and the token count drops versus encoding every frame in full.

Alongside the spatial position from §03 of the images guide, each of these tokens also carries a temporal position — which moment it came from — so the model can reason about order and timing.

§ 10Taming the token flood

Even with sampling and tubelets, video produces an avalanche of tokens. The remaining tools should look familiar from the images guide, now aimed at the time axis too:

Compression tactics

Pooling / merging nearby tokens that look alike
Resampler / Q-Former to a fixed budget per frame or clip
Keyframe selection — keep only frames that change

What you trade

Aggressive merging can drop the one detail a question needs
Fixed budgets blur dense or fast scenes
Tuning the retention ratio is its own art

The numbers are striking: recent work shows you can keep only a small fraction of video tokens and retain most of the understanding performance — proof of just how redundant raw frames are.

§ 11What audio and video actually cost

Both bill by their natural unit — seconds of audio and sampled frames of video — and a long clip adds up fast. Estimate the order of magnitude below. (Rates follow common provider defaults and are illustrative; check current docs for exact numbers.)

Interactive · audio + video token estimator

Length (seconds)

—

frames sampled

1,920

total tokens

1.5%

of a 128k window

Gotcha Video cost is dominated by frame rate, not resolution. Doubling fps doubles the bill; an hour of footage at even 1 fps is thousands of frames. When you only need the gist, sample sparsely and lean on the audio track — speech is far cheaper per second than pictures.

§ 12Where hearing and watching break

As with images, the failures fall out of the pipeline.

🎞️

Between the frames

Sub-sampling means anything that happens between sampled frames — a quick gesture, a fast ball — simply isn't in the tokens.

⏱️

Precise timing

"At what second does X happen?" is shaky: temporal positions are coarse, especially after pooling and merging.

🗣️

Overlapping speakers

A spectrogram mixes everyone together. Separating two voices or speech-in-noise stresses the encoder hard.

🌫️

Codec blur

Discrete acoustic tokens round away fine texture, so generated audio can sound slightly synthetic or lose subtle detail.

🔇

Sync & modality bias

Lip-sync and "who said what when" demand tight audio–video alignment that loose fusion can miss; models may over-trust one stream.

🧩

Long-context drift

An hour of video is a huge token sequence; details from the opening minutes fade by the end, just like long text.

§ 13Practical notes for builders

Pick your frame rate deliberately

It's the master dial for video cost and capability. Gist and dialogue? Sample low. Sports, sign language, anything fast? Push the rate up and budget for it, or use a model built for high-fps clips.

Don't ignore the audio track

For lectures, meetings, and interviews, the speech carries most of the meaning at a fraction of the token cost of the picture. Transcribe-then-reason is often cheaper and better than watching every frame.

Match the token type to the task

Understanding wants soft encoder features (Whisper-style). Generation wants discrete codec tokens. Mixing the right semantic and acoustic streams is what makes voice models both coherent and natural.

Pre-trim and pre-segment

Clip to the relevant window before sending, and split very long media so the important parts don't drown in a sea of redundant frames or minutes of silence.

One-line summary Sound → spectrogram or codec tokens; video → sampled frames (or tubelets) + audio + time. Chop, embed, add position, feed alongside text. Same recipe, new axes.

§ 14References & further reading

Radford et al., Robust Speech Recognition via Large-Scale Weak Supervision (Whisper), 2022. arXiv:2212.04356.
Zeghidour et al., SoundStream: An End-to-End Neural Audio Codec, 2021. arXiv:2107.03312.
Défossez et al., High Fidelity Neural Audio Compression (EnCodec), 2022. arXiv:2210.13438.
Borsos et al., AudioLM: a Language Modeling Approach to Audio Generation, 2022. arXiv:2209.03143.
Hsu et al., HuBERT: Self-Supervised Speech Representation Learning, 2021. arXiv:2106.07447.
Wang et al., Neural Codec Language Models are Zero-Shot TTS Synthesizers (VALL-E), 2023. arXiv:2301.02111.
Chu et al., Qwen-Audio, 2023. arXiv:2311.07919; Tang et al., SALMONN, 2023. arXiv:2310.13289.
Arnab et al., ViViT: A Video Vision Transformer, 2021. arXiv:2103.15691.
Tong et al., VideoMAE, 2022. arXiv:2203.12602.
Zhang et al., Video-LLaMA, 2023. arXiv:2306.02858.
Li et al., Improving LLM Video Understanding with 16 Frames Per Second (F-16), 2025. arXiv:2503.13956.
Google, Gemini API — audio & video understanding documentation (default 1 fps video sampling). ai.google.dev.