The full account: the kinds of knowledge a model draws on, the kinds of memory it operates over, how they trade off, how an agent fuses them through a knowledge graph — and the engineering, failure modes, and governance that make it work in production. Threaded throughout by one example: root-causing a current oscillation in an EV traction drive.
Every time a model produces an answer, two distinct questions are at play: where did the information come from, and how long does it persist. These are not the same axis, and conflating them is the most common source of confusion in applied LLM work.
Knowledge sources describe the origin of what the model uses: baked into the weights, supplied in the prompt, retrieved from a database, or fetched live from a tool. Memory describes the retention of information: a single forward pass, one conversation, or permanently.
A retrieved document is a knowledge source and a form of short-term memory while it sits in context. A fine-tuned fact is a knowledge source and long-term parametric memory. Holding both axes in mind keeps system design honest.
A model does not have a single brain. It has a stack of overlapping stores, each with its own latency, cost, freshness, and failure mode.
A traction inverter reports current oscillation on driftmode exit at roughly 3000 RPM and 750 Nm. Nobody hands you the answer. You have control theory in your head, calibration files on a server, live dyno logs behind an API, a stack of past failure reports, and a recalibration procedure that has worked before. Each store below is shown doing its part of this one investigation.
Ordered roughly from most-baked-in to most-external. The trade-off runs along a spine: deeper integration means lower latency but worse freshness and traceability; more external means higher latency but auditable, updatable knowledge.
What the model absorbed during pretraining and stored implicitly in its parameters — grammar, world facts, reasoning patterns, code idioms. Fast (no lookup) but frozen at the training cutoff, hard to attribute, and prone to confident error when stretched beyond what it actually saw.
Practical: Treat parametric facts as a strong prior, never a citation. Hallucination lives here.
Anything you place in the context window: the question, system instructions, pasted documents, few-shot examples, prior turns. The highest-fidelity channel — the model treats supplied text as ground truth far more reliably than its own recall.
Practical: Long contexts suffer "lost in the middle." Put the most critical material at the edges.
An external corpus is chunked, embedded, and indexed; relevant chunks are retrieved into the prompt at query time. Makes knowledge updatable, attributable, and unbounded by the window, without retraining.
Practical: Retrieval quality caps answer quality. Section 16 opens the pipeline up.
The model emits a structured call to an external system — search, calculator, SQL, an internal API — and the result returns into context. Covers knowledge that is live, computed, or impossible to memorize.
Practical: Tools turn "what the model knows" into "what it can find out." Validate and sandbox every call.
Continued training on curated data — full fine-tuning or parameter-efficient methods like LoRA. The right tool for how the model behaves; a weaker tool for injecting facts.
Practical: Fine-tune for form and behavior, retrieve for facts.
Knowledge graphs, relational tables, and ontologies provide explicit, queryable relationships the model traverses via tools (GraphRAG, text-to-SQL). They answer multi-hop and aggregate questions prose chunks cannot. Figure E walks one node by node.
Practical: Use graphs when relationships and provenance matter as much as content.
A persistent instruction block framing every turn: role, constraints, format, safety rules, durable facts. Technically in-context, but the highest-authority, most-stable supplied knowledge in a deployment.
Practical: Keep it stable and ordered — constraints first, examples last.
| Source | Freshness | Updatable? | Attributable? | Best for |
|---|---|---|---|---|
| Parametric | Frozen at cutoff | Retrain only | No | General reasoning, language, common facts |
| In-context | As fresh as input | Per request | Yes | Grounding on supplied material |
| RAG | As fresh as index | Re-index | Yes | Private / large / changing corpora |
| Tools / APIs | Real-time | Always live | Yes | Live data, computation, actions |
| Fine-tuning | Frozen at train | Retrain | No | Style, format, behavior, narrow domain |
| Knowledge graph | As fresh as graph | Update graph | Yes | Multi-hop, relational, provenance |
The pragmatic stack: parametric reasoning + a stable system prompt + RAG for facts + tools for live actions, with fine-tuning reserved for behavior prompting can't pin down.
Memory borrows its vocabulary from cognitive science, but the mechanisms are engineered. The dividing line is retention horizon: a single forward pass, one session, or persistence across sessions.
The model's only true working memory: the token span it can attend to right now. When the conversation outgrows it, the oldest tokens fall off — the model doesn't forget gracefully, it stops seeing them.
Practical: Bigger windows aren't free; recall in the middle degrades. Manage the budget — Section 15.
The cached key/value tensors of already-processed tokens, so generation doesn't recompute the sequence each step. Invisible to users, but it governs throughput, latency, and prefix caching.
Practical: Put stable content (system prompt) at the front so it can be prefix-cached.
Knowledge frozen into the weights — the model's long-term semantic memory. Permanent within a version, not editable per-user, not introspectable. The same store as "parametric knowledge," seen through the memory lens.
A record of specific past interactions — what was said, when. Stored as transcripts or summaries and re-injected later. The model has none of this natively; it is an application-layer construct.
Practical: Store summaries plus key facts, not raw transcripts. Set a retention/deletion policy up front.
Distilled, durable facts — preferences, profile, project context — in a structured store, surfaced when relevant. Episodic is "what happened," semantic is "what is true."
Practical: Separate extraction from retrieval. The full lifecycle is Section 17.
Knowledge of how to do things — tool-use patterns, multi-step workflows, learned routines. Partly in weights, partly external artifacts an agent loads when a task recurs. Section 10 develops this as skills.
Practical: Capture recurring successful procedures as named, reusable artifacts.
Workspace memory agents write to and read from during a task: notes, intermediate results, files, a running plan. Decouples reasoning from the window limit, enabling long-horizon tasks.
Practical: Persist intermediate state so a long task survives truncation, retries, or a fresh session.
| Memory | Horizon | Native or built? | Holds |
|---|---|---|---|
| Context window | Single request | Native | Active working set of tokens |
| KV cache | Within a generation | Native | Processed-token tensors (speed) |
| Parametric | Model lifetime | Native | General world knowledge |
| Episodic | Across sessions | Built | What happened, when |
| Semantic store | Across sessions | Built | Durable facts & preferences |
| Procedural | Mixed | Both | How to perform tasks |
| External scratchpad | Task duration+ | Built | Intermediate state, plans, files |
The base model ships with only the native rows. Every "memory" feature users notice — remembering preferences, recalling old chats, resuming long tasks — is the built rows, engineered on top.
Origin runs from internal (in the weights) to external (fetched at runtime). Retention runs from transient (gone after one request) to permanent. Every store sits somewhere on this plane — and that position predicts its cost, freshness, and failure mode.
No store is best; each wins a specific contest. The art is knowing which contest you are in.
Zero-latency reasoning over general principles. The model already knows differential inductance falls under saturation — no lookup to form the hypothesis.
The actual PI gains, this motor's flux table, last quarter's failure log — none in the weights. RAG makes them present, citable, updatable.
Paste the dyno trace and it's ground truth this turn. Change input, change answer — no training cycle, full traceability.
To make every diagnosis emit the same structured report reliably, bend the weights. Form lives better in parameters than in a growing prompt.
Today's telemetry, an exact ripple FFT, a SQL count of affected VINs. Time-sensitive or computed → tool call, never recall.
Reasoning, framing, well-known physics need no network call. A tool call to recall Ohm's law is latency for nothing.
Parametric knowledge forms the hypothesis. Retrieval and tools test it against reality. Memory remembers how it turned out.
A practical router. Answer a few questions about the information you need and the tree lands on the store that can actually serve it.
A single-shot chat mostly uses parametric + in-context. An agentic workflow plans, calls tools, retrieves, reasons, and writes back — touching every store in sequence. Step through the fault investigation; each step names its source and memory and lights up the matching nodes in the graph (Figure E).
Internal stores frame the problem, external stores ground it, and memory closes the loop by retaining the resolved case so the next investigation starts smarter. Skip retrieval and tools and you get a fluent guess; skip the write-back and the agent never learns.
A skill is reusable know-how — a named, packaged procedure the agent loads when a task type recurs. Semantic memory holds facts, episodic holds what happened, procedural holds how to act.
In the EV case, step seven of the loop doesn't reinvent a fix. It invokes a skill — call it recal-saturation-instability — that encodes a validated procedure: re-measure the flux map at high current, smooth the table edges, refit the PI gains against the saturated differential inductance rather than the nominal value, check the discretization margin ωe·Ta, and re-run the dyno point.
Three benefits. Reliability — the procedure ran and worked, so reuse removes a class of reasoning errors. Composability — skills call tools, pull retrieval, and invoke other skills, so a complex job becomes an orchestration of trusted units. Economy — loading a tested playbook beats burning context and tokens re-deriving the workflow every run.
A skill is partly in the weights (the model broadly knows FOC tuning) and partly external (the version-controlled playbook). That split is exactly why procedural memory sits mid-origin: learned and stored.
Knowledge is what you get when sources are fused. Toggle inputs to build a diagnosis up from nothing — or hit Select all and remove one to run a fault-injection: watch which removal breaks the conclusion.
No sources selected — the agent has nothing to reason over.
A knowledge graph is where fusion becomes structure. Symptoms, operating points, subsystems, causes, evidence, physics, history, and fixes become typed nodes; relationships become traversable edges. Each node is sourced from a different store — the graph doesn't replace them, it indexes them. Click a node, or step the loop above to see it drive this graph.
An agent that never records what it learned re-solves the same fault forever. Write-back is how today's investigation becomes tomorrow's head start: each resolved case adds a node to the graph and an entry to memory, and the next similar diagnosis starts further along.
The discipline matters as much as the mechanism: write back distilled outcomes, not raw transcripts; reconcile a new case against existing nodes rather than appending blindly; and let low-value memories decay so retrieval stays sharp.
Put it together and a capable diagnostic agent is a small society of stores with clear roles. None is sufficient; the architecture is the value.
| Layer | Store | Role in the loop | EV case |
|---|---|---|---|
| Reasoning | Parametric | Forms hypotheses from principles | Knows Ldiff drops under saturation |
| Framing | System prompt / in-context | Sets role, constraints, the live query | "You are a VHM diagnostic agent" |
| Evidence | Tools | Pulls live, precise, computed data | Dyno log, current FFT, VIN count |
| Facts | RAG | Grounds in calibration & docs | PI gains, flux table, prior reports |
| Structure | Knowledge graph | Fuses sources into a causal chain | Symptom→cause→physics→fix |
| Recall | Episodic memory | Surfaces analogous past cases | Prior YASA NVH signature |
| Action | Procedural / skill | Executes validated remediation | Recalibration playbook |
| Working set | Context window | Holds the active investigation | All of the above, this turn |
The hybrid pattern in vehicle-health and similar high-stakes domains layers these deliberately: a physics layer (parametric + structured models) proposes mechanisms, an ML/data layer (tools + retrieval over telemetry) tests them against measurement, and an LLM layer orchestrates — planning, fusing, narrating, and writing memory back. The LLM is the conductor, not the whole orchestra.
An agent's intelligence is less about any one store than about the discipline of routing each question to the store that can actually answer it.
The context window is a finite resource shared by every in-context source. Context engineering is the craft of allocating it: what to include, what to compress, what to drop, and what to offload — so the model has room left to actually reason.
Four moves recur. Prioritize: put the highest-value material at the edges, where recall is best. Compact: replace old turns with rolling summaries so history costs less. Evict: drop the least-relevant retrieved chunks rather than stuffing the top-50. Offload: move bulky intermediate state to a scratchpad and pull it back on demand. Underneath, the KV cache rewards keeping the stable prefix fixed so it can be reused.
The failure this prevents is quiet: an over-full window doesn't error, it truncates or buries — so the model simply stops attending to material you assumed it had.
"RAG" hides a pipeline, and every stage caps the quality of the next. The LLM at the end can only be as good as what reaches it.
Chunking decides what a "unit of knowledge" is — too small and you lose context, too large and you dilute relevance; semantic or structure-aware splitting beats fixed windows for technical documents like calibration specs. Embeddings must match the domain: a general model may not separate "differential inductance" from "leakage inductance" the way a motor-control corpus needs. Hybrid search pairs dense vectors (meaning) with keyword/BM25 (exact terms, part numbers, symbols) so a query for Ld doesn't get lost. Reranking with a cross-encoder reorders the shortlist by true relevance before it spends context budget. Query rewriting (and HyDE-style hypothetical answers) bridges the gap between how a user asks and how the corpus is written.
The live debate: with long context windows, why retrieve at all? Because retrieval is cheaper, fresher, and citable — and dumping a whole corpus into context invites "lost in the middle" and cost blowups. Cache-augmented generation (preloading a small, stable corpus into the cached prefix) is a middle path when the knowledge is bounded and rarely changes.
Memory types (Section 4) describe what's stored; the lifecycle describes how it lives. A store that only grows becomes noise.
Formation — distill raw interaction into clean facts or cases (extraction), not verbatim dumps. Storage — write to the right place: a fact to the semantic store, a case to episodic, a procedure to a skill, a relationship to the graph. Retrieval — surface the right item at the right moment, the same precision problem as RAG. Conflict resolution — when a new fact contradicts an old one ("gains were retuned in the April release"), decide which wins, usually the most recent and best-sourced. Decay — let low-value, stale, or superseded memories fade so retrieval stays sharp. Correction — give users and the system a way to fix or delete a wrong memory; a confidently stored error is worse than a gap.
The resolved oscillation becomes a distilled case (formation), filed to episodic memory and a graph node (storage). Months later a similar signature retrieves it (retrieval). But the calibration has since changed — so the agent must reconcile the old fix against the new gains (conflict), retire the obsolete detail (decay), and update the record when an engineer corrects a misattributed cause (correction).
The graph in Figure E doesn't materialize by itself. It's built — and the build choices decide whether it helps or just adds latency.
Schema / ontology comes first: define the node types (symptom, subsystem, cause, evidence, physics, fix) and the edge types (localizes-to, explained-by, evidenced-by, resolved-by). A loose schema yields a graph you can't query; an over-strict one rejects real-world messiness. Extraction populates it — increasingly LLM-assisted, reading reports and specs to propose entities and relations, with human or rule-based validation on high-stakes edges. Linking resolves that "magnetic saturation" in one report and "core saturation" in another are the same node. Maintenance keeps it current as new cases write back.
When is the graph worth it over plain RAG? Use the rule: if your questions are lookup ("what are the gains for motor X"), vector RAG suffices. If they're multi-hop or relational ("which faults share a root cause with this one, and what fixed them"), the graph earns its cost. Many systems run both — RAG for passages, the graph for structure — which is exactly the GraphRAG pattern.
Most knowledge has a clock. Treating facts as timeless is a quiet, dangerous bug in any diagnostic or operational system.
A useful distinction is valid time (when a fact was true in the world) versus ingestion time (when the system learned it). The PI gains in effect at the moment of the fault may differ from the gains in the latest calibration release — and answering "why did it oscillate" requires the then-current values, not today's. Point-in-time queries ("what did we know, and what was deployed, as of the fault timestamp") are essential, and they require stamping retrieved facts and graph edges with versions, not just storing the latest.
This reaches back into memory too: episodic cases should record the configuration under which they happened, so a retrieved fix isn't applied to a vehicle whose calibration has since moved on. Versioning is what lets the system reason about change instead of being silently wrong about it.
Each store fails in a characteristic way. Knowing the signature is half the mitigation.
| Store | Failure mode | Symptom | Mitigation |
|---|---|---|---|
| Parametric | Hallucination | Fluent, confident, wrong — esp. on niche or post-cutoff facts | Ground with retrieval/tools; never cite the weights |
| In-context | Lost in the middle | Ignores material buried mid-prompt | Put key content at the edges; trim noise |
| RAG | Retrieval miss / poison | Wrong or irrelevant chunks; injected text in a doc | Better chunking + rerank; sanitize and bound retrieved content |
| Tools | Bad arguments / blind trust | Malformed call, or output trusted without checks | Validate args; sandbox; verify results |
| Fine-tuning | Brittle facts / drift | Memorized facts can't update; behavior over-fits | Fine-tune for form; keep facts in retrieval |
| Knowledge graph | Schema rot | Stale edges, duplicate nodes, broken links | Linking + validation + maintenance on write-back |
| Memory (any) | Stale / conflicting / leaky | Confidently wrong recall; cross-user bleed | Conflict resolution, decay, correction, isolation |
| System prompt | Instruction conflict | Contradictory rules degrade reliability | Order constraints first; keep coherent |
One cross-cutting threat deserves naming: indirect prompt injection. Hostile text can arrive through a retrieved document or a tool result and hijack the agent. Treat all retrieved and tool-returned content as untrusted input, not as instructions.
A correct-sounding answer from the wrong source is a latent failure. Evaluation has to test the plumbing, not just the prose.
Four measures matter. Groundedness / faithfulness: is every claim supported by the retrieved evidence, or did the model wander back into parametric guessing? Attribution accuracy: do the citations actually point to the sentences that support the claim? Retrieval quality: precision and recall of the chunks fetched, measured independently of the generator — because a perfect model can't fix a bad retrieval. Answer correctness: the end-to-end result against a trusted reference. Pipelines like RAGAS operationalize the first three so you can tell which stage failed.
For an agent, add trajectory evaluation: did it call the right tools, retrieve the right facts, and route each sub-question to the appropriate store? In the EV case, an answer that names the right fix but skipped the dyno log got lucky, not right — and luck doesn't generalize.
In a high-stakes domain, an answer you can't trace is an answer you can't trust. Governance is what makes the stack auditable.
Provenance: tag every retrieved chunk, graph edge, and memory with where it came from and when, so any conclusion can be traced to its sources. Access control: retrieval and memory must respect permissions — a diagnostic agent should surface only the data and documents the requester is cleared for, enforced at the row/document level, not by hoping the prompt holds. PII & data handling: minimize what you store, honor deletion, and keep export-controlled or personal data out of stores that weren't designed for it. Audit trail: log what was retrieved, which tools ran with what arguments, and what was written back — both for debugging and for accountability.
These aren't bolt-ons. Provenance is also what powers attribution; access control is also what prevents the cross-user memory leak in Section 20; the audit trail is also your evaluation dataset. Good governance and good engineering are the same investment seen twice.
The hardest skill is knowing when not to answer. A confident root cause built on thin evidence is the most expensive kind of wrong.
Signal of confidence comes largely from agreement across stores — the fusion meter in Figure D made this concrete: parametric hypothesis confirmed by live telemetry, grounded in retrieved calibration, structured into a causal chain, and matched to a prior case is a strong answer; the same hypothesis with no data behind it is a guess. When the sources don't converge, the right move is to abstain or escalate: say what is and isn't supported, request the missing measurement, or hand off to a human.
Design for it explicitly: thresholds below which the agent refuses to recommend an action, a clear "insufficient evidence" path, and a handoff that carries the assembled graph and evidence so the human starts where the agent left off. An agent that escalates well is more trustworthy than one that always has an answer.
Calibrated doubt is a feature. The store that says "I don't have enough to call this" has done its job.