A Field Guide to Machine Cognition · Expanded Edition

Knowledge & Memory in Large Language Models

The full account: the kinds of knowledge a model draws on, the kinds of memory it operates over, how they trade off, how an agent fuses them through a knowledge graph — and the engineering, failure modes, and governance that make it work in production. Threaded throughout by one example: root-causing a current oscillation in an EV traction drive.

By Majid Mazouchi Knowledge Systems Series

PART ONE

The Taxonomy

§ 01

The two questions

Every time a model produces an answer, two distinct questions are at play: where did the information come from, and how long does it persist. These are not the same axis, and conflating them is the most common source of confusion in applied LLM work.

Knowledge sources describe the origin of what the model uses: baked into the weights, supplied in the prompt, retrieved from a database, or fetched live from a tool. Memory describes the retention of information: a single forward pass, one conversation, or permanently.

A retrieved document is a knowledge source and a form of short-term memory while it sits in context. A fine-tuned fact is a knowledge source and long-term parametric memory. Holding both axes in mind keeps system design honest.

A model does not have a single brain. It has a stack of overlapping stores, each with its own latency, cost, freshness, and failure mode.

▸ The running case, used throughout

A traction inverter reports current oscillation on driftmode exit at roughly 3000 RPM and 750 Nm. Nobody hands you the answer. You have control theory in your head, calibration files on a server, live dyno logs behind an API, a stack of past failure reports, and a recalibration procedure that has worked before. Each store below is shown doing its part of this one investigation.

§ 02

Types of knowledge sources

Ordered roughly from most-baked-in to most-external. The trade-off runs along a spine: deeper integration means lower latency but worse freshness and traceability; more external means higher latency but auditable, updatable knowledge.

Parametric knowledge in the weights

What the model absorbed during pretraining and stored implicitly in its parameters — grammar, world facts, reasoning patterns, code idioms. Fast (no lookup) but frozen at the training cutoff, hard to attribute, and prone to confident error when stretched beyond what it actually saw.

Practical: Treat parametric facts as a strong prior, never a citation. Hallucination lives here.

In-context knowledge the prompt

Anything you place in the context window: the question, system instructions, pasted documents, few-shot examples, prior turns. The highest-fidelity channel — the model treats supplied text as ground truth far more reliably than its own recall.

Practical: Long contexts suffer "lost in the middle." Put the most critical material at the edges.

Retrieval-augmented (RAG) vector / hybrid search

An external corpus is chunked, embedded, and indexed; relevant chunks are retrieved into the prompt at query time. Makes knowledge updatable, attributable, and unbounded by the window, without retraining.

Practical: Retrieval quality caps answer quality. Section 16 opens the pipeline up.

Tool / function calling live, dynamic

The model emits a structured call to an external system — search, calculator, SQL, an internal API — and the result returns into context. Covers knowledge that is live, computed, or impossible to memorize.

Practical: Tools turn "what the model knows" into "what it can find out." Validate and sandbox every call.

Fine-tuning & adapters weight updates

Continued training on curated data — full fine-tuning or parameter-efficient methods like LoRA. The right tool for how the model behaves; a weaker tool for injecting facts.

Practical: Fine-tune for form and behavior, retrieve for facts.

Structured knowledge: graphs & databases symbolic

Knowledge graphs, relational tables, and ontologies provide explicit, queryable relationships the model traverses via tools (GraphRAG, text-to-SQL). They answer multi-hop and aggregate questions prose chunks cannot. Figure E walks one node by node.

Practical: Use graphs when relationships and provenance matter as much as content.

System prompts & instructions policy layer

A persistent instruction block framing every turn: role, constraints, format, safety rules, durable facts. Technically in-context, but the highest-authority, most-stable supplied knowledge in a deployment.

Practical: Keep it stable and ordered — constraints first, examples last.

§ 03

Sources at a glance

Source	Freshness	Updatable?	Attributable?	Best for
Parametric	Frozen at cutoff	Retrain only	No	General reasoning, language, common facts
In-context	As fresh as input	Per request	Yes	Grounding on supplied material
RAG	As fresh as index	Re-index	Yes	Private / large / changing corpora
Tools / APIs	Real-time	Always live	Yes	Live data, computation, actions
Fine-tuning	Frozen at train	Retrain	No	Style, format, behavior, narrow domain
Knowledge graph	As fresh as graph	Update graph	Yes	Multi-hop, relational, provenance

The pragmatic stack: parametric reasoning + a stable system prompt + RAG for facts + tools for live actions, with fine-tuning reserved for behavior prompting can't pin down.

§ 04

Types of memory

Memory borrows its vocabulary from cognitive science, but the mechanisms are engineered. The dividing line is retention horizon: a single forward pass, one session, or persistence across sessions.

Short-term: the working memory

Context window working memory

The model's only true working memory: the token span it can attend to right now. When the conversation outgrows it, the oldest tokens fall off — the model doesn't forget gracefully, it stops seeing them.

Practical: Bigger windows aren't free; recall in the middle degrades. Manage the budget — Section 15.

KV cache operational

The cached key/value tensors of already-processed tokens, so generation doesn't recompute the sequence each step. Invisible to users, but it governs throughput, latency, and prefix caching.

Practical: Put stable content (system prompt) at the front so it can be prefix-cached.

Long-term: persisted across the model's life or the deployment's

Parametric (semantic) memory long-term

Knowledge frozen into the weights — the model's long-term semantic memory. Permanent within a version, not editable per-user, not introspectable. The same store as "parametric knowledge," seen through the memory lens.

Episodic memory conversation history

A record of specific past interactions — what was said, when. Stored as transcripts or summaries and re-injected later. The model has none of this natively; it is an application-layer construct.

Practical: Store summaries plus key facts, not raw transcripts. Set a retention/deletion policy up front.

Semantic memory store facts about the user/world

Distilled, durable facts — preferences, profile, project context — in a structured store, surfaced when relevant. Episodic is "what happened," semantic is "what is true."

Practical: Separate extraction from retrieval. The full lifecycle is Section 17.

Procedural memory skills & workflows

Knowledge of how to do things — tool-use patterns, multi-step workflows, learned routines. Partly in weights, partly external artifacts an agent loads when a task recurs. Section 10 develops this as skills.

Practical: Capture recurring successful procedures as named, reusable artifacts.

External / scratchpad memory agentic

Workspace memory agents write to and read from during a task: notes, intermediate results, files, a running plan. Decouples reasoning from the window limit, enabling long-horizon tasks.

Practical: Persist intermediate state so a long task survives truncation, retries, or a fresh session.

§ 05

Memory at a glance

Memory	Horizon	Native or built?	Holds
Context window	Single request	Native	Active working set of tokens
KV cache	Within a generation	Native	Processed-token tensors (speed)
Parametric	Model lifetime	Native	General world knowledge
Episodic	Across sessions	Built	What happened, when
Semantic store	Across sessions	Built	Durable facts & preferences
Procedural	Mixed	Both	How to perform tasks
External scratchpad	Task duration+	Built	Intermediate state, plans, files

The base model ships with only the native rows. Every "memory" feature users notice — remembering preferences, recalling old chats, resuming long tasks — is the built rows, engineered on top.

PART TWO

The Agentic Loop

§ 06

The map: origin × retention

Origin runs from internal (in the weights) to external (fetched at runtime). Retention runs from transient (gone after one request) to permanent. Every store sits somewhere on this plane — and that position predicts its cost, freshness, and failure mode.

Figure A · Interactivetap a node

Select a store to see what it is, what it costs, and where it shows up in the EV fault case.

The same plane holds both knowledge sources and memory stores — they are two readings of the same objects.

§ 07

Benefits, head to head

No store is best; each wins a specific contest. The art is knowing which contest you are in.

Parametric vs. Retrieval

Parametric wins

Breadth & speed

Zero-latency reasoning over general principles. The model already knows differential inductance falls under saturation — no lookup to form the hypothesis.

Retrieval wins

Specific & current facts

The actual PI gains, this motor's flux table, last quarter's failure log — none in the weights. RAG makes them present, citable, updatable.

In-context vs. Fine-tuning

In-context wins

Facts & flexibility

Paste the dyno trace and it's ground truth this turn. Change input, change answer — no training cycle, full traceability.

Fine-tuning wins

Behavior & form

To make every diagnosis emit the same structured report reliably, bend the weights. Form lives better in parameters than in a growing prompt.

Tools vs. everything internal

Tools win

Live, precise, actionable

Today's telemetry, an exact ripple FFT, a SQL count of affected VINs. Time-sensitive or computed → tool call, never recall.

Internal wins

No round-trip, no dependency

Reasoning, framing, well-known physics need no network call. A tool call to recall Ohm's law is latency for nothing.

Parametric knowledge forms the hypothesis. Retrieval and tools test it against reality. Memory remembers how it turned out.

§ 08

Which store? — decide

A practical router. Answer a few questions about the information you need and the tree lands on the store that can actually serve it.

Figure B · Interactive decision treeanswer to walk it

The same routing discipline an agent applies internally on every sub-question of a task.

§ 09

The agentic diagnosis loop

A single-shot chat mostly uses parametric + in-context. An agentic workflow plans, calls tools, retrieves, reasons, and writes back — touching every store in sequence. Step through the fault investigation; each step names its source and memory and lights up the matching nodes in the graph (Figure E).

Figure C · Interactive walkthroughwired to Figure E ↓

The loop is what distinguishes an agent from a chatbot: it doesn't answer from one store, it orchestrates all of them.

Internal stores frame the problem, external stores ground it, and memory closes the loop by retaining the resolved case so the next investigation starts smarter. Skip retrieval and tools and you get a fluent guess; skip the write-back and the agent never learns.

§ 10

Skills: the procedural memory

A skill is reusable know-how — a named, packaged procedure the agent loads when a task type recurs. Semantic memory holds facts, episodic holds what happened, procedural holds how to act.

In the EV case, step seven of the loop doesn't reinvent a fix. It invokes a skill — call it recal-saturation-instability — that encodes a validated procedure: re-measure the flux map at high current, smooth the table edges, refit the PI gains against the saturated differential inductance rather than the nominal value, check the discretization margin ωe·Ta, and re-run the dyno point.

Why a skill, and not just a prompt?

Three benefits. Reliability — the procedure ran and worked, so reuse removes a class of reasoning errors. Composability — skills call tools, pull retrieval, and invoke other skills, so a complex job becomes an orchestration of trusted units. Economy — loading a tested playbook beats burning context and tokens re-deriving the workflow every run.

A skill is partly in the weights (the model broadly knows FOC tuning) and partly external (the version-controlled playbook). That split is exactly why procedural memory sits mid-origin: learned and stored.

§ 11

Merging sources into knowledge

Knowledge is what you get when sources are fused. Toggle inputs to build a diagnosis up from nothing — or hit Select all and remove one to run a fault-injection: watch which removal breaks the conclusion.

Figure D · Interactive fusion / fault injectiontoggle, or break it

No sources selected — the agent has nothing to reason over.

Diagnostic confidence: 0%

Fusion has a grammar: tools without framing return uninterpretable numbers; a graph without evidence is an empty schema; a skill without a confirmed cause is a fix in search of a problem.

§ 12

The knowledge graph as glue

A knowledge graph is where fusion becomes structure. Symptoms, operating points, subsystems, causes, evidence, physics, history, and fixes become typed nodes; relationships become traversable edges. Each node is sourced from a different store — the graph doesn't replace them, it indexes them. Click a node, or step the loop above to see it drive this graph.

Figure E · Interactive graphclick a node

The fault graph

Click a node to trace its role and see which store populates it.

symptom / state structure candidate cause evidence (tool/RAG) physics (parametric) fix (skill)

Plain RAG retrieves a paragraph that mentions saturation; the graph traverses from a live symptom, through the physics that explains it, to the procedure that resolved a past case — carrying provenance the whole way.

§ 13

Closing the loop: write-back

An agent that never records what it learned re-solves the same fault forever. Write-back is how today's investigation becomes tomorrow's head start: each resolved case adds a node to the graph and an entry to memory, and the next similar diagnosis starts further along.

Figure F · Interactiverun diagnoses, watch memory grow

CASES LEARNED0

TIME TO FIRST HYPOTHESIS

A cold knowledge base reasons every case from scratch.

Returns diminish — memory isn't free to grow. Consolidation (merge near-duplicates) and decay (retire stale cases) keep it useful, not just large.

The discipline matters as much as the mechanism: write back distilled outcomes, not raw transcripts; reconcile a new case against existing nodes rather than appending blindly; and let low-value memories decay so retrieval stays sharp.

§ 14

The hybrid framework

Put it together and a capable diagnostic agent is a small society of stores with clear roles. None is sufficient; the architecture is the value.

Layer	Store	Role in the loop	EV case
Reasoning	Parametric	Forms hypotheses from principles	Knows L_diff drops under saturation
Framing	System prompt / in-context	Sets role, constraints, the live query	"You are a VHM diagnostic agent"
Evidence	Tools	Pulls live, precise, computed data	Dyno log, current FFT, VIN count
Facts	RAG	Grounds in calibration & docs	PI gains, flux table, prior reports
Structure	Knowledge graph	Fuses sources into a causal chain	Symptom→cause→physics→fix
Recall	Episodic memory	Surfaces analogous past cases	Prior YASA NVH signature
Action	Procedural / skill	Executes validated remediation	Recalibration playbook
Working set	Context window	Holds the active investigation	All of the above, this turn

The hybrid pattern in vehicle-health and similar high-stakes domains layers these deliberately: a physics layer (parametric + structured models) proposes mechanisms, an ML/data layer (tools + retrieval over telemetry) tests them against measurement, and an LLM layer orchestrates — planning, fusing, narrating, and writing memory back. The LLM is the conductor, not the whole orchestra.

An agent's intelligence is less about any one store than about the discipline of routing each question to the store that can actually answer it.

PART THREE

Engineering, Failure & Governance

§ 15

Context engineering & the budget

The context window is a finite resource shared by every in-context source. Context engineering is the craft of allocating it: what to include, what to compress, what to drop, and what to offload — so the model has room left to actually reason.

Four moves recur. Prioritize: put the highest-value material at the edges, where recall is best. Compact: replace old turns with rolling summaries so history costs less. Evict: drop the least-relevant retrieved chunks rather than stuffing the top-50. Offload: move bulky intermediate state to a scratchpad and pull it back on demand. Underneath, the KV cache rewards keeping the stable prefix fixed so it can be reused.

Figure G · Interactive budgetdrag to reallocate

Headroom is the room the model has to think. Spend the whole window on inputs and there is nothing left for reasoning — the silent cause of many "it ignored half my context" complaints.

The failure this prevents is quiet: an over-full window doesn't error, it truncates or buries — so the model simply stops attending to material you assumed it had.

§ 16

Retrieval mechanics, opened up

"RAG" hides a pipeline, and every stage caps the quality of the next. The LLM at the end can only be as good as what reaches it.

query → rewrite (expand, disambiguate) → embed → hybrid search (dense + BM25) → rerank (cross-encoder) → top-k chunks → context → generate + cite

Chunking decides what a "unit of knowledge" is — too small and you lose context, too large and you dilute relevance; semantic or structure-aware splitting beats fixed windows for technical documents like calibration specs. Embeddings must match the domain: a general model may not separate "differential inductance" from "leakage inductance" the way a motor-control corpus needs. Hybrid search pairs dense vectors (meaning) with keyword/BM25 (exact terms, part numbers, symbols) so a query for Ld doesn't get lost. Reranking with a cross-encoder reorders the shortlist by true relevance before it spends context budget. Query rewriting (and HyDE-style hypothetical answers) bridges the gap between how a user asks and how the corpus is written.

The live debate: with long context windows, why retrieve at all? Because retrieval is cheaper, fresher, and citable — and dumping a whole corpus into context invites "lost in the middle" and cost blowups. Cache-augmented generation (preloading a small, stable corpus into the cached prefix) is a middle path when the knowledge is bounded and rarely changes.

§ 17

The memory lifecycle

Memory types (Section 4) describe what's stored; the lifecycle describes how it lives. A store that only grows becomes noise.

Formation — distill raw interaction into clean facts or cases (extraction), not verbatim dumps. Storage — write to the right place: a fact to the semantic store, a case to episodic, a procedure to a skill, a relationship to the graph. Retrieval — surface the right item at the right moment, the same precision problem as RAG. Conflict resolution — when a new fact contradicts an old one ("gains were retuned in the April release"), decide which wins, usually the most recent and best-sourced. Decay — let low-value, stale, or superseded memories fade so retrieval stays sharp. Correction — give users and the system a way to fix or delete a wrong memory; a confidently stored error is worse than a gap.

▸ Lifecycle in the EV case

The resolved oscillation becomes a distilled case (formation), filed to episodic memory and a graph node (storage). Months later a similar signature retrieves it (retrieval). But the calibration has since changed — so the agent must reconcile the old fix against the new gains (conflict), retire the obsolete detail (decay), and update the record when an engineer corrects a misattributed cause (correction).

§ 18

Building the knowledge graph

The graph in Figure E doesn't materialize by itself. It's built — and the build choices decide whether it helps or just adds latency.

Schema / ontology comes first: define the node types (symptom, subsystem, cause, evidence, physics, fix) and the edge types (localizes-to, explained-by, evidenced-by, resolved-by). A loose schema yields a graph you can't query; an over-strict one rejects real-world messiness. Extraction populates it — increasingly LLM-assisted, reading reports and specs to propose entities and relations, with human or rule-based validation on high-stakes edges. Linking resolves that "magnetic saturation" in one report and "core saturation" in another are the same node. Maintenance keeps it current as new cases write back.

When is the graph worth it over plain RAG? Use the rule: if your questions are lookup ("what are the gains for motor X"), vector RAG suffices. If they're multi-hop or relational ("which faults share a root cause with this one, and what fixed them"), the graph earns its cost. Many systems run both — RAG for passages, the graph for structure — which is exactly the GraphRAG pattern.

§ 19

Temporal & versioned knowledge

Most knowledge has a clock. Treating facts as timeless is a quiet, dangerous bug in any diagnostic or operational system.

A useful distinction is valid time (when a fact was true in the world) versus ingestion time (when the system learned it). The PI gains in effect at the moment of the fault may differ from the gains in the latest calibration release — and answering "why did it oscillate" requires the then-current values, not today's. Point-in-time queries ("what did we know, and what was deployed, as of the fault timestamp") are essential, and they require stamping retrieved facts and graph edges with versions, not just storing the latest.

This reaches back into memory too: episodic cases should record the configuration under which they happened, so a retrieved fix isn't applied to a vehicle whose calibration has since moved on. Versioning is what lets the system reason about change instead of being silently wrong about it.

§ 20

Failure modes & anti-patterns

Each store fails in a characteristic way. Knowing the signature is half the mitigation.

Store	Failure mode	Symptom	Mitigation
Parametric	Hallucination	Fluent, confident, wrong — esp. on niche or post-cutoff facts	Ground with retrieval/tools; never cite the weights
In-context	Lost in the middle	Ignores material buried mid-prompt	Put key content at the edges; trim noise
RAG	Retrieval miss / poison	Wrong or irrelevant chunks; injected text in a doc	Better chunking + rerank; sanitize and bound retrieved content
Tools	Bad arguments / blind trust	Malformed call, or output trusted without checks	Validate args; sandbox; verify results
Fine-tuning	Brittle facts / drift	Memorized facts can't update; behavior over-fits	Fine-tune for form; keep facts in retrieval
Knowledge graph	Schema rot	Stale edges, duplicate nodes, broken links	Linking + validation + maintenance on write-back
Memory (any)	Stale / conflicting / leaky	Confidently wrong recall; cross-user bleed	Conflict resolution, decay, correction, isolation
System prompt	Instruction conflict	Contradictory rules degrade reliability	Order constraints first; keep coherent

One cross-cutting threat deserves naming: indirect prompt injection. Hostile text can arrive through a retrieved document or a tool result and hijack the agent. Treat all retrieved and tool-returned content as untrusted input, not as instructions.

§ 21

Evaluation: did it use the right source?

A correct-sounding answer from the wrong source is a latent failure. Evaluation has to test the plumbing, not just the prose.

Four measures matter. Groundedness / faithfulness: is every claim supported by the retrieved evidence, or did the model wander back into parametric guessing? Attribution accuracy: do the citations actually point to the sentences that support the claim? Retrieval quality: precision and recall of the chunks fetched, measured independently of the generator — because a perfect model can't fix a bad retrieval. Answer correctness: the end-to-end result against a trusted reference. Pipelines like RAGAS operationalize the first three so you can tell which stage failed.

For an agent, add trajectory evaluation: did it call the right tools, retrieve the right facts, and route each sub-question to the appropriate store? In the EV case, an answer that names the right fix but skipped the dyno log got lucky, not right — and luck doesn't generalize.

§ 22

Governance & provenance

In a high-stakes domain, an answer you can't trace is an answer you can't trust. Governance is what makes the stack auditable.

Provenance: tag every retrieved chunk, graph edge, and memory with where it came from and when, so any conclusion can be traced to its sources. Access control: retrieval and memory must respect permissions — a diagnostic agent should surface only the data and documents the requester is cleared for, enforced at the row/document level, not by hoping the prompt holds. PII & data handling: minimize what you store, honor deletion, and keep export-controlled or personal data out of stores that weren't designed for it. Audit trail: log what was retrieved, which tools ran with what arguments, and what was written back — both for debugging and for accountability.

These aren't bolt-ons. Provenance is also what powers attribution; access control is also what prevents the cross-user memory leak in Section 20; the audit trail is also your evaluation dataset. Good governance and good engineering are the same investment seen twice.

§ 23

Uncertainty & escalation

The hardest skill is knowing when not to answer. A confident root cause built on thin evidence is the most expensive kind of wrong.

Signal of confidence comes largely from agreement across stores — the fusion meter in Figure D made this concrete: parametric hypothesis confirmed by live telemetry, grounded in retrieved calibration, structured into a causal chain, and matched to a prior case is a strong answer; the same hypothesis with no data behind it is a guess. When the sources don't converge, the right move is to abstain or escalate: say what is and isn't supported, request the missing measurement, or hand off to a human.

Design for it explicitly: thresholds below which the agent refuses to recommend an action, a clear "insufficient evidence" path, and a handoff that carries the assembled graph and evidence so the human starts where the agent left off. An agent that escalates well is more trustworthy than one that always has an answer.

Calibrated doubt is a feature. The store that says "I don't have enough to call this" has done its job.

§ 24

Takeaways

What holds up in practice

Origin and retention are the two real axes. Locate any store on that plane and its cost, freshness, and failure mode follow.
Parametric proposes, external disposes. Weights form hypotheses; tools and retrieval decide whether they survive contact with data. Never cite the weights.
Supplied text beats recalled text — but only if the model can see it. Manage the context budget so there's headroom left to reason.
Fine-tune for behavior, retrieve for facts. Memorizing a knowledge base in the weights is usually the expensive wrong answer.
RAG is a pipeline, not a step. Chunking, embeddings, hybrid search, and reranking cap quality more than the LLM does.
Agentic ≠ a bigger prompt. The defining move is routing each sub-question to the store that can answer it, in a loop.
Skills are procedural memory you can trust — reliability, composability, economy from packaging a validated workflow once.
The knowledge graph indexes; it doesn't replace. Use it when questions are multi-hop or relational; carry provenance on every edge.
Close the loop, then keep it clean. Write back distilled outcomes; reconcile, decay, and correct — a store that only grows becomes noise.
Knowledge has a clock. Version facts and cases so you answer with what was true then, not just what's latest.
Treat retrieved and tool content as untrusted. Indirect prompt injection arrives through the data, not the user.
Evaluate the plumbing and know when to stop. Test groundedness and trajectory, not just the prose; abstain or escalate when the sources don't converge.

§ 25

References

Lewis, P. et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. NeurIPS. arXiv:2005.11401
Liu, N. F. et al. (2023). Lost in the Middle: How Language Models Use Long Contexts. TACL. arXiv:2307.03172
Hu, E. et al. (2021). LoRA: Low-Rank Adaptation of Large Language Models. arXiv:2106.09685
Schick, T. et al. (2023). Toolformer: Language Models Can Teach Themselves to Use Tools. arXiv:2302.04761
Yao, S. et al. (2022). ReAct: Synergizing Reasoning and Acting in Language Models. arXiv:2210.03629
Park, J. S. et al. (2023). Generative Agents: Interactive Simulacra of Human Behavior. arXiv:2304.03442
Edge, D. et al. (2024). From Local to Global: A Graph RAG Approach to Query-Focused Summarization. arXiv:2404.16130
Khattab, O. & Zaharia, M. (2020). ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction. arXiv:2004.12832
Gao, L. et al. (2022). Precise Zero-Shot Dense Retrieval without Relevance Labels (HyDE). arXiv:2212.10496
Es, S. et al. (2023). RAGAS: Automated Evaluation of Retrieval Augmented Generation. arXiv:2309.15217
Greshake, K. et al. (2023). Not What You've Signed Up For: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection. arXiv:2302.12173
Kwon, W. et al. (2023). Efficient Memory Management for LLM Serving with PagedAttention (vLLM). arXiv:2309.06180
Wang, L. et al. (2024). A Survey on Large Language Model based Autonomous Agents. Frontiers of Computer Science. arXiv:2308.11432