Glossary

Memory and Context Management

Memory management (sometimes "context engineering") is one of the central engineering challenges of agentic AI. A frontier LLM in 2025 has 200k–2M tokens of context, but long-running agents quickly exceed even those bounds. Production systems borrow the memory hierarchy metaphor from operating systems.

Three tiers

Tier Substrate Lifespan Capacity
Working / short-term LLM context window One inference call 200k–2M tokens
Long-term semantic Vector database Permanent Unbounded
Episodic Summarised past conversations Session-scoped or permanent Hundreds–thousands of summaries

Working memory tactics

  1. System prompt + recent turns, the canonical chat layout.
  2. Scratchpad, <thinking>...</thinking> blocks (e.g. chain-of-thought or reasoning models).
  3. Tool result truncation, drop verbose tool outputs after they are summarised.
  4. Dynamic prompting, re-inject critical state at the bottom of context (recency bias).

Long-term memory

Implemented as a vector DB keyed by embeddings:

def remember(text):
    emb = embed(text)
    vector_store.upsert(id=uuid(), embedding=emb, metadata={"text": text})

def recall(query, k=5):
    return vector_store.search(embed(query), top_k=k)

Triggers for storage are typically:

  • User states a fact about themselves.
  • Agent learns a new tool or skill.
  • Conversation crosses a summary boundary.

Episodic memory

For long conversations, agents periodically summarise older turns:

[System] [Summary of turns 1-50] [Verbatim turns 51-100] [Current user message]

Implementations include:

  • MemGPT / Letta (Packer et al. 2023), virtual context paging à la OS virtual memory.
  • Mem0, managed long-term memory service.
  • Zep, temporal knowledge graphs of conversation entities.

Anthropic's Memory Tool (2025)

Claude 4.5+ ships a structured memory tool: a file-system-like store with create_file, read_file, update_file, delete_file operations. The agent decides what to write and when to read; persistence is across sessions. This effectively turns the file system into the long-term memory.

Compaction

When context fills up, compaction rewrites the history into a shorter summary. Claude Code, Codex, and Cursor all implement automatic compaction at ~80% context usage. Naïve compaction loses information; better systems keep recent verbatim plus a structured summary of older turns.

Open problems

  • What to remember, agents store too much (noise) or too little (forgetting).
  • Retrieval drift , embedding similarity ≠ relevance.
  • Memory poisoning, an adversary plants false "memories" via prompt injection.
  • Cross-session identity, should the assistant remember you between accounts? Privacy-vs-utility tradeoff.

Citation

Packer, C. et al. (2023). MemGPT: Towards LLMs as Operating Systems. arXiv:2310.08560.

Related terms: Vector Database, Embeddings APIs, Retrieval-Augmented Generation, Agentic RAG

Discussed in:

This site is currently in Beta. Contact: Chris Paton

Textbook of Usability · Textbook of Digital Health

Auckland Maths and Science Tutoring

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).