Memory management (sometimes "context engineering") is one of the central engineering challenges of agentic AI. A frontier LLM in 2025 has 200k–2M tokens of context, but long-running agents quickly exceed even those bounds. Production systems borrow the memory hierarchy metaphor from operating systems.
Three tiers
| Tier | Substrate | Lifespan | Capacity |
|---|---|---|---|
| Working / short-term | LLM context window | One inference call | 200k–2M tokens |
| Long-term semantic | Vector database | Permanent | Unbounded |
| Episodic | Summarised past conversations | Session-scoped or permanent | Hundreds–thousands of summaries |
Working memory tactics
- System prompt + recent turns, the canonical chat layout.
- Scratchpad,
<thinking>...</thinking>blocks (e.g. chain-of-thought or reasoning models). - Tool result truncation, drop verbose tool outputs after they are summarised.
- Dynamic prompting, re-inject critical state at the bottom of context (recency bias).
Long-term memory
Implemented as a vector DB keyed by embeddings:
def remember(text):
emb = embed(text)
vector_store.upsert(id=uuid(), embedding=emb, metadata={"text": text})
def recall(query, k=5):
return vector_store.search(embed(query), top_k=k)
Triggers for storage are typically:
- User states a fact about themselves.
- Agent learns a new tool or skill.
- Conversation crosses a summary boundary.
Episodic memory
For long conversations, agents periodically summarise older turns:
[System] [Summary of turns 1-50] [Verbatim turns 51-100] [Current user message]
Implementations include:
- MemGPT / Letta (Packer et al. 2023), virtual context paging à la OS virtual memory.
- Mem0, managed long-term memory service.
- Zep, temporal knowledge graphs of conversation entities.
Anthropic's Memory Tool (2025)
Claude 4.5+ ships a structured memory tool: a file-system-like store with create_file, read_file, update_file, delete_file operations. The agent decides what to write and when to read; persistence is across sessions. This effectively turns the file system into the long-term memory.
Compaction
When context fills up, compaction rewrites the history into a shorter summary. Claude Code, Codex, and Cursor all implement automatic compaction at ~80% context usage. Naïve compaction loses information; better systems keep recent verbatim plus a structured summary of older turns.
Open problems
- What to remember, agents store too much (noise) or too little (forgetting).
- Retrieval drift , embedding similarity ≠ relevance.
- Memory poisoning, an adversary plants false "memories" via prompt injection.
- Cross-session identity, should the assistant remember you between accounts? Privacy-vs-utility tradeoff.
Citation
Packer, C. et al. (2023). MemGPT: Towards LLMs as Operating Systems. arXiv:2310.08560.
Related terms: Vector Database, Embeddings APIs, Retrieval-Augmented Generation, Agentic RAG
Discussed in:
- Chapter 15: Modern AI, Modern AI