Memory for AI Agents
Why Every Orchestration Platform Is Racing to Solve the Same Problem
Different contexts. Same question.
“How does this agent can remember anything?”
Over the last six months, I’ve watched the same architectural question surface in every AI project I’ve been close to. Startups building their first agent workflows. Scaleups wrapping compliance around existing automation. Large enterprises deploying AI across regulated environments.
Not remember within a single chat. That’s just a context window. I mean remember across sessions, across days, across hundreds of runs — while staying coherent, efficient, and (increasingly) compliant.
This article explains why agent memory became one of the hottest enteprise infrastructure problem of early 2026.
The Problem That Context Windows Don’t Solve
Every LLM is stateless. GPT-4, Claude, Gemini — they all start each API call with a blank slate. The context window creates an illusion of memory, but it’s really just a very large input buffer.
That works for chatbots. It breaks for agents.
The moment you build an agent that runs repeatedly — a sales analyst that processes reports daily, a support bot that handles tickets across shifts, a compliance monitor that learns from policy violations — you hit the wall.
Context windows reset. Knowledge is lost. The agent makes the same mistakes it made last week. It asks the user the same questions. It doesn’t learn.
Bigger context windows don’t fix this. Models with 128K or even 1M token windows still reset between API calls. Even within a single call, performance degrades over long contexts — models lose track of details buried in the middle. And stuffing every prior interaction into the prompt gets expensive fast.
What you actually need is a memory system: infrastructure that decides what to store, how to retrieve it, when to update it, and when to forget.
This is where things get interesting.
Three Types of Memory (That Actually Matter)
The academic literature — particularly the survey “Memory in the Age of AI Agents” (December 2025) and the CoALA framework — converges on three types of long-term memory that production agents need. This isn’t just taxonomy. Each type has different storage patterns, retrieval strategies, and lifecycle rules.
Semantic memory stores what the agent knows — facts, preferences, constraints. “The user prefers Python.” “Our fiscal year starts in April.” “The compliance threshold is €100K.” These are stable facts that hold across sessions and should be updated when they change, not duplicated.
Episodic memory stores what happened — specific interactions, outcomes, decisions. “On February 15, the policy engine denied SQL access because the query contained PII.” “Last Thursday’s report cost €0.42 and used Claude Sonnet.” These are events. They accumulate. They provide context for pattern recognition.
Procedural memory stores how to do things — learned behaviors, workflows, response patterns. “When the user asks for a financial summary, always check PII classification first.” “Use the compact format for Slack responses.” These are rare but powerful — they represent the agent actually improving its own behavior.
Most memory systems on the market implement some version of this taxonomy, whether they call it that or not. The real differentiation is in what happens after storage: consolidation, retrieval, and lifecycle management.
The Core Pipeline: Extract → Consolidate → Store → Retrieve
If you just append every interaction to a database and search it later, you’ve built a log, not a memory. Logs grow without bound, contain duplicates, hold contradictory facts, and become increasingly expensive to query.
Production memory systems follow a pipeline pattern that mirrors (loosely) how human memory works.
Extraction takes raw conversation data and distills it into structured memory units — facts, observations, preferences. Instead of storing “the user said they switched from JavaScript to Rust last month,” you extract the fact: {preference: "Rust", negated: "JavaScript", timestamp: "2026-02"}.
Consolidation is where the real engineering happens. New facts are compared against existing memories. Duplicates are detected. Contradictions are resolved. Stale entries are invalidated. Without consolidation, your memory fills with noise. With it, storage drops by roughly 60% and retrieval precision improves by over 20%, according to Mem0’s benchmarks on LOCOMO.
Storage persists the processed memories — typically in a vector database for semantic search, sometimes augmented with a graph database for relational reasoning. The choice of backend shapes what kinds of queries you can answer efficiently.
Retrieval fetches relevant memories at query time and injects them into the agent’s prompt. The sophistication here ranges from simple keyword matching to composite scoring that weighs relevance, recency, memory type, and trust.
Every serious memory product implements some version of this pipeline. The differences are in the details.
The AUDN Cycle: How Mem0 Handles Consolidation
Mem0 — arguably the clearest “memory as a product” offering — popularised what I’ll call the AUDN cycle: Add, Update, Delete, Noop.
For each candidate fact extracted from a conversation, Mem0 retrieves the top-S most similar existing memories using vector similarity. It then presents both the new fact and the existing memories to an LLM through a tool-calling interface. The LLM decides:
Add: genuinely new information, store it
Update: augments an existing memory with more detail
Delete: contradicts an existing memory, remove the old one
Noop: already captured, skip
This is elegant because it offloads conflict resolution to the LLM itself. The model decides whether “prefers Python” should be overwritten by “switched to Rust” or whether both should coexist. No hand-crafted rules needed.
The results are strong. On the LOCOMO benchmark, Mem0 delivers a 26% accuracy uplift over OpenAI’s built-in memory, 91% lower p95 latency compared to full-context baselines, and 90% token cost savings. The graph-enhanced variant (Mem0g) adds entity-relationship extraction for multi-hop reasoning — “what decisions led to this outcome?” becomes answerable.
The limitation is that Mem0 deletes contradicted facts. Once overwritten, the old memory is gone. For a personal assistant, that’s fine. For a regulated enterprise environment, it’s a problem — auditors want to see what the system believed at a given point in time, not just what it believes now.
Temporal Knowledge Graphs: How Zep Thinks About Time
Zep, and its open-source engine Graphiti, take a fundamentally different approach. Where Mem0 is optimised for fast, flat fact retrieval, Zep builds a temporal knowledge graph with bi-temporal semantics.
Every fact in Zep has four timestamps: when the event occurred, when it became invalid, when the system first learned about it, and when the system stopped considering it current.
This dual timeline — event time and ingestion time — enables queries that no flat memory store can handle. “What did the agent know as of last Tuesday?” “Show me how this relationship evolved over the past month.” “When did we first learn that the customer changed their billing address?”
When new information contradicts existing facts, Zep doesn’t delete the old edge. It invalidates it — setting an invalid_at timestamp and preserving the full history. The knowledge graph grows richer over time, not just larger.
On benchmarks, Zep achieves up to 18.5% accuracy improvement on LongMemEval (which tests cross-session reasoning and temporal tasks) and 90% latency reduction compared to baselines. It particularly excels at multi-hop reasoning — connecting facts across multiple sessions and time periods.
The trade-off is complexity. Zep requires a graph database (Neo4j or FalkorDB), embedding infrastructure, and more operational overhead than a simple key-value memory store. The open-source Graphiti framework makes this accessible, but it’s still a heavier commitment than Mem0’s three-line API.
LangMem: Memory as a Library
LangChain’s LangMem takes a more modular approach. Rather than being a standalone memory product, it provides composable primitives — create_memory_manager, create_search_memory_tool, create_prompt_optimizer — that plug into LangGraph’s agent framework.
LangMem separates memory into profiles (structured schemas updated in-place, like user preferences) and collections (unbounded document sets searched semantically). It supports background consolidation through a memory manager that extracts, deduplicates, and updates memories asynchronously.
The key design choice is storage agnosticism. LangMem doesn’t mandate a specific backend — it works with any store that supports save and semantic search. MongoDB, Postgres with pgvector, in-memory stores, or custom implementations all work.
For teams already invested in the LangChain ecosystem, LangMem is the path of least resistance. The trade-off is that you’re assembling pieces rather than getting a turnkey system. There’s no built-in temporal model, no graph-based reasoning, and conflict resolution is delegated to the LLM without the structured AUDN pipeline that Mem0 provides.
Letta (MemGPT): Memory as Agent State
Letta — the production evolution of the MemGPT research project — treats memory as a first-class component of the agent’s state. Agents have explicit core memory blocks (always injected into the prompt — persona, goals, preferences) and archival memory (out-of-context storage searched on demand).
The distinctive feature is that agents can explicitly write, update, and delete their own memory blocks through tool calls. Memory isn’t something that happens to the agent — it’s something the agent actively manages.
This makes Letta particularly well-suited for persistent assistants and local-LLM deployments (it works well with Ollama and vLLM). The agent maintains identity and continuity across restarts, which is critical for long-lived worker agents.
The limitation is that Letta’s memory model is agent-centric. It doesn’t natively handle cross-agent memory sharing, multi-tenant isolation, or compliance-grade audit trails. For single-agent personal assistant scenarios, it’s excellent. For enterprise multi-agent deployments, you’ll need to build additional infrastructure.
The Gap Nobody Is Filling: Governed Memory
Here’s what struck me as I surveyed the landscape.
Every product optimises for recall quality. Better accuracy. Lower latency. Fewer tokens. Richer reasoning. Those metrics matter. But they’re all answering the same question: “How well does the agent remember?”
Nobody is answering: “Is it safe for the agent to remember this?”
Consider what happens when an AI agent processes a customer support ticket containing a credit card number, a medical diagnosis, or an employee’s home address. The agent learns from it — stores an observation, updates its memory, maybe adjusts its behavior. But should it?
Under GDPR Article 25, that’s a data protection by design question. Under the EU AI Act (full enforcement August 2026), high-risk AI systems need technical documentation of every decision. Under NIS2, incident response requires reconstructing what the system knew at the time of a breach.
None of the major memory products address this. Mem0 has no PII detection on memory writes. Zep has no policy enforcement layer. LangMem delegates governance entirely to the application developer. Letta stores whatever the agent decides to store.
This isn’t a criticism of those products — they’re solving a different problem. But for European enterprises deploying AI agents in regulated environments, the gap is real.
Dativo Talon
At Dativo Talon — the open-source compliance-first AI orchestration platform I am currently building, started from the governance side and worked toward recall quality, rather than the other way around.
Talon’s memory architecture wraps a full governance pipeline around every write operation. Before anything hits the memmory database, it passes through PII scanning (25+ EU-specific patterns covering all 27 member states), OPA policy evaluation, category validation against allow/forbid lists, policy override detection, conflict checking, and provenance tracking with trust scores. Every write — and every governance decision — generates an HMAC-signed evidence record.
The storage layer uses SQLite with FTS5 for full-text search, progressive disclosure (lightweight index entries for prompt injection, full detail on demand), and AI-compressed observations that reduce raw agent runs from thousands of tokens down to roughly 500-token structured summaries. ( N.B. I love SQLite and I truely believe it is excellent solution without unneccessary overhead)
What I am working right now — and what motivated this article — is the consolidation layer. I am working on a governed AUDN cycle inspired by Mem0’s approach, but with a critical difference: invalidated entries are preserved (Zep-style temporal invalidation), not deleted. Every consolidation decision — add, update, invalidate, noop — is governed and audited. We’re adding bi-temporal queries so any auditor can reconstruct what the agent knew at any point in time.
We’re also moving from flat timestamp retrieval to relevance-scored retrieval that weighs keyword relevance, recency, memory type (semantic/episodic/procedural), and trust score — matching the retrieval sophistication of Mem0 while preserving the audit trail.
The goal is build only memory system where an agent’s learning is both high-quality and compliance-grade. Where governed memory is a compliance asset, not just a developer convenience.
When You Don’t Need Persistent Memory
Ok, Memory isn’t always the answer.
If your agent handles single-turn queries — “translate this,” “summarise that document,” “generate a report” — the context window is sufficient. Adding a memory layer introduces complexity, latency, and storage costs with minimal benefit.
If your agent runs the same static workflow every time (process this CSV, send this email), procedural memory might help but semantic and episodic memory probably won’t.
If your data is already in a well-structured knowledge base, retrieval-augmented generation (RAG) over that knowledge base is likely a better fit than agent memory. Memory shines when the knowledge comes from the interactions themselves — not from pre-existing documents.
Memory pays off when agents run repeatedly, learn from outcomes, serve multiple users, or operate in environments where context evolves over time.
Final Thought
Agent memory is having its “Iceberg moment.”
Just as Iceberg standardised the table format for data lakes — separating storage from compute and making data engine-independent — the memory layer is becoming the standard infrastructure for making AI agents stateful, efficient, and persistent.
The products differ in approach. Mem0 optimises for speed and simplicity. Zep optimises for temporal reasoning and relational depth. LangMem optimises for composability. Letta optimises for agent autonomy.
But the underlying pattern is converging: extract, consolidate, store, retrieve. With scoring. With lifecycle management. With conflict resolution.
What’s still missing — and what I believe will matter enormously as AI regulation matures in Europe and worldwide — is governance around that pipeline. Not as an afterthought. Not as a compliance checkbox. But as a first-class architectural concern, where every memory write is scanned, evaluated, and signed.
If you’re building AI agents and thinking about memory architecture, I’d love to hear how you’re approaching it. Drop a comment or reach out — this is a fast-moving space and I’m learning from every conversation.
(1) Dativo Talon is open-source and available on GitHub.


