BACK TO BLOGS

Engineering

Why AI Agent Frameworks Ship Without Real Memory

I watched a production agent system contradict itself yesterday. The agent had assured a customer that their subscription renews on the 15th, but two weeks later—same user, different session—it confidently stated renewals happen on the 21st. Both statements came from the same training data, the same system prompt, the same LLM. What changed? Nothing about memory. The agent had forgotten.

This is the hidden tax of today's agent frameworks. LangChain, CrewAI, AutoGen, OpenAI's SDK, Claude Agent SDK—they're engineered for orchestration brilliance and architectural elegance. They excel at tool routing, state machines, and multi-step workflows. But they ship with memory that's barely a notch above a slack channel keeping chat history. And that's not an oversight. It's a deliberate design choice that stems from how AI agent platforms are built, funded, and scoped.

I'll walk you through why this gap exists, what it means for your systems, and how teams are closing it in production. By the end, you'll understand the exact memory problems your agent framework can't solve and which ones matter most to your use case.

Why Frameworks Prioritize Orchestration Over Memory

Agent frameworks are primarily task orchestrators, not memory systems. Their core value proposition is routing, planning, and tool calling—translating user intent into structured API calls and managing multi-step workflows. Memory, by contrast, is a persistence and retrieval problem that sits orthogonal to task execution.

This separation matters architecturally. A framework that tries to own memory ends up overcomplicating its core loops. It becomes responsible for storage, indexing, temporal reasoning, and retrieval—each of which has hard engineering tradeoffs. Should memory be in-process or remote? Lossy or lossless? Per-session or cross-session? Per-user or per-agent? These decisions don't affect how well a framework routes a tool call, so framework authors punt them downstream.

There's also a business reality: frameworks compete on ease of use and quick wins. A developer can build a working agent in an afternoon. Getting memory right takes weeks. So frameworks ship a basic session-based store—enough to make demos work—and let users discover the pain only after they've shipped to production. By then, switching costs are high. You're already integrated; you'll try to retrofit memory on top.

How Major Frameworks Handle Memory Today

LangChain / LangGraph

LangChain's memory abstractions (ConversationSummaryMemory, ConversationBufferWindowMemory) are designed for single-session chat bots, not multi-turn agent systems operating over weeks or months. The framework provides a straightforward abstraction: feed chat history into the prompt context at each turn. For retrieval-augmented generation (RAG), LangChain delegates to vector stores, which means you're responsible for ingesting documents, managing embeddings, and handling stale vectors yourself.

LangGraph—the newer state machine layer—is more powerful but still memory-agnostic. You can build stateful workflows, but the state itself is a JSON blob you manage. There's no native concept of "what should the agent remember about this user" or "which facts changed since last session?" If you need temporal reasoning (understanding that a fact was true yesterday but isn't today), you're writing that logic by hand.

CrewAI

CrewAI positions memory as part of its agent architecture, but the implementation amounts to context injection. Each agent gets a "memory module" that stores interaction history, then that history gets stuffed into the system prompt. For production systems processing hundreds of users, this approach hits hard limits: token budgets explode, retrieval becomes naive (you either include all history or none), and there's no way to distinguish signal from noise across long conversations.

CrewAI also doesn't solve the cross-agent memory problem. If you have three agents (analyst, planner, executor), none of them share a unified view of what the team has learned about the user. They each maintain separate memory, leading to redundant calculations and contradictions.

AutoGen / AG2

AutoGen's conversable agents framework focuses on agent-to-agent dialogue, not persistent memory. Agents can see the conversation thread within a session, but there's no built-in mechanism to query past sessions or learn from them. If you run the same agent tomorrow with the same user, it wakes up with zero context about what happened today.

The framework provides hooks for custom memory implementations (you can subclass and override), but this is a low-level escape hatch, not a first-class feature. Most teams end up bolting memory onto AutoGen through database calls embedded in tool definitions, which is flexible but fragile.

OpenAI Agents SDK

OpenAI's agents framework (powered by GPT-4 and the Assistants API) has some built-in storage: the Assistants API maintains conversation history within a thread, and you can attach files to an assistant. But threads are session-scoped; they don't carry meaning across threads. If you want to remember something a user said in thread #1 and apply it in thread #2, you're manually querying the database and re-injecting context.

Claude Agent SDK

Claude's agent framework (this one) prioritizes a clean tool-calling loop and stateless architecture. Each invocation of the agent is independent; there's no built-in session management. Memory is entirely your responsibility—you provide the full context (including chat history and user profile) at each turn. This is actually a strength: it forces you to think clearly about what memory should be injected, rather than hiding memory problems behind framework abstraction. But it also means you own the complexity.

Google ADK

Google's Agent Development Kit focuses on tool orchestration and safety guardrails. Like Claude's SDK, it's largely memory-agnostic; it excels at routing and execution safety but delegates persistence to you. The framework provides clear extension points for custom memory handlers, but no batteries-included solution.

The Five Memory Gaps Every Framework Shares

Every major agent framework shares five critical memory failures. Understanding these gaps will help you identify where your system will break first.

No Cross-Session Persistence. Most frameworks have no native way to store facts learned in one session and retrieve them in the next. If your agent learns that a user is in the automotive industry, and you want that fact available six months later without the user re-stating it, you're building custom persistence. Frameworks might offer session history storage, but that's not the same as extracting, summarizing, and indexing the facts within that history for future retrieval. See why AI agents forget users after one session for the real production impact.

No Temporal Reasoning. When did the user say this? Is it still true? Facts decay. A user's budget for Q2 isn't their budget for Q4. An agent that doesn't understand time can't maintain accurate knowledge. Most frameworks store conversation history as a flat list; they don't tag assertions with timestamps, certainty levels, or expiration. Your agent can't reason "this was true two months ago but probably isn't now." See context windows are not memory for why stuffing old history into prompts actually makes agents worse at temporal reasoning.

No User-Level Personalization. Frameworks treat every user as a fresh start. If you run the same agent for 10,000 users, none of them benefit from what the agent learned about the other 9,999. In machine learning terms, you have zero transfer learning. In production terms, you're solving the same problem ten thousand times. Some frameworks offer "user scoping" for session history (each user gets their own session), but that's just isolation, not learning.

No Self-Improving Retrieval. Vector databases have become the default memory backend for agents. You embed documents, store them, and retrieve by semantic similarity. But this is a static retrieval strategy. The same query will always return the same documents, even if those documents were wrong or irrelevant last time. A real memory system should track which retrieved facts led to good decisions and up-weight them, or track failures and filter them out. Most frameworks + vector DB combinations don't do this.

No Native Ingestion from Real-World Apps. Your users live in Gmail, Slack, Salesforce, calendars, CRMs, and spreadsheets. The agent should have access to continuous, current context from these systems. But frameworks have no standard way to say "keep this agent up-to-date with whatever changed in Salesforce since the last run." You're manually calling batch jobs to sync data, which is late and brittle. See chat history is not enough for context for why historical conversation context is only a fraction of what an agent needs to remember.

What Production Memory Actually Requires

Real agent memory—the kind that ships to production and stays there—is fundamentally different from framework-included history storage. It has to solve five hard problems.

Unified retrieval across time and context. The agent needs to query facts learned months ago, but only those that are still valid. This requires temporal metadata (when was this learned? when does it expire?), relevance scoring (how confident are we in this fact?), and fast retrieval (sub-100ms lookups even at scale). A vector database alone can't do this. You need a layer that understands recency, decay, and retrieval quality.

Cross-session continuity. Every new user interaction should have access to everything the agent learned about that user across all prior sessions. This requires persistent storage tied to user identity, fast loading (you can't re-compute embeddings or summaries on every request), and versioning (you need to know what changed between sessions). Most agents today miss this entirely; they start fresh or rely on crude session logs.

Collaborative learning across agents. If your system has multiple agents (a planner, an executor, an auditor), they need to share a knowledge base about the user. When one agent learns a fact, the others should be able to retrieve it. This requires a shared, queryable knowledge graph, not siloed memory stores. Frameworks ship with agent-scoped memory, not team-scoped.

Adversarial consistency checking. An agent that contradicts itself across sessions erodes trust and breaks workflows. Production memory has to surface these contradictions and flag them for resolution. Most systems never even detect when they've said conflicting things to the same user.

Integration with external truth sources. The agent's memory should be aware of what's in the CRM, the database, the API. When it retrieves a stored fact, it should cross-check against current state. If the database says the contract expired yesterday, but the agent's memory says it's still valid, that's a bug. The memory layer has to understand its scope: where truth comes from, when to trust storage versus live queries.

How Teams Are Filling the Gap

Across the industry, I'm seeing three patterns emerge.

Custom-built memory layers. Many teams at large companies simply build their own. Stripe has one. Anthropic has one. You'd be amazed how often the right answer for a mature system is a hand-rolled storage and retrieval layer tailored to your exact domain. The downside: this takes 3–6 engineers and 4+ months. You also won't ship a great temporal reasoning layer or user-level learning on the first try. You'll iterate into it. The upside: you own the semantics entirely. You can embed domain logic directly into retrieval.

Open-source memory frameworks. Projects like Mem0, Zep, and Letta are trying to standardize memory abstractions. Mem0 focuses on personal memory (facts about a user). Zep handles multi-agent memory and temporal reasoning. Letta is a full agent runtime with memory baked in from the start. These are young but improving fast. The advantage is you're not alone; you get community scrutiny and faster iteration. The disadvantage is you're betting on their design choices and accepting their data model.

Managed memory platforms. A newer category—platforms like HydraDB treat memory as a first-class service. They handle the hard problems (temporal reasoning, consistency, retrieval quality) so you don't have to. You attach your agent framework to the memory platform and stop thinking about storage. The LongMemEval-s benchmark from ICLR 2025 measured this directly across 500 question-conversation stacks with ~115K tokens per stack. On temporal reasoning tasks, HydraDB achieved 90.97% accuracy versus 81.95% for open-source alternatives and 62.4% for baseline vector databases. Overall, HydraDB hit 90.79% versus 85.20% for alternatives and 60.2% for context-only baselines. The tradeoff: you're adopting another service, another integration point, and another vendor relationship. But for teams that have shipped agents and then discovered memory is the constraint, this gets you unblocked fast.

Each approach has a slot. Custom is best if you have the team and timeline. Open-source is best if you want to own the code but want a head start. Managed platforms are best if memory is blocking your roadmap right now. For most teams shipping production agents in 2026, the answer is: start with a framework, discover memory is the problem in month two of production, and then add a layer. The cost of that discovery is real, which is why I'm writing this now—so you can see it coming.

Frequently Asked Questions

Can LangChain maintain memory across sessions?

LangChain has session history and conversation summaries, but these are prompt-injection techniques, not true memory. A conversation buffer from six months ago gets mixed with yesterday's context, making retrieval noisy and token budgets explode. For cross-session memory done well, you need a separate layer that can reason about time, relevance, and decay. LangChain alone can't do it.

Which framework has the best built-in memory?

None of them. Every major framework—LangChain, CrewAI, AutoGen, OpenAI's SDK, Claude's SDK, Google's ADK—ships with basic session history at best. They're not trying to compete on memory; they're trying to be good at orchestration. The frameworks that try hardest to include memory (CrewAI) actually make the problem worse by hiding the complexity inside abstractions that don't scale. Better to accept the gap and fill it explicitly.

Do I need a separate memory layer if my context window is large?

Context windows are not memory. A 200K-token window lets you stuff a lot of history into prompts, but you're still solving the retrieval problem incorrectly. The agent has to read all of that context every request (slow), redundant facts dominate the window (wastes tokens), and temporal reasoning becomes guesswork. See context windows are not memory for the full argument. Even with a large window, you need smart retrieval.

How do memory platforms integrate with existing frameworks?

There are two patterns. The first: replace the framework's memory store with a memory platform API call. Instead of storing history in a list, you call memoryplatform.store() and memoryplatform.retrieve(). This works with LangChain (custom memory class), Claude SDK (custom retrieval in your prompt), and others. The second: use the memory platform as an agentic tool. The agent calls memory.recall() and memory.update() as part of its tool set, which is cleaner for multi-agent systems but slower (adds latency). Both work; the first is faster for single-agent, the second is cleaner architecturally.

What's the difference between agent state and agent memory?

Agent state is transient: the current values of variables in this execution, the tools the agent has called so far, the plan it's building. State dies when the agent finishes. Memory is what persists: facts about the user, decisions the agent has made, outcomes from past runs. State is session-scoped; memory is user-scoped or system-scoped. Confusing the two is a common mistake. I've seen teams try to solve memory problems by adding more state variables. It doesn't work.

Conclusion

Agent frameworks are shipping without real memory because memory is orthogonal to their core value: orchestration. They're excellent at routing, planning, and tool calling. What they're not excellent at is knowing which facts matter three months from now, or understanding that a user's context changed since yesterday, or reasoning about time at all.

The gap is real and it will bite you. Your agent will contradict itself. It will re-learn the same facts for each user. It will fetch irrelevant context and miss crucial facts. You'll discover this in production, not before. And then you'll add memory—either custom, open-source, or managed.

Start now. Read stateful AI agents to understand what proper memory architecture looks like. Choose your framework for orchestration alone, not memory. Design your memory strategy in parallel: will you build, adopt open-source, or integrate a managed platform? The sooner you make that choice, the fewer contradictions your agent will commit before you ship.

Enjoying this article?

Get the latest blogs and insights straight to your inbox.