claude-agentsApril 26, 2026

Building Multi-Turn Claude Agents with Memory: A Guide to Stateful Conversation Management

Learn how to architect Claude agents that remember conversations without burning through tokens. Covers memory patterns, summarization strategies, and hybrid storage approaches for production-ready stateful agents.

Why Most Claude Agents Forget Everything (And How to Fix That)

Here's something I learned the hard way: building a Claude agent that can actually remember what happened three exchanges ago is trickier than it sounds. You'd think it'd be straightforward, right? Just keep feeding the conversation history back into the model. But then your token costs spiral, your latency shoots up, and suddenly your "smart" agent can barely remember what the user said two minutes ago.

The problem? Context windows are large but not infinite. Claude 3.5 Sonnet gives you 200K tokens—sounds generous until you're managing a customer support conversation that's been going on for twenty minutes with attached documents, previous ticket references, and a detailed product catalog. Suddenly you're playing a high-stakes game of what to keep and what to forget.

What's interesting is that the solution isn't just technical. It's architectural. You need to think about memory the way humans do—some things go into short-term memory, others get summarized and filed away, and the truly important stuff? That gets indexed for quick retrieval.

The Three Memory Patterns That Actually Work

I've seen teams try dozens of approaches to stateful conversations. Most fail. Three patterns consistently work in production.

In-Memory State: Fast but Fragile

The simplest approach. Keep everything in RAM. For single-server deployments or short-lived conversations, this works beautifully. You maintain a dictionary or object that tracks conversation history, user preferences, and relevant context. Access is instantaneous—no database roundtrips, no serialization overhead.

But here's the thing: it doesn't scale. Your agent crashes? Gone. Load balancer routes the next request to a different server? Also gone. I mean, you could use sticky sessions, but that's sort of missing the point of distributed systems.

In my experience, in-memory state makes sense for prototyping or when you're building something like a research assistant where sessions are naturally bounded—maybe 30 minutes max. A Python dictionary tracking the last 10 exchanges plus a summary of earlier context gets you 80% of the way there with minimal complexity.

Vector Databases: When Context Gets Complicated

Now we're talking about real persistence. Vector databases like Pinecone, Weaviate, or Qdrant let you store conversation snippets as embeddings and retrieve semantically relevant context.

This is where things get interesting. Instead of maintaining a linear conversation history, you're building a searchable knowledge graph of everything that's been discussed. User mentions they're allergic to shellfish in message 3? That embedding sits there waiting to be retrieved when they ask about restaurant recommendations in message 47.

The pattern looks something like this: every exchange gets embedded and stored with metadata (timestamp, user ID, conversation ID, tags). When a new message comes in, you embed it, query for the top 5-10 most relevant previous exchanges, and inject those into Claude's context along with the immediate conversation history. You're basically giving Claude both short-term memory (recent messages) and long-term memory (semantically relevant past context).

That said, vector search isn't free. You're adding 50-150ms per request for the embedding generation and search. For a customer support bot where accuracy matters more than speed? Totally worth it. For a real-time coding assistant? Maybe reconsider.

Hybrid Approaches: The Production-Ready Solution

Here's what actually works at scale. You combine multiple storage strategies based on access patterns.

Recent conversation history—say the last 5-10 turns—lives in a fast key-value store like Redis. This gives you sub-millisecond access to immediate context. Older conversations get summarized (more on this in a minute) and the summaries go into your primary database—PostgreSQL, MongoDB, whatever you're already running. Semantically important information gets embedded and indexed in your vector database.

When a request comes in, you pull from all three: the last few turns from Redis, the conversation summary from your database, and relevant long-term context from vectors. This layered approach mirrors how human memory actually works, which turns out to be pretty efficient.

Conversation Summarization: Your Secret Weapon Against Token Bloat

Let's talk about the elephant in the room. Token costs.

A detailed customer support conversation can easily hit 10K tokens in the first fifteen minutes. Keep appending every exchange and you'll burn through your context window faster than you can say "budget overrun." Summarization changes the game entirely.

The basic technique? Every N turns (I usually start with 5-7), you ask Claude to summarize the conversation so far while preserving critical details. Key facts, user preferences, decisions made, action items—these get compressed into a dense paragraph or two. Then you drop the older detailed exchanges and keep only the summary plus recent context.

Here's a pattern I've found works well: maintain a rolling summary that gets updated, not replaced. Each summarization pass incorporates the previous summary plus new exchanges. This creates a hierarchical compression where very old context is highly compressed, recent context is lightly compressed, and immediate context is uncompressed. Think of it like a lossy compression algorithm that preserves what matters.

One gotcha—and this took me a while to figure out—is that you need to be explicit about what to preserve. If you're building a sales agent, things like budget constraints, decision-makers, timeline, and pain points cannot get lost in summarization. I literally include a system prompt that says "CRITICAL FACTS THAT MUST BE PRESERVED" with categories. Sounds paranoid, but it works.

Token Optimization Strategies Beyond Summarization

Summarization helps, but there's more you can do.

Selective Context Injection

Not every piece of historical context matters for every query. If the user asks "What's the weather like?" you probably don't need to inject the entire history of their product preferences. Obvious, right? But implementing this intelligently requires some thought.

I've had success with a simple classifier (can be another lightweight LLM call, or even Claude itself with a structured output) that categorizes incoming messages—factual question, continuation of previous topic, new topic, meta-conversation, etc. Based on the category, you adjust how much context to inject. New topic? Maybe just the summary. Continuation? Full recent history plus relevant vectors.

This dynamic context loading can cut your token usage by 40-60% without meaningfully impacting conversation quality. The key is being aggressive about dropping irrelevant context while paranoid about preserving relevant context.

Prompt Compression Techniques

Your system prompts probably contain a lot of redundancy. Do you really need to remind Claude of its role and constraints in every single turn? In practice, you can often frontload the detailed instructions in turn 1, then use abbreviated prompts for subsequent turns.

Some teams use prompt templating where you swap in shorter variants after the conversation is established. "You are a helpful customer support agent for Acme Corp. You should be friendly, professional, and solution-oriented..." becomes just "Continue as Acme support agent." Claude maintains the context from the earlier detailed instruction.

Does this always work? Not always. But for stateful conversations where the agent's role is established early, it's basically free token savings.

Real-World Scenario: Building a Customer Support Bot

Let me walk through a concrete example because abstract patterns only get you so far.

You're building a customer support agent for a SaaS company. Users come in with questions about billing, features, technical issues. Conversations can span days—user starts a chat, gets interrupted, comes back six hours later expecting the agent to remember everything.

Here's the architecture I'd recommend:

Session initialization: User starts a chat. You create a session ID and initialize a Redis entry with empty conversation history. You also load any existing customer data from your CRM—their plan, previous tickets, account age, etc. This goes into a structured context object.

Turn 1-5: Each exchange gets appended to Redis. You're sending Claude the full history plus the customer context. Total token count is reasonable—maybe 2K-3K tokens including your system prompt.

Turn 6: You hit your summarization threshold. Make a summarization call: "Summarize this conversation preserving all critical facts about the customer's issue, any solutions attempted, and next steps." Store this summary in PostgreSQL linked to the session ID. Clear the detailed history from Redis, keeping only the summary and last 2 turns.

Turn 7-10: You're now sending: customer context + conversation summary + last 2-3 turns. Token count stays controlled around 2.5K-3.5K.

Turn 11+: Another summarization pass. The new summary incorporates the old summary plus turns 6-10. You're maintaining a rolling compressed history that keeps growing slowly instead of linearly.

Vector enhancement: Simultaneously, every exchange gets embedded and stored in Pinecone with metadata (customer ID, session ID, timestamp, topic tags). When complex questions come in—"What did I ask about billing last week?"—you do a vector search scoped to that customer's history and inject relevant snippets.

This hybrid approach handles conversations that span hours or days, maintains context across server restarts, keeps token costs predictable, and provides both short-term and long-term memory. It's not simple, but it works.

Research Assistant Agents: A Different Memory Challenge

Customer support is one pattern. Research assistants are another beast entirely.

The core difference? Information density. A research conversation might involve analyzing multiple papers, extracting key findings, building connections across sources, and maintaining a growing knowledge base throughout the session. You're not just tracking what was said—you're building a dynamic knowledge graph.

Here's what I've learned building these: you need a document store separate from conversation memory. When the user uploads a paper or you retrieve articles, those get chunked, embedded, and indexed independently. The conversation memory references these documents but doesn't duplicate their content.

So your context assembly looks different. You maintain: (1) conversation summary showing the research thread, (2) last few exchanges, (3) vector search results from your document store based on the current query, and (4) any user-created annotations or highlights.

The summarization strategy also shifts. Instead of just summarizing dialogue, you're maintaining a running research summary—key findings, hypotheses being explored, connections discovered, open questions. This summary gets updated more frequently (every 3-4 turns maybe) because it's the backbone of the research process.

Token management becomes critical here because you're often hitting context limits. I've found that aggressive chunking plus citation-based retrieval works well. Instead of injecting full document sections, you inject only the most relevant paragraphs with citations. Claude can say "Based on Section 3.2 of Paper A..." without needing the entire paper in context. You're essentially building a RAG system but with conversation memory layered on top.

Sales Agents: Where Memory Directly Impacts Revenue

Sales agents are fascinating because memory quality directly correlates with conversion rates. Forget a prospect's budget constraint? You just wasted everyone's time. Fail to remember they mentioned a competitor? You lose credibility.

The memory architecture here needs to be paranoid about retention. Everything that could impact deal qualification gets tagged as critical and stored redundantly. I'm talking budget, timeline, decision-makers, pain points, competitors mentioned, objections raised, commitments made—all of this goes into both your vector database AND a structured relational database.

Why both? Because you need semantic search for conversation flow ("What did they say about implementation challenges?") but you also need reliable structured retrieval for qualification logic ("Is this deal qualified based on our BANT criteria?").

Here's a pattern that works: maintain a structured deal object that gets updated after every conversation turn. Use Claude with structured output to extract entities and facts from each exchange, then merge those into your deal object. Meanwhile, the conversation itself uses the hybrid memory approach—recent turns in Redis, summaries in your database, vectors for semantic retrieval.

Before each new turn, you inject both the conversation context AND the current state of the deal object. This gives Claude both the conversational flow and the hard facts needed to guide the sales process.

One more thing—and this is crucial—sales conversations often have long gaps. Prospect says they'll think about it, then comes back two weeks later. Your memory system needs to handle these dormant periods gracefully. When they return, you need to provide Claude with a concise recap: "Last conversation was two weeks ago. Here's what was discussed, where we left off, and what was supposed to happen." This re-orientation prompt makes the conversation feel continuous even across large time gaps.

Choosing Your Storage Strategy: A Decision Framework

So which approach do you actually use? Depends on your constraints.

Use in-memory state when: You're prototyping, sessions are short (under 30 minutes), you're running single-server, or you're building something that naturally resets (like a one-off analysis tool).

Use vector databases when: You need semantic search over conversation history, conversations are long and complex, you're building knowledge-intensive agents (research, expert advisors), or you need to find relevant context from weeks or months ago.

Use hybrid approaches when: You're building production systems, conversations span hours or days, you need both speed and persistence, or you're managing thousands of concurrent conversations.

Here's the thing nobody tells you: you can start simple and migrate. Begin with in-memory state to validate your agent works. Once you've got product-market fit, add Redis for persistence. When conversations get complex, layer in vector search. You don't need the full architecture on day one.

That said, if you're building something you know will scale—enterprise customer support, sales automation, anything handling real user data—just build the hybrid approach from the start. The incremental complexity is worth avoiding a painful migration later.

Implementation Gotchas I Wish Someone Had Told Me

Let me save you some debugging time.

Race Conditions in Concurrent Conversations

User sends two messages in quick succession. Both hit your API, both try to update conversation state, one overwrites the other. Suddenly Claude thinks a message never happened. Use optimistic locking or atomic operations for state updates. Redis WATCH/MULTI/EXEC works well for this.

Embedding Drift Over Time

You change embedding models six months in. All your historical vectors are now in a different vector space than new ones. Semantic search returns garbage. Either commit to an embedding model long-term or build in a migration strategy from day one. I've seen teams have to re-embed millions of conversation snippets. Not fun.

The Summarization Cascade Problem

You summarize summaries which get summarized again. After enough iterations, you've lost critical details through repeated compression. Mitigate this by periodically going back to raw conversation logs when critically important topics emerge, or maintain a parallel "critical facts" store that never gets summarized, only appended to.

Cost Monitoring Blindness

Vector operations and summarization calls add up fast. You think you're making one API call per user message, but actually you're making three—one for summarization, one for embedding, one for the actual response. Instrument everything and monitor your actual token consumption and API call patterns. I've seen costs triple because summarization was running more aggressively than expected.

Performance Benchmarks You Should Actually Care About

Let's talk numbers. Because "it works" isn't good enough when you're spending real money on tokens.

For a typical customer support conversation (15-20 turns, moderate complexity), here's what good looks like: average token count per turn should stay under 4K including system prompt and context. If you're hitting 8K-10K, your summarization isn't aggressive enough or you're injecting too much context.

Latency budgets matter too. In-memory state retrieval should be under 5ms. Redis roundtrip for recent history: 10-30ms. Vector search: 50-150ms depending on your database and index size. Full relational database query for conversation summaries: 50-200ms. If you're adding 500ms of overhead before Claude even starts processing, users will notice.

Here's something interesting: conversation quality often improves when you're slightly aggressive with summarization. Forcing the model to work with compressed context seems to make responses more focused. We ran A/B tests and found that users slightly preferred agents using summarization over agents with full history, even though theoretically the full history provides more information. Less noise, basically.

The Future: What's Coming in Stateful Agent Design

The landscape is shifting fast. Claude's context windows keep expanding—we went from 100K to 200K and there's no reason to think that's the limit. At some point, do we even need aggressive summarization? Maybe. Token costs scale with context size, so even with infinite windows, you're paying for what you include.

What I'm watching: native stateful APIs from Anthropic or other providers. Right now, we're building memory systems on top of stateless models. But there's no fundamental reason the providers couldn't offer built-in conversation state management. Upload a conversation ID, they maintain the history server-side, you just send new messages. Some smaller model providers are already experimenting with this.

Agentic workflows are also changing the game. When your Claude agent can orchestrate calls to other tools, memory becomes more complex. You're not just tracking conversation—you're tracking actions taken, data retrieved, workflows in progress. The memory systems I've described here are a foundation, but they'll need to evolve to handle multi-step agent processes.

Vector databases are getting faster and cheaper. What cost $500/month two years ago is now $50. This makes hybrid approaches more accessible to smaller teams. I expect we'll see vector storage become standard infrastructure, like Redis is today.

Building Your First Stateful Claude Agent: A Starting Point

If you're ready to build, here's where I'd start.

Pick a specific use case. Don't build a generic conversational agent—build a customer support bot for your product, or a research assistant for a specific domain. Concrete use cases clarify your memory requirements.

Start with the simplest architecture that could work. In-memory state for your first prototype. Get the conversation flow right before optimizing for scale. You need to understand what context actually matters before building elaborate storage systems.

Instrument from day one. Log every token count, every API call, every context assembly step. You'll need this data to optimize later, and retroactively adding instrumentation is painful.

Build summarization before you need it. Even if you're starting with in-memory state, implement the summarization logic early. It's easier to add when your conversations are simple than after they've gotten complex.

Test with real users quickly. Your intuition about what context matters is probably wrong. I know mine usually is. Real conversation patterns will surprise you and guide your architecture decisions.

Most importantly: remember that memory systems exist to make conversations better, not to showcase technical sophistication. If your elaborate vector database setup isn't measurably improving user experience, simplify. The goal is Claude agents that remember what matters and forget what doesn't—kind of like us, actually.

Frequently Asked Questions

How do I stop my Claude agent from losing context after a few exchanges?+

The key is using a hybrid memory approach instead of just keeping full conversation history. Keep recent exchanges (last 5-10 turns) in a fast key-value store like Redis, summarize older conversations and store them in your primary database, and use vector databases to index semantically important information. When a request comes in, pull from all three layers. This mirrors how human memory works and prevents context from getting lost while managing token costs.

What's the best way to handle summarization without losing critical details?+

Use a rolling summary approach where each summarization pass incorporates the previous summary plus new exchanges, creating hierarchical compression. More importantly, be explicit about what must be preserved by including a system prompt that specifies critical facts—like "CRITICAL FACTS THAT MUST BE PRESERVED" with categories relevant to your use case (budget constraints, decision-makers, timeline, etc.). This prevents important details from getting lost during compression.

Can I cut down token usage by being selective about what context I inject?+

Yes, you can reduce token usage by 40-60% using a simple classifier that categorizes incoming messages as factual questions, continuations, new topics, or meta-conversation. Based on the category, adjust how much context to inject—a new topic might need just the summary, while a continuation needs full recent history plus relevant vectors. The trick is being aggressive about dropping irrelevant context while staying paranoid about preserving relevant context.

Is in-memory state a good approach for keeping conversation history?+

In-memory state (keeping everything in RAM) is fast and has zero database overhead, making it great for prototyping or short sessions under 30 minutes. However, it doesn't scale—if your agent crashes or a load balancer routes requests to a different server, all context is lost. It works for bounded sessions like research assistants with naturally limited timeframes, but for production systems needing persistence, you need external storage.

How much does using a vector database add to request latency?+

Vector databases add about 50-150ms per request for embedding generation and search. For use cases where accuracy matters more than speed—like customer support bots—it's totally worth the tradeoff. For real-time systems like coding assistants, you might want to reconsider or use vector search more selectively.

What's different about memory management for a research assistant versus a customer support bot?+

Research assistants have much higher information density and need a document store separate from conversation memory. Instead of just tracking dialogue, you're building a dynamic knowledge graph. Your context assembly includes conversation summary, last few exchanges, vector search results from your document store, and user annotations. Summarization shifts from just dialogue to maintaining a running research summary with key findings, hypotheses, connections, and open questions—updated more frequently than customer support.

How should I structure a customer support bot to handle conversations that span multiple days?+

Use a hybrid architecture: store the customer context and data upfront, keep the last 5-10 exchanges in Redis for fast access, and at summarization points (every 5-7 turns), create a summary and store it in PostgreSQL while clearing old detailed history from Redis. Continue appending new exchanges, and create a new summary that incorporates the previous summary. Simultaneously embed all exchanges in a vector database like Pinecone scoped to that customer for handling complex historical queries. This keeps token costs predictable while maintaining both short and long-term memory.

Written by

Daniel S.

Business AI Specialist & Author

Daniel is an AI strategist and practitioner with 30+ years in IT, specialising in autonomous agents and end-to-end AI systems for small and medium-sized businesses. He writes on the practical application of AI — helping organisations automate intelligently, optimise performance, and adopt AI responsibly. Certified in Agile, ITIL, AWS, Security, and PMP.

← Back to Blog

// Stay in the loop

AI Agents, Weekly

New agents, tutorials, and automation ideas — straight to your inbox.

No spam. Unsubscribe any time.