Cross-session context: the problem every agent framework gets wrong


Every agent framework in 2026 has the same blind spot. They nail the single-session experience — tool calling, reasoning, chain-of-thought — and then just… stop. Session ends, state evaporates, next session starts from zero.

I keep watching framework after framework ship increasingly sophisticated agent capabilities while punting on the most basic question: what does your agent remember tomorrow?

The three wrong answers

Most frameworks pick one of these, and all three break eventually.

Option 1: Nothing. The agent is stateless. Each session is independent. This is technically clean and practically useless for anything beyond one-shot tasks. Your agent can write code, search the web, manage files — but ask it about the conversation you had yesterday and you get a blank stare.

Option 2: Conversation history. Append every message to a growing log. Feed it back on startup. This works for about a week. Then your agent is spending 40% of its context window reading through old chat transcripts to find the one line where you said “I prefer tabs over spaces.” At $0.003 per 1K input tokens, that’s real money burning on irrelevant context.

Option 3: RAG. Build a full retrieval pipeline. Chunk documents, embed them, store in a vector database, write retrieval logic, tune chunk sizes, manage index updates. For what? So your agent can remember that you like TypeScript?

RAG solves a different problem. It’s for querying large document collections — codebases, knowledge bases, documentation. Using it for agent memory is like buying a forklift to move a chair.

What “memory” actually means for agents

Here’s what I think most framework authors miss: agent memory isn’t a data retrieval problem. It’s a context curation problem.

An agent doesn’t need to search through documents. It needs to recall that you corrected it about your deployment target three weeks ago. It needs to know that last Tuesday’s research session turned up a useful API endpoint. It needs your preferences, your project context, the mistakes it made and the lessons it learned.

This is a fundamentally different retrieval pattern than RAG:

  • Small, atomic pieces of information (not document chunks)
  • Weighted by importance (a correction matters more than a casual observation)
  • Scoped by project or context (backend memories shouldn’t pollute frontend work)
  • Queried by meaning, not keywords (“what does the user prefer for formatting?” should surface the tabs-vs-spaces correction)

The MEMORY.md plateau

OpenClaw users know this pattern well. You start with a MEMORY.md file. The agent reads it on startup, appends to it on shutdown. Simple, works great.

Then it grows. 50 lines, 100, 200. Somewhere around 200 entries, you hit what I think of as the plateau: the file is big enough to waste context but not organized enough to be useful. Your agent loads 15,000 tokens of old notes every single turn, and maybe 3% of it is relevant to what you’re actually doing.

You try to fix it. Split into multiple files. Add headers and sections. Write cleanup scripts. Congratulations, you’re now a memory janitor. The agent is supposed to help you, not create a new maintenance burden.

The middle ground nobody talks about

Between “nothing” and “full RAG pipeline” there’s a simpler answer: a memory service that understands what things mean.

Store a memory: a short piece of text with an importance score, some tags, maybe a namespace. Recall memories: describe what you need in natural language, get back the relevant ones ranked by semantic similarity and importance.

That’s it. No chunking strategy. No index management. No retrieval pipeline to maintain.

With MemoClaw, this looks like:

# Store a correction (high importance)
memoclaw store "User prefers pnpm over npm for all projects" --importance 0.9 --tags "preferences,tooling"

# Store a project detail (moderate importance)
memoclaw store "Backend API uses Hono on Cloudflare Workers" --importance 0.6 --namespace backend --tags "architecture"

# Later, in a different session:
memoclaw recall "what package manager does the user prefer"
# → Returns the pnpm preference, scored by relevance

The agent’s session starts with a targeted recall instead of loading an entire file. It gets 3-5 relevant memories in maybe 200ms, using a fraction of the tokens. Everything else stays out of the context window until it’s actually needed.

Why frameworks keep getting this wrong

I think there are two reasons.

First, memory is boring infrastructure. Framework authors want to ship impressive demos — multi-step reasoning, tool orchestration, agent-to-agent communication. Memory doesn’t make for a good demo. It makes for a good product, which is a different thing entirely.

Second, memory is hard to abstract cleanly. Every user’s memory needs are slightly different. Some want simple key-value pairs. Some want full semantic search. Some need multi-agent shared memory. Rather than pick an abstraction and commit, most frameworks punt and say “bring your own persistence layer.”

Which is fine, honestly. Frameworks shouldn’t bundle memory. But someone needs to build the memory layer that frameworks connect to. That’s where MemoClaw fits — not as a framework feature, but as infrastructure that any framework can use.

What this looks like in practice

Here’s a real OpenClaw agent setup. The agent’s AGENTS.md tells it to use MemoClaw instead of file-based memory:

On session start: recall relevant memories for the current task
On session end: store important learnings, corrections, and context
Never load MEMORY.md — use semantic recall instead

Day one, the agent has nothing. Cold start. By day three, it knows your coding style, your project structure, your preferred tools. By day ten, it’s recalling decisions you made last week and applying them without being asked.

The difference between day one and day thirty isn’t more code or better prompts. It’s accumulated memory, recalled at the right time.

The cost question

People ask about this, so here’s the math. MemoClaw charges $0.005 per store or recall. If your agent stores 10 memories and recalls 5 per session, that’s $0.075 per session.

Compare that to loading a 30KB MEMORY.md every turn. At ~7,500 tokens and $0.003/1K tokens, that’s $0.0225 per turn. Over a 20-turn session, you’ve spent $0.45 just on reading the same file repeatedly — and most of that context was irrelevant.

Selective recall is cheaper than brute-force loading. It’s also faster and produces better results because the agent sees only what matters.

Where this is going

The frameworks will catch up eventually. Some already are — LangGraph has checkpointers, CrewAI added memory, AutoGen has state management. But these are all framework-specific implementations. Your memories are locked to whatever framework you picked.

The bet I’m making: memory should be a service, not a framework feature. Same way you don’t build your own database, you shouldn’t build your own memory layer. Store memories through an API, recall them from anywhere. Switch frameworks without losing what your agent learned.

Your agent’s knowledge shouldn’t be coupled to its runtime. The memories persist. The framework is just the current way of using them.