Memory-aware error recovery: teaching agents to learn from failures
Your agent hits an error. It retries. Same error. It retries again. Same error. Eventually it gives up and asks you for help. You tell it the fix. Tomorrow it hits the same error and the whole cycle repeats.
This is what stateless error handling looks like. The agent has no memory of past failures, so every error is a first encounter. It can’t learn from experience because it doesn’t have experience — it has a fresh context window.
MemoClaw fixes this by giving your agent a searchable history of what went wrong and what worked. When an error shows up, the agent checks its memory before retrying blindly.
The error memory pattern
The basic idea: when your agent encounters and resolves an error, it stores a structured memory of the incident. When a similar error appears later, it recalls that memory and applies the known fix.
Here’s what an error memory looks like:
memoclaw store "ECONNREFUSED on port 5432: PostgreSQL wasn't running after system restart. Fix: sudo systemctl start postgresql. Root cause: postgresql service not enabled for auto-start. Permanent fix: sudo systemctl enable postgresql." \
--importance 0.85 \
--tags error,postgres,connection,infrastructure \
--memory-type correction \
--namespace errors
A few design choices here:
Memory type: correction. Corrections have a 180-day decay half-life in MemoClaw. Error knowledge stays relevant for a long time. A fix that worked in January probably still works in June.
Dedicated namespace. Keeping errors in their own namespace means recall queries don’t have to compete with general project context. When your agent searches for “connection refused postgres”, it gets error resolutions, not architecture decisions.
Structured content. The memory includes the error, the immediate fix, the root cause, and the permanent fix. Your agent can decide which level of fix to apply depending on context.
High importance. Error resolutions are high-value knowledge. Setting importance to 0.8+ ensures they surface above lower-confidence memories.
The recovery workflow
When your OpenClaw agent encounters an error, the workflow looks like this:
Recall before retrying.
# Agent encounters: ECONNREFUSED 127.0.0.1:5432
memoclaw recall "ECONNREFUSED port 5432 connection refused" \
--namespace errors \
--limit 3
If there’s a matching memory, the agent gets back the previous resolution. It can try the known fix instead of guessing.
Apply the fix. The agent reads the stored resolution and applies it. If the memory says “postgresql service not running, start it with systemctl,” the agent does that.
Store new knowledge if needed. If the error is new (no recall matches), the agent works through it the old way — retrying, debugging, maybe asking for help. Once resolved, it stores the resolution:
memoclaw store "npm install fails with ERESOLVE: peer dependency conflict between react 18 and react-dom 17. Fix: update react-dom to 18. If you can't update, use --legacy-peer-deps but document why." \
--importance 0.8 \
--tags error,npm,dependencies \
--memory-type correction \
--namespace errors
If an existing fix didn’t work (the error matched a memory but the stored fix failed), the agent updates or stores a new memory with the corrected resolution:
memoclaw store "ECONNREFUSED 5432 — previous fix was 'start postgresql' but actual issue was pg_hba.conf rejecting local connections after config change. Fix: check pg_hba.conf for correct local auth method." \
--importance 0.9 \
--tags error,postgres,connection,config \
--memory-type correction \
--namespace errors
Higher importance than the original, so this surfaces first next time.
Importance scoring for error severity
Not all errors are equal. A typo in a config file is annoying. A production database connection failure is urgent. Use importance scoring to reflect this.
# Critical — production impact
memoclaw store "Production API returning 502: nginx upstream timeout. Root cause: API server OOM killed. Fix: increase memory limit in docker-compose, add memory monitoring alert." \
--importance 0.95 --tags error,production,nginx,oom \
--memory-type correction --namespace errors --pinned true
# Moderate — dev workflow friction
memoclaw store "TypeScript build fails after upgrading to 5.4: new strictNullChecks behavior on optional chaining. Fix: update affected type guards." \
--importance 0.7 --tags error,typescript,build \
--memory-type correction --namespace errors
# Low — cosmetic or one-off
memoclaw store "ESLint warning about unused import in test file. Suppressed with eslint-disable-next-line." \
--importance 0.3 --tags error,eslint,minor \
--memory-type observation --namespace errors
Notice the low-severity one uses observation memory type (14-day decay) instead of correction. Minor issues fade naturally. The production outage is pinned — you never want to forget how to fix that.
Building the self-improvement loop
The real value shows up over time. As your agent accumulates error memories, it gets faster at resolving issues.
Daily error audit (via cron or heartbeat):
# Check what errors were stored recently
memoclaw list --namespace errors --limit 20
# Look for patterns — multiple errors with similar tags
memoclaw recall "recurring errors this week" --namespace errors --limit 10
Consolidation for recurring issues. If the same type of error keeps showing up, consolidate the memories into a single, comprehensive resolution:
# Merge multiple Docker-related error memories
memoclaw consolidate --namespace errors --dry-run
Review the dry run output. If the clusters make sense, merge them. This keeps your error namespace from getting noisy with variations of the same fix.
Proactive prevention. Once your agent has enough error history, it can start checking for known issues before they happen:
# Before deploying, recall deployment-related errors
memoclaw recall "deployment errors and gotchas" --namespace errors --limit 5
# Before upgrading dependencies, check for upgrade-related issues
memoclaw recall "dependency upgrade problems" --namespace errors --limit 5
This turns reactive error handling into proactive risk awareness. The agent doesn’t just fix problems — it anticipates them based on past experience.
What to store vs. what to skip
Not every error deserves a memory.
Store errors that took more than one attempt to fix, errors with non-obvious root causes, errors likely to recur (infrastructure, config, dependencies), production incidents (always), and errors where the fix differed from what you’d naively try.
Skip typos and syntax errors, one-time network blips, errors from code that’s already been deleted, and anything with PII or secrets in the error message (MemoClaw isn’t for secrets).
The payoff
An agent without error memory makes the same mistakes forever. An agent with error memory converges toward reliability.
After a month of storing error resolutions, most OpenClaw agents I’ve seen hit a pattern where 60-70% of errors match existing memories. The agent goes from “unknown error, let me debug” to “I’ve seen this before, here’s the fix” in a single recall call at $0.005.
The errors namespace becomes your agent’s institutional knowledge — the stuff a senior engineer carries around in their head. Except it’s searchable, it persists across sessions, and it doesn’t quit when it gets a better offer.