How I think about agent memory
A draft. Feedback welcome — ashwanijha04@gmail.com.
Every LLM agent I've shipped in the last two years has had the same fatal property: it forgets. You can paper over it with longer context windows. You can RAG over a vector store and call that memory. Neither of those is what humans mean when they say memory, and neither is what an agent needs to be useful past a single session.
This is the model I keep coming back to. It's the basis of Extremis, an open-source memory layer I'm building that drops into the Anthropic and OpenAI SDKs with one import change. But the architecture is the interesting part, not the package.
What memory is not
Three things keep getting confused with memory.
Context window. Bigger windows are great for one long conversation. They're useless for "what did the user tell me last week?" because the window only contains what this call carries. Stuffing prior conversations into the prompt is a brute-force workaround, and it scales with token cost, not with relevance.
Vector RAG over chat logs. Embed every turn, retrieve top-k by cosine, hand the chunks to the model. This is the default "memory" most production agents ship today. It works for QA-shaped lookups ("what did we say about pricing?") and fails for everything else — there's no notion of importance, no concept of time, no learning from whether the retrieval was actually useful, and crucially, no way to distinguish "the user prefers Indian food" from "the user mentioned Indian food once three months ago."
Long-term fine-tuning. Updating model weights on user data is a research project, not infrastructure. Even when it works, it's all-or-nothing — you can't forget a fact, you can't show the user what the model thinks it remembers, and you can't roll back.
Memory is none of these things. Memory is a structured, queryable, editable substrate that sits next to the model and feeds it the relevant slice of the past on every call.
Four layers, not one
When I started I tried single-layer designs first — one big embedding store, one retrieval pass, ranked by recency-weighted similarity. It works for demos. It falls apart in production because different kinds of memory have different access patterns.
Borrowing from cognitive science (and the brain-region naming I use in Friday), I now think about memory in four layers:
- Episodic — what happened. Raw conversation turns, events, observations. Append-only. Cheap to write, slow to retrieve at scale. This is what most vector-RAG systems are.
- Semantic — what's true. Facts distilled out of episodes. "The user is a senior engineer at Property Finder." "They prefer code examples in TypeScript." Smaller than episodic, much more queryable. Updated by consolidation, not append.
- Procedural — how to do things. Learned routines and skills. "When the user asks about deployments, check the Vercel dashboard first." Used for action selection, not just question answering.
- Identity — what's stable. The agent's own values, voice, persistent goals. Rarely written. Read on almost every call to keep behavior consistent.
The mistake I made for too long was treating these as the same store with different tags. They're not. They have different read frequencies (identity > semantic >> procedural >> episodic), different write rules (episodic is a firehose, identity is curated), and different retrieval needs. Once I split them, ranking got dramatically simpler.
Recall should explain itself
In a single-layer RAG system, the answer to "why did you tell me X?" is cosine similarity was 0.87. That's not an answer. It tells you nothing about whether the result is relevant, recently useful, or stale.
Every memory in Extremis ships with a reason field on retrieval:
results = mem.recall("what is the user building?")
for r in results:
print(r.memory.content)
print(r.reason)
# "similarity 0.91 · score +2.0 · used 5× · 3d old"
That string is the entire ranking decision laid out. Similarity is one input. Score is a learned weight that goes up when this memory has been useful and down when it's been ignored or contradicted. Usage count is how often it's been retrieved. Age is recency.
Two things follow from making this visible:
- You can debug bad answers. If the agent said something wrong, you can ask which memory it pulled and why it ranked. Almost always the answer is "stale memory ranked over fresh one because score had drifted high" — and now you can fix it.
- You can build a feedback loop. If the user thumbs-up or the downstream action succeeds, score up. If they correct the agent, score down. Useful memories rise; bad ones decay. The system gets less wrong over time without retraining.
This is the single most important property of an agent-grade memory system, and it's the one almost every "we built memory for our LLM!" announcement skips.
Consolidation matters more than retrieval
Most of the engineering effort in vector RAG goes into the retrieval side: better embeddings, hybrid sparse-dense ranking, re-rankers. That's the wrong end of the pipe to optimize.
The bigger lever is consolidation — the process that takes the raw episodic firehose and writes useful memories into the semantic and procedural layers. If you do this well, retrieval gets easy because there's less garbage to rank. If you do it badly, no amount of re-ranking saves you.
Concretely: at the end of every session (or on a schedule), an agent should:
- Read its recent episodic memories.
- Extract durable facts that are likely to matter next session ("the user moved teams" — not "the user said hi").
- De-duplicate against existing semantic memory. If a new fact contradicts an old one, write the contradiction explicitly — don't just overwrite.
- Promote routines it executed successfully into procedural memory.
This is roughly what your hippocampus does during sleep, which is why Friday's consolidation module is named after it. The biology isn't a gimmick — it's a useful framing for what should run when.
Forgetting is a feature
The hardest part of running a memory system in production isn't writing, retrieving, or ranking. It's forgetting.
People change. They move jobs, change preferences, get over the breakup. An agent that remembers everything indefinitely becomes a creep — surfacing facts the user no longer identifies with, treating six-month-old preferences as current, refusing to update its model of who you are.
A useful memory system has three forgetting modes:
- Decay — passive. Scores drift down for unused memories. Eventually they fall out of the top-k.
- Contradiction — active. When a new fact contradicts an old one, the new one wins by default and the old one is marked superseded (not deleted — auditable).
- Explicit deletion — user-driven. The user can see what the agent remembers and remove specific entries. This is non-negotiable for trust. Build the UI.
The temptation when you've built memory infrastructure is to remember more. Resist it. The job is to remember the right things.
Why this needs to be a library
The whole point of the agent boom is that the model is the easy part. What's hard is the surrounding infrastructure — retrieval, tools, state, observability — and every team is currently rebuilding that infrastructure from scratch, badly.
Memory is the most reusable piece of that infrastructure. It doesn't depend on your domain or your model. It has well-defined inputs (conversations) and well-defined outputs (relevant context, with reasons). It's exactly the shape of thing that should be a library you install and forget about.
That's the bet with Extremis. Change one import:
from extremis.wrap import Anthropic
from extremis import Extremis
client = Anthropic(api_key="sk-ant-...", memory=Extremis())
…and every call automatically recalls relevant memories before, persists new ones after, and surfaces why it retrieved what it retrieved.
If you're building agent products and you've been bolting together your own memory layer, I'd love to hear what's been working and what hasn't. The library is open source, and I'm dogfooding it inside Friday — every week of usage uncovers an assumption that didn't survive contact with reality.