Agent Memory ((and tool use) Are All You Need)
The current paradigm for “long”-horizon agentic tasks (where “long” still mostly means minutes) isn’t a single monolithic agent. Instead, it’s a system of specialized sub-agents, coordinated by top-level planners, with feedback and error-correction layers stacked on top.
Around the jump from GPT-3.5 to 4, I started to think orchestration scaffolds might quickly become obsolete, with GPT-5/6 potentially becoming coherent and nuanced enough — with large enough context windows — to one-shot most tasks. Especially since there are obvious downsides to multi-agent systems not unlike those found in human coordination problems - dropped context, misalignment, communication loss.
I thought if models got good enough you could just give it all the data, and it could reason its way to the end, without the penalty of coordination, and the performance gap between monolithic and multi-agent setups would close or even flip.
But the trend since then has mostly been that whenever the foundation models improve, you can replace the old model inside your agents and get an immediate lift. If your multi-agent system built using GPT-4o performs better than a monolithic 4o agent, you’ll see that replacing 4o with 4.5 would make that multi-agent system still better than a monolithic 4.5 agent. In other words, the improvements continue to be worth the communication penalties.
It’s worth noting that we’re in the beginning of a new scaling paradigm, R1/O1s are the GPT-2 of RL scaling. I expect to see significantly improved capabilities soon that can render many pieces of multi agent system obsolete. In one specific dimention this has already happened - prompt engineering isn’t as important anymore. We used to craft a specific prompt for each of [summary, Q&A, tool use etc] and now the attitude has shifted to “ehhh the reasoning models can handle this”.
But I think it’s plausible that very advanced reasoning models will still need systems for memory (and tool use), and maybe that’s all we’ll need going into AGI.
What’s missing?
In most multi-agent systems today, you wouldn’t pass all the intermediary data an agent used to come to an answer (i.e. search results read) onto the next agent. Instead you pass along just some synthesised ‘output’ of that agent (i.e. what ChatGPT gives you back after doing web search).
This is done because passing all the intermediate data is slow, expensive and noisy. Agent communication using the synthesized outputs works to a point, but information loses fidelity quickly, and there is little ability to backtrack and ask for a different way of synthesising the information without needing to redo the work.
It’s like a researcher who, after writing their paper knows only what’s written down in the paper, and when asked a follow up question needs to do the research all over again.
So we don’t just need communication — we need better memory. While agent memory systems exist (often already split into long-term, short-term, episodic, semantic, and instructional memories) it is still unclear:
- What fidelity do we store memory in? Should we store all previous conversations? What about the intermediary data for previous conversations?
- Should we extract important memory to be on a ‘different tier’? What should this extract? How would agents know what’s available?
- How do we keep the memory relevant? Do we time decay or invalidate memories?
- When should memories be retrieved? Should it be an active ’tool use’ or a passive prompt injection?
Naively reasoning about how my memory works, it seems to be preallocating compute to highly compress information in the right format for its best guess of how that memory might be used.
Here’s how I think it should work:
- The long-term persistence store contains all the historical interactions with agents, without intermediary data, but with some pointers to important parts of how it came to a conclusion (like how I don’t remember the equation for x but I remember it’s in chapter y of my textbook)
- The intermediary data for recent conversations may be stored in full until some process has run over it to extract pointers, and maybe useful high level concepts
- Higher level memory should be the equivalent of things we can always bring to mind without effort: where I work, who my friends are, what food I like.
- Updates should be done in a few ways:
- Memories themselves should be updated, existing ones that now conflict with new data (ie I no longer like black coffee) should trigger an update (overwritten, or ideally leaving a trace of the history if needed)
- The mechanism or rules to determine what memory to save should be updateable passively through interactions, ie it should learn over time that I like machine learning and cooking, and save more of what related to it than, say, TV shows, even if I do watch TV a lot and talk about them
- Toy examples might be:
- I ask for a new paper to read - it should know from high level memories that I’m interested in agents and interp. It should then search through the long term memory to find out what I’ve read in the past. From there it should have enough information to search.
- If I ask for new paper recommendations often - it might want to update its memory update rules so that there’s a piece of high level memory that includes all the papers I’ve read, and update this piece of memory whenever I mention a new one or it recommends me a new one.
Do we really need all this?
Maybe not, one obvious way we wouldn’t need any memory setup is having models that can efficiently and accurately attend to billions of tokens.
But if you look around, a good chunk of what people do in the world seems to be looking for the right piece of information, relaying it to the right people, in the right format.