DEV Community

Cover image for Why LLM Memory Still Fails - A Field Guide for Builders
Isaac Hagoel
Isaac Hagoel

Posted on

Why LLM Memory Still Fails - A Field Guide for Builders

It's an open secret that despite the immense power of Large Language Models, the AI revolution hasn’t swept through every industry and not for lack of trying. We were warned AI would take our jobs and replace every app, but that hasn’t happened. Why?

Some would say "hallucinations," but let’s be honest - people hallucinate too, and often more than modern LLMs. The real missing piece, the thing standing in the way of the AI tsunami, is memory: the ability to learn, grow, and evolve over time.

Imagine hiring an AI as a new team member. You wouldn’t expect them to know everything on day one. They’d need to learn the role, get to know the team, understand your business logic, make mistakes, get feedback, and improve. All of that learning happens over time.

LLMs as they exist today, even when equipped with the best available tools, can’t do any of that. They are stateless and frozen in time. 

This isn’t a theoretical overview - I rolled up my sleeves and tested real systems to see what actually works.


Stateless Intelligence

The only way to introduce new information or "teach" an LLM new skills is by providing it with examples, instructions, and all the accumulated relevant information repeatedly with every invocation (prompt). This blob of tokens given to the model is know as "context" and has a clear limit - "context rot": The more you stuff into the prompt, the harder it becomes for the model to separate signal from noise and know what to attend to. This is true even for models with large context windows e.g.1M tokens for GPT4.1 or Gemini 2.5 Pro. We all know this first hand from experience - that moment when you realise you need to start a new chat because the model gets all confused, and it's also backed by research. In other words, the idea of dumping "all the datas" into the context window fails the test of reality.

Because of that, Context Engineering is the "make or break" pillar of any sufficiently complex LLM-centered feature/app and the most important skill for engineers building with AI.

This is why there is a whole industry around "how to get relevant information into the context" (and flush it out or compact it when it becomes less relevant). There is a whole slew of commercial and open source offering, all of which revolve around different flavours of RAG (e.g. Agentic RAG, Graph RAG).

The real confusion begins when these offerings start using the name "memory" for these RAG-based solutions - sometimes going as far as splitting it into categories like "episodic memory" or "semantic memory". This is smart for marketing but creates false expectations by analogy. 

Another thing you'd notice if you start playing with these "memory" frameworks/libraries is that they focus on interactions with a single user - the "chatbot" use case we know today but definitely not what a real agent that continuously operates in an "open world" environment (like the AI worker we discussed before) requires.

What About The Memory Feature In ChatGPT?

When OpenAI say they added memory to ChatGPT what they actually mean is that they gave the model the ability to store a flat list of blobs of textual information about the user, and perform searches on that (using RAG presumably). As before, the scope is a single user and it suffers from all the normal limitations of RAG, which we will discuss next. 


“Memory” Systems That Aren’t 

I explored most of the major "memory" implementations out there. As I said before, all of them are search tools with the label "memory" slapped on top.

Most fall into one of three camps:

  • RAG (Retrieval-Augmented Generation): Vector search over external notes or structured memories. It’s decent when you want to retrieve a few relevant examples for a specific fact or topic, but it’s not designed to surface every occurrence or reason about them in aggregate. RAG retrieves semantically similar, often out-of-context chunks, expecting the LLM to stitch them together, which can lead to incomplete or inaccurate results - sometimes even hallucinations when gaps are filled incorrectly. It also struggles with large datasets, where relevant information gets buried under noisy matches, and complex, multi-hop queries requiring reasoning (e.g., analyzing trends or causality). For example, if you ingest "Lord of the Rings" and ask for all disagreements between characters, RAG might surface vaguely related scenes rather than a comprehensive list. It’s also poor at associative tasks e.g., a user says “Today is my anniversary,” and RAG retrieves generic anniversary info instead of memories tied to the user’s relationship. This isn’t surprising given how it works - vectorizing the query and searching for the nearest text chunks in a flat list.

    The great post here provides a solid breakdown of these shortcomings, though I don’t fully agree with their proposed fix (agentic RAG - see below).

Most modern systems don’t rely on RAG alone - they pair it with structured metadata, keyword indices, or hybrid approaches (see here) to try and make up for its limitations. The strengths of RAG are in its simplicity and unstructured nature (easy ingestion) but they are also its downfall. When it comes to implementing the kind of memory true persistent agents need - RAG won't do.

  • Agentic RAG: The main idea behind agentic RAG is to take a standard RAG system and allow the LLM to query it multiple times in a loop - refining its queries and accumulating context until it has what it needs to answer. This enables more sophisticated reasoning and planning. Unfortunately, it inherits the same core limitations: the underlying retrieval is still vector-based RAG, so it suffers from context fragmentation, relevance drift, and shallow matches. The iterative nature also makes it computationally expensive, token-hungry, and often too slow for real-time use. It can get stuck in "loops of doom" or terminate prematurely without finding the necessary information. 

    Although it improves upon RAG, Agentic RAG still falls short of what we intuitively think of as memory. That’s why I turned to something more structured...

To be clear, not all use cases require real memory the way it's define in this post. If your goal is to retrieve a page from documentation or pull up a few helpful examples - RAG can be perfectly sufficient. Its simplicity makes it fast to implement and often “good enough” in practice. But once your system needs to reason over time, adapt to new experiences, or manage overlapping context - you’re outside RAG territory. That’s where the memory gap becomes painfully obvious.

  • Graph RAG: In a previous post, I described how I tried to compensate for RAG's limitations using agent-generated SQL queries over structured metadata. The core insight was simple: RAG lacks structure. So what if we added it?

    Graph-RAG attempts exactly that. During ingestion, a large language model extracts entities and relationships from text and encodes them as nodes and edges in a graph. Later, retrieval happens by traversing that graph - e.g. walking outward from a node, filtering by relationship types, running path algorithms, and so on. Some frameworks even add a temporal dimension, which brings it a little closer to how we imagine human memory.

    It sounds promising on paper. After all, remembering is often associative - one idea leads to another. Graphs seem like a natural fit.

    However, this sophistication comes at a steep cost: the simplicity of traditional RAG is lost. Operations grow complex - entity resolution becomes a puzzle (e.g., does "the king," "king Arthur," or "He" refer to an existing "Arthur" node or a new entity?), and disambiguation is tricky (e.g., distinguishing between multiple Arthurs like the king, his father, or a peasant). Beyond that, challenges like conflict resolution, data invalidation (when new information arrives), and compaction arise.

    These are solvable, maybe... the real blocker is schema design:

    How do you decide upfront what types of nodes and edges are relevant? In domains with rigid structure, like business workflows or e-commerce, you can get away with it, but for modeling generic memory? The kind of evolving, messy, contextual knowledge humans have? It falls apart.

    Memory is not a tree of concepts. It's a living web of hypotheses, contradictions, associations, and revisions.

    (PS: There are variants of Graph-RAG like this one that builds a tree structure, summarizing information as you move up. I didn’t explore it deeply since it felt even less suited for memory use cases.)

  • Agentic Graph RAG: Not sure it exists but no reason not to try :) 

As a software engineer, I was heavily attracted to Graph RAG and spent a long time playing with Graphiti. When I realised it was built with a chat between a single user and a single agent in mind, I even tried to implement my own customised version on top of Neo4j, tailored for my needs. But defining a good schema and ingesting long-form text into coherent, evolving memory graphs? That turned out to be really hard. Humans build memory by revising beliefs, forming hypotheses, forgetting selectively and reading between the lines.

Take something as ordinary as a chat log. Here's a real (simplified) example:

Person A: "Oh, I'm gonna be so late..."

Person B: "What happened?"

Person A: "Ah, too embarrassed to ssy"

Person A: "*say"

Person B: "Lol, I bet you it's that roommate of yours again"

Person A: "That dude always forgets where he put our keys :("

This tiny exchange contains a surprising amount of context and information: there's a roommate, the keys were lost, lateness resulted, and there's an ongoing joke or shared memory. Humans pick this up instinctively, but encoding it into a graph - resolving entities, inferring causality, surfacing associations is non-trivial. Where do you even start?

Then, after grappling with it for a few days, it hit me: this was Symbolic AI all over again.

Like Symbolic AI, Graph RAG gives you the illusion of control: explicit structure, clean logic, tidy representations. But it breaks down the moment things get ambiguous, nuanced, or evolve over time. That neatness just doesn’t hold up in the real world. I pondered: what was the antidote to symbolic AI? deep learning and the transformers architecture...

And then, something clicked.

There already exists a system with exceptional recall - something that can store associative, fuzzy, contextual information and resurface it later. LLMs, when it comes to their pre-trained data, already behave like they have memory.

But the magic is in how they remember. LLMs don’t store records in a database. They don't store records at all. The knowledge they absorb during training becomes embedded in their weights. And when you query them, the right patterns get activated.

Try it: ask ChatGPT about something obscure, temporal, or requiring synthesis and instruct it to answer without using tools (no web search!). The results can be eerie. That thing remembers A LOT. It has zero trouble with timelines, contradictory information or anything else that traditional systems struggle with. Here’s an example. Here is another one.

So what’s the problem? The weights are fixed, right? Once training ends, the model’s knowledge is frozen in time.

Or is it?

I remembered that fine-tuning updates model weights post training - usually to match a tone, format, or domain. It basically continues the training process after it was complete. So why not use it to add memory? Why not encode new experiences or user knowledge directly into the weights?


Memory in the Weights? Maybe

Unfortunately fine-tuning can't do it 😕. Turns out there's a well-known issue in continual learning called catastrophic forgetting: when you fine-tune a model on new knowledge, it inevitably overwrites older capabilities - the more you fine-tune the more of the original knowledge you lose. Not ideal if you're trying to simulate a persistent, growing memory.

That realization sent me down a rabbit hole of academic papers. Unsurprisingly, I wasn’t the first person to chase this idea and I quickly found some genuinely exciting research that tries to do what fine-tuning can’t. Two papers stood out: MemoryLLM and MEGa.

The MEGa paper was fascinating and advanced but didn’t release any code. MemoryLLM, on the other hand, did and their approach was clever: rather than modifying the entire model, they introduced a dedicated memory region within the weights. The base model stays untouched, while memory is isolated, updated, and read from dynamically at inference. They even accounted for "forgetting" older, less frequently used information.

And the most beautiful thing:

Since the memories were encoded in the weights - none of the context window limitations apply.

I instantly knew had to try it first hand.


Getting Hands-On: Memory LLM

I cloned the repo and got it running on a remote machine with a beefy GPU (after burning a few hours on trial an error and fighting with Python dependencies etc). 

When I went over the codebase, one thing immediately stood out: the researchers had actually modified the inference logic of the model to support reading from and writing to memory. Not the training pipeline, the live inference code. That’s a part of the stack us developers never go near. But seeing it altered made something click for me:

We’re used to thinking of models as black boxes that you train or fine-tune. But you can also intervene in how they run, almost like patching application code. That’s powerful and honestly under-explored by engineers.

Then I saw they were only supporting Llama 3 - a relatively old, weak model and that highlighted something else:

The research mindset is very different from the engineering mindset.

Researchers prototype with the goal of publishing a paper. That means small models, clean baselines, simple benchmarks, limited scope. Engineers, on the other hand build with the goal of reaching a working, usable POC that can be used in the real world - not in a lab. They reach for the most powerful tools they can find (for open models that would be LLaMA 4, DeepSeek R1 or Kimi at the time of writing this). The last thing we engineers want is to be bottlenecked by a weak model. We instinctively ask: "will this scale?"

But here’s the tradeoff: those stronger models often come with much more complex internals: Mixture of Experts, longer pipelines, finicky tokenizers, harder fine-tuning. They’re not well documented and not easy to poke at unless you have serious time, compute, and domain knowledge. The researcher is making a pragmatic choice and one that makes sense for them, but leaves us wanting.

Still, I had a plan: if MemoryLLM could reliably store and retrieve memory (even in small scales), I could wrap it with a smarter agent that decides what’s worth remembering, when to save it, and how to use it in the future. I could scale horizontally with multiple instances. I didn’t need perfection, just a signal that the core mechanism worked.

And if it did work, it would open up a whole new avenue: not just tweaking prompts or retraining models, but actually engineering memory systems by intervening in the model’s runtime itself.


Reality Check: Does MemoryLLM Work?

I had high hopes. The benchmarks on the paper looked great but I was about to swallow some bitter medicine.

MemoryLLM offers two modes: chat and mplus. The latter increases memory storage and improves retrieval, but it isn’t optimized for conversational flows - it tends to keep generating past the user’s question as if continuing the chat history. I tested both.

I knew that in the paper they ingested short snippets but for my use case that wasn't practical. Consider the same example we looked at before:

Person A: "Oh, I'm gonna be so late..."

Person B: "What happened?"

Person A: "Ah, too embarrassed to ssy"

Person A: "*say"

Person B: "Lol, I bet you it's that roommate of yours again"

Person A: "That dude always forgets where he put our keys :("

LLMs have no problem understanding this like a human would but only if it is ingested as a whole. If we ingest it line by line it loses all meaning.
Ingesting full "episodes" was possible in my previous experiments, when I was playing with Graphiti, but not quite so with MemoryLLM.

When I tried to ingest examples like the conversation above, the results were disappointing. Sometimes the input was ignored entirely. Other times, the responses were hallucinated or incoherent. When I carefully mimicked the benchmark setup, including their custom attention masks, and drastically shortened the inputs - I could get semi-coherent replies, but still nothing close to usable in an actual system.

Digging into the benchmark code revealed why: the test setup was highly unrealistic. Simple, isolated prompt-response pairs. No real dialogue. No sustained context. In short, nothing that resembled real-world use.

This isn’t a knock on the researchers - they weren’t trying to build a production agent and I do think their ideas are remarkable. But it served as a bitter reminder: synthetic benchmarks can look impressive while masking critical limitations.

To be fair, I’m not an AI researcher. Maybe I missed something. But after days of config tweaks, prompt engineering, and test cases, I’m pretty confident: this is an exciting idea but it’s still in its infancy. Not ready for real-world agents.


The Contrast

Here’s the kicker: I was using Claude 4 and o3 inside my IDE to help write, test, and troubleshoot all of this. The difference was staggering. These models were grounded and nuanced. I’d paste in the same kinds of messy conversations I tested MemoryLLM on, and they’d instantly parse the implied context, draw the right conclusions, and respond meaningfully.

It drove the following point home:

We’re sitting on a goldmine that’s frozen in time. 


So What Now?

I still believe that embedding memory in the model itself is the long-term path. It’s the only direction that could eventually support agents that learn, grow, and evolve the way humans do. But I now better understand why real-world teams still rely on RAG, vector DBs, and graph overlays: they’re accessible, composable, debug-able and well understood. You can build something useful without re-architecting a transformer.

Still, I wonder if we, as engineers, should begin crossing that boundary - learning to intervene not just at the API layer, but in the internals of open-weight models - bringing our engineering mindset to the table. I mean, how difficult can it be? 😉

For now, I’m keeping one foot in each world: shipping practical tools with what’s available, and probing the frontier to see what might be possible.

If I find anything that shifts the landscape, I’ll write a follow-up.


PS: If you're working on any similar ideas, I’d love to hear from you. Let's compare scars.

Top comments (1)

Collapse
 
umang_suthar_9bad6f345a8a profile image
Umang Suthar

This is such a brilliant breakdown of the real bottleneck in LLMs' memory. It’s refreshing to see someone go beyond the buzzwords and dig into what actually works (and what doesn’t). We’ve been thinking about this problem a lot, too, especially from a systems perspective.

One thing we’re exploring is whether memory should live closer to the compute layer itself, where context isn’t just retrieved but natively processed alongside AI tasks. It feels like the current RAG-based 'memory' approaches are duct tape solutions, and the next breakthrough will come from rethinking the infrastructure beneath it.

Would love to hear your thoughts, especially if you’ve considered how blockchain-like transparency or distributed compute might change how we think about AI memory.