Memory Is Not a Database

By Erhan Bilal, PhD - CSO, Enkira AIMay 9, 2026

AI memoryagentsretrievalneurosciencePSA

Delay cells experiment

Part I: The Problem of What Comes to Mind

Why the hard part of AI memory is deciding what should come to mind

In the early 1980s, working in a lab at Yale, Patricia Goldman-Rakic noticed something that should have ended a long-running argument about what computers and minds have in common. She had trained a macaque on a delayed-response task: a flash of light somewhere on a screen, several seconds of nothing, and a saccade back to where the light had been. The interesting moment was the seconds of nothing. The animal stared at a blank screen and held its place. Inside its prefrontal cortex, throughout the silence, a small population of neurons fired in a pattern locked to the location it was supposed to remember.

Goldman-Rakic called them delay cells. As far as the recordings showed, they were the cellular substrate of keeping something in mind.

What is striking about delay cells is the negative space around them. While the firing continues, there is nowhere else in the brain that the remembered location can be said to be sitting. Cut the activity, and there is no memory to recover. The neurons that represent the location are the same neurons that keep it alive, doing one continuous thing that we, from the outside, have decided to describe as if it were two.

That detail keeps recurring in the brain when you look closely, and it keeps failing to recur in the systems we are currently calling AI agents.

The standard story is that those agents have a memory problem, which is true, but the framing is misleading. Calling them too forgetful misses what is actually wrong. What they store is the wrong shape for the questions an agent ends up needing to ask. The dominant metaphor in agent design is still the database: store the conversation, embed it, retrieve the nearest matching chunk, summarize whatever overflows, expand the context window if you need more room. That works extremely well for the things databases were built for, which are demos and factual lookup and span-of-history retrieval. It does not work, except superficially, for the kind of collaborator we are increasingly trying to build, an agent expected to sustain a coherent working relationship with a project or a team or a person over many months. The reason has less to do with how clever the retrieval is than with the architecture underneath.

A bargain that became a bill

When John von Neumann sketched the architecture that would later carry his name, in 1945, he did the practical thing. He separated the components: a processor that did the arithmetic and the logic, a memory that held the instructions and the data, a bus that shuttled bits back and forth.

It was a brilliant engineering decision. It was also, in a quiet way, a metaphysical one. It declared that information was one kind of thing and that operations on information were another, and it built a world in which the two could live in different physical places, exchange messages, and produce thought.

For the workloads it was designed to handle, the decision turned out to be one of the great bargains of the century. It gave us general-purpose software, and with it spreadsheets, compilers, web servers, games, and most of what people now have in mind when they say the word computer. The bill, when it came, came late.

For numerical computing, the bill is what computer architects call the memory wall. Wulf and McKee saw it coming in a 1995 paper bluntly titled Hitting the Memory Wall: Implications of the Obvious. The gap between how fast a processor can compute and how fast its memory can feed it the next operand has only widened since. Modern processors spend most of their cycles waiting.

The whole tower of cache hierarchies on a modern chip (L1, L2, L3, HBM, NVMe) is, in effect, three decades of trying to keep that bill paid down. Caches aren't a botched repair on a bad design; they are brilliant engineering, and they are also the reason the bill never quite gets paid in full. Each new level of cache exists because, somewhere in the system, computation and storage are still in different physical places, and the next operation has to wait while the gap is closed.

For agents, a similar bill has begun to arrive in a different currency. It is paid not in clock cycles but in continuity, in judgment, and in whatever capacity the system has to adapt as it accumulates experience. Bigger context windows and more elaborate retrieval stacks are the agent world's version of a cache hierarchy, and they have the same character: they help a great deal, and they leave the underlying problem in place. A larger context window widens the bus. The bus is still there.

The brain made a different bargain. It refused to put memory on one side of the system and thought on the other.

Memory as an ongoing act

Delay cells are the cleanest experimental case of that refusal, but they are not the only one, and they are not even the strongest. Long-term memory plainly persists through sleep, anesthesia, distraction, and years of inattention, so the firing-keeps-it-alive picture cannot be the whole story. The architectural lesson is the same wherever the neuroscience has been able to look. In the brain, the substrate that stores is the substrate that computes.

The same pattern shows up at three quite different time scales.

The simplest is the synapse. A synapse is not really a wire so much as a small, history-dependent piece of physical machinery: a multiplication whose coefficient was written by prior use, so that whatever computation it performs the next time current crosses it is conditioned on what came before. Donald Hebb proposed in 1949 that one neuron repeatedly taking part in firing another should strengthen the connection between them. (Carla Shatz later compressed Hebb's claim, more catchily, into "cells that fire together, wire together.") The substrate of the calculation and the substrate of the memory are, at this level, the same piece of physical machinery, and the relationship between them is mutual. The computation writes the memory, and the memory shapes the next computation.

Move up a level, and the same pattern reappears in what neuroscientists now call the engram. A consolidated memory is not a file in a folder; it is a distributed set of physical and chemical changes spread across populations of neurons, sometimes across multiple brain regions, that get reactivated together when the right cue arrives. Recalling a memory, on this picture, is not so much opening an archive as reigniting a pattern. The hardware that did the original thinking is the hardware that fires again, conditioned now by everything the brain has learned in the meantime, which is why the second firing is never an exact copy of the first. It is a fresh event shaped by the original. This is broadly why eyewitnesses become unreliable, and why memory-disrupting therapies for trauma can work at all.

A third version of the pattern shows up at the level of cognitive control. Earl Miller and Jonathan Cohen described the prefrontal cortex's job as the active maintenance of patterns that represent goals, and the use of those patterns to bias perception, action, and the rest of the brain's memory systems toward whatever the current task requires. Mark Stokes and others have shown more recently that this maintenance is more flexible than the original delay-cell picture allowed. Information can be held in hidden network states, with no continuous spiking activity, ready to be picked up again when it becomes relevant. Memory, at this level, can be available without being on.

What this leaves us with is one architectural pattern at three different time scales. The synapse stores a learned coefficient that has been refined across many small uses. The engram is more elaborate: a distributed pattern, subtly rewritten every time it is recalled. Working memory is more transient still, the configuration of activity the system happens to be holding open in the moment. None of these resembles a warehouse with a query interface. Each is a piece of computational work that, once done, becomes a constraint on whatever the system gets up to next.

What we want from memory is not the past preserved but the past made actionable in the present.

What this means for the LLM-shaped thing on your desk

A large language model knows a great deal. The knowledge sits in the weights, and the weights are read by computation. So far, the picture is genuinely brain-like.

Everything else about the running system still looks like the 1945 EDVAC bargain: computation on one side, memory on the other, and a bus between them.

The conversation you had with the model yesterday is not in the model itself; it is in a JSON file somewhere on a disk. To use any part of it, the file has to be loaded into a context window, which is a temporary scratchpad that exists for the duration of one inference and is discarded the moment the inference ends. When the window fills up, work is summarized and written back out to another file, and when you open a new session, that file is read again from scratch.

Properly, what we have here is not memory but paging, dressed up in the vocabulary of memory, which is exactly what makes the failure mode feel uncanny. The model speaks fluently about what it remembers, but inside the running system there is nothing that corresponds to a delay cell, an engram, or a shaped synapse. When the file is not loaded, there is no faded version of yesterday's conversation still humming somewhere in the network. There is no version at all.

The deeper problem isn't capacity (the hard drives are effectively infinite) but selection: of all the things the system technically has access to, which ones get to come back. Semantic search by itself can't decide that, because relevance to a current task does not always look like surface similarity. A debugging note can be exactly what you need during a release, even if no string in it overlaps with anything in your prompt. A failed experiment with one model can be the precise lesson you need when you evaluate a different one, because the failure was about the evaluation setup, not the model. Cosine distance is blind to all of that. Real relevance is a matter of cause, procedure, timing, risk, and personal history, none of which the embedding was trained to encode.

What is missing in most agent stacks is not more storage but something closer to what the prefrontal cortex appears to do: a layer that asks, given what we are doing now, which fragments of the past should be allowed to shape attention.

This is the working hypothesis behind PSA, the Persistent Semantic Atlas: a local memory layer for AI agents that tries to make past work available as structured context rather than as a pile of searchable transcripts. Instead of storing an indefinite conversation log and running semantic search over it on every turn, PSA breaks experience into typed memories: episodes, procedures, failures, tool-use notes, and semantic facts. It groups those memories into atlas regions that can be opened or kept closed, learns from real usage which regions tend to matter, and runs forgetting as part of maintenance rather than treating storage as permanent. It is a small system that runs on a laptop. It does not solve agency, but it is our attempt to build the missing layer in software.

Forgetting as a feature

Databases do not forget in the cognitive sense. Brains do, and the forgetting is not the bug it gets described as.

A mind that preserved every sensory detail forever would not be intelligent. It would be buried. Memory in a useful animal is adaptive compression: it keeps what is likely to help future action, fades what no longer does, and consolidates repeated episodes into something more general. Sleep is one of the places this seems to happen. Tononi and Cirelli's synaptic homeostasis hypothesis proposes that one function of sleep is to weaken synaptic strengths globally, so that the important ones stand out more clearly when the system wakes. Memory without forgetting is noise. A system that never decays its associations becomes less capable of recall over time, not more, which is roughly the opposite of what the "infinite memory" pitch from the agent-startup ecosystem assumes.

Agents have, in this respect, a particularly noisy life. Tool traces, half-formed plans, failed commands, temporary assumptions, contradictory instructions, debug logs that mattered for ten minutes before becoming irrelevant. Treat all of that as equal memory and the agent doesn't get wiser so much as heavier.

The right question is not only what to remember but what should keep influencing future thought.

The real test

The right way to test an agent's memory is not by asking whether it can answer "what did we discuss on March 12?" That is a search problem, and even the database-shaped systems can do it adequately.

The harder test is whether, six weeks later, working on a different task, the agent quietly avoids a mistake because it has internalized a relevant prior failure; whether it asks the right clarifying question because it has noticed before what kind of ambiguity tends to cause trouble in this codebase; whether it loads the correct project context without forcing the user to rebuild the entire relationship from scratch. That is what memory is for, when memory is doing its job.

There is no cache hierarchy that papers over an agent that does not know what it knows, and no retrieval stack does either. A larger context window is a wider bus, not a different architecture. The von Neumann bargain was a brilliant trade for the workloads it served, but the bill it has begun to leave the agent stack is not paid in cycles.

Brains seem to have solved this, to whatever extent it is solved, by refusing the separation in the first place. Memory is what the network is doing right now, or what it has been changed to do later. Computation is what makes that memory continue to exist. The stored trace and the operating system are the same tissue.

We are not going to grow biological neurons in a data center, and that is fine. We can be honest about what we are imitating, and where we are still cheating, and we can build typed memories instead of flat chunks, semantic regions that open and close, gating that learns from use, forgetting that runs on its own, and feedback loops in which retrieval shapes future retrieval. PSA is gesturing at all of that. It is the shape of a hypothesis.

Memory is not the past preserved. It is the past made available to judgment, attention, and action. The agents that matter will not be the ones that store the most history, but the ones that can let the right parts of that history shape what they do next.