When Memory Spans Months: Benchmarking How Personalized Agents Remember, Update, and Forget

Jun 25, 2026

AI agents are moving from single-session tools toward long-term companions: tutors that follow students across a semester, assistants that hold context for months, conversational partners people return to over years. What makes any of this work is memory. And memory here means more than recalling what was said last Tuesday — it means updating that information when circumstances change, and letting go of it when it no longer applies.

This is the part current benchmarks rarely measure. Most of them test recall: was a fact mentioned earlier in the history, and can the model retrieve it? But recall turns out to be one of the easier things to get right. The genuinely hard problem is keeping a coherent picture of who the user is today, as their preferences shift, their plans change, and their old goals fall away — and that is exactly what existing evaluations leave untested.

In this work, we introduce Memora, a long-term memory benchmark built to reflect how real users actually behave. Over weeks and months, information arrives across many separate conversations, gets revised and contradicted, and gradually goes out of date, and a useful agent has to keep up with all of it. When we put today's systems under these conditions, the results are humbling. Across frontier LLMs — GPT-5.2, Claude Sonnet 4.5, and Gemini 3 Pro Preview, with follow-up spot checks on newer models including Claude Fable 5, GPT-5.5, Claude Opus 4.7, and Gemini 3.1 Pro Preview — and six dedicated memory agents, even the strongest systems reach only the low 20s out of 100 when asked to remember information at the quarterly horizon. On multi-step reasoning over evolving memory, almost everything scored zero — only Anthropic's newly released Claude Fable 5 managed to put any points on the board at all.

The current state of long-term memory evaluation

Most prior long-term memory benchmarks, including LoCoMo, LongMemEval, PerLTQA, PersonaMem, and MemDaily, frame the task as retrieval. The agent is handed a long conversation history and asked a question whose answer appeared in some earlier session, and the metric simply checks whether the right fact comes back.

The problem is how little these benchmarks actually stress memory over time. In LoCoMo, 94% of the questions can be answered from no more than two prior sessions, and LongMemEval is much the same at 85%. Updates and deletions, which are central to how memory works in practice, barely feature at all, appearing across at most two or three sessions. Underlying all of this is an implicit assumption that information, once stored, stays valid forever. Real users are not like that. They change jobs, abandon goals, and contradict last month's preferences, and an agent that scores well on these benchmarks can still cheerfully recommend a restaurant the user swore off weeks ago.

The Memora benchmark

Building the benchmark. Memora is designed to capture how memory accumulates and changes over a long relationship with a user. We build it through simulation: ten professional personas, each with structured memory of their preferences, activities, and goals, evolve across weekly, monthly, and quarterly horizons under explicit temporal and operational constraints. Because we record the full memory state before and after every session, we know exactly when each piece of information was valid, which gives us a precise ground truth to evaluate against. A multi-agent dialogue process then turns those memory traces into natural conversations, and every conversation is checked by three independent LLM graders, accepted only when all three agree, with 5% of each persona's data verified by hand.

The Memora construction pipeline. The strip along the bottom traces one fact through the system — a preference that changes over time — to show how FAMA credits recalling the current state while penalizing reuse of what the user has moved past.

The result is considerably more demanding than earlier benchmarks. Answering a typical quarterly question means integrating evidence spread across roughly 28 prior sessions while accounting for about 15 updates or deletions along the way, where most existing benchmarks ask the model to consolidate just one or two. Three tasks run against the same memory traces: remembering (recalling a stored fact, such as a current list of project objectives), reasoning (combining several memory elements, for example computing a remaining budget from a series of logged expenses), and recommending (suggesting something that fits the user's current preferences rather than a retracted one).

Measuring forgetting. There is a second, subtler gap in how long-term memory is usually scored. Standard metrics check whether the right information shows up in a response, but they say nothing about whether the wrong information also shows up, so an answer that blends a current preference with one the user has since dropped can still score well. To close this gap we introduce Forgetting-Aware Memory Accuracy (FAMA). FAMA separates the criteria for each question into what should appear (the current memory state) and what should not (the invalidated state), and weights the two by how much of the question depends on each. Three LLM judges score every criterion and we take a majority vote; the setup agrees with human annotators 88.3% of the time, with Cohen's κ between 0.86 and 0.90.

What we found. We evaluated four LLMs (GPT-5.2, Claude Sonnet 4.5, Gemini 3 Pro Preview, Qwen3-32B) and six memory agents (A-Mem, LangMem, Mem-0, MemoBase, MemoryOS, Nemori). Three patterns came through clearly.

The first is that performance falls off as the horizon grows. Almost every system does worse at the quarterly scale than the weekly one, and the memory agents — the systems built specifically for this — often slip the most. MemoBase drops from 43.6 to 15.2 on remembering, and MemoryOS from 51.8 to 25.1. An explicit memory store, on its own, does not keep inconsistency from piling up.

The second is that long-term memory is not one capability but several. The memory agents are far ahead on remembering, with an aggregate FAMA of 119.5 against roughly 66 for the LLMs. On recommending, though, the LLMs match or beat them, because the task tolerates approximate memory and lets a model infer current preferences from cues still sitting in its context. On reasoning, everything struggles: the best agents average 27.6 and the LLMs hover near 13. Retrieving a fact and reasoning correctly over many facts are simply different problems.

The third pattern is the one standard metrics hide. Forgetting-aware scoring reveals heavy reliance on information that should have been discarded, and it shows up differently for the two kinds of system. For LLMs, the forgetting penalty actually shrinks at longer horizons — not because they handle updates better, but because the relevant history has fallen out of the context window entirely. For memory agents the opposite happens: the longer the history, the more they lean on stale information they never let go of. The table below reports aggregated memory presence accuracy across the three tasks, with the FAMA penalty in parentheses; a larger penalty means heavier reliance on obsolete memory.

System	Weekly	Monthly	Quarterly
LLMs (without reasoning tokens)
Qwen3-32B	103.6 (−21.3)	83.0 (−9.7)	79.5 (−5.3)
Claude Sonnet 4.5	114.0 (−36.2)	100.2 (−34.1)	94.0 (−23.2)
Gemini 3 Pro Preview	114.2 (−42.0)	110.6 (−39.4)	101.8 (−27.9)
GPT-5.2	115.4 (−30.7)	96.6 (−21.8)	92.5 (−14.7)
LLMs (with reasoning tokens)
Qwen3-32B	98.7 (−18.8)	97.8 (−10.1)	79.5 (−11.6)
Claude Sonnet 4.5	114.0 (−31.0)	104.2 (−21.9)	83.2 (−9.8)
Gemini 3 Pro Preview	113.5 (−43.2)	111.4 (−33.2)	95.8 (−18.6)
GPT-5.2	111.5 (−27.8)	97.4 (−26.5)	91.8 (−14.2)
Long-term memory agents
A-Mem	118.0 (−9.1)	112.0 (−29.5)	118.6 (−37.9)
LangMem	173.0 (−23.0)	132.2 (−31.1)	127.4 (−43.4)
Mem-0	119.4 (−10.4)	78.6 (−21.3)	72.7 (−12.3)
MemoBase	154.4 (−23.8)	107.2 (−21.6)	93.7 (−31.9)
MemoryOS	155.2 (−20.6)	112.8 (−28.4)	99.6 (−25.0)
Nemori	159.4 (−22.8)	105.4 (−15.4)	106.8 (−26.3)

Aggregated memory presence accuracy summed across the three tasks (max 300). Parenthesized values are the FAMA reduction from the forgetting penalty.

A manual error analysis on 75 incorrect predictions tells the same story in detail. Most recommendation errors (64%) come from failing to forget an outdated preference, and every reasoning error we looked at traced back to incomplete retrieval that kept the model from putting the pieces together.

Do newer models close the gap? Since the paper, we have spot-checked four newer frontier models — Anthropic's just-released Claude Fable 5, plus GPT-5.5, Claude Opus 4.7, and Gemini 3.1 Pro Preview — on the academic_researcher persona across all three horizons. The picture barely moves. Remembering still lands in the low-to-mid 20s at the quarterly scale, and recommending stays in a comparable range. Reasoning is the wall: scores sit at zero across almost every model and horizon, and Claude Fable 5 — which tops the no-reasoning group — is the only one to break zero at all, on just two of its six runs. Newer and stronger general-purpose models, in other words, do not yet solve the problem Memora is built to expose.

Model (per-task FAMA, W / M / Q)	Remembering	Recommending	Reasoning
Without reasoning tokens
Claude Fable 5	22.5 / 26.4 / 22.1	48.0 / 59.6 / 64.9	0 / 10.0 / 0
GPT-5.5	14.9 / 27.0 / 22.2	64.0 / 51.9 / 64.5	0 / 0 / 0
Claude Opus 4.7	21.9 / 29.0 / 23.0	38.7 / 48.5 / 49.3	0 / 0 / 0
Gemini 3.1 Pro Preview	16.6 / 10.8 / 17.7	44.0 / 45.1 / 47.6	0 / 0 / 0
With reasoning tokens
GPT-5.5	26.3 / 17.0 / 23.2	64.0 / 57.9 / 73.1	0 / 0 / 0
Claude Fable 5	26.5 / 27.8 / 25.2	33.3 / 43.8 / 65.9	0 / 0 / 5.0
Claude Opus 4.7	28.6 / 24.4 / 23.7	42.7 / 39.6 / 51.1	0 / 0 / 0
Gemini 3.1 Pro Preview	10.2 / 9.4 / 13.0	52.0 / 57.7 / 58.1	0 / 0 / 0

Per-task FAMA (0–100) on a single persona (academic_researcher), multi-judge. This is a spot check, not a full 10-persona evaluation, so these numbers are not directly comparable to the table above; they are meant to show whether the overall pattern holds for newer models.

Looking ahead

The bottleneck here is not context length or storage capacity. It is the lack of mechanisms that treat consolidation, mutation, and forgetting as first-class problems rather than afterthoughts to retrieval. A few directions look especially promising. The first is memory that revises rather than merely accumulates, so that an agent can invalidate something cleanly without losing the surrounding context. The second is retrieval that is aware of the task it is serving, since remembering, reasoning, and recommending each tolerate partial memory very differently and a single policy will not serve all three. The third is evaluation that moves beyond simulation, because real users introduce implicit and contradictory updates that the next generation of benchmarks will need to capture.

Conclusion

Long-term memory is becoming central to how personalized agents are built, and the way we measure it quietly shapes what gets built. Framed as recall, it rewards systems that look strong on a benchmark while still failing the thing users care about: an accurate, current picture of who they are. Memora is our attempt to make those failures visible — a benchmark that reflects how memory actually evolves over a long relationship, scored by a metric that refuses to reward reliance on information the user has moved past. Current systems, new and old alike, have a long way to go, and we hope this gives the field a clearer target to aim at. We welcome feedback, replications, and collaborations.

Resources

Paper: From Recall to Forgetting: Benchmarking Long-Term Memory for Personalized Agents
Code and data: github.com/geniesinc/Memora
Publication Venue: ACL 2026

Work by Md Nayem Uddin (Arizona State University, during an internship at Genies), Kumar Shubham (Genies), Eduardo Blanco (University of Arizona), Chitta Baral (Arizona State University), and Gengyu Wang (Genies).