The Lucy Syndrome: Why LLMs Forget Corrections

This is a visual overview. The full five-part argument (19,600 words, 163 findings, complete lab data) is here.

I corrected the same mistake three times in one month. The correction was documented. The file was loaded. The model read the file. It did not remember reading it.

I run a civil engineering firm in Paraguay — nine business areas, over 530 files across three specialized knowledge bases, all operated daily through Claude Code. The system I built is the equivalent of Henry’s videos in 50 First Dates: every morning, the model reads its instructions, orients itself, and functions. Often brilliantly. But it does not remember watching the video yesterday. And the mistake comes back.

This essay names that gap and proposes a fix that does not require better memory systems, larger context windows, or access to model weights.

Open Table of contents

The problem has a name
The causal loop
Five invariants of a functional scar
What the lab found
What this does not fix
Read further

The problem has a name

The Lucy Syndrome is the inability of an LLM-based system to retain operational corrections across sessions, resulting in the repetition of errors that have already been identified and fixed by the human operator.

It is not catastrophic forgetting — that is a training-time phenomenon where learning one thing erases another. It is not context degradation — the within-session drift that happens as the window fills. It is the gap between what the system reads at the start of each session and what it would know if it could actually remember.

Andrej Karpathy named a version of this when he called it “anterograde amnesia” at YC AI Startup School in May 2025. Alakuijala and colleagues at Google formalized a related mechanism as the “Memento effect” in their paper Memento No More. The Lucy Syndrome sits underneath both: it is the cross-session persistence failure that forces corrections into prompts in the first place, where they are then subject to the degradation those works describe.

The name comes from the film. Lucy can watch the video and function. She cannot remember watching it. The video is not broken. The memory system is.

The causal loop

The natural assumption is that “the model forgets corrections” is many different problems. It is not. It is one loop.

The causal loop: metacognitive friction (D) leads to false confidence (C), which forks into correction leaking (A) or holding (B), with A feeding back into D

Over several months of structured observation — 163 findings extracted from 17 source files spanning two LLM platforms — four categories of failure emerged:

D — Metacognitive friction. The model fails to flag its own uncertainty. It does not say “I’m not sure about this.”
C — False confidence. Because uncertainty was not flagged, the model generates a confident wrong answer.
A — Correction leaks. The operator corrects the error. Next session, the same error returns.
B — Correction holds. The operator corrects the error and it sticks — but only when the correction has a specific shape.

These are not four independent problems. They are four positions on one causal chain: D → C → A, with the leak at A feeding back into D the next session. The loop runs because the model never generated the signal that would tell it to stop.

The critical finding from the data: the persistence difference between corrections that hold (B) and corrections that leak (A) is not gradual. It is a cliff. Binary corrections — “never do X” — held at rates above 80%. Proportional corrections — “balance X and Y” — held below 40%. The shape of the rule predicts whether it survives.

Five invariants of a functional scar

If the only intervenable point in the loop is the link between correction and next action, then the fix has to live at that link. Not more information. Not more memory. Something structural.

I call this a functional scar: a correction with five properties that must be present simultaneously. Four out of five is still a correction. It will sometimes hold and sometimes leak. All five, and it holds.

The five invariants: binary rule, durable physical support, structural integration, non-passive technical trigger, refinable activation metric

#	Invariant	What it means
1	Binary rule	Pass/fail test. No partial credit. “Never do X” not “balance X and Y.”
2	Durable physical support	A file in a repo under version control. Not declarative memory.
3	Structural integration	The output shape changes. Not an appended disclaimer.
4	Non-passive trigger	A hook fires before generation completes. The model cannot skip it.
5	Refinable metric	Firings can be counted, thresholds tuned between sessions.

The intuition behind invariant 4 is where most memory systems break. Any correction that depends on the model noticing the reminder — reading a line in the system prompt, recalling a stored fact, consulting a retrieved document — is a passive trigger. The model has to choose to weight the reminder over the competing pull of its training distribution. A non-passive trigger is one the model cannot skip: a hook that fires on a tool call, a mandatory consultation the pipeline enforces before the response is produced.

The essay tests eleven existing approaches against these five invariants: fine-tuning, MemGPT, Cursor Rules, Voyager’s skill library, NeMo Guardrails, Memento No More, and others. Each does something useful. None passes all five. The closest match is Claude Code Hooks, which passes four natively. The missing fifth — a per-rule activation counter — is not an infrastructure gap. It is a discipline gap: the hooks can carry it, but no one has used them that way.

What the lab found

Three layers of implementation: Layer 1 accidental (partial), Layer 2 Phase 1 scars (3/5), Layer 3 Phase 2 hooks (5/5)

The framework was not designed top-down. It was extracted from an accidental success.

A long-running project in the same system had accumulated roughly sixty-seven numbered rules over months. Rules that reached a three-surface form — numbered rule, memory entry, and workflow checklist — had a persistence rate close to 100% across sampled sessions. Rules that existed only as verbal corrections had a persistence rate close to zero. The pattern was running in plain sight before the lab existed.

Layer 1 was this accidental validation. Layer 2 was the first deliberate generalization: five explicit scars, each mapped to a documented error pattern, with an auto-detect skill to surface them by context. It worked — until the model chose to skip its own enforcement step. Model discipline cannot enforce model memory.

Layer 3 moved the enforcement outside the model. Four hooks wired into the harness fire on specific events — session start, tool calls, subagent dispatch. The model does not have to remember. The pipeline remembers on its behalf. This is where invariant 4 is satisfied: the trigger is non-passive because it lives in the infrastructure, not in the prompt.

The founding case: during the lab itself, the model declared a batch extraction complete while having silently skipped the densest file in the dataset — twenty-one findings, 14.8% of the total corpus. One question from the operator recovered all of them. The evidence that the system works is also the evidence that it needs to.

What this does not fix

Functional scars close the loop from correction to repeated error. They do not touch:

Errors the model has never made before. A scar prevents recurrence, not occurrence. The first instance of every failure is unguarded.
Rules that cannot be made binary. Proportional judgments, aesthetic calibrations, audience-dependent register — these require the model to re-inhabit a standard each time. Scars do not carry that kind of correction.
Environments without harness access. The implementation described here requires a programmable pipeline — Claude Code’s hook system, or an equivalent. Users of chat surfaces (claude.ai, chatgpt.com) can apply the diagnostic framework but not the enforcement layer.

The honest scope: this is an operator-side fix for operator-reachable failures. It does not propose a model-side solution. It argues that the corrective layer has to live where the operator can reach it, and shows one way to build that.

Read further

The full essay — five parts, the complete 11-proposal test, 14 documented meta-Lucy events where the syndrome operated inside the lab studying it, and the detailed methodology — is here:

The Lucy Syndrome and AI — Full Essay (~19,600 words)

The five scars and four hooks from Phase 2 are open-sourced:

github.com/VDP89/lucy-syndrome (Apache 2.0)

The citable preprint is on Zenodo:

doi:10.5281/zenodo.19555971 (CC BY 4.0)

Two companion pieces: Where this came from (the informal origin story) and Questions and answers (objections and responses).

If you run an LLM system in production and recognize the pattern, I want to hear what your persistence numbers look like.