prism-memory / docs /release /technical-blog.md

Publish PRISM-Memory adapter bundle

419e63b verified 22 days ago

preview code

raw

history blame contribute delete

5.97 kB

PRISM-Memory: Turn Conversations Into Durable, Searchable Memory

The Problem

Most long-chat systems do not actually have memory. They have transcript search. That works until someone asks a later question that depends on a hard constraint, a changed plan, a dated fact, or a contradiction that happened months ago.

PRISM-Memory focuses on the part of the stack that usually stays hidden: the step that decides what should become memory at all.

The release model is a 7B adapter that writes short proposition-level memory records from dialogue. Those records are then indexed by a hybrid retrieval stack and used later for recall.

What This Release Shows

The useful result is narrow and practical:

a 7B open model can replace the GPT-4.1 extraction step in this memory pipeline
it scores 0.4768 on LongMemEval versus 0.4650 for the GPT-4.1-based PropMem reference
it scores 0.4981 on LoCoMo versus 0.5360 for that same reference

This is not a claim that a 7B model beats GPT-4.1 everywhere. It is a claim that a 7B model can take over the memory-writing step and stay competitive on the held-out evaluation surface.

Why That Matters

If the memory-writing step is weak, retrieval never gets a clean chance. Important details stay buried inside noisy chat turns.

PRISM-Memory is useful when later questions depend on things like:

a hard operational limit: 20 GitHub Actions jobs
a durable preference: aggregated Slack alerts instead of noisy ones
a status distinction: mTLS is not live yet, it is planned for phase two
a dated fact: Sam took up painting in May 2023
a refusal case: the system should answer None instead of inventing a reason for an unsupported guitar story

Those are memory problems, not style problems.

How The System Works

The released system has three pieces.

A learned extractor based on Qwen/Qwen2.5-7B-Instruct with LoRA.
Post-processing that cleans speaker references and resolves relative time.
Hybrid retrieval with BM25, dense retrieval, and reranking.

The extracted propositions are the important interface. They are the memory records the retriever indexes. That keeps the memory store inspectable instead of opaque.

What The Training Data Actually Was

The release data is synthetic.

2,329 synthetic training conversations
584 held-out synthetic conversations
100,427 supervised extraction examples derived from those conversations
20,000 supervised examples used for the released adapter

The conversations were designed to stress real memory behaviors:

new facts introduced in one session and used later
updated details that should overwrite stale ones
deleted or invalidated facts that should stop influencing answers
mixtures of personal details, project facts, preferences, dates, and plans

The labels were GPT-4.1-derived memory-writing targets. No real user chat logs are part of the public release.

What Worked

1. The clean supervised base mattered more than clever add-ons

The release model came from a stable 20,000-example synthetic supervision base. That base was more valuable than trying to patch the model later with many narrow benchmark-specific additions.

2. Hybrid retrieval was part of the result

The release is not just a model story. It is a model-plus-retrieval story. Sparse retrieval kept lexical anchors, dense retrieval recovered semantically close memories, and reranking cleaned the shortlist.

3. Explicit time anchoring helped

The model improved when the memory records carried explicit dates and the system resolved relative references like yesterday or last weekend into normalized anchors.

4. Turn-local extraction was enough

Feeding long recent-context windows into the extractor made it worse. The stronger pattern was local extraction at write time and cross-turn composition later through retrieval.

5. Adversarial precision mattered

The release model kept the best adversarial behavior among the runs considered for public release. That mattered because a memory system that answers unsupported questions confidently is worse than one that refuses.

What Did Not Work

1. Benchmark-style formatting tricks

Trying to train the model toward benchmark-style relative-date outputs hurt more than it helped. It optimized the look of answers instead of the quality of the stored memory.

2. Narrow LoCoMo-style add-ons

Adding targeted benchmark-domain data often bought a small gain in one slice of LoCoMo and then lost balance somewhere else.

3. More noisy supervision was not automatically better

Scaling up original noisy temporal supervision amplified the wrong lesson. The model became more specialized and less balanced.

4. Overtraining past the local optimum

Several follow-on variants nearly matched the final release on one metric, but they usually gave back LongMemEval performance, adversarial precision, or both.

Why Only One Public Model Ships

The repo tried multiple follow-on variants. The nearest internal runner-up nearly tied the released model on overall LoCoMo and disagreed on 152 of the 400 held-out LoCoMo questions, which means the comparison was real.

But the public release decision is simpler than the internal ablation story. One model ships because it had the best overall release profile:

strongest LongMemEval score
strongest adversarial behavior
best total balance across the held-out surface

That is a better public story than shipping several near-tied variants with internal names nobody else should care about.

What Ships

The public release surface is intentionally narrow:

PRISM-Memory
one released model
one extraction skill
one Space demo
one set of release docs and benchmark artifacts

The broader frontier_memory harness stays in the repo for ongoing research, but the release story stays focused on the memory-writing component that proved worth shipping.