Publish PRISM-Memory adapter bundle
Browse files- README.md +34 -24
- docs/release/datasets.md +95 -84
- docs/release/extraction-examples.md +7 -7
- docs/release/extraction-skill.md +29 -72
- docs/release/memory-scenarios.md +1 -1
- docs/release/release-results.md +29 -23
- docs/release/technical-blog.md +102 -188
- results/{scenario_comparisons.json → benchmark_cases.json} +10 -5
- results/{readme_extraction_examples.json → extraction_examples.json} +3 -2
- results/{confirmed_exp15_summary.json → release_summary.json} +6 -4
README.md
CHANGED
|
@@ -16,8 +16,14 @@ tags:
|
|
| 16 |
# PRISM-Memory
|
| 17 |
|
| 18 |
PRISM-Memory is a LoRA adapter that trains `Qwen/Qwen2.5-7B-Instruct` to write
|
| 19 |
-
proposition-level memory from dialogue. It is
|
| 20 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 21 |
|
| 22 |
## What this release shows
|
| 23 |
|
|
@@ -35,7 +41,7 @@ extractor, not a full end-to-end GPT-4.1 system.
|
|
| 35 |
- It supports dated recall and clean refusal on unsupported questions.
|
| 36 |
|
| 37 |
See [docs/release/memory-scenarios.md](docs/release/memory-scenarios.md) for
|
| 38 |
-
|
| 39 |
|
| 40 |
## Load the adapter
|
| 41 |
|
|
@@ -55,38 +61,42 @@ base_model = AutoModelForCausalLM.from_pretrained(
|
|
| 55 |
model = PeftModel.from_pretrained(base_model, adapter_id)
|
| 56 |
```
|
| 57 |
|
| 58 |
-
This repo contains
|
| 59 |
|
| 60 |
## Training data
|
| 61 |
|
| 62 |
PRISM-Memory was trained on **synthetic** multi-session memory conversations
|
| 63 |
-
with **GPT-4.1-derived
|
| 64 |
real user chat logs.
|
| 65 |
|
| 66 |
-
|
|
| 67 |
|---|---:|---|
|
| 68 |
-
|
|
| 69 |
-
|
|
| 70 |
-
|
|
| 71 |
-
|
|
| 72 |
|
| 73 |
-
|
| 74 |
-
[docs/release/datasets.md](docs/release/datasets.md) for the full inventory,
|
| 75 |
-
the evaluation surfaces, and the ablations that regressed.
|
| 76 |
|
| 77 |
-
|
| 78 |
|
| 79 |
-
|
|
|
|
| 80 |
|
| 81 |
-
|
| 82 |
|
| 83 |
-
|
| 84 |
|
| 85 |
-
|
| 86 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 87 |
|
| 88 |
-
The
|
| 89 |
-
|
| 90 |
|
| 91 |
## Confirmed results
|
| 92 |
|
|
@@ -149,9 +159,9 @@ More held-out examples live in
|
|
| 149 |
- [docs/release/memory-scenarios.md](docs/release/memory-scenarios.md)
|
| 150 |
- [docs/release/release-results.md](docs/release/release-results.md)
|
| 151 |
- [docs/release/technical-blog.md](docs/release/technical-blog.md)
|
| 152 |
-
- [results/
|
| 153 |
-
- [results/
|
| 154 |
-
- [results/
|
| 155 |
|
| 156 |
## Demo
|
| 157 |
|
|
|
|
| 16 |
# PRISM-Memory
|
| 17 |
|
| 18 |
PRISM-Memory is a LoRA adapter that trains `Qwen/Qwen2.5-7B-Instruct` to write
|
| 19 |
+
proposition-level memory from dialogue. It is a memory-writing component, not a
|
| 20 |
+
general chat model.
|
| 21 |
+
|
| 22 |
+
## Released model
|
| 23 |
+
|
| 24 |
+
- Model name: `PRISM-Memory 7B Adapter`
|
| 25 |
+
- Base model: `Qwen/Qwen2.5-7B-Instruct`
|
| 26 |
+
- Adapter type: `LoRA`
|
| 27 |
|
| 28 |
## What this release shows
|
| 29 |
|
|
|
|
| 41 |
- It supports dated recall and clean refusal on unsupported questions.
|
| 42 |
|
| 43 |
See [docs/release/memory-scenarios.md](docs/release/memory-scenarios.md) for
|
| 44 |
+
compact end-to-end examples.
|
| 45 |
|
| 46 |
## Load the adapter
|
| 47 |
|
|
|
|
| 61 |
model = PeftModel.from_pretrained(base_model, adapter_id)
|
| 62 |
```
|
| 63 |
|
| 64 |
+
This repo contains adapter weights only. You still need the base model.
|
| 65 |
|
| 66 |
## Training data
|
| 67 |
|
| 68 |
PRISM-Memory was trained on **synthetic** multi-session memory conversations
|
| 69 |
+
with **GPT-4.1-derived** memory-writing labels. The public release does not use
|
| 70 |
real user chat logs.
|
| 71 |
|
| 72 |
+
| Item | Count | Notes |
|
| 73 |
|---|---:|---|
|
| 74 |
+
| synthetic training conversations | `2,329` | multi-session conversations with inserts, updates, and deletes |
|
| 75 |
+
| synthetic held-out conversations | `584` | evaluation split used for held-out examples |
|
| 76 |
+
| supervised extraction examples | `100,427` | memory-writing labels derived from the synthetic corpus |
|
| 77 |
+
| released training subset | `20,000` | supervised examples used for the public adapter |
|
| 78 |
|
| 79 |
+
### Example training item
|
|
|
|
|
|
|
| 80 |
|
| 81 |
+
**Synthetic scenario**
|
| 82 |
|
| 83 |
+
- Domain: cloud infrastructure performance optimization
|
| 84 |
+
- Persona: senior cloud systems engineer at a fintech startup
|
| 85 |
|
| 86 |
+
**Synthetic user turn**
|
| 87 |
|
| 88 |
+
> Here’s the initial architecture outline: deploy microservices on AWS Fargate, use PostgreSQL 13 as the primary database, plan Kubernetes orchestration, use Redis for caching, and keep API latency under 50ms.
|
| 89 |
|
| 90 |
+
**Target memory records**
|
| 91 |
+
|
| 92 |
+
- Deploy microservices on AWS Fargate
|
| 93 |
+
- Orchestrate containers on a Kubernetes cluster (planned)
|
| 94 |
+
- Primary database: PostgreSQL 13
|
| 95 |
+
- Use Redis as an in-memory caching layer
|
| 96 |
+
- Latency target: API responses under 50ms
|
| 97 |
|
| 98 |
+
The release makes the dataset design, counts, and example records public. It
|
| 99 |
+
does not bundle the full raw corpus files.
|
| 100 |
|
| 101 |
## Confirmed results
|
| 102 |
|
|
|
|
| 159 |
- [docs/release/memory-scenarios.md](docs/release/memory-scenarios.md)
|
| 160 |
- [docs/release/release-results.md](docs/release/release-results.md)
|
| 161 |
- [docs/release/technical-blog.md](docs/release/technical-blog.md)
|
| 162 |
+
- [results/release_summary.json](results/release_summary.json)
|
| 163 |
+
- [results/extraction_examples.json](results/extraction_examples.json)
|
| 164 |
+
- [results/benchmark_cases.json](results/benchmark_cases.json)
|
| 165 |
|
| 166 |
## Demo
|
| 167 |
|
docs/release/datasets.md
CHANGED
|
@@ -1,125 +1,136 @@
|
|
| 1 |
-
# PRISM-Memory
|
| 2 |
|
| 3 |
-
|
| 4 |
-
|
|
|
|
| 5 |
|
| 6 |
-
##
|
| 7 |
|
| 8 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 9 |
|
| 10 |
-
|
| 11 |
-
|
| 12 |
-
and multi-session recall.
|
| 13 |
-
- The SFT labels were then derived from those synthetic conversations with a
|
| 14 |
-
GPT-4.1 proposition extractor.
|
| 15 |
-
- No real end-user chat logs are part of this public release story.
|
| 16 |
|
| 17 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 18 |
|
| 19 |
-
|
| 20 |
|
| 21 |
-
The
|
| 22 |
|
| 23 |
-
1.
|
| 24 |
-
2. Fine-tune with LoRA on a `20k` sample from `train_sft.jsonl`.
|
| 25 |
-
3. Evaluate on held-out `LoCoMo` and held-out `LongMemEval`.
|
| 26 |
|
| 27 |
-
|
|
|
|
|
|
|
|
|
|
| 28 |
|
| 29 |
-
|
| 30 |
-
`better_memory/data/output/` directory.
|
| 31 |
|
| 32 |
-
|
| 33 |
-
|
| 34 |
-
|
| 35 |
-
|
| 36 |
-
|
|
|
|
| 37 |
|
| 38 |
-
|
| 39 |
-
inserts, updates, deletes, and multi-session recall.
|
| 40 |
|
| 41 |
-
|
|
|
|
| 42 |
|
| 43 |
-
|
| 44 |
-
|
|
|
|
|
|
|
| 45 |
|
| 46 |
-
|
| 47 |
|
| 48 |
-
|
| 49 |
|
| 50 |
-
|
|
|
|
| 51 |
|
| 52 |
-
|
| 53 |
-
- Wants Snyk Slack notifications aggregated and concise, consistent with other pipeline alerts
|
| 54 |
|
| 55 |
-
|
| 56 |
-
|
| 57 |
-
|
| 58 |
|
| 59 |
-
|
| 60 |
|
| 61 |
-
|
| 62 |
-
conversations.
|
| 63 |
|
| 64 |
-
|
| 65 |
-
|
| 66 |
-
|
| 67 |
-
|
| 68 |
-
|
| 69 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 70 |
|
| 71 |
## Evaluation Surfaces
|
| 72 |
|
| 73 |
-
The released model
|
| 74 |
|
| 75 |
-
| Benchmark | Held-out
|
| 76 |
|---|---|---|
|
| 77 |
-
| `LoCoMo` | conversations `conv-49` and `conv-50` |
|
| 78 |
-
| `LongMemEval` | held-out items
|
| 79 |
|
| 80 |
-
Both the
|
| 81 |
-
with the same
|
| 82 |
|
| 83 |
-
## What Is Public
|
| 84 |
|
| 85 |
Public now:
|
| 86 |
|
| 87 |
-
- dataset
|
|
|
|
|
|
|
| 88 |
- held-out extraction examples
|
| 89 |
-
-
|
| 90 |
|
| 91 |
Not public yet:
|
| 92 |
|
| 93 |
-
- the raw
|
| 94 |
-
- the full
|
| 95 |
-
- the auxiliary
|
| 96 |
-
|
| 97 |
-
So the current release makes the **data recipe** public, but not the full raw
|
| 98 |
-
training corpora.
|
| 99 |
-
|
| 100 |
-
## Auxiliary LoCoMo Datasets
|
| 101 |
-
|
| 102 |
-
These files were used in ablations and targeted probes. They matter for the
|
| 103 |
-
research story, but they are not the main public training recipe.
|
| 104 |
-
|
| 105 |
-
| File | Examples | Intended Use | Outcome |
|
| 106 |
-
|---|---|---|---|
|
| 107 |
-
| `locomo_qa_supervised_factual.jsonl` | `512` | factual QA supervision | neutral to small benefit |
|
| 108 |
-
| `locomo_qa_supervised_multihop.jsonl` | `625` | multihop QA supervision | neutral to small benefit |
|
| 109 |
-
| `locomo_qa_supervised_temporal.jsonl` | `248` | temporal QA supervision with absolute dates | neutral to small benefit |
|
| 110 |
-
| `locomo_qa_supervised_inferential.jsonl` | `133` | inferential QA supervision | too small, hurt balance |
|
| 111 |
-
| `locomo_qa_supervised_temporal_relformat.jsonl` | `248` | temporal QA with benchmark-style relative dates | hurt |
|
| 112 |
-
| `locomo_sft_extra.jsonl` | `2,645` | LoCoMo-domain SFT add-on | hurt |
|
| 113 |
-
| `locomo_sft_extra_relformat.jsonl` | `3,178` | relative-date LoCoMo SFT add-on | hurt |
|
| 114 |
|
| 115 |
-
## Practical
|
| 116 |
|
| 117 |
-
1. The
|
| 118 |
-
from
|
| 119 |
-
2.
|
| 120 |
-
3.
|
| 121 |
-
4.
|
| 122 |
-
|
| 123 |
|
| 124 |
Related docs:
|
| 125 |
|
|
|
|
| 1 |
+
# PRISM-Memory Training Data
|
| 2 |
|
| 3 |
+
The PRISM-Memory release is trained on **synthetic** multi-session
|
| 4 |
+
conversations with **GPT-4.1-derived** memory-writing labels. No real user chat
|
| 5 |
+
logs are part of the public release story.
|
| 6 |
|
| 7 |
+
## Dataset At A Glance
|
| 8 |
|
| 9 |
+
| Item | Count | What it means |
|
| 10 |
+
|---|---:|---|
|
| 11 |
+
| synthetic training conversations | `2,329` | multi-session conversations used to build the training label bank |
|
| 12 |
+
| synthetic held-out conversations | `584` | held-out conversations used for evaluation examples and reference labels |
|
| 13 |
+
| total generated conversations | `2,913` | train plus eval |
|
| 14 |
+
| supervised extraction examples | `100,427` | memory-writing examples derived from the synthetic conversations |
|
| 15 |
+
| released training subset | `20,000` | supervised examples used to train the public adapter |
|
| 16 |
+
| agent and task families | `6` | research, data analysis, QA, coding, planning, writing |
|
| 17 |
|
| 18 |
+
The synthetic conversation generator deliberately creates long-horizon memory
|
| 19 |
+
pressure:
|
|
|
|
|
|
|
|
|
|
|
|
|
| 20 |
|
| 21 |
+
- facts introduced early and queried later
|
| 22 |
+
- updated plans and corrected details
|
| 23 |
+
- deleted or invalidated information
|
| 24 |
+
- multi-session continuity
|
| 25 |
+
- mixtures of preferences, project state, dates, and operational facts
|
| 26 |
|
| 27 |
+
## How The Data Is Built
|
| 28 |
|
| 29 |
+
The training pipeline has two layers.
|
| 30 |
|
| 31 |
+
### 1. Synthetic conversation generation
|
|
|
|
|
|
|
| 32 |
|
| 33 |
+
The first layer creates multi-session conversations around realistic work and
|
| 34 |
+
assistant scenarios. Each conversation comes with scenario metadata, a persona,
|
| 35 |
+
multiple sessions, and explicit memory events such as inserts, updates, and
|
| 36 |
+
deletes.
|
| 37 |
|
| 38 |
+
Across the full corpus:
|
|
|
|
| 39 |
|
| 40 |
+
- `899` conversations are short
|
| 41 |
+
- `1,162` are medium
|
| 42 |
+
- `852` are long
|
| 43 |
+
- `897` are insert-only
|
| 44 |
+
- `937` include updates
|
| 45 |
+
- `435` include both updates and deletes
|
| 46 |
|
| 47 |
+
### 2. Supervised memory-writing labels
|
|
|
|
| 48 |
|
| 49 |
+
The second layer converts those conversations into supervised extraction
|
| 50 |
+
examples. Each example contains:
|
| 51 |
|
| 52 |
+
- retrieved memories seen so far
|
| 53 |
+
- recent conversation context
|
| 54 |
+
- the current user turn
|
| 55 |
+
- target memory operations that should be written from that turn
|
| 56 |
|
| 57 |
+
The released model learns this memory-writing step.
|
| 58 |
|
| 59 |
+
## What A Training Example Looks Like
|
| 60 |
|
| 61 |
+
One real synthetic scenario in the corpus is about **cloud infrastructure
|
| 62 |
+
performance optimization** for a low-latency trading platform.
|
| 63 |
|
| 64 |
+
**Synthetic scenario**
|
|
|
|
| 65 |
|
| 66 |
+
- domain: cloud infrastructure performance optimization
|
| 67 |
+
- persona: senior cloud systems engineer at a fintech startup
|
| 68 |
+
- conversation shape: two sessions, ten chunks, five later questions
|
| 69 |
|
| 70 |
+
**Synthetic user turn**
|
| 71 |
|
| 72 |
+
> Here’s the initial architecture outline: deploy microservices on AWS Fargate, use PostgreSQL 13 as the primary database, plan Kubernetes orchestration, use Redis for caching, keep API latency under 50ms, and redesign the system with a team of five engineers.
|
|
|
|
| 73 |
|
| 74 |
+
**Target memory records**
|
| 75 |
+
|
| 76 |
+
- Deploy microservices on AWS Fargate
|
| 77 |
+
- Orchestrate containers on a Kubernetes cluster (planned)
|
| 78 |
+
- Primary database: PostgreSQL 13
|
| 79 |
+
- Use Redis as an in-memory caching layer
|
| 80 |
+
- Latency target: API responses under 50ms
|
| 81 |
+
|
| 82 |
+
Later turns in the same conversation update that memory with new load targets,
|
| 83 |
+
TTL settings, and rollout constraints such as zero downtime.
|
| 84 |
+
|
| 85 |
+
## What Trained The Released Model
|
| 86 |
+
|
| 87 |
+
The public adapter was trained on `20,000` supervised extraction examples
|
| 88 |
+
sampled from the larger `100,427`-example label bank.
|
| 89 |
+
|
| 90 |
+
In plain terms, the model saw many examples of this pattern:
|
| 91 |
+
|
| 92 |
+
1. a conversation turn mentions several durable facts
|
| 93 |
+
2. the target output keeps only the memory-worthy facts
|
| 94 |
+
3. those facts are written as short standalone memory records
|
| 95 |
+
|
| 96 |
+
That is why the release behaves like a memory writer rather than a chat model.
|
| 97 |
|
| 98 |
## Evaluation Surfaces
|
| 99 |
|
| 100 |
+
The released model is evaluated on two held-out surfaces.
|
| 101 |
|
| 102 |
+
| Benchmark | Held-out surface | What it tests |
|
| 103 |
|---|---|---|
|
| 104 |
+
| `LoCoMo` | held-out conversations `conv-49` and `conv-50` | factual, temporal, inferential, multi-hop, and adversarial recall |
|
| 105 |
+
| `LongMemEval` | held-out items across six categories | knowledge updates, multi-session recall, single-session recall, and temporal reasoning |
|
| 106 |
|
| 107 |
+
Both the PRISM extractor and the GPT-4.1-based PropMem reference are scored
|
| 108 |
+
with the same QA layer, so the public comparison isolates the extraction step.
|
| 109 |
|
| 110 |
+
## What Is Public Today
|
| 111 |
|
| 112 |
Public now:
|
| 113 |
|
| 114 |
+
- the dataset design
|
| 115 |
+
- corpus counts
|
| 116 |
+
- example training records
|
| 117 |
- held-out extraction examples
|
| 118 |
+
- benchmark results and category breakdowns
|
| 119 |
|
| 120 |
Not public yet:
|
| 121 |
|
| 122 |
+
- the full raw synthetic conversation files
|
| 123 |
+
- the full supervised label bank
|
| 124 |
+
- the auxiliary ablation corpora used for follow-on experiments
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 125 |
|
| 126 |
+
## Practical Lessons From The Data
|
| 127 |
|
| 128 |
+
1. The strongest release model came from the stable `20,000`-example base, not
|
| 129 |
+
from benchmark-specific add-ons.
|
| 130 |
+
2. Explicit date anchoring helped more than benchmark-style answer formatting.
|
| 131 |
+
3. More narrow benchmark data did not automatically improve generalization.
|
| 132 |
+
4. The supervision is most useful when it teaches durable facts, updates, and
|
| 133 |
+
contradictions instead of stylistic imitation.
|
| 134 |
|
| 135 |
Related docs:
|
| 136 |
|
docs/release/extraction-examples.md
CHANGED
|
@@ -1,8 +1,8 @@
|
|
| 1 |
# PRISM-Memory Extraction Examples
|
| 2 |
|
| 3 |
-
Selected held-out examples from the
|
| 4 |
-
The `GPT-4.1 reference` rows come from the
|
| 5 |
-
The `PRISM-Memory` rows were regenerated
|
| 6 |
|
| 7 |
These examples are illustrations, not the benchmark itself. Use
|
| 8 |
[release-results.md](release-results.md) for the aggregate numbers.
|
|
@@ -22,7 +22,7 @@ These examples are illustrations, not the benchmark itself. Use
|
|
| 22 |
- No caching beyond basic Docker layer caching
|
| 23 |
- Jenkins nodes have limited capacity and experience queue delays during peak commits
|
| 24 |
|
| 25 |
-
**PRISM-Memory
|
| 26 |
|
| 27 |
- No Docker caching beyond basic layer caching
|
| 28 |
- Jenkins nodes have limited capacity; peak commits cause queue delays
|
|
@@ -42,7 +42,7 @@ These examples are illustrations, not the benchmark itself. Use
|
|
| 42 |
- GitHub Actions concurrency limit: 20 concurrent jobs
|
| 43 |
- Wants Snyk Slack notifications aggregated and concise, consistent with other pipeline alerts
|
| 44 |
|
| 45 |
-
**PRISM-Memory
|
| 46 |
|
| 47 |
- GitHub Actions concurrency limit: 20 concurrent jobs
|
| 48 |
- Snyk Slack notifications should be aggregated and concise
|
|
@@ -63,7 +63,7 @@ These examples are illustrations, not the benchmark itself. Use
|
|
| 63 |
- mTLS planned in phase two
|
| 64 |
- Plan to use canary deployments, traffic splitting, and basic fault injection
|
| 65 |
|
| 66 |
-
**PRISM-Memory
|
| 67 |
|
| 68 |
- Sidecar CPU limits set and monitored via Prometheus
|
| 69 |
- Istio mTLS planned for phase two
|
|
@@ -72,5 +72,5 @@ These examples are illustrations, not the benchmark itself. Use
|
|
| 72 |
## Regeneration
|
| 73 |
|
| 74 |
```bash
|
| 75 |
-
conda run -n pytorch_p310 python scripts/release/
|
| 76 |
```
|
|
|
|
| 1 |
# PRISM-Memory Extraction Examples
|
| 2 |
|
| 3 |
+
Selected held-out examples from the synthetic evaluation split.
|
| 4 |
+
The `GPT-4.1 reference` rows come from the supervised target memory labels.
|
| 5 |
+
The `PRISM-Memory 7B Adapter` rows were regenerated with greedy decoding using the same extraction prompt family used during evaluation.
|
| 6 |
|
| 7 |
These examples are illustrations, not the benchmark itself. Use
|
| 8 |
[release-results.md](release-results.md) for the aggregate numbers.
|
|
|
|
| 22 |
- No caching beyond basic Docker layer caching
|
| 23 |
- Jenkins nodes have limited capacity and experience queue delays during peak commits
|
| 24 |
|
| 25 |
+
**PRISM-Memory**
|
| 26 |
|
| 27 |
- No Docker caching beyond basic layer caching
|
| 28 |
- Jenkins nodes have limited capacity; peak commits cause queue delays
|
|
|
|
| 42 |
- GitHub Actions concurrency limit: 20 concurrent jobs
|
| 43 |
- Wants Snyk Slack notifications aggregated and concise, consistent with other pipeline alerts
|
| 44 |
|
| 45 |
+
**PRISM-Memory**
|
| 46 |
|
| 47 |
- GitHub Actions concurrency limit: 20 concurrent jobs
|
| 48 |
- Snyk Slack notifications should be aggregated and concise
|
|
|
|
| 63 |
- mTLS planned in phase two
|
| 64 |
- Plan to use canary deployments, traffic splitting, and basic fault injection
|
| 65 |
|
| 66 |
+
**PRISM-Memory**
|
| 67 |
|
| 68 |
- Sidecar CPU limits set and monitored via Prometheus
|
| 69 |
- Istio mTLS planned for phase two
|
|
|
|
| 72 |
## Regeneration
|
| 73 |
|
| 74 |
```bash
|
| 75 |
+
conda run -n pytorch_p310 python scripts/release/generate_extraction_examples.py
|
| 76 |
```
|
docs/release/extraction-skill.md
CHANGED
|
@@ -2,20 +2,20 @@
|
|
| 2 |
|
| 3 |
**Hook:** Turn conversations into durable, searchable memory.
|
| 4 |
|
| 5 |
-
This is the single extraction skill
|
| 6 |
-
Public release should point to one checkpoint and one extraction behavior:
|
| 7 |
|
| 8 |
-
- **
|
| 9 |
- **Base model:** `Qwen/Qwen2.5-7B-Instruct`
|
| 10 |
- **Role:** proposition extraction for long-term conversational memory
|
| 11 |
-
- **Why this one:**
|
| 12 |
-
best LongMemEval score
|
|
|
|
| 13 |
|
| 14 |
## Skill Definition
|
| 15 |
|
| 16 |
-
The extractor operates turn by turn and emits `0-5` atomic
|
| 17 |
-
turn. Each
|
| 18 |
-
preference, or property, with dates carried into the fact when available.
|
| 19 |
|
| 20 |
Canonical prompt:
|
| 21 |
|
|
@@ -23,17 +23,14 @@ Canonical prompt:
|
|
| 23 |
You are a memory extraction assistant. Given a conversation turn, extract 0-5 atomic, standalone facts. Each fact must be a complete sentence about a specific person, event, preference, or property. Include dates/times when mentioned. Skip greetings, filler, and questions. Output ONLY a JSON array of strings, e.g. ["fact1", "fact2"] or [].
|
| 24 |
```
|
| 25 |
|
| 26 |
-
This prompt comes from `experiment15_learned_extraction.py` in the upstream
|
| 27 |
-
`better_memory` workspace.
|
| 28 |
-
|
| 29 |
## Inference Contract
|
| 30 |
|
| 31 |
-
1. Format the turn with speaker and session date.
|
| 32 |
2. Extract `0-5` propositions as a JSON array.
|
| 33 |
-
3. Clean speaker references so generic labels become real names.
|
| 34 |
4. Resolve relative temporal expressions against the session date.
|
| 35 |
-
5. Prefix each proposition with the normalized session date before indexing.
|
| 36 |
-
6.
|
| 37 |
|
| 38 |
## Retrieval Setup To Keep
|
| 39 |
|
|
@@ -48,69 +45,29 @@ Best confirmed retrieval settings:
|
|
| 48 |
- **LongMemEval:** multi-session `k=20`, all other categories `k=8` except
|
| 49 |
single-session-user `k=5`
|
| 50 |
|
| 51 |
-
## What
|
| 52 |
-
|
| 53 |
-
1. **The original 20k base mattered.**
|
| 54 |
-
`sft4` came from the exact `train_sft_clean_merged.jsonl` base distribution.
|
| 55 |
-
Runs that changed the base subset regressed.
|
| 56 |
-
|
| 57 |
-
2. **Four epochs was the sweet spot.**
|
| 58 |
-
`sft4` is the local optimum the repo could actually reproduce.
|
| 59 |
-
|
| 60 |
-
3. **Absolute date anchoring helped.**
|
| 61 |
-
Temporal repairs worked when the model saw explicit, normalized dates rather
|
| 62 |
-
than benchmark-specific relative phrasing.
|
| 63 |
-
|
| 64 |
-
4. **Post-processing mattered.**
|
| 65 |
-
Speaker cleanup plus relative-date resolution was necessary to turn raw
|
| 66 |
-
outputs into stable memory records.
|
| 67 |
-
|
| 68 |
-
5. **Hybrid retrieval beat simpler retrieval.**
|
| 69 |
-
BM25 + dense + reranking consistently outperformed BM25-only or dense-only
|
| 70 |
-
approaches.
|
| 71 |
-
|
| 72 |
-
6. **Turn-local extraction was enough.**
|
| 73 |
-
The model performed better without feeding long recent-context windows into
|
| 74 |
-
the extractor.
|
| 75 |
-
|
| 76 |
-
7. **Multihop supervision preserved inferential behavior.**
|
| 77 |
-
When temporal data was added, multihop QA was the only extra signal that
|
| 78 |
-
reliably helped preserve inferential performance.
|
| 79 |
|
| 80 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 81 |
|
| 82 |
-
|
| 83 |
-
Training the extractor to emit benchmark-style relative dates hurt temporal
|
| 84 |
-
performance instead of helping it.
|
| 85 |
|
| 86 |
-
|
| 87 |
-
|
| 88 |
-
|
| 89 |
-
|
| 90 |
-
Scaling the original noisy temporal labels to 50k amplified anchor loss and
|
| 91 |
-
caused major regression.
|
| 92 |
-
|
| 93 |
-
4. **Small clean bases.**
|
| 94 |
-
5k-base follow-on runs forgot too much and collapsed inferential behavior.
|
| 95 |
-
|
| 96 |
-
5. **Heavy QA multipliers.**
|
| 97 |
-
High temporal or QA multipliers damaged adversarial precision and LongMemEval.
|
| 98 |
-
|
| 99 |
-
6. **High learning rates on follow-on QA runs.**
|
| 100 |
-
Aggressive fine-tuning degraded the traits that made `sft4` good.
|
| 101 |
-
|
| 102 |
-
7. **Trying to push past the local optimum.**
|
| 103 |
-
Most post-`sft4` training traded away adversarial performance for narrower
|
| 104 |
-
gains.
|
| 105 |
|
| 106 |
## Release Rule
|
| 107 |
|
| 108 |
-
|
| 109 |
-
|
| 110 |
-
- `exp15_sft_qwen7b_4ep`
|
| 111 |
-
|
| 112 |
-
Treat all other checkpoints as internal ablations and learning artifacts, not as
|
| 113 |
-
parallel public releases.
|
| 114 |
|
| 115 |
Related docs:
|
| 116 |
|
|
|
|
| 2 |
|
| 3 |
**Hook:** Turn conversations into durable, searchable memory.
|
| 4 |
|
| 5 |
+
This is the single extraction skill the public release keeps.
|
|
|
|
| 6 |
|
| 7 |
+
- **Released model:** `PRISM-Memory 7B Adapter`
|
| 8 |
- **Base model:** `Qwen/Qwen2.5-7B-Instruct`
|
| 9 |
- **Role:** proposition extraction for long-term conversational memory
|
| 10 |
+
- **Why this one:** strongest confirmed overall release profile, strongest
|
| 11 |
+
adversarial behavior, and best confirmed LongMemEval score among the release
|
| 12 |
+
candidates
|
| 13 |
|
| 14 |
## Skill Definition
|
| 15 |
|
| 16 |
+
The extractor operates turn by turn and emits `0-5` atomic memory records per
|
| 17 |
+
turn. Each record should be a standalone fact about a person, event,
|
| 18 |
+
preference, plan, or property, with dates carried into the fact when available.
|
| 19 |
|
| 20 |
Canonical prompt:
|
| 21 |
|
|
|
|
| 23 |
You are a memory extraction assistant. Given a conversation turn, extract 0-5 atomic, standalone facts. Each fact must be a complete sentence about a specific person, event, preference, or property. Include dates/times when mentioned. Skip greetings, filler, and questions. Output ONLY a JSON array of strings, e.g. ["fact1", "fact2"] or [].
|
| 24 |
```
|
| 25 |
|
|
|
|
|
|
|
|
|
|
| 26 |
## Inference Contract
|
| 27 |
|
| 28 |
+
1. Format the current turn with speaker and session date.
|
| 29 |
2. Extract `0-5` propositions as a JSON array.
|
| 30 |
+
3. Clean speaker references so generic labels become real names when possible.
|
| 31 |
4. Resolve relative temporal expressions against the session date.
|
| 32 |
+
5. Prefix each stored proposition with the normalized session date before indexing.
|
| 33 |
+
6. Pair the extractor with the hybrid retrieval stack, not with raw transcript search alone.
|
| 34 |
|
| 35 |
## Retrieval Setup To Keep
|
| 36 |
|
|
|
|
| 45 |
- **LongMemEval:** multi-session `k=20`, all other categories `k=8` except
|
| 46 |
single-session-user `k=5`
|
| 47 |
|
| 48 |
+
## What Held Up In The Repo
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 49 |
|
| 50 |
+
1. The stable `20,000`-example supervised base mattered more than aggressive
|
| 51 |
+
benchmark-specific add-ons.
|
| 52 |
+
2. Four epochs was enough to reach the useful local optimum for this 7B line.
|
| 53 |
+
3. Explicit date anchoring helped. Benchmark-style relative-date imitation did not.
|
| 54 |
+
4. Post-processing mattered. Speaker cleanup and relative-date resolution made
|
| 55 |
+
the extracted records usable.
|
| 56 |
+
5. Hybrid retrieval beat simpler sparse-only or dense-only retrieval.
|
| 57 |
+
6. Turn-local extraction worked better than feeding long recent-context windows
|
| 58 |
+
into the extractor.
|
| 59 |
|
| 60 |
+
## What To Avoid
|
|
|
|
|
|
|
| 61 |
|
| 62 |
+
1. Benchmark-specific format hacks, especially relative-date answer imitation.
|
| 63 |
+
2. Narrow LoCoMo-style SFT add-ons that improve one slice and hurt balance.
|
| 64 |
+
3. Overtraining follow-on variants that trade adversarial precision for narrow gains.
|
| 65 |
+
4. Treating the extractor as a standalone answer model instead of a memory writer.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 66 |
|
| 67 |
## Release Rule
|
| 68 |
|
| 69 |
+
Public surfaces should expose exactly one extraction behavior and one released
|
| 70 |
+
model. Other runs remain internal research artifacts.
|
|
|
|
|
|
|
|
|
|
|
|
|
| 71 |
|
| 72 |
Related docs:
|
| 73 |
|
docs/release/memory-scenarios.md
CHANGED
|
@@ -5,7 +5,7 @@ artifacts.
|
|
| 5 |
|
| 6 |
- The first two use the released held-out extraction examples.
|
| 7 |
- The last two use confirmed held-out benchmark cases from
|
| 8 |
-
[../../results/
|
| 9 |
|
| 10 |
The point is not just that the extractor matches GPT-4.1-style labels. The
|
| 11 |
point is that a later system can ask a concrete question and get back a useful,
|
|
|
|
| 5 |
|
| 6 |
- The first two use the released held-out extraction examples.
|
| 7 |
- The last two use confirmed held-out benchmark cases from
|
| 8 |
+
[../../results/benchmark_cases.json](../../results/benchmark_cases.json).
|
| 9 |
|
| 10 |
The point is not just that the extractor matches GPT-4.1-style labels. The
|
| 11 |
point is that a later system can ask a concrete question and get back a useful,
|
docs/release/release-results.md
CHANGED
|
@@ -1,11 +1,11 @@
|
|
| 1 |
# PRISM-Memory Release Results
|
| 2 |
|
| 3 |
-
This
|
| 4 |
-
|
| 5 |
|
| 6 |
-
## Released
|
| 7 |
|
| 8 |
-
-
|
| 9 |
- Base model: `Qwen/Qwen2.5-7B-Instruct`
|
| 10 |
- Adapter type: LoRA
|
| 11 |
- Confirmed LoCoMo mean: `0.4981204463`
|
|
@@ -13,18 +13,18 @@ artifacts that informed the public checkpoint choice.
|
|
| 13 |
- QA cache hits during confirmation: `460`
|
| 14 |
- QA cache misses during confirmation: `0`
|
| 15 |
|
| 16 |
-
##
|
| 17 |
|
| 18 |
-
|
| 19 |
-
|
| 20 |
|
| 21 |
-
| Benchmark | PRISM-Memory
|
| 22 |
|---|---:|---:|---|
|
| 23 |
| LongMemEval | `0.4768` | `0.4650` | PRISM wins |
|
| 24 |
-
| LoCoMo | `0.4981` | `0.5360` | PRISM trails, but stays
|
| 25 |
|
| 26 |
-
The QA layer is held constant. This is an
|
| 27 |
-
|
| 28 |
|
| 29 |
## LoCoMo Breakdown
|
| 30 |
|
|
@@ -47,24 +47,30 @@ not an end-to-end GPT-4.1 replacement claim.
|
|
| 47 |
| single-session-user | `0.9133333333` |
|
| 48 |
| temporal-reasoning | `0.4316666667` |
|
| 49 |
|
| 50 |
-
##
|
| 51 |
|
| 52 |
-
The closest runner-up
|
|
|
|
| 53 |
|
| 54 |
-
-
|
| 55 |
-
-
|
| 56 |
-
-
|
| 57 |
-
- Question-level wins: `56` for `sft4`, `52` for the runner-up
|
| 58 |
|
| 59 |
-
|
| 60 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 61 |
|
| 62 |
## Artifact Files
|
| 63 |
|
| 64 |
-
- [../../results/
|
| 65 |
-
- [../../results/
|
| 66 |
-
- [../../results/
|
| 67 |
-
- [../../results/
|
| 68 |
|
| 69 |
Related docs:
|
| 70 |
|
|
|
|
| 1 |
# PRISM-Memory Release Results
|
| 2 |
|
| 3 |
+
This page summarizes the confirmed public release metrics and the internal
|
| 4 |
+
comparison evidence that informed the release choice.
|
| 5 |
|
| 6 |
+
## Released Model
|
| 7 |
|
| 8 |
+
- Model: `PRISM-Memory 7B Adapter`
|
| 9 |
- Base model: `Qwen/Qwen2.5-7B-Instruct`
|
| 10 |
- Adapter type: LoRA
|
| 11 |
- Confirmed LoCoMo mean: `0.4981204463`
|
|
|
|
| 13 |
- QA cache hits during confirmation: `460`
|
| 14 |
- QA cache misses during confirmation: `0`
|
| 15 |
|
| 16 |
+
## Public Comparison
|
| 17 |
|
| 18 |
+
PRISM-Memory fine-tunes `Qwen/Qwen2.5-7B-Instruct` for the memory extraction
|
| 19 |
+
step that the PropMem reference gets from GPT-4.1.
|
| 20 |
|
| 21 |
+
| Benchmark | PRISM-Memory | GPT-4.1-based PropMem reference | Read |
|
| 22 |
|---|---:|---:|---|
|
| 23 |
| LongMemEval | `0.4768` | `0.4650` | PRISM wins |
|
| 24 |
+
| LoCoMo | `0.4981` | `0.5360` | PRISM trails, but stays competitive |
|
| 25 |
|
| 26 |
+
The QA layer is held constant. This is an extraction-step comparison, not an
|
| 27 |
+
end-to-end GPT-4.1 replacement claim.
|
| 28 |
|
| 29 |
## LoCoMo Breakdown
|
| 30 |
|
|
|
|
| 47 |
| single-session-user | `0.9133333333` |
|
| 48 |
| temporal-reasoning | `0.4316666667` |
|
| 49 |
|
| 50 |
+
## Why This Model Was Released
|
| 51 |
|
| 52 |
+
The closest internal runner-up nearly tied the released model on overall
|
| 53 |
+
LoCoMo, but it lost on the broader release profile:
|
| 54 |
|
| 55 |
+
- lower LongMemEval score: `0.4689`
|
| 56 |
+
- weaker adversarial precision
|
| 57 |
+
- less balanced behavior across the full evaluation surface
|
|
|
|
| 58 |
|
| 59 |
+
Question-level comparison on held-out LoCoMo:
|
| 60 |
+
|
| 61 |
+
- disagreements: `152 / 400`
|
| 62 |
+
- questions favoring PRISM-Memory: `56`
|
| 63 |
+
- questions favoring the runner-up: `52`
|
| 64 |
+
|
| 65 |
+
That is close enough to be a real internal comparison, but not close enough to
|
| 66 |
+
justify two public models.
|
| 67 |
|
| 68 |
## Artifact Files
|
| 69 |
|
| 70 |
+
- [../../results/release_summary.json](../../results/release_summary.json)
|
| 71 |
+
- [../../results/release_model.json](../../results/release_model.json)
|
| 72 |
+
- [../../results/benchmark_cases.json](../../results/benchmark_cases.json)
|
| 73 |
+
- [../../results/internal_locomo_pairwise_diffs.json](../../results/internal_locomo_pairwise_diffs.json)
|
| 74 |
|
| 75 |
Related docs:
|
| 76 |
|
docs/release/technical-blog.md
CHANGED
|
@@ -1,244 +1,158 @@
|
|
| 1 |
# PRISM-Memory: Turn Conversations Into Durable, Searchable Memory
|
| 2 |
|
| 3 |
-
##
|
| 4 |
|
| 5 |
-
|
| 6 |
-
|
| 7 |
-
|
|
|
|
| 8 |
|
| 9 |
-
|
| 10 |
-
|
| 11 |
|
| 12 |
-
|
|
|
|
|
|
|
| 13 |
|
| 14 |
-
|
| 15 |
-
- **Confirmed LoCoMo mean:** `0.4981204463`
|
| 16 |
-
- **Confirmed LongMemEval mean:** `0.4767574431`
|
| 17 |
-
- **QA cache misses during confirmation:** `0`
|
| 18 |
|
| 19 |
-
The
|
| 20 |
|
| 21 |
-
|
|
|
|
|
|
|
| 22 |
|
| 23 |
-
|
|
|
|
|
|
|
| 24 |
|
| 25 |
-
|
| 26 |
-
question and get back a useful answer without rereading the original chat. The
|
| 27 |
-
public release artifacts already show that pattern.
|
| 28 |
|
| 29 |
-
|
|
|
|
| 30 |
|
| 31 |
-
|
| 32 |
|
| 33 |
-
-
|
| 34 |
-
-
|
|
|
|
|
|
|
|
|
|
| 35 |
|
| 36 |
-
|
| 37 |
|
| 38 |
-
|
| 39 |
|
| 40 |
-
|
| 41 |
|
| 42 |
-
|
| 43 |
-
|
| 44 |
-
That is a real product use case. Teams mention constraints and preferences once,
|
| 45 |
-
then expect downstream tools and agents to remember them.
|
| 46 |
-
|
| 47 |
-
### 2. Keep current state separate from the roadmap
|
| 48 |
-
|
| 49 |
-
The released extractor can also preserve the difference between what is true
|
| 50 |
-
now and what is only planned:
|
| 51 |
-
|
| 52 |
-
- sidecar CPU limits are already set and monitored
|
| 53 |
-
- mTLS is planned for phase two
|
| 54 |
-
- rollout strategy is canary deployments plus traffic splitting
|
| 55 |
-
|
| 56 |
-
So a later question like:
|
| 57 |
-
|
| 58 |
-
> Did we already enable mTLS, and what rollout strategy are we planning?
|
| 59 |
-
|
| 60 |
-
can be answered without confusing the present state with the future plan.
|
| 61 |
-
|
| 62 |
-
This is a core memory problem, not a style problem. Chat history tends to blur
|
| 63 |
-
these states together.
|
| 64 |
-
|
| 65 |
-
### 3. Answer dated questions with dated evidence
|
| 66 |
-
|
| 67 |
-
One confirmed held-out benchmark case asks:
|
| 68 |
-
|
| 69 |
-
> Which hobby did Sam take up in May 2023?
|
| 70 |
-
|
| 71 |
-
The retrieved memory contains explicit dated propositions about Sam trying
|
| 72 |
-
painting in May 2023, and the released system answers:
|
| 73 |
-
|
| 74 |
-
> painting
|
| 75 |
-
|
| 76 |
-
That matters because the useful behavior is not “remember that hobbies were
|
| 77 |
-
discussed.” The useful behavior is “recover the dated fact that actually
|
| 78 |
-
answers the later question.”
|
| 79 |
-
|
| 80 |
-
There is a fourth practical behavior that matters too: refusal. On the held-out
|
| 81 |
-
adversarial guitar case, the released model returns `None` instead of inventing
|
| 82 |
-
a reason for an unsupported premise. That is also part of being useful.
|
| 83 |
-
|
| 84 |
-
For the compact scenario version of this story, see
|
| 85 |
-
[memory-scenarios.md](memory-scenarios.md).
|
| 86 |
-
|
| 87 |
-
## What The Repo Actually Contributed
|
| 88 |
-
|
| 89 |
-
The core contribution is not another opaque memory model. The repo showed that a
|
| 90 |
-
7B open model can replace GPT-4-class extraction with a transparent memory
|
| 91 |
-
pipeline that is still competitive on long-horizon dialogue benchmarks.
|
| 92 |
-
|
| 93 |
-
The released system has three pieces:
|
| 94 |
-
|
| 95 |
-
1. A learned proposition extractor (`Qwen2.5-7B-Instruct` + LoRA).
|
| 96 |
2. Post-processing that cleans speaker references and resolves relative time.
|
| 97 |
-
3. Hybrid retrieval
|
| 98 |
|
| 99 |
-
The
|
| 100 |
-
|
|
|
|
| 101 |
|
| 102 |
-
## The
|
| 103 |
|
| 104 |
-
|
| 105 |
-
and one checkpoint publicly exposed:
|
| 106 |
|
| 107 |
-
-
|
| 108 |
-
-
|
| 109 |
-
-
|
| 110 |
-
|
| 111 |
|
| 112 |
-
|
| 113 |
-
[extraction-skill.md](extraction-skill.md).
|
| 114 |
|
| 115 |
-
|
| 116 |
-
|
| 117 |
-
|
| 118 |
-
|
| 119 |
-
The repo repeatedly showed that `sft4` was the stable optimum for the 7B line.
|
| 120 |
-
The same 20k clean base distribution was critical. Changing the base subset,
|
| 121 |
-
shrinking it, or overextending it consistently hurt.
|
| 122 |
-
|
| 123 |
-
Why that matters:
|
| 124 |
-
|
| 125 |
-
- the model needed the exact data distribution that produced `sft4`
|
| 126 |
-
- 4 epochs was enough to reach the useful local optimum
|
| 127 |
-
- follow-on runs often traded away robustness for narrower gains
|
| 128 |
|
| 129 |
-
|
|
|
|
| 130 |
|
| 131 |
-
|
| 132 |
-
best path was proposition extraction plus `PRISMv3Rerank`.
|
| 133 |
-
|
| 134 |
-
That means:
|
| 135 |
-
|
| 136 |
-
- sparse retrieval captured lexical anchors
|
| 137 |
-
- dense retrieval recovered semantically close memories
|
| 138 |
-
- reranking cleaned up the final shortlist
|
| 139 |
-
|
| 140 |
-
This combination is what made the memory store usable.
|
| 141 |
-
|
| 142 |
-
### 3. Absolute date anchoring and temporal cleanup helped
|
| 143 |
|
| 144 |
-
|
| 145 |
-
teaching the model to imitate LoCoMo’s relative-answer style.
|
| 146 |
|
| 147 |
-
|
|
|
|
|
|
|
| 148 |
|
| 149 |
-
|
| 150 |
-
- normalizing session dates
|
| 151 |
-
- post-processing relative references like `yesterday` or `last weekend`
|
| 152 |
|
| 153 |
-
|
|
|
|
|
|
|
| 154 |
|
| 155 |
-
|
| 156 |
|
| 157 |
-
|
|
|
|
|
|
|
| 158 |
|
| 159 |
-
|
| 160 |
-
model worked best when extracting from the current turn and letting the memory
|
| 161 |
-
system handle cross-turn reasoning later.
|
| 162 |
|
| 163 |
-
|
| 164 |
-
composition
|
|
|
|
| 165 |
|
| 166 |
-
### 5. Adversarial precision
|
| 167 |
|
| 168 |
-
|
| 169 |
-
|
| 170 |
-
|
| 171 |
-
only checkpoint worth releasing publicly.
|
| 172 |
|
| 173 |
## What Did Not Work
|
| 174 |
|
| 175 |
-
### 1. Benchmark-
|
| 176 |
|
| 177 |
-
|
| 178 |
-
|
|
|
|
| 179 |
|
| 180 |
-
### 2. LoCoMo-
|
| 181 |
|
| 182 |
-
Adding
|
| 183 |
-
|
| 184 |
|
| 185 |
-
### 3. More
|
| 186 |
|
| 187 |
-
Scaling
|
| 188 |
-
|
| 189 |
|
| 190 |
-
### 4.
|
| 191 |
|
| 192 |
-
|
| 193 |
-
|
| 194 |
-
precision, inferential balance, or LongMemEval.
|
| 195 |
|
| 196 |
-
##
|
| 197 |
|
| 198 |
-
The repo
|
| 199 |
-
|
| 200 |
-
|
| 201 |
|
| 202 |
-
|
|
|
|
| 203 |
|
| 204 |
-
|
| 205 |
-
|
|
|
|
| 206 |
|
| 207 |
-
|
| 208 |
-
|
| 209 |
-
- `inferential_from_temporal_heavy` nearly tied `sft4` on overall LoCoMo
|
| 210 |
-
- it recovered some inferential and temporal misses
|
| 211 |
-
- it still lost on LongMemEval and adversarial precision
|
| 212 |
-
|
| 213 |
-
Question-level comparison on held-out LoCoMo:
|
| 214 |
-
|
| 215 |
-
- `400` questions replayed
|
| 216 |
-
- `152` answer-level disagreements
|
| 217 |
-
- `56` questions favored `sft4`
|
| 218 |
-
- `52` questions favored the runner-up
|
| 219 |
-
|
| 220 |
-
That is a useful research result, but not a reason to ship two public models.
|
| 221 |
-
The right release decision is one clean skill, one clean checkpoint.
|
| 222 |
-
|
| 223 |
-
## Failure Modes Still Visible In The Release Model
|
| 224 |
-
|
| 225 |
-
The selected model is good enough to release, but its errors are clear:
|
| 226 |
-
|
| 227 |
-
- it can miss specific diagnoses while retaining the broader health frame
|
| 228 |
-
- it can overcommit to a salient retrieved clue in inferential questions
|
| 229 |
-
- it can remember a coarse book description but miss the exact title
|
| 230 |
-
|
| 231 |
-
Those are not packaging issues. They are the current limits of the extraction +
|
| 232 |
-
retrieval stack at this model size.
|
| 233 |
|
| 234 |
## What Ships
|
| 235 |
|
| 236 |
-
|
| 237 |
|
| 238 |
1. `PRISM-Memory`
|
| 239 |
-
2.
|
| 240 |
-
3.
|
| 241 |
-
4.
|
| 242 |
-
|
| 243 |
-
|
| 244 |
-
|
|
|
|
|
|
|
|
|
| 1 |
# PRISM-Memory: Turn Conversations Into Durable, Searchable Memory
|
| 2 |
|
| 3 |
+
## The Problem
|
| 4 |
|
| 5 |
+
Most long-chat systems do not actually have memory. They have transcript
|
| 6 |
+
search. That works until someone asks a later question that depends on a hard
|
| 7 |
+
constraint, a changed plan, a dated fact, or a contradiction that happened
|
| 8 |
+
months ago.
|
| 9 |
|
| 10 |
+
PRISM-Memory focuses on the part of the stack that usually stays hidden: the
|
| 11 |
+
step that decides what should become memory at all.
|
| 12 |
|
| 13 |
+
The release model is a 7B adapter that writes short proposition-level memory
|
| 14 |
+
records from dialogue. Those records are then indexed by a hybrid retrieval
|
| 15 |
+
stack and used later for recall.
|
| 16 |
|
| 17 |
+
## What This Release Shows
|
|
|
|
|
|
|
|
|
|
| 18 |
|
| 19 |
+
The useful result is narrow and practical:
|
| 20 |
|
| 21 |
+
- a 7B open model can replace the GPT-4.1 extraction step in this memory pipeline
|
| 22 |
+
- it scores `0.4768` on LongMemEval versus `0.4650` for the GPT-4.1-based PropMem reference
|
| 23 |
+
- it scores `0.4981` on LoCoMo versus `0.5360` for that same reference
|
| 24 |
|
| 25 |
+
This is not a claim that a 7B model beats GPT-4.1 everywhere. It is a claim
|
| 26 |
+
that a 7B model can take over the memory-writing step and stay competitive on
|
| 27 |
+
the held-out evaluation surface.
|
| 28 |
|
| 29 |
+
## Why That Matters
|
|
|
|
|
|
|
| 30 |
|
| 31 |
+
If the memory-writing step is weak, retrieval never gets a clean chance.
|
| 32 |
+
Important details stay buried inside noisy chat turns.
|
| 33 |
|
| 34 |
+
PRISM-Memory is useful when later questions depend on things like:
|
| 35 |
|
| 36 |
+
- a hard operational limit: `20` GitHub Actions jobs
|
| 37 |
+
- a durable preference: aggregated Slack alerts instead of noisy ones
|
| 38 |
+
- a status distinction: mTLS is not live yet, it is planned for phase two
|
| 39 |
+
- a dated fact: Sam took up painting in May 2023
|
| 40 |
+
- a refusal case: the system should answer `None` instead of inventing a reason for an unsupported guitar story
|
| 41 |
|
| 42 |
+
Those are memory problems, not style problems.
|
| 43 |
|
| 44 |
+
## How The System Works
|
| 45 |
|
| 46 |
+
The released system has three pieces.
|
| 47 |
|
| 48 |
+
1. A learned extractor based on `Qwen/Qwen2.5-7B-Instruct` with LoRA.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 49 |
2. Post-processing that cleans speaker references and resolves relative time.
|
| 50 |
+
3. Hybrid retrieval with BM25, dense retrieval, and reranking.
|
| 51 |
|
| 52 |
+
The extracted propositions are the important interface. They are the memory
|
| 53 |
+
records the retriever indexes. That keeps the memory store inspectable instead
|
| 54 |
+
of opaque.
|
| 55 |
|
| 56 |
+
## What The Training Data Actually Was
|
| 57 |
|
| 58 |
+
The release data is synthetic.
|
|
|
|
| 59 |
|
| 60 |
+
- `2,329` synthetic training conversations
|
| 61 |
+
- `584` held-out synthetic conversations
|
| 62 |
+
- `100,427` supervised extraction examples derived from those conversations
|
| 63 |
+
- `20,000` supervised examples used for the released adapter
|
| 64 |
|
| 65 |
+
The conversations were designed to stress real memory behaviors:
|
|
|
|
| 66 |
|
| 67 |
+
- new facts introduced in one session and used later
|
| 68 |
+
- updated details that should overwrite stale ones
|
| 69 |
+
- deleted or invalidated facts that should stop influencing answers
|
| 70 |
+
- mixtures of personal details, project facts, preferences, dates, and plans
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 71 |
|
| 72 |
+
The labels were GPT-4.1-derived memory-writing targets. No real user chat logs
|
| 73 |
+
are part of the public release.
|
| 74 |
|
| 75 |
+
## What Worked
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 76 |
|
| 77 |
+
### 1. The clean supervised base mattered more than clever add-ons
|
|
|
|
| 78 |
|
| 79 |
+
The release model came from a stable `20,000`-example synthetic supervision
|
| 80 |
+
base. That base was more valuable than trying to patch the model later with
|
| 81 |
+
many narrow benchmark-specific additions.
|
| 82 |
|
| 83 |
+
### 2. Hybrid retrieval was part of the result
|
|
|
|
|
|
|
| 84 |
|
| 85 |
+
The release is not just a model story. It is a model-plus-retrieval story.
|
| 86 |
+
Sparse retrieval kept lexical anchors, dense retrieval recovered semantically
|
| 87 |
+
close memories, and reranking cleaned the shortlist.
|
| 88 |
|
| 89 |
+
### 3. Explicit time anchoring helped
|
| 90 |
|
| 91 |
+
The model improved when the memory records carried explicit dates and the system
|
| 92 |
+
resolved relative references like `yesterday` or `last weekend` into normalized
|
| 93 |
+
anchors.
|
| 94 |
|
| 95 |
+
### 4. Turn-local extraction was enough
|
|
|
|
|
|
|
| 96 |
|
| 97 |
+
Feeding long recent-context windows into the extractor made it worse. The
|
| 98 |
+
stronger pattern was local extraction at write time and cross-turn composition
|
| 99 |
+
later through retrieval.
|
| 100 |
|
| 101 |
+
### 5. Adversarial precision mattered
|
| 102 |
|
| 103 |
+
The release model kept the best adversarial behavior among the runs considered
|
| 104 |
+
for public release. That mattered because a memory system that answers
|
| 105 |
+
unsupported questions confidently is worse than one that refuses.
|
|
|
|
| 106 |
|
| 107 |
## What Did Not Work
|
| 108 |
|
| 109 |
+
### 1. Benchmark-style formatting tricks
|
| 110 |
|
| 111 |
+
Trying to train the model toward benchmark-style relative-date outputs hurt more
|
| 112 |
+
than it helped. It optimized the look of answers instead of the quality of the
|
| 113 |
+
stored memory.
|
| 114 |
|
| 115 |
+
### 2. Narrow LoCoMo-style add-ons
|
| 116 |
|
| 117 |
+
Adding targeted benchmark-domain data often bought a small gain in one slice of
|
| 118 |
+
LoCoMo and then lost balance somewhere else.
|
| 119 |
|
| 120 |
+
### 3. More noisy supervision was not automatically better
|
| 121 |
|
| 122 |
+
Scaling up original noisy temporal supervision amplified the wrong lesson. The
|
| 123 |
+
model became more specialized and less balanced.
|
| 124 |
|
| 125 |
+
### 4. Overtraining past the local optimum
|
| 126 |
|
| 127 |
+
Several follow-on variants nearly matched the final release on one metric, but
|
| 128 |
+
they usually gave back LongMemEval performance, adversarial precision, or both.
|
|
|
|
| 129 |
|
| 130 |
+
## Why Only One Public Model Ships
|
| 131 |
|
| 132 |
+
The repo tried multiple follow-on variants. The nearest internal runner-up
|
| 133 |
+
nearly tied the released model on overall LoCoMo and disagreed on `152` of the
|
| 134 |
+
`400` held-out LoCoMo questions, which means the comparison was real.
|
| 135 |
|
| 136 |
+
But the public release decision is simpler than the internal ablation story.
|
| 137 |
+
One model ships because it had the best overall release profile:
|
| 138 |
|
| 139 |
+
- strongest LongMemEval score
|
| 140 |
+
- strongest adversarial behavior
|
| 141 |
+
- best total balance across the held-out surface
|
| 142 |
|
| 143 |
+
That is a better public story than shipping several near-tied variants with
|
| 144 |
+
internal names nobody else should care about.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 145 |
|
| 146 |
## What Ships
|
| 147 |
|
| 148 |
+
The public release surface is intentionally narrow:
|
| 149 |
|
| 150 |
1. `PRISM-Memory`
|
| 151 |
+
2. one released model
|
| 152 |
+
3. one extraction skill
|
| 153 |
+
4. one Space demo
|
| 154 |
+
5. one set of release docs and benchmark artifacts
|
| 155 |
+
|
| 156 |
+
The broader `frontier_memory` harness stays in the repo for ongoing research,
|
| 157 |
+
but the release story stays focused on the memory-writing component that proved
|
| 158 |
+
worth shipping.
|
results/{scenario_comparisons.json → benchmark_cases.json}
RENAMED
|
@@ -18,7 +18,8 @@
|
|
| 18 |
"note": "The released model keeps the dated hobby proposition and answers correctly.",
|
| 19 |
"systems": [
|
| 20 |
{
|
| 21 |
-
"name": "
|
|
|
|
| 22 |
"prediction": "painting",
|
| 23 |
"top_retrieval": [
|
| 24 |
"Sam: [18 May 2023] Sam is considering trying painting as a new hobby.",
|
|
@@ -44,7 +45,8 @@
|
|
| 44 |
"note": "This tests whether the system refuses to invent an answer when the premise is unsupported.",
|
| 45 |
"systems": [
|
| 46 |
{
|
| 47 |
-
"name": "
|
|
|
|
| 48 |
"prediction": "None",
|
| 49 |
"top_retrieval": [
|
| 50 |
"[2:55 pm on 31 August, 2023] Dave: That guitar has a gorgeous purple hue. Why did you make it so shiny?",
|
|
@@ -67,7 +69,8 @@
|
|
| 67 |
"note": "A representative factual miss: the model retrieves the health-risk frame but not the specific diagnosis.",
|
| 68 |
"systems": [
|
| 69 |
{
|
| 70 |
-
"name": "
|
|
|
|
| 71 |
"prediction": "serious health risk",
|
| 72 |
"top_retrieval": [
|
| 73 |
"Sam: [8 October 2023] The doctor told Sam that his weight is a serious health risk.",
|
|
@@ -93,7 +96,8 @@
|
|
| 93 |
"note": "A representative inferential miss: retrieval includes both clues, but the model overcommits to the mountain mention.",
|
| 94 |
"systems": [
|
| 95 |
{
|
| 96 |
-
"name": "
|
|
|
|
| 97 |
"prediction": "mountains",
|
| 98 |
"top_retrieval": [
|
| 99 |
"Evan: [27 August 2023] Evan also shared his recent road trip to the Rocky Mountains and love for hiking.",
|
|
@@ -119,7 +123,8 @@
|
|
| 119 |
"note": "A representative multi-hop miss: the model retains the coarse book description but misses the specific title.",
|
| 120 |
"systems": [
|
| 121 |
{
|
| 122 |
-
"name": "
|
|
|
|
| 123 |
"prediction": "a new mystery novel",
|
| 124 |
"top_retrieval": [
|
| 125 |
"Evan: [27 August 2023] Evan is reading a book that he finds increasingly compelling.",
|
|
|
|
| 18 |
"note": "The released model keeps the dated hobby proposition and answers correctly.",
|
| 19 |
"systems": [
|
| 20 |
{
|
| 21 |
+
"name": "release_model",
|
| 22 |
+
"display_name": "PRISM-Memory 7B Adapter",
|
| 23 |
"prediction": "painting",
|
| 24 |
"top_retrieval": [
|
| 25 |
"Sam: [18 May 2023] Sam is considering trying painting as a new hobby.",
|
|
|
|
| 45 |
"note": "This tests whether the system refuses to invent an answer when the premise is unsupported.",
|
| 46 |
"systems": [
|
| 47 |
{
|
| 48 |
+
"name": "release_model",
|
| 49 |
+
"display_name": "PRISM-Memory 7B Adapter",
|
| 50 |
"prediction": "None",
|
| 51 |
"top_retrieval": [
|
| 52 |
"[2:55 pm on 31 August, 2023] Dave: That guitar has a gorgeous purple hue. Why did you make it so shiny?",
|
|
|
|
| 69 |
"note": "A representative factual miss: the model retrieves the health-risk frame but not the specific diagnosis.",
|
| 70 |
"systems": [
|
| 71 |
{
|
| 72 |
+
"name": "release_model",
|
| 73 |
+
"display_name": "PRISM-Memory 7B Adapter",
|
| 74 |
"prediction": "serious health risk",
|
| 75 |
"top_retrieval": [
|
| 76 |
"Sam: [8 October 2023] The doctor told Sam that his weight is a serious health risk.",
|
|
|
|
| 96 |
"note": "A representative inferential miss: retrieval includes both clues, but the model overcommits to the mountain mention.",
|
| 97 |
"systems": [
|
| 98 |
{
|
| 99 |
+
"name": "release_model",
|
| 100 |
+
"display_name": "PRISM-Memory 7B Adapter",
|
| 101 |
"prediction": "mountains",
|
| 102 |
"top_retrieval": [
|
| 103 |
"Evan: [27 August 2023] Evan also shared his recent road trip to the Rocky Mountains and love for hiking.",
|
|
|
|
| 123 |
"note": "A representative multi-hop miss: the model retains the coarse book description but misses the specific title.",
|
| 124 |
"systems": [
|
| 125 |
{
|
| 126 |
+
"name": "release_model",
|
| 127 |
+
"display_name": "PRISM-Memory 7B Adapter",
|
| 128 |
"prediction": "a new mystery novel",
|
| 129 |
"top_retrieval": [
|
| 130 |
"Evan: [27 August 2023] Evan is reading a book that he finds increasingly compelling.",
|
results/{readme_extraction_examples.json → extraction_examples.json}
RENAMED
|
@@ -1,6 +1,7 @@
|
|
| 1 |
{
|
| 2 |
-
"
|
| 3 |
-
"
|
|
|
|
| 4 |
"output_examples": 3,
|
| 5 |
"examples": [
|
| 6 |
{
|
|
|
|
| 1 |
{
|
| 2 |
+
"dataset_name": "Held-out synthetic evaluation split",
|
| 3 |
+
"model_name": "PRISM-Memory 7B Adapter",
|
| 4 |
+
"base_model": "Qwen/Qwen2.5-7B-Instruct",
|
| 5 |
"output_examples": 3,
|
| 6 |
"examples": [
|
| 7 |
{
|
results/{confirmed_exp15_summary.json → release_summary.json}
RENAMED
|
@@ -1,9 +1,11 @@
|
|
| 1 |
{
|
|
|
|
| 2 |
"results": [
|
| 3 |
{
|
| 4 |
-
"
|
| 5 |
-
"
|
| 6 |
-
"
|
|
|
|
| 7 |
"args": {
|
| 8 |
"n_lme": 10,
|
| 9 |
"context_window": 0,
|
|
@@ -50,4 +52,4 @@
|
|
| 50 |
}
|
| 51 |
],
|
| 52 |
"failures": []
|
| 53 |
-
}
|
|
|
|
| 1 |
{
|
| 2 |
+
"generated_at_unix": 1776521335,
|
| 3 |
"results": [
|
| 4 |
{
|
| 5 |
+
"model_key": "release_model",
|
| 6 |
+
"model_name": "PRISM-Memory 7B Adapter",
|
| 7 |
+
"base_model": "Qwen/Qwen2.5-7B-Instruct",
|
| 8 |
+
"elapsed_min": 30.0,
|
| 9 |
"args": {
|
| 10 |
"n_lme": 10,
|
| 11 |
"context_window": 0,
|
|
|
|
| 52 |
}
|
| 53 |
],
|
| 54 |
"failures": []
|
| 55 |
+
}
|