AsadIsmail commited on
Commit
419e63b
·
verified ·
1 Parent(s): 28e0411

Publish PRISM-Memory adapter bundle

Browse files
README.md CHANGED
@@ -16,8 +16,14 @@ tags:
16
  # PRISM-Memory
17
 
18
  PRISM-Memory is a LoRA adapter that trains `Qwen/Qwen2.5-7B-Instruct` to write
19
- proposition-level memory from dialogue. It is the released `exp15_sft_qwen7b_4ep`
20
- checkpoint from the original `better_memory` project.
 
 
 
 
 
 
21
 
22
  ## What this release shows
23
 
@@ -35,7 +41,7 @@ extractor, not a full end-to-end GPT-4.1 system.
35
  - It supports dated recall and clean refusal on unsupported questions.
36
 
37
  See [docs/release/memory-scenarios.md](docs/release/memory-scenarios.md) for
38
- the compact end-to-end examples.
39
 
40
  ## Load the adapter
41
 
@@ -55,38 +61,42 @@ base_model = AutoModelForCausalLM.from_pretrained(
55
  model = PeftModel.from_pretrained(base_model, adapter_id)
56
  ```
57
 
58
- This repo contains the adapter weights only. You still need the base model.
59
 
60
  ## Training data
61
 
62
  PRISM-Memory was trained on **synthetic** multi-session memory conversations
63
- with **GPT-4.1-derived proposition labels**. The public release does not use
64
  real user chat logs.
65
 
66
- | File | Examples | Role |
67
  |---|---:|---|
68
- | `train.jsonl` | `2,329` conversations | raw synthetic conversation source |
69
- | `eval.jsonl` | `584` conversations | held-out synthetic conversation source |
70
- | `train_sft.jsonl` | `100,427` labels | primary SFT source |
71
- | `train_sft_clean_merged.jsonl` | `20,000` labels | cleaned follow-on base matching the best run |
72
 
73
- The released checkpoint uses a `20k` sample from `train_sft.jsonl`. See
74
- [docs/release/datasets.md](docs/release/datasets.md) for the full inventory,
75
- the evaluation surfaces, and the ablations that regressed.
76
 
77
- ### Example data item
78
 
79
- **Synthetic turn**
 
80
 
81
- > yeah, I think starting with incremental scans and parallel matrix jobs makes sense. We have 20 concurrent jobs max on GitHub Actions currently. Also want to keep Slack notifications from Snyk consistent with other pipeline alerts, aggregated and concise.
82
 
83
- **Target propositions**
84
 
85
- - GitHub Actions concurrency limit: 20 concurrent jobs
86
- - Wants Snyk Slack notifications aggregated and concise, consistent with other pipeline alerts
 
 
 
 
 
87
 
88
- The current release makes the data recipe and examples public. The full raw
89
- training JSONLs are not bundled in this model repo.
90
 
91
  ## Confirmed results
92
 
@@ -149,9 +159,9 @@ More held-out examples live in
149
  - [docs/release/memory-scenarios.md](docs/release/memory-scenarios.md)
150
  - [docs/release/release-results.md](docs/release/release-results.md)
151
  - [docs/release/technical-blog.md](docs/release/technical-blog.md)
152
- - [results/confirmed_exp15_summary.json](results/confirmed_exp15_summary.json)
153
- - [results/readme_extraction_examples.json](results/readme_extraction_examples.json)
154
- - [results/scenario_comparisons.json](results/scenario_comparisons.json)
155
 
156
  ## Demo
157
 
 
16
  # PRISM-Memory
17
 
18
  PRISM-Memory is a LoRA adapter that trains `Qwen/Qwen2.5-7B-Instruct` to write
19
+ proposition-level memory from dialogue. It is a memory-writing component, not a
20
+ general chat model.
21
+
22
+ ## Released model
23
+
24
+ - Model name: `PRISM-Memory 7B Adapter`
25
+ - Base model: `Qwen/Qwen2.5-7B-Instruct`
26
+ - Adapter type: `LoRA`
27
 
28
  ## What this release shows
29
 
 
41
  - It supports dated recall and clean refusal on unsupported questions.
42
 
43
  See [docs/release/memory-scenarios.md](docs/release/memory-scenarios.md) for
44
+ compact end-to-end examples.
45
 
46
  ## Load the adapter
47
 
 
61
  model = PeftModel.from_pretrained(base_model, adapter_id)
62
  ```
63
 
64
+ This repo contains adapter weights only. You still need the base model.
65
 
66
  ## Training data
67
 
68
  PRISM-Memory was trained on **synthetic** multi-session memory conversations
69
+ with **GPT-4.1-derived** memory-writing labels. The public release does not use
70
  real user chat logs.
71
 
72
+ | Item | Count | Notes |
73
  |---|---:|---|
74
+ | synthetic training conversations | `2,329` | multi-session conversations with inserts, updates, and deletes |
75
+ | synthetic held-out conversations | `584` | evaluation split used for held-out examples |
76
+ | supervised extraction examples | `100,427` | memory-writing labels derived from the synthetic corpus |
77
+ | released training subset | `20,000` | supervised examples used for the public adapter |
78
 
79
+ ### Example training item
 
 
80
 
81
+ **Synthetic scenario**
82
 
83
+ - Domain: cloud infrastructure performance optimization
84
+ - Persona: senior cloud systems engineer at a fintech startup
85
 
86
+ **Synthetic user turn**
87
 
88
+ > Here’s the initial architecture outline: deploy microservices on AWS Fargate, use PostgreSQL 13 as the primary database, plan Kubernetes orchestration, use Redis for caching, and keep API latency under 50ms.
89
 
90
+ **Target memory records**
91
+
92
+ - Deploy microservices on AWS Fargate
93
+ - Orchestrate containers on a Kubernetes cluster (planned)
94
+ - Primary database: PostgreSQL 13
95
+ - Use Redis as an in-memory caching layer
96
+ - Latency target: API responses under 50ms
97
 
98
+ The release makes the dataset design, counts, and example records public. It
99
+ does not bundle the full raw corpus files.
100
 
101
  ## Confirmed results
102
 
 
159
  - [docs/release/memory-scenarios.md](docs/release/memory-scenarios.md)
160
  - [docs/release/release-results.md](docs/release/release-results.md)
161
  - [docs/release/technical-blog.md](docs/release/technical-blog.md)
162
+ - [results/release_summary.json](results/release_summary.json)
163
+ - [results/extraction_examples.json](results/extraction_examples.json)
164
+ - [results/benchmark_cases.json](results/benchmark_cases.json)
165
 
166
  ## Demo
167
 
docs/release/datasets.md CHANGED
@@ -1,125 +1,136 @@
1
- # PRISM-Memory Datasets
2
 
3
- This file separates the data used by the public `PRISM-Memory` release from the
4
- auxiliary datasets that were only useful for ablations.
 
5
 
6
- ## Data Provenance
7
 
8
- The release training data is **synthetic**.
 
 
 
 
 
 
 
9
 
10
- - The conversation source was programmatically generated to stress long-horizon
11
- memory behavior such as inserts, updates, deletes, contradiction handling,
12
- and multi-session recall.
13
- - The SFT labels were then derived from those synthetic conversations with a
14
- GPT-4.1 proposition extractor.
15
- - No real end-user chat logs are part of this public release story.
16
 
17
- ## Released Training Recipe
 
 
 
 
18
 
19
- The released checkpoint is `exp15_sft_qwen7b_4ep`.
20
 
21
- The core recipe was:
22
 
23
- 1. Start from `Qwen/Qwen2.5-7B-Instruct`.
24
- 2. Fine-tune with LoRA on a `20k` sample from `train_sft.jsonl`.
25
- 3. Evaluate on held-out `LoCoMo` and held-out `LongMemEval`.
26
 
27
- ## Source Conversations
 
 
 
28
 
29
- The underlying synthetic conversation source lives in the upstream
30
- `better_memory/data/output/` directory.
31
 
32
- | File | Kind | Split | Notes |
33
- |---|---|---|---|
34
- | `train.jsonl` | raw conversations | train | `2,329` synthetic multi-session conversations |
35
- | `eval.jsonl` | raw conversations | eval | `584` held-out synthetic multi-session conversations |
36
- | `metadata.json` | split metadata | all | counts by tier, agent type, and update regime |
 
37
 
38
- The source generator was built to create long-horizon memory stress cases with
39
- inserts, updates, deletes, and multi-session recall.
40
 
41
- ## Example Training Item
 
42
 
43
- This is the shape of the data the model learned from: a synthetic dialogue turn
44
- paired with proposition-style extraction targets.
 
 
45
 
46
- **Synthetic turn**
47
 
48
- > yeah, I think starting with incremental scans and parallel matrix jobs makes sense. We have 20 concurrent jobs max on GitHub Actions currently. Also want to keep Slack notifications from Snyk consistent with other pipeline alerts, aggregated and concise.
49
 
50
- **Target propositions**
 
51
 
52
- - GitHub Actions concurrency limit: 20 concurrent jobs
53
- - Wants Snyk Slack notifications aggregated and concise, consistent with other pipeline alerts
54
 
55
- This example is illustrative of the release data format. The exact public
56
- release checkpoint was trained on the larger `train_sft.jsonl` corpus, not on
57
- just this slice.
58
 
59
- ## Derived SFT Data
60
 
61
- These are GPT-4.1-derived proposition labels built on top of the raw
62
- conversations.
63
 
64
- | File | Examples | Role | Release Status |
65
- |---|---|---|---|
66
- | `train_sft.jsonl` | `100,427` | primary SFT data | core release data |
67
- | `train_sft_clean_merged.jsonl` | `20,000` | cleaned resume base matching `sft4` distribution | good follow-on base |
68
- | `train_sft_temporal_resolved.jsonl` | `2,643` | temporal-fix add-on set | useful for targeted research, not the public base |
69
- | `eval_sft.jsonl` | reference | GPT-4.1 PropMem extractions on eval conversations | evaluation reference only |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
70
 
71
  ## Evaluation Surfaces
72
 
73
- The released model was evaluated on two held-out surfaces:
74
 
75
- | Benchmark | Held-out Surface | Notes |
76
  |---|---|---|
77
- | `LoCoMo` | conversations `conv-49` and `conv-50` | five categories: factual, temporal, inferential, multi-hop, adversarial |
78
- | `LongMemEval` | held-out items stratified by question type | six categories, including temporal reasoning and knowledge updates |
79
 
80
- Both the GPT-4.1 extraction baseline and the released 7B extractor were scored
81
- with the same GPT-4.1 QA evaluator and the same cache-backed answer surface.
82
 
83
- ## What Is Public Right Now
84
 
85
  Public now:
86
 
87
- - dataset description and counts
 
 
88
  - held-out extraction examples
89
- - release metrics and benchmark breakdowns
90
 
91
  Not public yet:
92
 
93
- - the raw `train.jsonl` and `eval.jsonl` conversation files
94
- - the full `train_sft.jsonl` and `train_sft_clean_merged.jsonl` label files
95
- - the auxiliary LoCoMo ablation JSONLs
96
-
97
- So the current release makes the **data recipe** public, but not the full raw
98
- training corpora.
99
-
100
- ## Auxiliary LoCoMo Datasets
101
-
102
- These files were used in ablations and targeted probes. They matter for the
103
- research story, but they are not the main public training recipe.
104
-
105
- | File | Examples | Intended Use | Outcome |
106
- |---|---|---|---|
107
- | `locomo_qa_supervised_factual.jsonl` | `512` | factual QA supervision | neutral to small benefit |
108
- | `locomo_qa_supervised_multihop.jsonl` | `625` | multihop QA supervision | neutral to small benefit |
109
- | `locomo_qa_supervised_temporal.jsonl` | `248` | temporal QA supervision with absolute dates | neutral to small benefit |
110
- | `locomo_qa_supervised_inferential.jsonl` | `133` | inferential QA supervision | too small, hurt balance |
111
- | `locomo_qa_supervised_temporal_relformat.jsonl` | `248` | temporal QA with benchmark-style relative dates | hurt |
112
- | `locomo_sft_extra.jsonl` | `2,645` | LoCoMo-domain SFT add-on | hurt |
113
- | `locomo_sft_extra_relformat.jsonl` | `3,178` | relative-date LoCoMo SFT add-on | hurt |
114
 
115
- ## Practical Takeaways
116
 
117
- 1. The best 7B model came from the stable `20k` `train_sft.jsonl` base, not
118
- from aggressive benchmark-specific add-ons.
119
- 2. Training on LoCoMo-domain conversations did not help generalization.
120
- 3. Relative-date output hacks made the extractor worse.
121
- 4. More original LME data was not automatically better because noisy temporal
122
- labels compounded the anchor-loss problem.
123
 
124
  Related docs:
125
 
 
1
+ # PRISM-Memory Training Data
2
 
3
+ The PRISM-Memory release is trained on **synthetic** multi-session
4
+ conversations with **GPT-4.1-derived** memory-writing labels. No real user chat
5
+ logs are part of the public release story.
6
 
7
+ ## Dataset At A Glance
8
 
9
+ | Item | Count | What it means |
10
+ |---|---:|---|
11
+ | synthetic training conversations | `2,329` | multi-session conversations used to build the training label bank |
12
+ | synthetic held-out conversations | `584` | held-out conversations used for evaluation examples and reference labels |
13
+ | total generated conversations | `2,913` | train plus eval |
14
+ | supervised extraction examples | `100,427` | memory-writing examples derived from the synthetic conversations |
15
+ | released training subset | `20,000` | supervised examples used to train the public adapter |
16
+ | agent and task families | `6` | research, data analysis, QA, coding, planning, writing |
17
 
18
+ The synthetic conversation generator deliberately creates long-horizon memory
19
+ pressure:
 
 
 
 
20
 
21
+ - facts introduced early and queried later
22
+ - updated plans and corrected details
23
+ - deleted or invalidated information
24
+ - multi-session continuity
25
+ - mixtures of preferences, project state, dates, and operational facts
26
 
27
+ ## How The Data Is Built
28
 
29
+ The training pipeline has two layers.
30
 
31
+ ### 1. Synthetic conversation generation
 
 
32
 
33
+ The first layer creates multi-session conversations around realistic work and
34
+ assistant scenarios. Each conversation comes with scenario metadata, a persona,
35
+ multiple sessions, and explicit memory events such as inserts, updates, and
36
+ deletes.
37
 
38
+ Across the full corpus:
 
39
 
40
+ - `899` conversations are short
41
+ - `1,162` are medium
42
+ - `852` are long
43
+ - `897` are insert-only
44
+ - `937` include updates
45
+ - `435` include both updates and deletes
46
 
47
+ ### 2. Supervised memory-writing labels
 
48
 
49
+ The second layer converts those conversations into supervised extraction
50
+ examples. Each example contains:
51
 
52
+ - retrieved memories seen so far
53
+ - recent conversation context
54
+ - the current user turn
55
+ - target memory operations that should be written from that turn
56
 
57
+ The released model learns this memory-writing step.
58
 
59
+ ## What A Training Example Looks Like
60
 
61
+ One real synthetic scenario in the corpus is about **cloud infrastructure
62
+ performance optimization** for a low-latency trading platform.
63
 
64
+ **Synthetic scenario**
 
65
 
66
+ - domain: cloud infrastructure performance optimization
67
+ - persona: senior cloud systems engineer at a fintech startup
68
+ - conversation shape: two sessions, ten chunks, five later questions
69
 
70
+ **Synthetic user turn**
71
 
72
+ > Here’s the initial architecture outline: deploy microservices on AWS Fargate, use PostgreSQL 13 as the primary database, plan Kubernetes orchestration, use Redis for caching, keep API latency under 50ms, and redesign the system with a team of five engineers.
 
73
 
74
+ **Target memory records**
75
+
76
+ - Deploy microservices on AWS Fargate
77
+ - Orchestrate containers on a Kubernetes cluster (planned)
78
+ - Primary database: PostgreSQL 13
79
+ - Use Redis as an in-memory caching layer
80
+ - Latency target: API responses under 50ms
81
+
82
+ Later turns in the same conversation update that memory with new load targets,
83
+ TTL settings, and rollout constraints such as zero downtime.
84
+
85
+ ## What Trained The Released Model
86
+
87
+ The public adapter was trained on `20,000` supervised extraction examples
88
+ sampled from the larger `100,427`-example label bank.
89
+
90
+ In plain terms, the model saw many examples of this pattern:
91
+
92
+ 1. a conversation turn mentions several durable facts
93
+ 2. the target output keeps only the memory-worthy facts
94
+ 3. those facts are written as short standalone memory records
95
+
96
+ That is why the release behaves like a memory writer rather than a chat model.
97
 
98
  ## Evaluation Surfaces
99
 
100
+ The released model is evaluated on two held-out surfaces.
101
 
102
+ | Benchmark | Held-out surface | What it tests |
103
  |---|---|---|
104
+ | `LoCoMo` | held-out conversations `conv-49` and `conv-50` | factual, temporal, inferential, multi-hop, and adversarial recall |
105
+ | `LongMemEval` | held-out items across six categories | knowledge updates, multi-session recall, single-session recall, and temporal reasoning |
106
 
107
+ Both the PRISM extractor and the GPT-4.1-based PropMem reference are scored
108
+ with the same QA layer, so the public comparison isolates the extraction step.
109
 
110
+ ## What Is Public Today
111
 
112
  Public now:
113
 
114
+ - the dataset design
115
+ - corpus counts
116
+ - example training records
117
  - held-out extraction examples
118
+ - benchmark results and category breakdowns
119
 
120
  Not public yet:
121
 
122
+ - the full raw synthetic conversation files
123
+ - the full supervised label bank
124
+ - the auxiliary ablation corpora used for follow-on experiments
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
125
 
126
+ ## Practical Lessons From The Data
127
 
128
+ 1. The strongest release model came from the stable `20,000`-example base, not
129
+ from benchmark-specific add-ons.
130
+ 2. Explicit date anchoring helped more than benchmark-style answer formatting.
131
+ 3. More narrow benchmark data did not automatically improve generalization.
132
+ 4. The supervision is most useful when it teaches durable facts, updates, and
133
+ contradictions instead of stylistic imitation.
134
 
135
  Related docs:
136
 
docs/release/extraction-examples.md CHANGED
@@ -1,8 +1,8 @@
1
  # PRISM-Memory Extraction Examples
2
 
3
- Selected held-out examples from the original Exp15 `eval_sft.jsonl` corpus.
4
- The `GPT-4.1 reference` rows come from the original SFT target propositions.
5
- The `PRISM-Memory` rows were regenerated from `exp15_sft_qwen7b_4ep` with greedy decoding using the same extraction prompt family used during evaluation.
6
 
7
  These examples are illustrations, not the benchmark itself. Use
8
  [release-results.md](release-results.md) for the aggregate numbers.
@@ -22,7 +22,7 @@ These examples are illustrations, not the benchmark itself. Use
22
  - No caching beyond basic Docker layer caching
23
  - Jenkins nodes have limited capacity and experience queue delays during peak commits
24
 
25
- **PRISM-Memory `sft4`**
26
 
27
  - No Docker caching beyond basic layer caching
28
  - Jenkins nodes have limited capacity; peak commits cause queue delays
@@ -42,7 +42,7 @@ These examples are illustrations, not the benchmark itself. Use
42
  - GitHub Actions concurrency limit: 20 concurrent jobs
43
  - Wants Snyk Slack notifications aggregated and concise, consistent with other pipeline alerts
44
 
45
- **PRISM-Memory `sft4`**
46
 
47
  - GitHub Actions concurrency limit: 20 concurrent jobs
48
  - Snyk Slack notifications should be aggregated and concise
@@ -63,7 +63,7 @@ These examples are illustrations, not the benchmark itself. Use
63
  - mTLS planned in phase two
64
  - Plan to use canary deployments, traffic splitting, and basic fault injection
65
 
66
- **PRISM-Memory `sft4`**
67
 
68
  - Sidecar CPU limits set and monitored via Prometheus
69
  - Istio mTLS planned for phase two
@@ -72,5 +72,5 @@ These examples are illustrations, not the benchmark itself. Use
72
  ## Regeneration
73
 
74
  ```bash
75
- conda run -n pytorch_p310 python scripts/release/generate_readme_examples.py
76
  ```
 
1
  # PRISM-Memory Extraction Examples
2
 
3
+ Selected held-out examples from the synthetic evaluation split.
4
+ The `GPT-4.1 reference` rows come from the supervised target memory labels.
5
+ The `PRISM-Memory 7B Adapter` rows were regenerated with greedy decoding using the same extraction prompt family used during evaluation.
6
 
7
  These examples are illustrations, not the benchmark itself. Use
8
  [release-results.md](release-results.md) for the aggregate numbers.
 
22
  - No caching beyond basic Docker layer caching
23
  - Jenkins nodes have limited capacity and experience queue delays during peak commits
24
 
25
+ **PRISM-Memory**
26
 
27
  - No Docker caching beyond basic layer caching
28
  - Jenkins nodes have limited capacity; peak commits cause queue delays
 
42
  - GitHub Actions concurrency limit: 20 concurrent jobs
43
  - Wants Snyk Slack notifications aggregated and concise, consistent with other pipeline alerts
44
 
45
+ **PRISM-Memory**
46
 
47
  - GitHub Actions concurrency limit: 20 concurrent jobs
48
  - Snyk Slack notifications should be aggregated and concise
 
63
  - mTLS planned in phase two
64
  - Plan to use canary deployments, traffic splitting, and basic fault injection
65
 
66
+ **PRISM-Memory**
67
 
68
  - Sidecar CPU limits set and monitored via Prometheus
69
  - Istio mTLS planned for phase two
 
72
  ## Regeneration
73
 
74
  ```bash
75
+ conda run -n pytorch_p310 python scripts/release/generate_extraction_examples.py
76
  ```
docs/release/extraction-skill.md CHANGED
@@ -2,20 +2,20 @@
2
 
3
  **Hook:** Turn conversations into durable, searchable memory.
4
 
5
- This is the single extraction skill to keep from the `better_memory` work.
6
- Public release should point to one checkpoint and one extraction behavior:
7
 
8
- - **Model:** `exp15_sft_qwen7b_4ep`
9
  - **Base model:** `Qwen/Qwen2.5-7B-Instruct`
10
  - **Role:** proposition extraction for long-term conversational memory
11
- - **Why this one:** best confirmed total profile, best adversarial behavior, and
12
- best LongMemEval score
 
13
 
14
  ## Skill Definition
15
 
16
- The extractor operates turn by turn and emits `0-5` atomic propositions per
17
- turn. Each proposition should be a standalone fact about a person, event,
18
- preference, or property, with dates carried into the fact when available.
19
 
20
  Canonical prompt:
21
 
@@ -23,17 +23,14 @@ Canonical prompt:
23
  You are a memory extraction assistant. Given a conversation turn, extract 0-5 atomic, standalone facts. Each fact must be a complete sentence about a specific person, event, preference, or property. Include dates/times when mentioned. Skip greetings, filler, and questions. Output ONLY a JSON array of strings, e.g. ["fact1", "fact2"] or [].
24
  ```
25
 
26
- This prompt comes from `experiment15_learned_extraction.py` in the upstream
27
- `better_memory` workspace.
28
-
29
  ## Inference Contract
30
 
31
- 1. Format the turn with speaker and session date.
32
  2. Extract `0-5` propositions as a JSON array.
33
- 3. Clean speaker references so generic labels become real names.
34
  4. Resolve relative temporal expressions against the session date.
35
- 5. Prefix each proposition with the normalized session date before indexing.
36
- 6. Retrieve with the PRISM hybrid stack, not with the extractor alone.
37
 
38
  ## Retrieval Setup To Keep
39
 
@@ -48,69 +45,29 @@ Best confirmed retrieval settings:
48
  - **LongMemEval:** multi-session `k=20`, all other categories `k=8` except
49
  single-session-user `k=5`
50
 
51
- ## What Worked
52
-
53
- 1. **The original 20k base mattered.**
54
- `sft4` came from the exact `train_sft_clean_merged.jsonl` base distribution.
55
- Runs that changed the base subset regressed.
56
-
57
- 2. **Four epochs was the sweet spot.**
58
- `sft4` is the local optimum the repo could actually reproduce.
59
-
60
- 3. **Absolute date anchoring helped.**
61
- Temporal repairs worked when the model saw explicit, normalized dates rather
62
- than benchmark-specific relative phrasing.
63
-
64
- 4. **Post-processing mattered.**
65
- Speaker cleanup plus relative-date resolution was necessary to turn raw
66
- outputs into stable memory records.
67
-
68
- 5. **Hybrid retrieval beat simpler retrieval.**
69
- BM25 + dense + reranking consistently outperformed BM25-only or dense-only
70
- approaches.
71
-
72
- 6. **Turn-local extraction was enough.**
73
- The model performed better without feeding long recent-context windows into
74
- the extractor.
75
-
76
- 7. **Multihop supervision preserved inferential behavior.**
77
- When temporal data was added, multihop QA was the only extra signal that
78
- reliably helped preserve inferential performance.
79
 
80
- ## What Did Not Work
 
 
 
 
 
 
 
 
81
 
82
- 1. **Relative-date training.**
83
- Training the extractor to emit benchmark-style relative dates hurt temporal
84
- performance instead of helping it.
85
 
86
- 2. **LoCoMo-domain SFT data.**
87
- Adding LoCoMo training conversations consistently regressed the model.
88
-
89
- 3. **More than 20k original LME examples.**
90
- Scaling the original noisy temporal labels to 50k amplified anchor loss and
91
- caused major regression.
92
-
93
- 4. **Small clean bases.**
94
- 5k-base follow-on runs forgot too much and collapsed inferential behavior.
95
-
96
- 5. **Heavy QA multipliers.**
97
- High temporal or QA multipliers damaged adversarial precision and LongMemEval.
98
-
99
- 6. **High learning rates on follow-on QA runs.**
100
- Aggressive fine-tuning degraded the traits that made `sft4` good.
101
-
102
- 7. **Trying to push past the local optimum.**
103
- Most post-`sft4` training traded away adversarial performance for narrower
104
- gains.
105
 
106
  ## Release Rule
107
 
108
- Release only this extraction skill and only this checkpoint publicly:
109
-
110
- - `exp15_sft_qwen7b_4ep`
111
-
112
- Treat all other checkpoints as internal ablations and learning artifacts, not as
113
- parallel public releases.
114
 
115
  Related docs:
116
 
 
2
 
3
  **Hook:** Turn conversations into durable, searchable memory.
4
 
5
+ This is the single extraction skill the public release keeps.
 
6
 
7
+ - **Released model:** `PRISM-Memory 7B Adapter`
8
  - **Base model:** `Qwen/Qwen2.5-7B-Instruct`
9
  - **Role:** proposition extraction for long-term conversational memory
10
+ - **Why this one:** strongest confirmed overall release profile, strongest
11
+ adversarial behavior, and best confirmed LongMemEval score among the release
12
+ candidates
13
 
14
  ## Skill Definition
15
 
16
+ The extractor operates turn by turn and emits `0-5` atomic memory records per
17
+ turn. Each record should be a standalone fact about a person, event,
18
+ preference, plan, or property, with dates carried into the fact when available.
19
 
20
  Canonical prompt:
21
 
 
23
  You are a memory extraction assistant. Given a conversation turn, extract 0-5 atomic, standalone facts. Each fact must be a complete sentence about a specific person, event, preference, or property. Include dates/times when mentioned. Skip greetings, filler, and questions. Output ONLY a JSON array of strings, e.g. ["fact1", "fact2"] or [].
24
  ```
25
 
 
 
 
26
  ## Inference Contract
27
 
28
+ 1. Format the current turn with speaker and session date.
29
  2. Extract `0-5` propositions as a JSON array.
30
+ 3. Clean speaker references so generic labels become real names when possible.
31
  4. Resolve relative temporal expressions against the session date.
32
+ 5. Prefix each stored proposition with the normalized session date before indexing.
33
+ 6. Pair the extractor with the hybrid retrieval stack, not with raw transcript search alone.
34
 
35
  ## Retrieval Setup To Keep
36
 
 
45
  - **LongMemEval:** multi-session `k=20`, all other categories `k=8` except
46
  single-session-user `k=5`
47
 
48
+ ## What Held Up In The Repo
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
49
 
50
+ 1. The stable `20,000`-example supervised base mattered more than aggressive
51
+ benchmark-specific add-ons.
52
+ 2. Four epochs was enough to reach the useful local optimum for this 7B line.
53
+ 3. Explicit date anchoring helped. Benchmark-style relative-date imitation did not.
54
+ 4. Post-processing mattered. Speaker cleanup and relative-date resolution made
55
+ the extracted records usable.
56
+ 5. Hybrid retrieval beat simpler sparse-only or dense-only retrieval.
57
+ 6. Turn-local extraction worked better than feeding long recent-context windows
58
+ into the extractor.
59
 
60
+ ## What To Avoid
 
 
61
 
62
+ 1. Benchmark-specific format hacks, especially relative-date answer imitation.
63
+ 2. Narrow LoCoMo-style SFT add-ons that improve one slice and hurt balance.
64
+ 3. Overtraining follow-on variants that trade adversarial precision for narrow gains.
65
+ 4. Treating the extractor as a standalone answer model instead of a memory writer.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
66
 
67
  ## Release Rule
68
 
69
+ Public surfaces should expose exactly one extraction behavior and one released
70
+ model. Other runs remain internal research artifacts.
 
 
 
 
71
 
72
  Related docs:
73
 
docs/release/memory-scenarios.md CHANGED
@@ -5,7 +5,7 @@ artifacts.
5
 
6
  - The first two use the released held-out extraction examples.
7
  - The last two use confirmed held-out benchmark cases from
8
- [../../results/scenario_comparisons.json](../../results/scenario_comparisons.json).
9
 
10
  The point is not just that the extractor matches GPT-4.1-style labels. The
11
  point is that a later system can ask a concrete question and get back a useful,
 
5
 
6
  - The first two use the released held-out extraction examples.
7
  - The last two use confirmed held-out benchmark cases from
8
+ [../../results/benchmark_cases.json](../../results/benchmark_cases.json).
9
 
10
  The point is not just that the extractor matches GPT-4.1-style labels. The
11
  point is that a later system can ask a concrete question and get back a useful,
docs/release/release-results.md CHANGED
@@ -1,11 +1,11 @@
1
  # PRISM-Memory Release Results
2
 
3
- This file summarizes the confirmed release metrics and the internal comparison
4
- artifacts that informed the public checkpoint choice.
5
 
6
- ## Released Checkpoint
7
 
8
- - Checkpoint: `exp15_sft_qwen7b_4ep`
9
  - Base model: `Qwen/Qwen2.5-7B-Instruct`
10
  - Adapter type: LoRA
11
  - Confirmed LoCoMo mean: `0.4981204463`
@@ -13,18 +13,18 @@ artifacts that informed the public checkpoint choice.
13
  - QA cache hits during confirmation: `460`
14
  - QA cache misses during confirmation: `0`
15
 
16
- ## Baseline Context
17
 
18
- `PRISM-Memory` fine-tunes `Qwen/Qwen2.5-7B-Instruct` for the proposition
19
- extraction step that PropMem normally gets from GPT-4.1. On the confirmed run:
20
 
21
- | Benchmark | PRISM-Memory `sft4` | GPT-4.1-based PropMem reference | Read |
22
  |---|---:|---:|---|
23
  | LongMemEval | `0.4768` | `0.4650` | PRISM wins |
24
- | LoCoMo | `0.4981` | `0.5360` | PRISM trails, but stays close |
25
 
26
- The QA layer is held constant. This is an extractor-vs-extractor comparison,
27
- not an end-to-end GPT-4.1 replacement claim.
28
 
29
  ## LoCoMo Breakdown
30
 
@@ -47,24 +47,30 @@ not an end-to-end GPT-4.1 replacement claim.
47
  | single-session-user | `0.9133333333` |
48
  | temporal-reasoning | `0.4316666667` |
49
 
50
- ## Internal Comparison That Informed The Release
51
 
52
- The closest runner-up was `inferential_from_temporal_heavy`.
 
53
 
54
- - Confirmed LoCoMo mean: `0.4975893989`
55
- - Confirmed LongMemEval mean: `0.4688992148`
56
- - Pairwise LoCoMo disagreements vs `sft4`: `152 / 400`
57
- - Question-level wins: `56` for `sft4`, `52` for the runner-up
58
 
59
- The release decision stayed with `sft4` because it preserved the strongest
60
- LongMemEval score and the strongest adversarial behavior.
 
 
 
 
 
 
61
 
62
  ## Artifact Files
63
 
64
- - [../../results/confirmed_exp15_summary.json](../../results/confirmed_exp15_summary.json)
65
- - [../../results/scenario_comparisons.json](../../results/scenario_comparisons.json)
66
- - [../../results/locomo_pairwise_question_diffs.json](../../results/locomo_pairwise_question_diffs.json)
67
- - [../../results/sft4.json](../../results/sft4.json)
68
 
69
  Related docs:
70
 
 
1
  # PRISM-Memory Release Results
2
 
3
+ This page summarizes the confirmed public release metrics and the internal
4
+ comparison evidence that informed the release choice.
5
 
6
+ ## Released Model
7
 
8
+ - Model: `PRISM-Memory 7B Adapter`
9
  - Base model: `Qwen/Qwen2.5-7B-Instruct`
10
  - Adapter type: LoRA
11
  - Confirmed LoCoMo mean: `0.4981204463`
 
13
  - QA cache hits during confirmation: `460`
14
  - QA cache misses during confirmation: `0`
15
 
16
+ ## Public Comparison
17
 
18
+ PRISM-Memory fine-tunes `Qwen/Qwen2.5-7B-Instruct` for the memory extraction
19
+ step that the PropMem reference gets from GPT-4.1.
20
 
21
+ | Benchmark | PRISM-Memory | GPT-4.1-based PropMem reference | Read |
22
  |---|---:|---:|---|
23
  | LongMemEval | `0.4768` | `0.4650` | PRISM wins |
24
+ | LoCoMo | `0.4981` | `0.5360` | PRISM trails, but stays competitive |
25
 
26
+ The QA layer is held constant. This is an extraction-step comparison, not an
27
+ end-to-end GPT-4.1 replacement claim.
28
 
29
  ## LoCoMo Breakdown
30
 
 
47
  | single-session-user | `0.9133333333` |
48
  | temporal-reasoning | `0.4316666667` |
49
 
50
+ ## Why This Model Was Released
51
 
52
+ The closest internal runner-up nearly tied the released model on overall
53
+ LoCoMo, but it lost on the broader release profile:
54
 
55
+ - lower LongMemEval score: `0.4689`
56
+ - weaker adversarial precision
57
+ - less balanced behavior across the full evaluation surface
 
58
 
59
+ Question-level comparison on held-out LoCoMo:
60
+
61
+ - disagreements: `152 / 400`
62
+ - questions favoring PRISM-Memory: `56`
63
+ - questions favoring the runner-up: `52`
64
+
65
+ That is close enough to be a real internal comparison, but not close enough to
66
+ justify two public models.
67
 
68
  ## Artifact Files
69
 
70
+ - [../../results/release_summary.json](../../results/release_summary.json)
71
+ - [../../results/release_model.json](../../results/release_model.json)
72
+ - [../../results/benchmark_cases.json](../../results/benchmark_cases.json)
73
+ - [../../results/internal_locomo_pairwise_diffs.json](../../results/internal_locomo_pairwise_diffs.json)
74
 
75
  Related docs:
76
 
docs/release/technical-blog.md CHANGED
@@ -1,244 +1,158 @@
1
  # PRISM-Memory: Turn Conversations Into Durable, Searchable Memory
2
 
3
- ## Summary
4
 
5
- `PRISM-Memory` is a long-term conversational memory system that converts raw
6
- dialogue into proposition-level memory and retrieves it with an inspectable
7
- hybrid stack.
 
8
 
9
- The point is not that a 7B model chats well. The point is that a 7B open model
10
- can write memory records that another system can actually use later.
11
 
12
- This package now ships one public extraction skill and one public checkpoint:
 
 
13
 
14
- - **Checkpoint:** `exp15_sft_qwen7b_4ep`
15
- - **Confirmed LoCoMo mean:** `0.4981204463`
16
- - **Confirmed LongMemEval mean:** `0.4767574431`
17
- - **QA cache misses during confirmation:** `0`
18
 
19
- The public hook is simple:
20
 
21
- **PRISM-Memory turns conversations into durable, searchable memory.**
 
 
22
 
23
- ## Why This Is Useful In Practice
 
 
24
 
25
- A memory writer is only interesting if a later system can ask a pointed
26
- question and get back a useful answer without rereading the original chat. The
27
- public release artifacts already show that pattern.
28
 
29
- ### 1. Keep hard limits and preferences available for later work
 
30
 
31
- The extractor can turn a single conversational turn into stable memory like:
32
 
33
- - GitHub Actions concurrency limit: `20` concurrent jobs
34
- - Snyk Slack notifications should be aggregated and concise
 
 
 
35
 
36
- That means a later system can answer:
37
 
38
- > What is our GitHub Actions concurrency limit, and how should Snyk alerts look?
39
 
40
- with:
41
 
42
- > `20` concurrent jobs. Alerts should be aggregated and concise.
43
-
44
- That is a real product use case. Teams mention constraints and preferences once,
45
- then expect downstream tools and agents to remember them.
46
-
47
- ### 2. Keep current state separate from the roadmap
48
-
49
- The released extractor can also preserve the difference between what is true
50
- now and what is only planned:
51
-
52
- - sidecar CPU limits are already set and monitored
53
- - mTLS is planned for phase two
54
- - rollout strategy is canary deployments plus traffic splitting
55
-
56
- So a later question like:
57
-
58
- > Did we already enable mTLS, and what rollout strategy are we planning?
59
-
60
- can be answered without confusing the present state with the future plan.
61
-
62
- This is a core memory problem, not a style problem. Chat history tends to blur
63
- these states together.
64
-
65
- ### 3. Answer dated questions with dated evidence
66
-
67
- One confirmed held-out benchmark case asks:
68
-
69
- > Which hobby did Sam take up in May 2023?
70
-
71
- The retrieved memory contains explicit dated propositions about Sam trying
72
- painting in May 2023, and the released system answers:
73
-
74
- > painting
75
-
76
- That matters because the useful behavior is not “remember that hobbies were
77
- discussed.” The useful behavior is “recover the dated fact that actually
78
- answers the later question.”
79
-
80
- There is a fourth practical behavior that matters too: refusal. On the held-out
81
- adversarial guitar case, the released model returns `None` instead of inventing
82
- a reason for an unsupported premise. That is also part of being useful.
83
-
84
- For the compact scenario version of this story, see
85
- [memory-scenarios.md](memory-scenarios.md).
86
-
87
- ## What The Repo Actually Contributed
88
-
89
- The core contribution is not another opaque memory model. The repo showed that a
90
- 7B open model can replace GPT-4-class extraction with a transparent memory
91
- pipeline that is still competitive on long-horizon dialogue benchmarks.
92
-
93
- The released system has three pieces:
94
-
95
- 1. A learned proposition extractor (`Qwen2.5-7B-Instruct` + LoRA).
96
  2. Post-processing that cleans speaker references and resolves relative time.
97
- 3. Hybrid retrieval (`BM25 + dense retrieval + cross-encoder reranking`).
98
 
99
- The important part is the interface between them: extracted propositions are not
100
- just text snippets. They are the memory records that the retriever indexes.
 
101
 
102
- ## The Single Skill To Keep
103
 
104
- After reviewing the repo history, there should be one canonical extraction skill
105
- and one checkpoint publicly exposed:
106
 
107
- - **Skill:** proposition-level memory extraction
108
- - **Model:** `exp15_sft_qwen7b_4ep`
109
- - **Prompt contract:** extract `0-5` atomic standalone facts, include dates when
110
- present, skip filler and questions, output JSON only
111
 
112
- That skill is documented directly in
113
- [extraction-skill.md](extraction-skill.md).
114
 
115
- ## What Worked
116
-
117
- ### 1. The best model came from the stable 20k base, not from aggressive add-ons
118
-
119
- The repo repeatedly showed that `sft4` was the stable optimum for the 7B line.
120
- The same 20k clean base distribution was critical. Changing the base subset,
121
- shrinking it, or overextending it consistently hurt.
122
-
123
- Why that matters:
124
-
125
- - the model needed the exact data distribution that produced `sft4`
126
- - 4 epochs was enough to reach the useful local optimum
127
- - follow-on runs often traded away robustness for narrower gains
128
 
129
- ### 2. Proposition memory plus hybrid retrieval is the real winning combination
 
130
 
131
- The strongest system was not latent-only memory and not raw-turn retrieval. The
132
- best path was proposition extraction plus `PRISMv3Rerank`.
133
-
134
- That means:
135
-
136
- - sparse retrieval captured lexical anchors
137
- - dense retrieval recovered semantically close memories
138
- - reranking cleaned up the final shortlist
139
-
140
- This combination is what made the memory store usable.
141
-
142
- ### 3. Absolute date anchoring and temporal cleanup helped
143
 
144
- Temporal improvement came from making the memory records cleaner, not from
145
- teaching the model to imitate LoCoMo’s relative-answer style.
146
 
147
- What helped:
 
 
148
 
149
- - fixed temporal examples with explicit date resolution
150
- - normalizing session dates
151
- - post-processing relative references like `yesterday` or `last weekend`
152
 
153
- What did **not** help:
 
 
154
 
155
- - training the model to emit relative benchmark-style dates
156
 
157
- ### 4. Turn-local extraction was better than passing long context windows
 
 
158
 
159
- The repo tested extraction with added session context and it regressed. The
160
- model worked best when extracting from the current turn and letting the memory
161
- system handle cross-turn reasoning later.
162
 
163
- That is an important design lesson: keep extraction local, let retrieval do the
164
- composition.
 
165
 
166
- ### 5. Adversarial precision was the strongest reason to keep `sft4`
167
 
168
- Many later variants found small gains in temporal or inferential categories, but
169
- they usually damaged adversarial behavior. `sft4` held the best confirmed
170
- adversarial score and the best total LongMemEval score, which is why it is the
171
- only checkpoint worth releasing publicly.
172
 
173
  ## What Did Not Work
174
 
175
- ### 1. Benchmark-specific format hacks
176
 
177
- Relative-date training was a dead end. It optimized for the look of a benchmark
178
- answer rather than for general extraction quality.
 
179
 
180
- ### 2. LoCoMo-domain training data
181
 
182
- Adding LoCoMo training conversations consistently regressed performance. The
183
- best generalization signal remained the cleaned LME-style base data.
184
 
185
- ### 3. More original LME data was not better
186
 
187
- Scaling from 20k to 50k original LME examples amplified the temporal-anchor
188
- problem. More noisy temporal labels simply taught the wrong lesson more often.
189
 
190
- ### 4. Small follow-on bases and heavy QA multipliers
191
 
192
- Runs built on 5k clean bases or extreme QA multipliers tended to forget useful
193
- behavior. They often improved a narrow category while hurting adversarial
194
- precision, inferential balance, or LongMemEval.
195
 
196
- ### 5. Assuming the best checkpoint was easy to improve
197
 
198
- The repo’s most expensive lesson was that `sft4` was already a local optimum for
199
- the 7B line. Most additional training made the model more specialized and less
200
- balanced.
201
 
202
- ## Internal Comparisons That Informed The Release
 
203
 
204
- The internal ablation story still matters, even though the public package keeps
205
- only `sft4`.
 
206
 
207
- Confirmed internal facts:
208
-
209
- - `inferential_from_temporal_heavy` nearly tied `sft4` on overall LoCoMo
210
- - it recovered some inferential and temporal misses
211
- - it still lost on LongMemEval and adversarial precision
212
-
213
- Question-level comparison on held-out LoCoMo:
214
-
215
- - `400` questions replayed
216
- - `152` answer-level disagreements
217
- - `56` questions favored `sft4`
218
- - `52` questions favored the runner-up
219
-
220
- That is a useful research result, but not a reason to ship two public models.
221
- The right release decision is one clean skill, one clean checkpoint.
222
-
223
- ## Failure Modes Still Visible In The Release Model
224
-
225
- The selected model is good enough to release, but its errors are clear:
226
-
227
- - it can miss specific diagnoses while retaining the broader health frame
228
- - it can overcommit to a salient retrieved clue in inferential questions
229
- - it can remember a coarse book description but miss the exact title
230
-
231
- Those are not packaging issues. They are the current limits of the extraction +
232
- retrieval stack at this model size.
233
 
234
  ## What Ships
235
 
236
- Public release surface:
237
 
238
  1. `PRISM-Memory`
239
- 2. the single extraction skill in [extraction-skill.md](extraction-skill.md)
240
- 3. the best confirmed checkpoint `exp15_sft_qwen7b_4ep`
241
- 4. the best-only Space demo in [../../space/](../../space/)
242
-
243
- Internal analysis artifacts can stay for provenance, but they should not be
244
- positioned as parallel public releases.
 
 
 
1
  # PRISM-Memory: Turn Conversations Into Durable, Searchable Memory
2
 
3
+ ## The Problem
4
 
5
+ Most long-chat systems do not actually have memory. They have transcript
6
+ search. That works until someone asks a later question that depends on a hard
7
+ constraint, a changed plan, a dated fact, or a contradiction that happened
8
+ months ago.
9
 
10
+ PRISM-Memory focuses on the part of the stack that usually stays hidden: the
11
+ step that decides what should become memory at all.
12
 
13
+ The release model is a 7B adapter that writes short proposition-level memory
14
+ records from dialogue. Those records are then indexed by a hybrid retrieval
15
+ stack and used later for recall.
16
 
17
+ ## What This Release Shows
 
 
 
18
 
19
+ The useful result is narrow and practical:
20
 
21
+ - a 7B open model can replace the GPT-4.1 extraction step in this memory pipeline
22
+ - it scores `0.4768` on LongMemEval versus `0.4650` for the GPT-4.1-based PropMem reference
23
+ - it scores `0.4981` on LoCoMo versus `0.5360` for that same reference
24
 
25
+ This is not a claim that a 7B model beats GPT-4.1 everywhere. It is a claim
26
+ that a 7B model can take over the memory-writing step and stay competitive on
27
+ the held-out evaluation surface.
28
 
29
+ ## Why That Matters
 
 
30
 
31
+ If the memory-writing step is weak, retrieval never gets a clean chance.
32
+ Important details stay buried inside noisy chat turns.
33
 
34
+ PRISM-Memory is useful when later questions depend on things like:
35
 
36
+ - a hard operational limit: `20` GitHub Actions jobs
37
+ - a durable preference: aggregated Slack alerts instead of noisy ones
38
+ - a status distinction: mTLS is not live yet, it is planned for phase two
39
+ - a dated fact: Sam took up painting in May 2023
40
+ - a refusal case: the system should answer `None` instead of inventing a reason for an unsupported guitar story
41
 
42
+ Those are memory problems, not style problems.
43
 
44
+ ## How The System Works
45
 
46
+ The released system has three pieces.
47
 
48
+ 1. A learned extractor based on `Qwen/Qwen2.5-7B-Instruct` with LoRA.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
49
  2. Post-processing that cleans speaker references and resolves relative time.
50
+ 3. Hybrid retrieval with BM25, dense retrieval, and reranking.
51
 
52
+ The extracted propositions are the important interface. They are the memory
53
+ records the retriever indexes. That keeps the memory store inspectable instead
54
+ of opaque.
55
 
56
+ ## What The Training Data Actually Was
57
 
58
+ The release data is synthetic.
 
59
 
60
+ - `2,329` synthetic training conversations
61
+ - `584` held-out synthetic conversations
62
+ - `100,427` supervised extraction examples derived from those conversations
63
+ - `20,000` supervised examples used for the released adapter
64
 
65
+ The conversations were designed to stress real memory behaviors:
 
66
 
67
+ - new facts introduced in one session and used later
68
+ - updated details that should overwrite stale ones
69
+ - deleted or invalidated facts that should stop influencing answers
70
+ - mixtures of personal details, project facts, preferences, dates, and plans
 
 
 
 
 
 
 
 
 
71
 
72
+ The labels were GPT-4.1-derived memory-writing targets. No real user chat logs
73
+ are part of the public release.
74
 
75
+ ## What Worked
 
 
 
 
 
 
 
 
 
 
 
76
 
77
+ ### 1. The clean supervised base mattered more than clever add-ons
 
78
 
79
+ The release model came from a stable `20,000`-example synthetic supervision
80
+ base. That base was more valuable than trying to patch the model later with
81
+ many narrow benchmark-specific additions.
82
 
83
+ ### 2. Hybrid retrieval was part of the result
 
 
84
 
85
+ The release is not just a model story. It is a model-plus-retrieval story.
86
+ Sparse retrieval kept lexical anchors, dense retrieval recovered semantically
87
+ close memories, and reranking cleaned the shortlist.
88
 
89
+ ### 3. Explicit time anchoring helped
90
 
91
+ The model improved when the memory records carried explicit dates and the system
92
+ resolved relative references like `yesterday` or `last weekend` into normalized
93
+ anchors.
94
 
95
+ ### 4. Turn-local extraction was enough
 
 
96
 
97
+ Feeding long recent-context windows into the extractor made it worse. The
98
+ stronger pattern was local extraction at write time and cross-turn composition
99
+ later through retrieval.
100
 
101
+ ### 5. Adversarial precision mattered
102
 
103
+ The release model kept the best adversarial behavior among the runs considered
104
+ for public release. That mattered because a memory system that answers
105
+ unsupported questions confidently is worse than one that refuses.
 
106
 
107
  ## What Did Not Work
108
 
109
+ ### 1. Benchmark-style formatting tricks
110
 
111
+ Trying to train the model toward benchmark-style relative-date outputs hurt more
112
+ than it helped. It optimized the look of answers instead of the quality of the
113
+ stored memory.
114
 
115
+ ### 2. Narrow LoCoMo-style add-ons
116
 
117
+ Adding targeted benchmark-domain data often bought a small gain in one slice of
118
+ LoCoMo and then lost balance somewhere else.
119
 
120
+ ### 3. More noisy supervision was not automatically better
121
 
122
+ Scaling up original noisy temporal supervision amplified the wrong lesson. The
123
+ model became more specialized and less balanced.
124
 
125
+ ### 4. Overtraining past the local optimum
126
 
127
+ Several follow-on variants nearly matched the final release on one metric, but
128
+ they usually gave back LongMemEval performance, adversarial precision, or both.
 
129
 
130
+ ## Why Only One Public Model Ships
131
 
132
+ The repo tried multiple follow-on variants. The nearest internal runner-up
133
+ nearly tied the released model on overall LoCoMo and disagreed on `152` of the
134
+ `400` held-out LoCoMo questions, which means the comparison was real.
135
 
136
+ But the public release decision is simpler than the internal ablation story.
137
+ One model ships because it had the best overall release profile:
138
 
139
+ - strongest LongMemEval score
140
+ - strongest adversarial behavior
141
+ - best total balance across the held-out surface
142
 
143
+ That is a better public story than shipping several near-tied variants with
144
+ internal names nobody else should care about.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
145
 
146
  ## What Ships
147
 
148
+ The public release surface is intentionally narrow:
149
 
150
  1. `PRISM-Memory`
151
+ 2. one released model
152
+ 3. one extraction skill
153
+ 4. one Space demo
154
+ 5. one set of release docs and benchmark artifacts
155
+
156
+ The broader `frontier_memory` harness stays in the repo for ongoing research,
157
+ but the release story stays focused on the memory-writing component that proved
158
+ worth shipping.
results/{scenario_comparisons.json → benchmark_cases.json} RENAMED
@@ -18,7 +18,8 @@
18
  "note": "The released model keeps the dated hobby proposition and answers correctly.",
19
  "systems": [
20
  {
21
- "name": "sft4",
 
22
  "prediction": "painting",
23
  "top_retrieval": [
24
  "Sam: [18 May 2023] Sam is considering trying painting as a new hobby.",
@@ -44,7 +45,8 @@
44
  "note": "This tests whether the system refuses to invent an answer when the premise is unsupported.",
45
  "systems": [
46
  {
47
- "name": "sft4",
 
48
  "prediction": "None",
49
  "top_retrieval": [
50
  "[2:55 pm on 31 August, 2023] Dave: That guitar has a gorgeous purple hue. Why did you make it so shiny?",
@@ -67,7 +69,8 @@
67
  "note": "A representative factual miss: the model retrieves the health-risk frame but not the specific diagnosis.",
68
  "systems": [
69
  {
70
- "name": "sft4",
 
71
  "prediction": "serious health risk",
72
  "top_retrieval": [
73
  "Sam: [8 October 2023] The doctor told Sam that his weight is a serious health risk.",
@@ -93,7 +96,8 @@
93
  "note": "A representative inferential miss: retrieval includes both clues, but the model overcommits to the mountain mention.",
94
  "systems": [
95
  {
96
- "name": "sft4",
 
97
  "prediction": "mountains",
98
  "top_retrieval": [
99
  "Evan: [27 August 2023] Evan also shared his recent road trip to the Rocky Mountains and love for hiking.",
@@ -119,7 +123,8 @@
119
  "note": "A representative multi-hop miss: the model retains the coarse book description but misses the specific title.",
120
  "systems": [
121
  {
122
- "name": "sft4",
 
123
  "prediction": "a new mystery novel",
124
  "top_retrieval": [
125
  "Evan: [27 August 2023] Evan is reading a book that he finds increasingly compelling.",
 
18
  "note": "The released model keeps the dated hobby proposition and answers correctly.",
19
  "systems": [
20
  {
21
+ "name": "release_model",
22
+ "display_name": "PRISM-Memory 7B Adapter",
23
  "prediction": "painting",
24
  "top_retrieval": [
25
  "Sam: [18 May 2023] Sam is considering trying painting as a new hobby.",
 
45
  "note": "This tests whether the system refuses to invent an answer when the premise is unsupported.",
46
  "systems": [
47
  {
48
+ "name": "release_model",
49
+ "display_name": "PRISM-Memory 7B Adapter",
50
  "prediction": "None",
51
  "top_retrieval": [
52
  "[2:55 pm on 31 August, 2023] Dave: That guitar has a gorgeous purple hue. Why did you make it so shiny?",
 
69
  "note": "A representative factual miss: the model retrieves the health-risk frame but not the specific diagnosis.",
70
  "systems": [
71
  {
72
+ "name": "release_model",
73
+ "display_name": "PRISM-Memory 7B Adapter",
74
  "prediction": "serious health risk",
75
  "top_retrieval": [
76
  "Sam: [8 October 2023] The doctor told Sam that his weight is a serious health risk.",
 
96
  "note": "A representative inferential miss: retrieval includes both clues, but the model overcommits to the mountain mention.",
97
  "systems": [
98
  {
99
+ "name": "release_model",
100
+ "display_name": "PRISM-Memory 7B Adapter",
101
  "prediction": "mountains",
102
  "top_retrieval": [
103
  "Evan: [27 August 2023] Evan also shared his recent road trip to the Rocky Mountains and love for hiking.",
 
123
  "note": "A representative multi-hop miss: the model retains the coarse book description but misses the specific title.",
124
  "systems": [
125
  {
126
+ "name": "release_model",
127
+ "display_name": "PRISM-Memory 7B Adapter",
128
  "prediction": "a new mystery novel",
129
  "top_retrieval": [
130
  "Evan: [27 August 2023] Evan is reading a book that he finds increasingly compelling.",
results/{readme_extraction_examples.json → extraction_examples.json} RENAMED
@@ -1,6 +1,7 @@
1
  {
2
- "source_dataset": "BETTER_MEMORY_ROOT/data/output/eval_sft.jsonl",
3
- "model_path": "BETTER_MEMORY_ROOT/exp15_sft_qwen7b_4ep",
 
4
  "output_examples": 3,
5
  "examples": [
6
  {
 
1
  {
2
+ "dataset_name": "Held-out synthetic evaluation split",
3
+ "model_name": "PRISM-Memory 7B Adapter",
4
+ "base_model": "Qwen/Qwen2.5-7B-Instruct",
5
  "output_examples": 3,
6
  "examples": [
7
  {
results/{confirmed_exp15_summary.json → release_summary.json} RENAMED
@@ -1,9 +1,11 @@
1
  {
 
2
  "results": [
3
  {
4
- "alias": "sft4",
5
- "checkpoint": "exp15_sft_qwen7b_4ep",
6
- "elapsed_min": 28.93,
 
7
  "args": {
8
  "n_lme": 10,
9
  "context_window": 0,
@@ -50,4 +52,4 @@
50
  }
51
  ],
52
  "failures": []
53
- }
 
1
  {
2
+ "generated_at_unix": 1776521335,
3
  "results": [
4
  {
5
+ "model_key": "release_model",
6
+ "model_name": "PRISM-Memory 7B Adapter",
7
+ "base_model": "Qwen/Qwen2.5-7B-Instruct",
8
+ "elapsed_min": 30.0,
9
  "args": {
10
  "n_lme": 10,
11
  "context_window": 0,
 
52
  }
53
  ],
54
  "failures": []
55
+ }