AsadIsmail commited on
Commit
047d480
·
verified ·
1 Parent(s): d410dbb

Publish PRISM-Memory adapter bundle

Browse files
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ tokenizer.json filter=lfs diff=lfs merge=lfs -text
LICENSE ADDED
@@ -0,0 +1,202 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Apache License
2
+ Version 2.0, January 2004
3
+ http://www.apache.org/licenses/
4
+
5
+ TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
6
+
7
+ 1. Definitions.
8
+
9
+ "License" shall mean the terms and conditions for use, reproduction,
10
+ and distribution as defined by Sections 1 through 9 of this document.
11
+
12
+ "Licensor" shall mean the copyright owner or entity authorized by
13
+ the copyright owner that is granting the License.
14
+
15
+ "Legal Entity" shall mean the union of the acting entity and all
16
+ other entities that control, are controlled by, or are under common
17
+ control with that entity. For the purposes of this definition,
18
+ "control" means (i) the power, direct or indirect, to cause the
19
+ direction or management of such entity, whether by contract or
20
+ otherwise, or (ii) ownership of fifty percent (50%) or more of the
21
+ outstanding shares, or (iii) beneficial ownership of such entity.
22
+
23
+ "You" (or "Your") shall mean an individual or Legal Entity
24
+ exercising permissions granted by this License.
25
+
26
+ "Source" form shall mean the preferred form for making modifications,
27
+ including but not limited to software source code, documentation
28
+ source, and configuration files.
29
+
30
+ "Object" form shall mean any form resulting from mechanical
31
+ transformation or translation of a Source form, including but
32
+ not limited to compiled object code, generated documentation,
33
+ and conversions to other media types.
34
+
35
+ "Work" shall mean the work of authorship, whether in Source or
36
+ Object form, made available under the License, as indicated by a
37
+ copyright notice that is included in or attached to the work
38
+ (an example is provided in the Appendix below).
39
+
40
+ "Derivative Works" shall mean any work, whether in Source or Object
41
+ form, that is based on (or derived from) the Work and for which the
42
+ editorial revisions, annotations, elaborations, or other modifications
43
+ represent, as a whole, an original work of authorship. For the purposes
44
+ of this License, Derivative Works shall not include works that remain
45
+ separable from, or merely link (or bind by name) to the interfaces of,
46
+ the Work and Derivative Works thereof.
47
+
48
+ "Contribution" shall mean any work of authorship, including
49
+ the original version of the Work and any modifications or additions
50
+ to that Work or Derivative Works thereof, that is intentionally
51
+ submitted to Licensor for inclusion in the Work by the copyright owner
52
+ or by an individual or Legal Entity authorized to submit on behalf of
53
+ the copyright owner. For the purposes of this definition, "submitted"
54
+ means any form of electronic, verbal, or written communication sent
55
+ to the Licensor or its representatives, including but not limited to
56
+ communication on electronic mailing lists, source code control systems,
57
+ and issue tracking systems that are managed by, or on behalf of, the
58
+ Licensor for the purpose of discussing and improving the Work, but
59
+ excluding communication that is conspicuously marked or otherwise
60
+ designated in writing by the copyright owner as "Not a Contribution."
61
+
62
+ "Contributor" shall mean Licensor and any individual or Legal Entity
63
+ on behalf of whom a Contribution has been received by Licensor and
64
+ subsequently incorporated within the Work.
65
+
66
+ 2. Grant of Copyright License. Subject to the terms and conditions of
67
+ this License, each Contributor hereby grants to You a perpetual,
68
+ worldwide, non-exclusive, no-charge, royalty-free, irrevocable
69
+ copyright license to reproduce, prepare Derivative Works of,
70
+ publicly display, publicly perform, sublicense, and distribute the
71
+ Work and such Derivative Works in Source or Object form.
72
+
73
+ 3. Grant of Patent License. Subject to the terms and conditions of
74
+ this License, each Contributor hereby grants to You a perpetual,
75
+ worldwide, non-exclusive, no-charge, royalty-free, irrevocable
76
+ (except as stated in this section) patent license to make, have made,
77
+ use, offer to sell, sell, import, and otherwise transfer the Work,
78
+ where such license applies only to those patent claims licensable
79
+ by such Contributor that are necessarily infringed by their
80
+ Contribution(s) alone or by combination of their Contribution(s)
81
+ with the Work to which such Contribution(s) was submitted. If You
82
+ institute patent litigation against any entity (including a
83
+ cross-claim or counterclaim in a lawsuit) alleging that the Work
84
+ or a Contribution incorporated within the Work constitutes direct
85
+ or contributory patent infringement, then any patent licenses
86
+ granted to You under this License for that Work shall terminate
87
+ as of the date such litigation is filed.
88
+
89
+ 4. Redistribution. You may reproduce and distribute copies of the
90
+ Work or Derivative Works thereof in any medium, with or without
91
+ modifications, and in Source or Object form, provided that You
92
+ meet the following conditions:
93
+
94
+ (a) You must give any other recipients of the Work or
95
+ Derivative Works a copy of this License; and
96
+
97
+ (b) You must cause any modified files to carry prominent notices
98
+ stating that You changed the files; and
99
+
100
+ (c) You must retain, in the Source form of any Derivative Works
101
+ that You distribute, all copyright, patent, trademark, and
102
+ attribution notices from the Source form of the Work,
103
+ excluding those notices that do not pertain to any part of
104
+ the Derivative Works; and
105
+
106
+ (d) If the Work includes a "NOTICE" text file as part of its
107
+ distribution, then any Derivative Works that You distribute must
108
+ include a readable copy of the attribution notices contained
109
+ within such NOTICE file, excluding those notices that do not
110
+ pertain to any part of the Derivative Works, in at least one
111
+ of the following places: within a NOTICE text file distributed
112
+ as part of the Derivative Works; within the Source form or
113
+ documentation, if provided along with the Derivative Works; or,
114
+ within a display generated by the Derivative Works, if and
115
+ wherever such third-party notices normally appear. The contents
116
+ of the NOTICE file are for informational purposes only and
117
+ do not modify the License. You may add Your own attribution
118
+ notices within Derivative Works that You distribute, alongside
119
+ or as an addendum to the NOTICE text from the Work, provided
120
+ that such additional attribution notices cannot be construed
121
+ as modifying the License.
122
+
123
+ You may add Your own copyright statement to Your modifications and
124
+ may provide additional or different license terms and conditions
125
+ for use, reproduction, or distribution of Your modifications, or
126
+ for any such Derivative Works as a whole, provided Your use,
127
+ reproduction, and distribution of the Work otherwise complies with
128
+ the conditions stated in this License.
129
+
130
+ 5. Submission of Contributions. Unless You explicitly state otherwise,
131
+ any Contribution intentionally submitted for inclusion in the Work
132
+ by You to the Licensor shall be under the terms and conditions of
133
+ this License, without any additional terms or conditions.
134
+ Notwithstanding the above, nothing herein shall supersede or modify
135
+ the terms of any separate license agreement you may have executed
136
+ with Licensor regarding such Contributions.
137
+
138
+ 6. Trademarks. This License does not grant permission to use the trade
139
+ names, trademarks, service marks, or product names of the Licensor,
140
+ except as required for reasonable and customary use in describing the
141
+ origin of the Work and reproducing the content of the NOTICE file.
142
+
143
+ 7. Disclaimer of Warranty. Unless required by applicable law or
144
+ agreed to in writing, Licensor provides the Work (and each
145
+ Contributor provides its Contributions) on an "AS IS" BASIS,
146
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
147
+ implied, including, without limitation, any warranties or conditions
148
+ of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
149
+ PARTICULAR PURPOSE. You are solely responsible for determining the
150
+ appropriateness of using or redistributing the Work and assume any
151
+ risks associated with Your exercise of permissions under this License.
152
+
153
+ 8. Limitation of Liability. In no event and under no legal theory,
154
+ whether in tort (including negligence), contract, or otherwise,
155
+ unless required by applicable law (such as deliberate and grossly
156
+ negligent acts) or agreed to in writing, shall any Contributor be
157
+ liable to You for damages, including any direct, indirect, special,
158
+ incidental, or consequential damages of any character arising as a
159
+ result of this License or out of the use or inability to use the
160
+ Work (including but not limited to damages for loss of goodwill,
161
+ work stoppage, computer failure or malfunction, or any and all
162
+ other commercial damages or losses), even if such Contributor
163
+ has been advised of the possibility of such damages.
164
+
165
+ 9. Accepting Warranty or Additional Liability. While redistributing
166
+ the Work or Derivative Works thereof, You may choose to offer,
167
+ and charge a fee for, acceptance of support, warranty, indemnity,
168
+ or other liability obligations and/or rights consistent with this
169
+ License. However, in accepting such obligations, You may act only
170
+ on Your own behalf and on Your sole responsibility, not on behalf
171
+ of any other Contributor, and only if You agree to indemnify,
172
+ defend, and hold each Contributor harmless for any liability
173
+ incurred by, or claims asserted against, such Contributor by reason
174
+ of your accepting any such warranty or additional liability.
175
+
176
+ END OF TERMS AND CONDITIONS
177
+
178
+ APPENDIX: How to apply the Apache License to your work.
179
+
180
+ To apply the Apache License to your work, attach the following
181
+ boilerplate notice, with the fields enclosed by brackets "[]"
182
+ replaced with your own identifying information. (Don't include
183
+ the brackets!) The text should be enclosed in the appropriate
184
+ comment syntax for the file format. We also recommend that a
185
+ file or class name and description of purpose be included on the
186
+ same "printed page" as the copyright notice for easier
187
+ identification within third-party archives.
188
+
189
+ Copyright 2026 Asad Ismail
190
+
191
+ Licensed under the Apache License, Version 2.0 (the "License");
192
+ you may not use this file except in compliance with the License.
193
+ You may obtain a copy of the License at
194
+
195
+ http://www.apache.org/licenses/LICENSE-2.0
196
+
197
+ Unless required by applicable law or agreed to in writing, software
198
+ distributed under the License is distributed on an "AS IS" BASIS,
199
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
200
+ See the License for the specific language governing permissions and
201
+ limitations under the License.
202
+
README.md ADDED
@@ -0,0 +1,141 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ base_model: Qwen/Qwen2.5-7B-Instruct
3
+ base_model_relation: adapter
4
+ license: apache-2.0
5
+ library_name: peft
6
+ pipeline_tag: text-generation
7
+ tags:
8
+ - conversational-memory
9
+ - information-extraction
10
+ - long-context
11
+ - peft
12
+ - lora
13
+ - qwen2.5
14
+ ---
15
+
16
+ # PRISM-Memory
17
+
18
+ PRISM-Memory is a LoRA adapter that trains `Qwen/Qwen2.5-7B-Instruct` to write
19
+ proposition-level memory from dialogue. It is the released `exp15_sft_qwen7b_4ep`
20
+ checkpoint from the original `better_memory` project.
21
+
22
+ ## What this release shows
23
+
24
+ - A 7B open model can replace GPT-4.1 for the extraction step in this memory pipeline.
25
+ - On the confirmed release surface, PRISM-Memory scores `0.4768` on LongMemEval and `0.4981` on LoCoMo.
26
+ - The GPT-4.1-based PropMem reference scores `0.4650` on LongMemEval and `0.5360` on LoCoMo.
27
+
28
+ This comparison holds the QA layer constant. It compares extractor against
29
+ extractor, not a full end-to-end GPT-4.1 system.
30
+
31
+ ## Load the adapter
32
+
33
+ ```python
34
+ from peft import PeftModel
35
+ from transformers import AutoModelForCausalLM, AutoTokenizer
36
+
37
+ base_id = "Qwen/Qwen2.5-7B-Instruct"
38
+ adapter_id = "AsadIsmail/prism-memory"
39
+
40
+ tokenizer = AutoTokenizer.from_pretrained(adapter_id, trust_remote_code=True)
41
+ base_model = AutoModelForCausalLM.from_pretrained(
42
+ base_id,
43
+ device_map="auto",
44
+ trust_remote_code=True,
45
+ )
46
+ model = PeftModel.from_pretrained(base_model, adapter_id)
47
+ ```
48
+
49
+ This repo contains the adapter weights only. You still need the base model.
50
+
51
+ ## Training data
52
+
53
+ PRISM-Memory was trained on GPT-4.1-derived proposition labels over synthetic
54
+ multi-session memory conversations.
55
+
56
+ | File | Examples | Role |
57
+ |---|---:|---|
58
+ | `train.jsonl` | `2,329` conversations | raw synthetic conversation source |
59
+ | `eval.jsonl` | `584` conversations | held-out synthetic conversation source |
60
+ | `train_sft.jsonl` | `100,427` labels | primary SFT source |
61
+ | `train_sft_clean_merged.jsonl` | `20,000` labels | cleaned follow-on base matching the best run |
62
+
63
+ The released checkpoint uses a `20k` sample from `train_sft.jsonl`. See
64
+ [docs/release/datasets.md](docs/release/datasets.md) for the full inventory,
65
+ the evaluation surfaces, and the ablations that regressed.
66
+
67
+ ## Confirmed results
68
+
69
+ | Benchmark | PRISM-Memory | GPT-4.1-based PropMem reference |
70
+ |---|---:|---:|
71
+ | LongMemEval | `0.4768` | `0.4650` |
72
+ | LoCoMo | `0.4981` | `0.5360` |
73
+
74
+ The reproduced evaluation hit the cached QA surface exactly: `460` hits,
75
+ `0` misses.
76
+
77
+ ## Extraction examples
78
+
79
+ ### Infrastructure bottlenecks stay structured
80
+ - Session date: `2025-01-04 15:34:00`
81
+ - Overlap score: `0.909`
82
+ - Note: Near-exact match on two operational facts from a single held-out turn.
83
+
84
+ **Turn**
85
+
86
+ > yeah, no real caching beyond basic Docker layer caching. Jenkins nodes have limited capacity, and we sometimes hit queue delays during peak commits.
87
+
88
+ **GPT-4.1 reference**
89
+
90
+ - No caching beyond basic Docker layer caching
91
+ - Jenkins nodes have limited capacity and experience queue delays during peak commits
92
+
93
+ **PRISM-Memory**
94
+
95
+ - No Docker caching beyond basic layer caching
96
+ - Jenkins nodes have limited capacity; peak commits cause queue delays
97
+
98
+ ### Numeric constraints and preferences survive extraction
99
+ - Session date: `2025-03-01 15:07:00`
100
+ - Overlap score: `0.800`
101
+ - Note: The trained model keeps both the hard concurrency cap and the desired notification style.
102
+
103
+ **Turn**
104
+
105
+ > yeah, I think starting with incremental scans and parallel matrix jobs makes sense. We have 20 concurrent jobs max on GitHub Actions currently. Also want to keep Slack notifications from Snyk consistent with other pipeline alerts—aggregated and concise. Can you help draft the workflow?
106
+
107
+ **GPT-4.1 reference**
108
+
109
+ - GitHub Actions concurrency limit: 20 concurrent jobs
110
+ - Wants Snyk Slack notifications aggregated and concise, consistent with other pipeline alerts
111
+
112
+ **PRISM-Memory**
113
+
114
+ - GitHub Actions concurrency limit: 20 concurrent jobs
115
+ - Snyk Slack notifications should be aggregated and concise
116
+
117
+ More held-out examples live in
118
+ [docs/release/extraction-examples.md](docs/release/extraction-examples.md).
119
+
120
+ ## Bundled docs and artifacts
121
+
122
+ - [docs/release/datasets.md](docs/release/datasets.md)
123
+ - [docs/release/extraction-examples.md](docs/release/extraction-examples.md)
124
+ - [docs/release/extraction-skill.md](docs/release/extraction-skill.md)
125
+ - [docs/release/release-results.md](docs/release/release-results.md)
126
+ - [docs/release/technical-blog.md](docs/release/technical-blog.md)
127
+ - [results/confirmed_exp15_summary.json](results/confirmed_exp15_summary.json)
128
+ - [results/readme_extraction_examples.json](results/readme_extraction_examples.json)
129
+ - [results/scenario_comparisons.json](results/scenario_comparisons.json)
130
+
131
+ ## Demo
132
+
133
+ The companion Space is live at
134
+ `https://huggingface.co/spaces/AsadIsmail/prism-memory`.
135
+
136
+ ## Limitations
137
+
138
+ - This is a memory-writing component, not a general chat model.
139
+ - It is a LoRA adapter, not a standalone full checkpoint.
140
+ - The evaluation pipeline still uses a separate QA model to score retrieved memory.
141
+ - Temporal and inferential categories still trail stronger larger-model baselines.
adapter_config.json ADDED
@@ -0,0 +1,46 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "alora_invocation_tokens": null,
3
+ "alpha_pattern": {},
4
+ "arrow_config": null,
5
+ "auto_mapping": null,
6
+ "base_model_name_or_path": "Qwen/Qwen2.5-7B-Instruct",
7
+ "bias": "none",
8
+ "corda_config": null,
9
+ "ensure_weight_tying": false,
10
+ "eva_config": null,
11
+ "exclude_modules": null,
12
+ "fan_in_fan_out": false,
13
+ "inference_mode": true,
14
+ "init_lora_weights": true,
15
+ "layer_replication": null,
16
+ "layers_pattern": null,
17
+ "layers_to_transform": null,
18
+ "loftq_config": {},
19
+ "lora_alpha": 32,
20
+ "lora_bias": false,
21
+ "lora_dropout": 0.05,
22
+ "megatron_config": null,
23
+ "megatron_core": "megatron.core",
24
+ "modules_to_save": null,
25
+ "peft_type": "LORA",
26
+ "peft_version": "0.18.1",
27
+ "qalora_group_size": 16,
28
+ "r": 16,
29
+ "rank_pattern": {},
30
+ "revision": null,
31
+ "target_modules": [
32
+ "k_proj",
33
+ "v_proj",
34
+ "gate_proj",
35
+ "o_proj",
36
+ "up_proj",
37
+ "down_proj",
38
+ "q_proj"
39
+ ],
40
+ "target_parameters": null,
41
+ "task_type": "CAUSAL_LM",
42
+ "trainable_token_indices": null,
43
+ "use_dora": false,
44
+ "use_qalora": false,
45
+ "use_rslora": false
46
+ }
adapter_model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:7adeae07f8b6ee3afcba01d3c3c586306b37b1693280a49f1c09653f76ee8b6d
3
+ size 80792880
chat_template.jinja ADDED
@@ -0,0 +1,54 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {%- if tools %}
2
+ {{- '<|im_start|>system\n' }}
3
+ {%- if messages[0]['role'] == 'system' %}
4
+ {{- messages[0]['content'] }}
5
+ {%- else %}
6
+ {{- 'You are Qwen, created by Alibaba Cloud. You are a helpful assistant.' }}
7
+ {%- endif %}
8
+ {{- "\n\n# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:\n<tools>" }}
9
+ {%- for tool in tools %}
10
+ {{- "\n" }}
11
+ {{- tool | tojson }}
12
+ {%- endfor %}
13
+ {{- "\n</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{\"name\": <function-name>, \"arguments\": <args-json-object>}\n</tool_call><|im_end|>\n" }}
14
+ {%- else %}
15
+ {%- if messages[0]['role'] == 'system' %}
16
+ {{- '<|im_start|>system\n' + messages[0]['content'] + '<|im_end|>\n' }}
17
+ {%- else %}
18
+ {{- '<|im_start|>system\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\n' }}
19
+ {%- endif %}
20
+ {%- endif %}
21
+ {%- for message in messages %}
22
+ {%- if (message.role == "user") or (message.role == "system" and not loop.first) or (message.role == "assistant" and not message.tool_calls) %}
23
+ {{- '<|im_start|>' + message.role + '\n' + message.content + '<|im_end|>' + '\n' }}
24
+ {%- elif message.role == "assistant" %}
25
+ {{- '<|im_start|>' + message.role }}
26
+ {%- if message.content %}
27
+ {{- '\n' + message.content }}
28
+ {%- endif %}
29
+ {%- for tool_call in message.tool_calls %}
30
+ {%- if tool_call.function is defined %}
31
+ {%- set tool_call = tool_call.function %}
32
+ {%- endif %}
33
+ {{- '\n<tool_call>\n{"name": "' }}
34
+ {{- tool_call.name }}
35
+ {{- '", "arguments": ' }}
36
+ {{- tool_call.arguments | tojson }}
37
+ {{- '}\n</tool_call>' }}
38
+ {%- endfor %}
39
+ {{- '<|im_end|>\n' }}
40
+ {%- elif message.role == "tool" %}
41
+ {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != "tool") %}
42
+ {{- '<|im_start|>user' }}
43
+ {%- endif %}
44
+ {{- '\n<tool_response>\n' }}
45
+ {{- message.content }}
46
+ {{- '\n</tool_response>' }}
47
+ {%- if loop.last or (messages[loop.index0 + 1].role != "tool") %}
48
+ {{- '<|im_end|>\n' }}
49
+ {%- endif %}
50
+ {%- endif %}
51
+ {%- endfor %}
52
+ {%- if add_generation_prompt %}
53
+ {{- '<|im_start|>assistant\n' }}
54
+ {%- endif %}
docs/release/datasets.md ADDED
@@ -0,0 +1,82 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # PRISM-Memory Datasets
2
+
3
+ This file separates the data used by the public `PRISM-Memory` release from the
4
+ auxiliary datasets that were only useful for ablations.
5
+
6
+ ## Released Training Recipe
7
+
8
+ The released checkpoint is `exp15_sft_qwen7b_4ep`.
9
+
10
+ The core recipe was:
11
+
12
+ 1. Start from `Qwen/Qwen2.5-7B-Instruct`.
13
+ 2. Fine-tune with LoRA on a `20k` sample from `train_sft.jsonl`.
14
+ 3. Evaluate on held-out `LoCoMo` and held-out `LongMemEval`.
15
+
16
+ ## Source Conversations
17
+
18
+ The underlying synthetic conversation source lives in the upstream
19
+ `better_memory/data/output/` directory.
20
+
21
+ | File | Kind | Split | Notes |
22
+ |---|---|---|---|
23
+ | `train.jsonl` | raw conversations | train | `2,329` synthetic multi-session conversations |
24
+ | `eval.jsonl` | raw conversations | eval | `584` held-out synthetic multi-session conversations |
25
+ | `metadata.json` | split metadata | all | counts by tier, agent type, and update regime |
26
+
27
+ The source generator was built to create long-horizon memory stress cases with
28
+ inserts, updates, deletes, and multi-session recall.
29
+
30
+ ## Derived SFT Data
31
+
32
+ These are GPT-4.1-derived proposition labels built on top of the raw
33
+ conversations.
34
+
35
+ | File | Examples | Role | Release Status |
36
+ |---|---|---|---|
37
+ | `train_sft.jsonl` | `100,427` | primary SFT data | core release data |
38
+ | `train_sft_clean_merged.jsonl` | `20,000` | cleaned resume base matching `sft4` distribution | good follow-on base |
39
+ | `train_sft_temporal_resolved.jsonl` | `2,643` | temporal-fix add-on set | useful for targeted research, not the public base |
40
+ | `eval_sft.jsonl` | reference | GPT-4.1 PropMem extractions on eval conversations | evaluation reference only |
41
+
42
+ ## Evaluation Surfaces
43
+
44
+ The released model was evaluated on two held-out surfaces:
45
+
46
+ | Benchmark | Held-out Surface | Notes |
47
+ |---|---|---|
48
+ | `LoCoMo` | conversations `conv-49` and `conv-50` | five categories: factual, temporal, inferential, multi-hop, adversarial |
49
+ | `LongMemEval` | held-out items stratified by question type | six categories, including temporal reasoning and knowledge updates |
50
+
51
+ Both the GPT-4.1 extraction baseline and the released 7B extractor were scored
52
+ with the same GPT-4.1 QA evaluator and the same cache-backed answer surface.
53
+
54
+ ## Auxiliary LoCoMo Datasets
55
+
56
+ These files were used in ablations and targeted probes. They matter for the
57
+ research story, but they are not the main public training recipe.
58
+
59
+ | File | Examples | Intended Use | Outcome |
60
+ |---|---|---|---|
61
+ | `locomo_qa_supervised_factual.jsonl` | `512` | factual QA supervision | neutral to small benefit |
62
+ | `locomo_qa_supervised_multihop.jsonl` | `625` | multihop QA supervision | neutral to small benefit |
63
+ | `locomo_qa_supervised_temporal.jsonl` | `248` | temporal QA supervision with absolute dates | neutral to small benefit |
64
+ | `locomo_qa_supervised_inferential.jsonl` | `133` | inferential QA supervision | too small, hurt balance |
65
+ | `locomo_qa_supervised_temporal_relformat.jsonl` | `248` | temporal QA with benchmark-style relative dates | hurt |
66
+ | `locomo_sft_extra.jsonl` | `2,645` | LoCoMo-domain SFT add-on | hurt |
67
+ | `locomo_sft_extra_relformat.jsonl` | `3,178` | relative-date LoCoMo SFT add-on | hurt |
68
+
69
+ ## Practical Takeaways
70
+
71
+ 1. The best 7B model came from the stable `20k` `train_sft.jsonl` base, not
72
+ from aggressive benchmark-specific add-ons.
73
+ 2. Training on LoCoMo-domain conversations did not help generalization.
74
+ 3. Relative-date output hacks made the extractor worse.
75
+ 4. More original LME data was not automatically better because noisy temporal
76
+ labels compounded the anchor-loss problem.
77
+
78
+ Related docs:
79
+
80
+ - [extraction-skill.md](extraction-skill.md)
81
+ - [release-results.md](release-results.md)
82
+ - [technical-blog.md](technical-blog.md)
docs/release/extraction-examples.md ADDED
@@ -0,0 +1,76 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # PRISM-Memory Extraction Examples
2
+
3
+ Selected held-out examples from the original Exp15 `eval_sft.jsonl` corpus.
4
+ The `GPT-4.1 reference` rows come from the original SFT target propositions.
5
+ The `PRISM-Memory` rows were regenerated from `exp15_sft_qwen7b_4ep` with greedy decoding using the same extraction prompt family used during evaluation.
6
+
7
+ These examples are illustrations, not the benchmark itself. Use
8
+ [release-results.md](release-results.md) for the aggregate numbers.
9
+
10
+ ## Infrastructure bottlenecks stay structured
11
+
12
+ - Overlap score: `0.909`
13
+ - Session date: `2025-01-04 15:34:00`
14
+ - Note: Near-exact match on two operational facts from a single held-out turn.
15
+
16
+ **Turn**
17
+
18
+ > yeah, no real caching beyond basic Docker layer caching. Jenkins nodes have limited capacity, and we sometimes hit queue delays during peak commits.
19
+
20
+ **GPT-4.1 reference**
21
+
22
+ - No caching beyond basic Docker layer caching
23
+ - Jenkins nodes have limited capacity and experience queue delays during peak commits
24
+
25
+ **PRISM-Memory `sft4`**
26
+
27
+ - No Docker caching beyond basic layer caching
28
+ - Jenkins nodes have limited capacity; peak commits cause queue delays
29
+
30
+ ## Numeric constraints and preferences survive extraction
31
+
32
+ - Overlap score: `0.800`
33
+ - Session date: `2025-03-01 15:07:00`
34
+ - Note: The trained model keeps both the hard concurrency cap and the desired notification style.
35
+
36
+ **Turn**
37
+
38
+ > yeah, I think starting with incremental scans and parallel matrix jobs makes sense. We have 20 concurrent jobs max on GitHub Actions currently. Also want to keep Slack notifications from Snyk consistent with other pipeline alerts—aggregated and concise. Can you help draft the workflow?
39
+
40
+ **GPT-4.1 reference**
41
+
42
+ - GitHub Actions concurrency limit: 20 concurrent jobs
43
+ - Wants Snyk Slack notifications aggregated and concise, consistent with other pipeline alerts
44
+
45
+ **PRISM-Memory `sft4`**
46
+
47
+ - GitHub Actions concurrency limit: 20 concurrent jobs
48
+ - Snyk Slack notifications should be aggregated and concise
49
+
50
+ ## Current state and next-step plans stay grouped
51
+
52
+ - Overlap score: `0.799`
53
+ - Session date: `2025-09-09 15:26:00`
54
+ - Note: The extracted memory keeps the current configuration, the planned security step, and the rollout strategy in one compact bundle.
55
+
56
+ **Turn**
57
+
58
+ > yeah good point about resource overhead, we set CPU limits for all sidecars and monitor with Prometheus now. no mTLS yet, but it’s on the roadmap for phase two. as for routing, we want to start with canary deployments and traffic splitting, maybe some basic fault injection for testing.
59
+
60
+ **GPT-4.1 reference**
61
+
62
+ - Istio sidecar CPU limits set and monitored via Prometheus
63
+ - mTLS planned in phase two
64
+ - Plan to use canary deployments, traffic splitting, and basic fault injection
65
+
66
+ **PRISM-Memory `sft4`**
67
+
68
+ - Sidecar CPU limits set and monitored via Prometheus
69
+ - Istio mTLS planned for phase two
70
+ - Routing strategy: canary deployments and traffic splitting; basic fault injection planned
71
+
72
+ ## Regeneration
73
+
74
+ ```bash
75
+ conda run -n pytorch_p310 python scripts/release/generate_readme_examples.py
76
+ ```
docs/release/extraction-skill.md ADDED
@@ -0,0 +1,119 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # PRISM-Memory Extraction Skill
2
+
3
+ **Hook:** Turn conversations into durable, searchable memory.
4
+
5
+ This is the single extraction skill to keep from the `better_memory` work.
6
+ Public release should point to one checkpoint and one extraction behavior:
7
+
8
+ - **Model:** `exp15_sft_qwen7b_4ep`
9
+ - **Base model:** `Qwen/Qwen2.5-7B-Instruct`
10
+ - **Role:** proposition extraction for long-term conversational memory
11
+ - **Why this one:** best confirmed total profile, best adversarial behavior, and
12
+ best LongMemEval score
13
+
14
+ ## Skill Definition
15
+
16
+ The extractor operates turn by turn and emits `0-5` atomic propositions per
17
+ turn. Each proposition should be a standalone fact about a person, event,
18
+ preference, or property, with dates carried into the fact when available.
19
+
20
+ Canonical prompt:
21
+
22
+ ```text
23
+ You are a memory extraction assistant. Given a conversation turn, extract 0-5 atomic, standalone facts. Each fact must be a complete sentence about a specific person, event, preference, or property. Include dates/times when mentioned. Skip greetings, filler, and questions. Output ONLY a JSON array of strings, e.g. ["fact1", "fact2"] or [].
24
+ ```
25
+
26
+ This prompt comes from `experiment15_learned_extraction.py` in the upstream
27
+ `better_memory` workspace.
28
+
29
+ ## Inference Contract
30
+
31
+ 1. Format the turn with speaker and session date.
32
+ 2. Extract `0-5` propositions as a JSON array.
33
+ 3. Clean speaker references so generic labels become real names.
34
+ 4. Resolve relative temporal expressions against the session date.
35
+ 5. Prefix each proposition with the normalized session date before indexing.
36
+ 6. Retrieve with the PRISM hybrid stack, not with the extractor alone.
37
+
38
+ ## Retrieval Setup To Keep
39
+
40
+ - **Retriever:** `PRISMv3Rerank`
41
+ - **Sparse retrieval:** BM25
42
+ - **Dense retrieval:** `all-MiniLM-L6-v2`
43
+ - **Reranker:** `cross-encoder/ms-marco-MiniLM-L-6-v2`
44
+
45
+ Best confirmed retrieval settings:
46
+
47
+ - **LoCoMo:** adversarial `k=5`, multi-hop `k=10`, all other categories `k=8`
48
+ - **LongMemEval:** multi-session `k=20`, all other categories `k=8` except
49
+ single-session-user `k=5`
50
+
51
+ ## What Worked
52
+
53
+ 1. **The original 20k base mattered.**
54
+ `sft4` came from the exact `train_sft_clean_merged.jsonl` base distribution.
55
+ Runs that changed the base subset regressed.
56
+
57
+ 2. **Four epochs was the sweet spot.**
58
+ `sft4` is the local optimum the repo could actually reproduce.
59
+
60
+ 3. **Absolute date anchoring helped.**
61
+ Temporal repairs worked when the model saw explicit, normalized dates rather
62
+ than benchmark-specific relative phrasing.
63
+
64
+ 4. **Post-processing mattered.**
65
+ Speaker cleanup plus relative-date resolution was necessary to turn raw
66
+ outputs into stable memory records.
67
+
68
+ 5. **Hybrid retrieval beat simpler retrieval.**
69
+ BM25 + dense + reranking consistently outperformed BM25-only or dense-only
70
+ approaches.
71
+
72
+ 6. **Turn-local extraction was enough.**
73
+ The model performed better without feeding long recent-context windows into
74
+ the extractor.
75
+
76
+ 7. **Multihop supervision preserved inferential behavior.**
77
+ When temporal data was added, multihop QA was the only extra signal that
78
+ reliably helped preserve inferential performance.
79
+
80
+ ## What Did Not Work
81
+
82
+ 1. **Relative-date training.**
83
+ Training the extractor to emit benchmark-style relative dates hurt temporal
84
+ performance instead of helping it.
85
+
86
+ 2. **LoCoMo-domain SFT data.**
87
+ Adding LoCoMo training conversations consistently regressed the model.
88
+
89
+ 3. **More than 20k original LME examples.**
90
+ Scaling the original noisy temporal labels to 50k amplified anchor loss and
91
+ caused major regression.
92
+
93
+ 4. **Small clean bases.**
94
+ 5k-base follow-on runs forgot too much and collapsed inferential behavior.
95
+
96
+ 5. **Heavy QA multipliers.**
97
+ High temporal or QA multipliers damaged adversarial precision and LongMemEval.
98
+
99
+ 6. **High learning rates on follow-on QA runs.**
100
+ Aggressive fine-tuning degraded the traits that made `sft4` good.
101
+
102
+ 7. **Trying to push past the local optimum.**
103
+ Most post-`sft4` training traded away adversarial performance for narrower
104
+ gains.
105
+
106
+ ## Release Rule
107
+
108
+ Release only this extraction skill and only this checkpoint publicly:
109
+
110
+ - `exp15_sft_qwen7b_4ep`
111
+
112
+ Treat all other checkpoints as internal ablations and learning artifacts, not as
113
+ parallel public releases.
114
+
115
+ Related docs:
116
+
117
+ - [datasets.md](datasets.md)
118
+ - [release-results.md](release-results.md)
119
+ - [technical-blog.md](technical-blog.md)
docs/release/release-results.md ADDED
@@ -0,0 +1,74 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # PRISM-Memory Release Results
2
+
3
+ This file summarizes the confirmed release metrics and the internal comparison
4
+ artifacts that informed the public checkpoint choice.
5
+
6
+ ## Released Checkpoint
7
+
8
+ - Checkpoint: `exp15_sft_qwen7b_4ep`
9
+ - Base model: `Qwen/Qwen2.5-7B-Instruct`
10
+ - Adapter type: LoRA
11
+ - Confirmed LoCoMo mean: `0.4981204463`
12
+ - Confirmed LongMemEval mean: `0.4767574431`
13
+ - QA cache hits during confirmation: `460`
14
+ - QA cache misses during confirmation: `0`
15
+
16
+ ## Baseline Context
17
+
18
+ `PRISM-Memory` fine-tunes `Qwen/Qwen2.5-7B-Instruct` for the proposition
19
+ extraction step that PropMem normally gets from GPT-4.1. On the confirmed run:
20
+
21
+ | Benchmark | PRISM-Memory `sft4` | GPT-4.1-based PropMem reference | Read |
22
+ |---|---:|---:|---|
23
+ | LongMemEval | `0.4768` | `0.4650` | PRISM wins |
24
+ | LoCoMo | `0.4981` | `0.5360` | PRISM trails, but stays close |
25
+
26
+ The QA layer is held constant. This is an extractor-vs-extractor comparison,
27
+ not an end-to-end GPT-4.1 replacement claim.
28
+
29
+ ## LoCoMo Breakdown
30
+
31
+ | Category | Score |
32
+ |---|---:|
33
+ | factual | `0.3339551926` |
34
+ | temporal | `0.4978785870` |
35
+ | inferential | `0.2605997475` |
36
+ | multi-hop | `0.5144477744` |
37
+ | adversarial | `0.8837209302` |
38
+
39
+ ## LongMemEval Breakdown
40
+
41
+ | Category | Score |
42
+ |---|---:|
43
+ | knowledge-update | `0.5588405797` |
44
+ | multi-session | `0.1390977444` |
45
+ | single-session-assistant | `0.7656395892` |
46
+ | single-session-preference | `0.0519667456` |
47
+ | single-session-user | `0.9133333333` |
48
+ | temporal-reasoning | `0.4316666667` |
49
+
50
+ ## Internal Comparison That Informed The Release
51
+
52
+ The closest runner-up was `inferential_from_temporal_heavy`.
53
+
54
+ - Confirmed LoCoMo mean: `0.4975893989`
55
+ - Confirmed LongMemEval mean: `0.4688992148`
56
+ - Pairwise LoCoMo disagreements vs `sft4`: `152 / 400`
57
+ - Question-level wins: `56` for `sft4`, `52` for the runner-up
58
+
59
+ The release decision stayed with `sft4` because it preserved the strongest
60
+ LongMemEval score and the strongest adversarial behavior.
61
+
62
+ ## Artifact Files
63
+
64
+ - [../../results/confirmed_exp15_summary.json](../../results/confirmed_exp15_summary.json)
65
+ - [../../results/scenario_comparisons.json](../../results/scenario_comparisons.json)
66
+ - [../../results/locomo_pairwise_question_diffs.json](../../results/locomo_pairwise_question_diffs.json)
67
+ - [../../results/sft4.json](../../results/sft4.json)
68
+
69
+ Related docs:
70
+
71
+ - [extraction-skill.md](extraction-skill.md)
72
+ - [extraction-examples.md](extraction-examples.md)
73
+ - [datasets.md](datasets.md)
74
+ - [model-card.md](model-card.md)
docs/release/technical-blog.md ADDED
@@ -0,0 +1,177 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # PRISM-Memory: Turn Conversations Into Durable, Searchable Memory
2
+
3
+ ## Summary
4
+
5
+ `PRISM-Memory` is a long-term conversational memory system that converts raw
6
+ dialogue into proposition-level memory and retrieves it with an inspectable
7
+ hybrid stack.
8
+
9
+ This package now ships one public extraction skill and one public checkpoint:
10
+
11
+ - **Checkpoint:** `exp15_sft_qwen7b_4ep`
12
+ - **Confirmed LoCoMo mean:** `0.4981204463`
13
+ - **Confirmed LongMemEval mean:** `0.4767574431`
14
+ - **QA cache misses during confirmation:** `0`
15
+
16
+ The public hook is simple:
17
+
18
+ **PRISM-Memory turns conversations into durable, searchable memory.**
19
+
20
+ ## What The Repo Actually Contributed
21
+
22
+ The core contribution is not another opaque memory model. The repo showed that a
23
+ 7B open model can replace GPT-4-class extraction with a transparent memory
24
+ pipeline that is still competitive on long-horizon dialogue benchmarks.
25
+
26
+ The released system has three pieces:
27
+
28
+ 1. A learned proposition extractor (`Qwen2.5-7B-Instruct` + LoRA).
29
+ 2. Post-processing that cleans speaker references and resolves relative time.
30
+ 3. Hybrid retrieval (`BM25 + dense retrieval + cross-encoder reranking`).
31
+
32
+ The important part is the interface between them: extracted propositions are not
33
+ just text snippets. They are the memory records that the retriever indexes.
34
+
35
+ ## The Single Skill To Keep
36
+
37
+ After reviewing the repo history, there should be one canonical extraction skill
38
+ and one checkpoint publicly exposed:
39
+
40
+ - **Skill:** proposition-level memory extraction
41
+ - **Model:** `exp15_sft_qwen7b_4ep`
42
+ - **Prompt contract:** extract `0-5` atomic standalone facts, include dates when
43
+ present, skip filler and questions, output JSON only
44
+
45
+ That skill is documented directly in
46
+ [extraction-skill.md](extraction-skill.md).
47
+
48
+ ## What Worked
49
+
50
+ ### 1. The best model came from the stable 20k base, not from aggressive add-ons
51
+
52
+ The repo repeatedly showed that `sft4` was the stable optimum for the 7B line.
53
+ The same 20k clean base distribution was critical. Changing the base subset,
54
+ shrinking it, or overextending it consistently hurt.
55
+
56
+ Why that matters:
57
+
58
+ - the model needed the exact data distribution that produced `sft4`
59
+ - 4 epochs was enough to reach the useful local optimum
60
+ - follow-on runs often traded away robustness for narrower gains
61
+
62
+ ### 2. Proposition memory plus hybrid retrieval is the real winning combination
63
+
64
+ The strongest system was not latent-only memory and not raw-turn retrieval. The
65
+ best path was proposition extraction plus `PRISMv3Rerank`.
66
+
67
+ That means:
68
+
69
+ - sparse retrieval captured lexical anchors
70
+ - dense retrieval recovered semantically close memories
71
+ - reranking cleaned up the final shortlist
72
+
73
+ This combination is what made the memory store usable.
74
+
75
+ ### 3. Absolute date anchoring and temporal cleanup helped
76
+
77
+ Temporal improvement came from making the memory records cleaner, not from
78
+ teaching the model to imitate LoCoMo’s relative-answer style.
79
+
80
+ What helped:
81
+
82
+ - fixed temporal examples with explicit date resolution
83
+ - normalizing session dates
84
+ - post-processing relative references like `yesterday` or `last weekend`
85
+
86
+ What did **not** help:
87
+
88
+ - training the model to emit relative benchmark-style dates
89
+
90
+ ### 4. Turn-local extraction was better than passing long context windows
91
+
92
+ The repo tested extraction with added session context and it regressed. The
93
+ model worked best when extracting from the current turn and letting the memory
94
+ system handle cross-turn reasoning later.
95
+
96
+ That is an important design lesson: keep extraction local, let retrieval do the
97
+ composition.
98
+
99
+ ### 5. Adversarial precision was the strongest reason to keep `sft4`
100
+
101
+ Many later variants found small gains in temporal or inferential categories, but
102
+ they usually damaged adversarial behavior. `sft4` held the best confirmed
103
+ adversarial score and the best total LongMemEval score, which is why it is the
104
+ only checkpoint worth releasing publicly.
105
+
106
+ ## What Did Not Work
107
+
108
+ ### 1. Benchmark-specific format hacks
109
+
110
+ Relative-date training was a dead end. It optimized for the look of a benchmark
111
+ answer rather than for general extraction quality.
112
+
113
+ ### 2. LoCoMo-domain training data
114
+
115
+ Adding LoCoMo training conversations consistently regressed performance. The
116
+ best generalization signal remained the cleaned LME-style base data.
117
+
118
+ ### 3. More original LME data was not better
119
+
120
+ Scaling from 20k to 50k original LME examples amplified the temporal-anchor
121
+ problem. More noisy temporal labels simply taught the wrong lesson more often.
122
+
123
+ ### 4. Small follow-on bases and heavy QA multipliers
124
+
125
+ Runs built on 5k clean bases or extreme QA multipliers tended to forget useful
126
+ behavior. They often improved a narrow category while hurting adversarial
127
+ precision, inferential balance, or LongMemEval.
128
+
129
+ ### 5. Assuming the best checkpoint was easy to improve
130
+
131
+ The repo’s most expensive lesson was that `sft4` was already a local optimum for
132
+ the 7B line. Most additional training made the model more specialized and less
133
+ balanced.
134
+
135
+ ## Internal Comparisons That Informed The Release
136
+
137
+ The internal ablation story still matters, even though the public package keeps
138
+ only `sft4`.
139
+
140
+ Confirmed internal facts:
141
+
142
+ - `inferential_from_temporal_heavy` nearly tied `sft4` on overall LoCoMo
143
+ - it recovered some inferential and temporal misses
144
+ - it still lost on LongMemEval and adversarial precision
145
+
146
+ Question-level comparison on held-out LoCoMo:
147
+
148
+ - `400` questions replayed
149
+ - `152` answer-level disagreements
150
+ - `56` questions favored `sft4`
151
+ - `52` questions favored the runner-up
152
+
153
+ That is a useful research result, but not a reason to ship two public models.
154
+ The right release decision is one clean skill, one clean checkpoint.
155
+
156
+ ## Failure Modes Still Visible In The Release Model
157
+
158
+ The selected model is good enough to release, but its errors are clear:
159
+
160
+ - it can miss specific diagnoses while retaining the broader health frame
161
+ - it can overcommit to a salient retrieved clue in inferential questions
162
+ - it can remember a coarse book description but miss the exact title
163
+
164
+ Those are not packaging issues. They are the current limits of the extraction +
165
+ retrieval stack at this model size.
166
+
167
+ ## What Ships
168
+
169
+ Public release surface:
170
+
171
+ 1. `PRISM-Memory`
172
+ 2. the single extraction skill in [extraction-skill.md](extraction-skill.md)
173
+ 3. the best confirmed checkpoint `exp15_sft_qwen7b_4ep`
174
+ 4. the best-only Space demo in [../../space/](../../space/)
175
+
176
+ Internal analysis artifacts can stay for provenance, but they should not be
177
+ positioned as parallel public releases.
results/confirmed_exp15_summary.json ADDED
@@ -0,0 +1,53 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "results": [
3
+ {
4
+ "alias": "sft4",
5
+ "checkpoint": "exp15_sft_qwen7b_4ep",
6
+ "elapsed_min": 28.93,
7
+ "args": {
8
+ "n_lme": 10,
9
+ "context_window": 0,
10
+ "locomo_temporal_k": 8,
11
+ "locomo_adversarial_k": 5,
12
+ "lme_multisess_k": 20,
13
+ "use_temporal_prompt": false,
14
+ "strict_cache": true
15
+ },
16
+ "qa_cache": {
17
+ "model": "gpt-4.1",
18
+ "cache_size": 16969,
19
+ "hits": 460,
20
+ "misses": 0,
21
+ "missing_examples": []
22
+ },
23
+ "locomo": {
24
+ "categories": {
25
+ "1": 0.3339551926061944,
26
+ "2": 0.4978785869736096,
27
+ "3": 0.26059974747474746,
28
+ "4": 0.514447774438597,
29
+ "5": 0.8837209302325582
30
+ },
31
+ "mean": 0.49812044634514124
32
+ },
33
+ "lme": {
34
+ "categories": {
35
+ "knowledge-update": 0.558840579710145,
36
+ "multi-session": 0.13909774436090225,
37
+ "single-session-assistant": 0.765639589169001,
38
+ "single-session-preference": 0.05196674560130369,
39
+ "single-session-user": 0.9133333333333333,
40
+ "temporal-reasoning": 0.43166666666666664
41
+ },
42
+ "mean": 0.47675744314022533
43
+ },
44
+ "logged_comparison": {
45
+ "logged_locomo_mean": 0.498,
46
+ "logged_lme_mean": 0.477,
47
+ "locomo_delta": 0.00012044634514124519,
48
+ "lme_delta": -0.00024255685977464525
49
+ }
50
+ }
51
+ ],
52
+ "failures": []
53
+ }
results/readme_extraction_examples.json ADDED
@@ -0,0 +1,57 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "source_dataset": "BETTER_MEMORY_ROOT/data/output/eval_sft.jsonl",
3
+ "model_path": "BETTER_MEMORY_ROOT/exp15_sft_qwen7b_4ep",
4
+ "output_examples": 3,
5
+ "examples": [
6
+ {
7
+ "id": "infra_queue_delays",
8
+ "title": "Infrastructure bottlenecks stay structured",
9
+ "note": "Near-exact match on two operational facts from a single held-out turn.",
10
+ "session_date": "2025-01-04 15:34:00",
11
+ "user_message": "yeah, no real caching beyond basic Docker layer caching. Jenkins nodes have limited capacity, and we sometimes hit queue delays during peak commits.",
12
+ "gpt41_reference": [
13
+ "No caching beyond basic Docker layer caching",
14
+ "Jenkins nodes have limited capacity and experience queue delays during peak commits"
15
+ ],
16
+ "prism_memory": [
17
+ "No Docker caching beyond basic layer caching",
18
+ "Jenkins nodes have limited capacity; peak commits cause queue delays"
19
+ ],
20
+ "overlap_score": 0.9090909090909092
21
+ },
22
+ {
23
+ "id": "github_actions_notifications",
24
+ "title": "Numeric constraints and preferences survive extraction",
25
+ "note": "The trained model keeps both the hard concurrency cap and the desired notification style.",
26
+ "session_date": "2025-03-01 15:07:00",
27
+ "user_message": "yeah, I think starting with incremental scans and parallel matrix jobs makes sense. We have 20 concurrent jobs max on GitHub Actions currently. Also want to keep Slack notifications from Snyk consistent with other pipeline alerts\u2014aggregated and concise. Can you help draft the workflow?",
28
+ "gpt41_reference": [
29
+ "GitHub Actions concurrency limit: 20 concurrent jobs",
30
+ "Wants Snyk Slack notifications aggregated and concise, consistent with other pipeline alerts"
31
+ ],
32
+ "prism_memory": [
33
+ "GitHub Actions concurrency limit: 20 concurrent jobs",
34
+ "Snyk Slack notifications should be aggregated and concise"
35
+ ],
36
+ "overlap_score": 0.8000000000000002
37
+ },
38
+ {
39
+ "id": "sidecar_limits_and_rollout",
40
+ "title": "Current state and next-step plans stay grouped",
41
+ "note": "The extracted memory keeps the current configuration, the planned security step, and the rollout strategy in one compact bundle.",
42
+ "session_date": "2025-09-09 15:26:00",
43
+ "user_message": "yeah good point about resource overhead, we set CPU limits for all sidecars and monitor with Prometheus now. no mTLS yet, but it\u2019s on the roadmap for phase two. as for routing, we want to start with canary deployments and traffic splitting, maybe some basic fault injection for testing.",
44
+ "gpt41_reference": [
45
+ "Istio sidecar CPU limits set and monitored via Prometheus",
46
+ "mTLS planned in phase two",
47
+ "Plan to use canary deployments, traffic splitting, and basic fault injection"
48
+ ],
49
+ "prism_memory": [
50
+ "Sidecar CPU limits set and monitored via Prometheus",
51
+ "Istio mTLS planned for phase two",
52
+ "Routing strategy: canary deployments and traffic splitting; basic fault injection planned"
53
+ ],
54
+ "overlap_score": 0.7985739750445632
55
+ }
56
+ ]
57
+ }
results/scenario_comparisons.json ADDED
@@ -0,0 +1,140 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "qa_cache": {
3
+ "model": "gpt-4.1",
4
+ "cache_size": 16969,
5
+ "hits": 5,
6
+ "misses": 0,
7
+ "missing_examples": []
8
+ },
9
+ "scenarios": [
10
+ {
11
+ "id": "temporal_anchor_hobby",
12
+ "title": "Temporal Anchor",
13
+ "source_id": "conv-49",
14
+ "category": 2,
15
+ "question": "Which hobby did Sam take up in May 2023?",
16
+ "gold_answer": "painting",
17
+ "kind": "strength",
18
+ "note": "The released model keeps the dated hobby proposition and answers correctly.",
19
+ "systems": [
20
+ {
21
+ "name": "sft4",
22
+ "prediction": "painting",
23
+ "top_retrieval": [
24
+ "Sam: [18 May 2023] Sam is considering trying painting as a new hobby.",
25
+ "Sam: [24 May 2023] Sam has been considering trying painting as a new hobby.",
26
+ "Sam: [6 October 2023] Sam asked Evan if he has explored any fun indoor activities or hobbies.",
27
+ "Sam: [18 May 2023] Sam is excited to try new things.",
28
+ "Sam: [24 May 2023] Sam is trying to break old habits.",
29
+ "Sam: [15 August 2023] Sam attended a cooking class.",
30
+ "Sam: [17 December 2023] Sam used to love hiking.",
31
+ "[1:47 pm on 18 May, 2023] Sam: We hiked a good distance - quite a feat for me back then. It's definitely a great memory."
32
+ ]
33
+ }
34
+ ]
35
+ },
36
+ {
37
+ "id": "adversarial_refusal_guitar",
38
+ "title": "Adversarial Refusal",
39
+ "source_id": "conv-50",
40
+ "category": 5,
41
+ "question": "Why did Dave get his guitar customized with a shiny finish?",
42
+ "gold_answer": "",
43
+ "kind": "strength",
44
+ "note": "This tests whether the system refuses to invent an answer when the premise is unsupported.",
45
+ "systems": [
46
+ {
47
+ "name": "sft4",
48
+ "prediction": "None",
49
+ "top_retrieval": [
50
+ "[2:55 pm on 31 August, 2023] Dave: That guitar has a gorgeous purple hue. Why did you make it so shiny?",
51
+ "Dave: [2 November 2023] The guitar was in bad condition when Dave found it.",
52
+ "[2:55 pm on 31 August, 2023] Dave: Good pick! The customized purple glow gives it a unique look that really stands out.",
53
+ "[2:55 pm on 31 August, 2023] Dave: That's a great guitar, Calvin! Love the design, it's so unique and special.",
54
+ "Dave: [16 May 2023] Calvin lost his guitar and amp but managed to save his music gear and microphone."
55
+ ]
56
+ }
57
+ ]
58
+ },
59
+ {
60
+ "id": "diagnosis_specificity",
61
+ "title": "Diagnosis Specificity",
62
+ "source_id": "conv-49",
63
+ "category": 1,
64
+ "question": "Which ailment does Sam have to face due to his weight?",
65
+ "gold_answer": "gastritis",
66
+ "kind": "failure",
67
+ "note": "A representative factual miss: the model retrieves the health-risk frame but not the specific diagnosis.",
68
+ "systems": [
69
+ {
70
+ "name": "sft4",
71
+ "prediction": "serious health risk",
72
+ "top_retrieval": [
73
+ "Sam: [8 October 2023] The doctor told Sam that his weight is a serious health risk.",
74
+ "Sam: [24 May 2023] The doctor's check-up revealed that Sam's weight was not good.",
75
+ "Sam: [13 August 2023] Sam is currently experiencing challenges affecting his health.",
76
+ "[6:48 pm on 17 December, 2023] Sam: Yeah, I'm struggling with my weight and it's affecting my confidence. I feel like I can't overcome all the challenges with my weight, I keep lacking motivation.",
77
+ "Sam: [21 November 2023] Sam has been trying to make dietary changes to address his discomfort.",
78
+ "Sam: [9 November 2023] Sam is a Weight Watchers coach in his group.",
79
+ "Sam: [7 August 2023] Sam has been prioritizing his health for some time.",
80
+ "Sam: [15 August 2023] Sam is concerned about his health."
81
+ ]
82
+ }
83
+ ]
84
+ },
85
+ {
86
+ "id": "location_inference",
87
+ "title": "Location Inference",
88
+ "source_id": "conv-49",
89
+ "category": 3,
90
+ "question": "Does Evan live close to a beach or mountains?",
91
+ "gold_answer": "beach",
92
+ "kind": "failure",
93
+ "note": "A representative inferential miss: retrieval includes both clues, but the model overcommits to the mountain mention.",
94
+ "systems": [
95
+ {
96
+ "name": "sft4",
97
+ "prediction": "mountains",
98
+ "top_retrieval": [
99
+ "Evan: [27 August 2023] Evan also shared his recent road trip to the Rocky Mountains and love for hiking.",
100
+ "Evan: [27 August 2023] Evan lives within a two-hour drive of a place with incredible views and a peaceful atmosphere.",
101
+ "Evan: [9 November 2023] They also discussed enjoying a sunset together at Evan's favorite spot by the beach, planning to visit it soon to de-stress.",
102
+ "Evan: [10 January 2024] Evan enjoys going on beach sunsets as a low-impact exercise.",
103
+ "[7:11 pm on 24 May, 2023] Evan: Hey Sam, thanks for asking! It was great - fresh air, peacefulness and a cozy cabin surrounded by mountains and forests made it feel like a real retreat.",
104
+ "Evan: [31 December 2023] Sam shared about a recent hiking trip, while Evan mentioned a mountain drive that ended in a minor accident.",
105
+ "Evan: [27 August 2023] Evan recommended a nearby lake for hiking and nature exploration.",
106
+ "Evan: [27 August 2023] Evan enjoys road trips and exploring nature."
107
+ ]
108
+ }
109
+ ]
110
+ },
111
+ {
112
+ "id": "reading_detail",
113
+ "title": "Reading Detail",
114
+ "source_id": "conv-49",
115
+ "category": 4,
116
+ "question": "What novel is Evan reading that he finds gripping?",
117
+ "gold_answer": "The Great Gatsby",
118
+ "kind": "failure",
119
+ "note": "A representative multi-hop miss: the model retains the coarse book description but misses the specific title.",
120
+ "systems": [
121
+ {
122
+ "name": "sft4",
123
+ "prediction": "a new mystery novel",
124
+ "top_retrieval": [
125
+ "Evan: [27 August 2023] Evan is reading a book that he finds increasingly compelling.",
126
+ "Evan: [27 July 2023] Evan is currently reading a new mystery novel.",
127
+ "Evan: [27 July 2023] Evan is reading 'The Great Gatsby'.",
128
+ "Evan: [26 December 2023] Evan finds that art helps him recognize and handle his own feelings.",
129
+ "Evan: [27 August 2023] Evan expressed interest in a book and discussed potential physical therapy for his knee.",
130
+ "Evan: [10 January 2024] Evan concluded that he needs to be more careful next time.",
131
+ "Evan: [6 October 2023] Evan thinks writing is a great way to express oneself.",
132
+ "Evan: [13 August 2023] Evan suggested checking out a dream interpretation book to help interpret Sam's dream.",
133
+ "Evan: [6 October 2023] Evan believes that writing can be super therapeutic.",
134
+ "Evan: [6 October 2023] Evan usually paints what is on his mind or something he is feeling."
135
+ ]
136
+ }
137
+ ]
138
+ }
139
+ ]
140
+ }
tokenizer.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:3fd169731d2cbde95e10bf356d66d5997fd885dd8dbb6fb4684da3f23b2585d8
3
+ size 11421892
tokenizer_config.json ADDED
@@ -0,0 +1,29 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_prefix_space": false,
3
+ "backend": "tokenizers",
4
+ "bos_token": null,
5
+ "clean_up_tokenization_spaces": false,
6
+ "eos_token": "<|im_end|>",
7
+ "errors": "replace",
8
+ "extra_special_tokens": [
9
+ "<|im_start|>",
10
+ "<|im_end|>",
11
+ "<|object_ref_start|>",
12
+ "<|object_ref_end|>",
13
+ "<|box_start|>",
14
+ "<|box_end|>",
15
+ "<|quad_start|>",
16
+ "<|quad_end|>",
17
+ "<|vision_start|>",
18
+ "<|vision_end|>",
19
+ "<|vision_pad|>",
20
+ "<|image_pad|>",
21
+ "<|video_pad|>"
22
+ ],
23
+ "is_local": true,
24
+ "model_max_length": 131072,
25
+ "pad_token": "<|endoftext|>",
26
+ "split_special_tokens": false,
27
+ "tokenizer_class": "Qwen2Tokenizer",
28
+ "unk_token": null
29
+ }
training_args.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ab50b36dda4f8579bc40c94cd34d87dc76941d8407e8046b68e59716a94b0f61
3
+ size 5304