Publish PRISM-Memory adapter bundle
Browse files- .gitattributes +1 -0
- LICENSE +202 -0
- README.md +141 -0
- adapter_config.json +46 -0
- adapter_model.safetensors +3 -0
- chat_template.jinja +54 -0
- docs/release/datasets.md +82 -0
- docs/release/extraction-examples.md +76 -0
- docs/release/extraction-skill.md +119 -0
- docs/release/release-results.md +74 -0
- docs/release/technical-blog.md +177 -0
- results/confirmed_exp15_summary.json +53 -0
- results/readme_extraction_examples.json +57 -0
- results/scenario_comparisons.json +140 -0
- tokenizer.json +3 -0
- tokenizer_config.json +29 -0
- training_args.bin +3 -0
.gitattributes
CHANGED
|
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
|
| 33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
| 34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
| 35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
|
|
|
|
|
| 33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
| 34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
| 35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
| 36 |
+
tokenizer.json filter=lfs diff=lfs merge=lfs -text
|
LICENSE
ADDED
|
@@ -0,0 +1,202 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
Apache License
|
| 2 |
+
Version 2.0, January 2004
|
| 3 |
+
http://www.apache.org/licenses/
|
| 4 |
+
|
| 5 |
+
TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
|
| 6 |
+
|
| 7 |
+
1. Definitions.
|
| 8 |
+
|
| 9 |
+
"License" shall mean the terms and conditions for use, reproduction,
|
| 10 |
+
and distribution as defined by Sections 1 through 9 of this document.
|
| 11 |
+
|
| 12 |
+
"Licensor" shall mean the copyright owner or entity authorized by
|
| 13 |
+
the copyright owner that is granting the License.
|
| 14 |
+
|
| 15 |
+
"Legal Entity" shall mean the union of the acting entity and all
|
| 16 |
+
other entities that control, are controlled by, or are under common
|
| 17 |
+
control with that entity. For the purposes of this definition,
|
| 18 |
+
"control" means (i) the power, direct or indirect, to cause the
|
| 19 |
+
direction or management of such entity, whether by contract or
|
| 20 |
+
otherwise, or (ii) ownership of fifty percent (50%) or more of the
|
| 21 |
+
outstanding shares, or (iii) beneficial ownership of such entity.
|
| 22 |
+
|
| 23 |
+
"You" (or "Your") shall mean an individual or Legal Entity
|
| 24 |
+
exercising permissions granted by this License.
|
| 25 |
+
|
| 26 |
+
"Source" form shall mean the preferred form for making modifications,
|
| 27 |
+
including but not limited to software source code, documentation
|
| 28 |
+
source, and configuration files.
|
| 29 |
+
|
| 30 |
+
"Object" form shall mean any form resulting from mechanical
|
| 31 |
+
transformation or translation of a Source form, including but
|
| 32 |
+
not limited to compiled object code, generated documentation,
|
| 33 |
+
and conversions to other media types.
|
| 34 |
+
|
| 35 |
+
"Work" shall mean the work of authorship, whether in Source or
|
| 36 |
+
Object form, made available under the License, as indicated by a
|
| 37 |
+
copyright notice that is included in or attached to the work
|
| 38 |
+
(an example is provided in the Appendix below).
|
| 39 |
+
|
| 40 |
+
"Derivative Works" shall mean any work, whether in Source or Object
|
| 41 |
+
form, that is based on (or derived from) the Work and for which the
|
| 42 |
+
editorial revisions, annotations, elaborations, or other modifications
|
| 43 |
+
represent, as a whole, an original work of authorship. For the purposes
|
| 44 |
+
of this License, Derivative Works shall not include works that remain
|
| 45 |
+
separable from, or merely link (or bind by name) to the interfaces of,
|
| 46 |
+
the Work and Derivative Works thereof.
|
| 47 |
+
|
| 48 |
+
"Contribution" shall mean any work of authorship, including
|
| 49 |
+
the original version of the Work and any modifications or additions
|
| 50 |
+
to that Work or Derivative Works thereof, that is intentionally
|
| 51 |
+
submitted to Licensor for inclusion in the Work by the copyright owner
|
| 52 |
+
or by an individual or Legal Entity authorized to submit on behalf of
|
| 53 |
+
the copyright owner. For the purposes of this definition, "submitted"
|
| 54 |
+
means any form of electronic, verbal, or written communication sent
|
| 55 |
+
to the Licensor or its representatives, including but not limited to
|
| 56 |
+
communication on electronic mailing lists, source code control systems,
|
| 57 |
+
and issue tracking systems that are managed by, or on behalf of, the
|
| 58 |
+
Licensor for the purpose of discussing and improving the Work, but
|
| 59 |
+
excluding communication that is conspicuously marked or otherwise
|
| 60 |
+
designated in writing by the copyright owner as "Not a Contribution."
|
| 61 |
+
|
| 62 |
+
"Contributor" shall mean Licensor and any individual or Legal Entity
|
| 63 |
+
on behalf of whom a Contribution has been received by Licensor and
|
| 64 |
+
subsequently incorporated within the Work.
|
| 65 |
+
|
| 66 |
+
2. Grant of Copyright License. Subject to the terms and conditions of
|
| 67 |
+
this License, each Contributor hereby grants to You a perpetual,
|
| 68 |
+
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
|
| 69 |
+
copyright license to reproduce, prepare Derivative Works of,
|
| 70 |
+
publicly display, publicly perform, sublicense, and distribute the
|
| 71 |
+
Work and such Derivative Works in Source or Object form.
|
| 72 |
+
|
| 73 |
+
3. Grant of Patent License. Subject to the terms and conditions of
|
| 74 |
+
this License, each Contributor hereby grants to You a perpetual,
|
| 75 |
+
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
|
| 76 |
+
(except as stated in this section) patent license to make, have made,
|
| 77 |
+
use, offer to sell, sell, import, and otherwise transfer the Work,
|
| 78 |
+
where such license applies only to those patent claims licensable
|
| 79 |
+
by such Contributor that are necessarily infringed by their
|
| 80 |
+
Contribution(s) alone or by combination of their Contribution(s)
|
| 81 |
+
with the Work to which such Contribution(s) was submitted. If You
|
| 82 |
+
institute patent litigation against any entity (including a
|
| 83 |
+
cross-claim or counterclaim in a lawsuit) alleging that the Work
|
| 84 |
+
or a Contribution incorporated within the Work constitutes direct
|
| 85 |
+
or contributory patent infringement, then any patent licenses
|
| 86 |
+
granted to You under this License for that Work shall terminate
|
| 87 |
+
as of the date such litigation is filed.
|
| 88 |
+
|
| 89 |
+
4. Redistribution. You may reproduce and distribute copies of the
|
| 90 |
+
Work or Derivative Works thereof in any medium, with or without
|
| 91 |
+
modifications, and in Source or Object form, provided that You
|
| 92 |
+
meet the following conditions:
|
| 93 |
+
|
| 94 |
+
(a) You must give any other recipients of the Work or
|
| 95 |
+
Derivative Works a copy of this License; and
|
| 96 |
+
|
| 97 |
+
(b) You must cause any modified files to carry prominent notices
|
| 98 |
+
stating that You changed the files; and
|
| 99 |
+
|
| 100 |
+
(c) You must retain, in the Source form of any Derivative Works
|
| 101 |
+
that You distribute, all copyright, patent, trademark, and
|
| 102 |
+
attribution notices from the Source form of the Work,
|
| 103 |
+
excluding those notices that do not pertain to any part of
|
| 104 |
+
the Derivative Works; and
|
| 105 |
+
|
| 106 |
+
(d) If the Work includes a "NOTICE" text file as part of its
|
| 107 |
+
distribution, then any Derivative Works that You distribute must
|
| 108 |
+
include a readable copy of the attribution notices contained
|
| 109 |
+
within such NOTICE file, excluding those notices that do not
|
| 110 |
+
pertain to any part of the Derivative Works, in at least one
|
| 111 |
+
of the following places: within a NOTICE text file distributed
|
| 112 |
+
as part of the Derivative Works; within the Source form or
|
| 113 |
+
documentation, if provided along with the Derivative Works; or,
|
| 114 |
+
within a display generated by the Derivative Works, if and
|
| 115 |
+
wherever such third-party notices normally appear. The contents
|
| 116 |
+
of the NOTICE file are for informational purposes only and
|
| 117 |
+
do not modify the License. You may add Your own attribution
|
| 118 |
+
notices within Derivative Works that You distribute, alongside
|
| 119 |
+
or as an addendum to the NOTICE text from the Work, provided
|
| 120 |
+
that such additional attribution notices cannot be construed
|
| 121 |
+
as modifying the License.
|
| 122 |
+
|
| 123 |
+
You may add Your own copyright statement to Your modifications and
|
| 124 |
+
may provide additional or different license terms and conditions
|
| 125 |
+
for use, reproduction, or distribution of Your modifications, or
|
| 126 |
+
for any such Derivative Works as a whole, provided Your use,
|
| 127 |
+
reproduction, and distribution of the Work otherwise complies with
|
| 128 |
+
the conditions stated in this License.
|
| 129 |
+
|
| 130 |
+
5. Submission of Contributions. Unless You explicitly state otherwise,
|
| 131 |
+
any Contribution intentionally submitted for inclusion in the Work
|
| 132 |
+
by You to the Licensor shall be under the terms and conditions of
|
| 133 |
+
this License, without any additional terms or conditions.
|
| 134 |
+
Notwithstanding the above, nothing herein shall supersede or modify
|
| 135 |
+
the terms of any separate license agreement you may have executed
|
| 136 |
+
with Licensor regarding such Contributions.
|
| 137 |
+
|
| 138 |
+
6. Trademarks. This License does not grant permission to use the trade
|
| 139 |
+
names, trademarks, service marks, or product names of the Licensor,
|
| 140 |
+
except as required for reasonable and customary use in describing the
|
| 141 |
+
origin of the Work and reproducing the content of the NOTICE file.
|
| 142 |
+
|
| 143 |
+
7. Disclaimer of Warranty. Unless required by applicable law or
|
| 144 |
+
agreed to in writing, Licensor provides the Work (and each
|
| 145 |
+
Contributor provides its Contributions) on an "AS IS" BASIS,
|
| 146 |
+
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
|
| 147 |
+
implied, including, without limitation, any warranties or conditions
|
| 148 |
+
of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
|
| 149 |
+
PARTICULAR PURPOSE. You are solely responsible for determining the
|
| 150 |
+
appropriateness of using or redistributing the Work and assume any
|
| 151 |
+
risks associated with Your exercise of permissions under this License.
|
| 152 |
+
|
| 153 |
+
8. Limitation of Liability. In no event and under no legal theory,
|
| 154 |
+
whether in tort (including negligence), contract, or otherwise,
|
| 155 |
+
unless required by applicable law (such as deliberate and grossly
|
| 156 |
+
negligent acts) or agreed to in writing, shall any Contributor be
|
| 157 |
+
liable to You for damages, including any direct, indirect, special,
|
| 158 |
+
incidental, or consequential damages of any character arising as a
|
| 159 |
+
result of this License or out of the use or inability to use the
|
| 160 |
+
Work (including but not limited to damages for loss of goodwill,
|
| 161 |
+
work stoppage, computer failure or malfunction, or any and all
|
| 162 |
+
other commercial damages or losses), even if such Contributor
|
| 163 |
+
has been advised of the possibility of such damages.
|
| 164 |
+
|
| 165 |
+
9. Accepting Warranty or Additional Liability. While redistributing
|
| 166 |
+
the Work or Derivative Works thereof, You may choose to offer,
|
| 167 |
+
and charge a fee for, acceptance of support, warranty, indemnity,
|
| 168 |
+
or other liability obligations and/or rights consistent with this
|
| 169 |
+
License. However, in accepting such obligations, You may act only
|
| 170 |
+
on Your own behalf and on Your sole responsibility, not on behalf
|
| 171 |
+
of any other Contributor, and only if You agree to indemnify,
|
| 172 |
+
defend, and hold each Contributor harmless for any liability
|
| 173 |
+
incurred by, or claims asserted against, such Contributor by reason
|
| 174 |
+
of your accepting any such warranty or additional liability.
|
| 175 |
+
|
| 176 |
+
END OF TERMS AND CONDITIONS
|
| 177 |
+
|
| 178 |
+
APPENDIX: How to apply the Apache License to your work.
|
| 179 |
+
|
| 180 |
+
To apply the Apache License to your work, attach the following
|
| 181 |
+
boilerplate notice, with the fields enclosed by brackets "[]"
|
| 182 |
+
replaced with your own identifying information. (Don't include
|
| 183 |
+
the brackets!) The text should be enclosed in the appropriate
|
| 184 |
+
comment syntax for the file format. We also recommend that a
|
| 185 |
+
file or class name and description of purpose be included on the
|
| 186 |
+
same "printed page" as the copyright notice for easier
|
| 187 |
+
identification within third-party archives.
|
| 188 |
+
|
| 189 |
+
Copyright 2026 Asad Ismail
|
| 190 |
+
|
| 191 |
+
Licensed under the Apache License, Version 2.0 (the "License");
|
| 192 |
+
you may not use this file except in compliance with the License.
|
| 193 |
+
You may obtain a copy of the License at
|
| 194 |
+
|
| 195 |
+
http://www.apache.org/licenses/LICENSE-2.0
|
| 196 |
+
|
| 197 |
+
Unless required by applicable law or agreed to in writing, software
|
| 198 |
+
distributed under the License is distributed on an "AS IS" BASIS,
|
| 199 |
+
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
| 200 |
+
See the License for the specific language governing permissions and
|
| 201 |
+
limitations under the License.
|
| 202 |
+
|
README.md
ADDED
|
@@ -0,0 +1,141 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
base_model: Qwen/Qwen2.5-7B-Instruct
|
| 3 |
+
base_model_relation: adapter
|
| 4 |
+
license: apache-2.0
|
| 5 |
+
library_name: peft
|
| 6 |
+
pipeline_tag: text-generation
|
| 7 |
+
tags:
|
| 8 |
+
- conversational-memory
|
| 9 |
+
- information-extraction
|
| 10 |
+
- long-context
|
| 11 |
+
- peft
|
| 12 |
+
- lora
|
| 13 |
+
- qwen2.5
|
| 14 |
+
---
|
| 15 |
+
|
| 16 |
+
# PRISM-Memory
|
| 17 |
+
|
| 18 |
+
PRISM-Memory is a LoRA adapter that trains `Qwen/Qwen2.5-7B-Instruct` to write
|
| 19 |
+
proposition-level memory from dialogue. It is the released `exp15_sft_qwen7b_4ep`
|
| 20 |
+
checkpoint from the original `better_memory` project.
|
| 21 |
+
|
| 22 |
+
## What this release shows
|
| 23 |
+
|
| 24 |
+
- A 7B open model can replace GPT-4.1 for the extraction step in this memory pipeline.
|
| 25 |
+
- On the confirmed release surface, PRISM-Memory scores `0.4768` on LongMemEval and `0.4981` on LoCoMo.
|
| 26 |
+
- The GPT-4.1-based PropMem reference scores `0.4650` on LongMemEval and `0.5360` on LoCoMo.
|
| 27 |
+
|
| 28 |
+
This comparison holds the QA layer constant. It compares extractor against
|
| 29 |
+
extractor, not a full end-to-end GPT-4.1 system.
|
| 30 |
+
|
| 31 |
+
## Load the adapter
|
| 32 |
+
|
| 33 |
+
```python
|
| 34 |
+
from peft import PeftModel
|
| 35 |
+
from transformers import AutoModelForCausalLM, AutoTokenizer
|
| 36 |
+
|
| 37 |
+
base_id = "Qwen/Qwen2.5-7B-Instruct"
|
| 38 |
+
adapter_id = "AsadIsmail/prism-memory"
|
| 39 |
+
|
| 40 |
+
tokenizer = AutoTokenizer.from_pretrained(adapter_id, trust_remote_code=True)
|
| 41 |
+
base_model = AutoModelForCausalLM.from_pretrained(
|
| 42 |
+
base_id,
|
| 43 |
+
device_map="auto",
|
| 44 |
+
trust_remote_code=True,
|
| 45 |
+
)
|
| 46 |
+
model = PeftModel.from_pretrained(base_model, adapter_id)
|
| 47 |
+
```
|
| 48 |
+
|
| 49 |
+
This repo contains the adapter weights only. You still need the base model.
|
| 50 |
+
|
| 51 |
+
## Training data
|
| 52 |
+
|
| 53 |
+
PRISM-Memory was trained on GPT-4.1-derived proposition labels over synthetic
|
| 54 |
+
multi-session memory conversations.
|
| 55 |
+
|
| 56 |
+
| File | Examples | Role |
|
| 57 |
+
|---|---:|---|
|
| 58 |
+
| `train.jsonl` | `2,329` conversations | raw synthetic conversation source |
|
| 59 |
+
| `eval.jsonl` | `584` conversations | held-out synthetic conversation source |
|
| 60 |
+
| `train_sft.jsonl` | `100,427` labels | primary SFT source |
|
| 61 |
+
| `train_sft_clean_merged.jsonl` | `20,000` labels | cleaned follow-on base matching the best run |
|
| 62 |
+
|
| 63 |
+
The released checkpoint uses a `20k` sample from `train_sft.jsonl`. See
|
| 64 |
+
[docs/release/datasets.md](docs/release/datasets.md) for the full inventory,
|
| 65 |
+
the evaluation surfaces, and the ablations that regressed.
|
| 66 |
+
|
| 67 |
+
## Confirmed results
|
| 68 |
+
|
| 69 |
+
| Benchmark | PRISM-Memory | GPT-4.1-based PropMem reference |
|
| 70 |
+
|---|---:|---:|
|
| 71 |
+
| LongMemEval | `0.4768` | `0.4650` |
|
| 72 |
+
| LoCoMo | `0.4981` | `0.5360` |
|
| 73 |
+
|
| 74 |
+
The reproduced evaluation hit the cached QA surface exactly: `460` hits,
|
| 75 |
+
`0` misses.
|
| 76 |
+
|
| 77 |
+
## Extraction examples
|
| 78 |
+
|
| 79 |
+
### Infrastructure bottlenecks stay structured
|
| 80 |
+
- Session date: `2025-01-04 15:34:00`
|
| 81 |
+
- Overlap score: `0.909`
|
| 82 |
+
- Note: Near-exact match on two operational facts from a single held-out turn.
|
| 83 |
+
|
| 84 |
+
**Turn**
|
| 85 |
+
|
| 86 |
+
> yeah, no real caching beyond basic Docker layer caching. Jenkins nodes have limited capacity, and we sometimes hit queue delays during peak commits.
|
| 87 |
+
|
| 88 |
+
**GPT-4.1 reference**
|
| 89 |
+
|
| 90 |
+
- No caching beyond basic Docker layer caching
|
| 91 |
+
- Jenkins nodes have limited capacity and experience queue delays during peak commits
|
| 92 |
+
|
| 93 |
+
**PRISM-Memory**
|
| 94 |
+
|
| 95 |
+
- No Docker caching beyond basic layer caching
|
| 96 |
+
- Jenkins nodes have limited capacity; peak commits cause queue delays
|
| 97 |
+
|
| 98 |
+
### Numeric constraints and preferences survive extraction
|
| 99 |
+
- Session date: `2025-03-01 15:07:00`
|
| 100 |
+
- Overlap score: `0.800`
|
| 101 |
+
- Note: The trained model keeps both the hard concurrency cap and the desired notification style.
|
| 102 |
+
|
| 103 |
+
**Turn**
|
| 104 |
+
|
| 105 |
+
> yeah, I think starting with incremental scans and parallel matrix jobs makes sense. We have 20 concurrent jobs max on GitHub Actions currently. Also want to keep Slack notifications from Snyk consistent with other pipeline alerts—aggregated and concise. Can you help draft the workflow?
|
| 106 |
+
|
| 107 |
+
**GPT-4.1 reference**
|
| 108 |
+
|
| 109 |
+
- GitHub Actions concurrency limit: 20 concurrent jobs
|
| 110 |
+
- Wants Snyk Slack notifications aggregated and concise, consistent with other pipeline alerts
|
| 111 |
+
|
| 112 |
+
**PRISM-Memory**
|
| 113 |
+
|
| 114 |
+
- GitHub Actions concurrency limit: 20 concurrent jobs
|
| 115 |
+
- Snyk Slack notifications should be aggregated and concise
|
| 116 |
+
|
| 117 |
+
More held-out examples live in
|
| 118 |
+
[docs/release/extraction-examples.md](docs/release/extraction-examples.md).
|
| 119 |
+
|
| 120 |
+
## Bundled docs and artifacts
|
| 121 |
+
|
| 122 |
+
- [docs/release/datasets.md](docs/release/datasets.md)
|
| 123 |
+
- [docs/release/extraction-examples.md](docs/release/extraction-examples.md)
|
| 124 |
+
- [docs/release/extraction-skill.md](docs/release/extraction-skill.md)
|
| 125 |
+
- [docs/release/release-results.md](docs/release/release-results.md)
|
| 126 |
+
- [docs/release/technical-blog.md](docs/release/technical-blog.md)
|
| 127 |
+
- [results/confirmed_exp15_summary.json](results/confirmed_exp15_summary.json)
|
| 128 |
+
- [results/readme_extraction_examples.json](results/readme_extraction_examples.json)
|
| 129 |
+
- [results/scenario_comparisons.json](results/scenario_comparisons.json)
|
| 130 |
+
|
| 131 |
+
## Demo
|
| 132 |
+
|
| 133 |
+
The companion Space is live at
|
| 134 |
+
`https://huggingface.co/spaces/AsadIsmail/prism-memory`.
|
| 135 |
+
|
| 136 |
+
## Limitations
|
| 137 |
+
|
| 138 |
+
- This is a memory-writing component, not a general chat model.
|
| 139 |
+
- It is a LoRA adapter, not a standalone full checkpoint.
|
| 140 |
+
- The evaluation pipeline still uses a separate QA model to score retrieved memory.
|
| 141 |
+
- Temporal and inferential categories still trail stronger larger-model baselines.
|
adapter_config.json
ADDED
|
@@ -0,0 +1,46 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"alora_invocation_tokens": null,
|
| 3 |
+
"alpha_pattern": {},
|
| 4 |
+
"arrow_config": null,
|
| 5 |
+
"auto_mapping": null,
|
| 6 |
+
"base_model_name_or_path": "Qwen/Qwen2.5-7B-Instruct",
|
| 7 |
+
"bias": "none",
|
| 8 |
+
"corda_config": null,
|
| 9 |
+
"ensure_weight_tying": false,
|
| 10 |
+
"eva_config": null,
|
| 11 |
+
"exclude_modules": null,
|
| 12 |
+
"fan_in_fan_out": false,
|
| 13 |
+
"inference_mode": true,
|
| 14 |
+
"init_lora_weights": true,
|
| 15 |
+
"layer_replication": null,
|
| 16 |
+
"layers_pattern": null,
|
| 17 |
+
"layers_to_transform": null,
|
| 18 |
+
"loftq_config": {},
|
| 19 |
+
"lora_alpha": 32,
|
| 20 |
+
"lora_bias": false,
|
| 21 |
+
"lora_dropout": 0.05,
|
| 22 |
+
"megatron_config": null,
|
| 23 |
+
"megatron_core": "megatron.core",
|
| 24 |
+
"modules_to_save": null,
|
| 25 |
+
"peft_type": "LORA",
|
| 26 |
+
"peft_version": "0.18.1",
|
| 27 |
+
"qalora_group_size": 16,
|
| 28 |
+
"r": 16,
|
| 29 |
+
"rank_pattern": {},
|
| 30 |
+
"revision": null,
|
| 31 |
+
"target_modules": [
|
| 32 |
+
"k_proj",
|
| 33 |
+
"v_proj",
|
| 34 |
+
"gate_proj",
|
| 35 |
+
"o_proj",
|
| 36 |
+
"up_proj",
|
| 37 |
+
"down_proj",
|
| 38 |
+
"q_proj"
|
| 39 |
+
],
|
| 40 |
+
"target_parameters": null,
|
| 41 |
+
"task_type": "CAUSAL_LM",
|
| 42 |
+
"trainable_token_indices": null,
|
| 43 |
+
"use_dora": false,
|
| 44 |
+
"use_qalora": false,
|
| 45 |
+
"use_rslora": false
|
| 46 |
+
}
|
adapter_model.safetensors
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:7adeae07f8b6ee3afcba01d3c3c586306b37b1693280a49f1c09653f76ee8b6d
|
| 3 |
+
size 80792880
|
chat_template.jinja
ADDED
|
@@ -0,0 +1,54 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{%- if tools %}
|
| 2 |
+
{{- '<|im_start|>system\n' }}
|
| 3 |
+
{%- if messages[0]['role'] == 'system' %}
|
| 4 |
+
{{- messages[0]['content'] }}
|
| 5 |
+
{%- else %}
|
| 6 |
+
{{- 'You are Qwen, created by Alibaba Cloud. You are a helpful assistant.' }}
|
| 7 |
+
{%- endif %}
|
| 8 |
+
{{- "\n\n# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:\n<tools>" }}
|
| 9 |
+
{%- for tool in tools %}
|
| 10 |
+
{{- "\n" }}
|
| 11 |
+
{{- tool | tojson }}
|
| 12 |
+
{%- endfor %}
|
| 13 |
+
{{- "\n</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{\"name\": <function-name>, \"arguments\": <args-json-object>}\n</tool_call><|im_end|>\n" }}
|
| 14 |
+
{%- else %}
|
| 15 |
+
{%- if messages[0]['role'] == 'system' %}
|
| 16 |
+
{{- '<|im_start|>system\n' + messages[0]['content'] + '<|im_end|>\n' }}
|
| 17 |
+
{%- else %}
|
| 18 |
+
{{- '<|im_start|>system\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\n' }}
|
| 19 |
+
{%- endif %}
|
| 20 |
+
{%- endif %}
|
| 21 |
+
{%- for message in messages %}
|
| 22 |
+
{%- if (message.role == "user") or (message.role == "system" and not loop.first) or (message.role == "assistant" and not message.tool_calls) %}
|
| 23 |
+
{{- '<|im_start|>' + message.role + '\n' + message.content + '<|im_end|>' + '\n' }}
|
| 24 |
+
{%- elif message.role == "assistant" %}
|
| 25 |
+
{{- '<|im_start|>' + message.role }}
|
| 26 |
+
{%- if message.content %}
|
| 27 |
+
{{- '\n' + message.content }}
|
| 28 |
+
{%- endif %}
|
| 29 |
+
{%- for tool_call in message.tool_calls %}
|
| 30 |
+
{%- if tool_call.function is defined %}
|
| 31 |
+
{%- set tool_call = tool_call.function %}
|
| 32 |
+
{%- endif %}
|
| 33 |
+
{{- '\n<tool_call>\n{"name": "' }}
|
| 34 |
+
{{- tool_call.name }}
|
| 35 |
+
{{- '", "arguments": ' }}
|
| 36 |
+
{{- tool_call.arguments | tojson }}
|
| 37 |
+
{{- '}\n</tool_call>' }}
|
| 38 |
+
{%- endfor %}
|
| 39 |
+
{{- '<|im_end|>\n' }}
|
| 40 |
+
{%- elif message.role == "tool" %}
|
| 41 |
+
{%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != "tool") %}
|
| 42 |
+
{{- '<|im_start|>user' }}
|
| 43 |
+
{%- endif %}
|
| 44 |
+
{{- '\n<tool_response>\n' }}
|
| 45 |
+
{{- message.content }}
|
| 46 |
+
{{- '\n</tool_response>' }}
|
| 47 |
+
{%- if loop.last or (messages[loop.index0 + 1].role != "tool") %}
|
| 48 |
+
{{- '<|im_end|>\n' }}
|
| 49 |
+
{%- endif %}
|
| 50 |
+
{%- endif %}
|
| 51 |
+
{%- endfor %}
|
| 52 |
+
{%- if add_generation_prompt %}
|
| 53 |
+
{{- '<|im_start|>assistant\n' }}
|
| 54 |
+
{%- endif %}
|
docs/release/datasets.md
ADDED
|
@@ -0,0 +1,82 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# PRISM-Memory Datasets
|
| 2 |
+
|
| 3 |
+
This file separates the data used by the public `PRISM-Memory` release from the
|
| 4 |
+
auxiliary datasets that were only useful for ablations.
|
| 5 |
+
|
| 6 |
+
## Released Training Recipe
|
| 7 |
+
|
| 8 |
+
The released checkpoint is `exp15_sft_qwen7b_4ep`.
|
| 9 |
+
|
| 10 |
+
The core recipe was:
|
| 11 |
+
|
| 12 |
+
1. Start from `Qwen/Qwen2.5-7B-Instruct`.
|
| 13 |
+
2. Fine-tune with LoRA on a `20k` sample from `train_sft.jsonl`.
|
| 14 |
+
3. Evaluate on held-out `LoCoMo` and held-out `LongMemEval`.
|
| 15 |
+
|
| 16 |
+
## Source Conversations
|
| 17 |
+
|
| 18 |
+
The underlying synthetic conversation source lives in the upstream
|
| 19 |
+
`better_memory/data/output/` directory.
|
| 20 |
+
|
| 21 |
+
| File | Kind | Split | Notes |
|
| 22 |
+
|---|---|---|---|
|
| 23 |
+
| `train.jsonl` | raw conversations | train | `2,329` synthetic multi-session conversations |
|
| 24 |
+
| `eval.jsonl` | raw conversations | eval | `584` held-out synthetic multi-session conversations |
|
| 25 |
+
| `metadata.json` | split metadata | all | counts by tier, agent type, and update regime |
|
| 26 |
+
|
| 27 |
+
The source generator was built to create long-horizon memory stress cases with
|
| 28 |
+
inserts, updates, deletes, and multi-session recall.
|
| 29 |
+
|
| 30 |
+
## Derived SFT Data
|
| 31 |
+
|
| 32 |
+
These are GPT-4.1-derived proposition labels built on top of the raw
|
| 33 |
+
conversations.
|
| 34 |
+
|
| 35 |
+
| File | Examples | Role | Release Status |
|
| 36 |
+
|---|---|---|---|
|
| 37 |
+
| `train_sft.jsonl` | `100,427` | primary SFT data | core release data |
|
| 38 |
+
| `train_sft_clean_merged.jsonl` | `20,000` | cleaned resume base matching `sft4` distribution | good follow-on base |
|
| 39 |
+
| `train_sft_temporal_resolved.jsonl` | `2,643` | temporal-fix add-on set | useful for targeted research, not the public base |
|
| 40 |
+
| `eval_sft.jsonl` | reference | GPT-4.1 PropMem extractions on eval conversations | evaluation reference only |
|
| 41 |
+
|
| 42 |
+
## Evaluation Surfaces
|
| 43 |
+
|
| 44 |
+
The released model was evaluated on two held-out surfaces:
|
| 45 |
+
|
| 46 |
+
| Benchmark | Held-out Surface | Notes |
|
| 47 |
+
|---|---|---|
|
| 48 |
+
| `LoCoMo` | conversations `conv-49` and `conv-50` | five categories: factual, temporal, inferential, multi-hop, adversarial |
|
| 49 |
+
| `LongMemEval` | held-out items stratified by question type | six categories, including temporal reasoning and knowledge updates |
|
| 50 |
+
|
| 51 |
+
Both the GPT-4.1 extraction baseline and the released 7B extractor were scored
|
| 52 |
+
with the same GPT-4.1 QA evaluator and the same cache-backed answer surface.
|
| 53 |
+
|
| 54 |
+
## Auxiliary LoCoMo Datasets
|
| 55 |
+
|
| 56 |
+
These files were used in ablations and targeted probes. They matter for the
|
| 57 |
+
research story, but they are not the main public training recipe.
|
| 58 |
+
|
| 59 |
+
| File | Examples | Intended Use | Outcome |
|
| 60 |
+
|---|---|---|---|
|
| 61 |
+
| `locomo_qa_supervised_factual.jsonl` | `512` | factual QA supervision | neutral to small benefit |
|
| 62 |
+
| `locomo_qa_supervised_multihop.jsonl` | `625` | multihop QA supervision | neutral to small benefit |
|
| 63 |
+
| `locomo_qa_supervised_temporal.jsonl` | `248` | temporal QA supervision with absolute dates | neutral to small benefit |
|
| 64 |
+
| `locomo_qa_supervised_inferential.jsonl` | `133` | inferential QA supervision | too small, hurt balance |
|
| 65 |
+
| `locomo_qa_supervised_temporal_relformat.jsonl` | `248` | temporal QA with benchmark-style relative dates | hurt |
|
| 66 |
+
| `locomo_sft_extra.jsonl` | `2,645` | LoCoMo-domain SFT add-on | hurt |
|
| 67 |
+
| `locomo_sft_extra_relformat.jsonl` | `3,178` | relative-date LoCoMo SFT add-on | hurt |
|
| 68 |
+
|
| 69 |
+
## Practical Takeaways
|
| 70 |
+
|
| 71 |
+
1. The best 7B model came from the stable `20k` `train_sft.jsonl` base, not
|
| 72 |
+
from aggressive benchmark-specific add-ons.
|
| 73 |
+
2. Training on LoCoMo-domain conversations did not help generalization.
|
| 74 |
+
3. Relative-date output hacks made the extractor worse.
|
| 75 |
+
4. More original LME data was not automatically better because noisy temporal
|
| 76 |
+
labels compounded the anchor-loss problem.
|
| 77 |
+
|
| 78 |
+
Related docs:
|
| 79 |
+
|
| 80 |
+
- [extraction-skill.md](extraction-skill.md)
|
| 81 |
+
- [release-results.md](release-results.md)
|
| 82 |
+
- [technical-blog.md](technical-blog.md)
|
docs/release/extraction-examples.md
ADDED
|
@@ -0,0 +1,76 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# PRISM-Memory Extraction Examples
|
| 2 |
+
|
| 3 |
+
Selected held-out examples from the original Exp15 `eval_sft.jsonl` corpus.
|
| 4 |
+
The `GPT-4.1 reference` rows come from the original SFT target propositions.
|
| 5 |
+
The `PRISM-Memory` rows were regenerated from `exp15_sft_qwen7b_4ep` with greedy decoding using the same extraction prompt family used during evaluation.
|
| 6 |
+
|
| 7 |
+
These examples are illustrations, not the benchmark itself. Use
|
| 8 |
+
[release-results.md](release-results.md) for the aggregate numbers.
|
| 9 |
+
|
| 10 |
+
## Infrastructure bottlenecks stay structured
|
| 11 |
+
|
| 12 |
+
- Overlap score: `0.909`
|
| 13 |
+
- Session date: `2025-01-04 15:34:00`
|
| 14 |
+
- Note: Near-exact match on two operational facts from a single held-out turn.
|
| 15 |
+
|
| 16 |
+
**Turn**
|
| 17 |
+
|
| 18 |
+
> yeah, no real caching beyond basic Docker layer caching. Jenkins nodes have limited capacity, and we sometimes hit queue delays during peak commits.
|
| 19 |
+
|
| 20 |
+
**GPT-4.1 reference**
|
| 21 |
+
|
| 22 |
+
- No caching beyond basic Docker layer caching
|
| 23 |
+
- Jenkins nodes have limited capacity and experience queue delays during peak commits
|
| 24 |
+
|
| 25 |
+
**PRISM-Memory `sft4`**
|
| 26 |
+
|
| 27 |
+
- No Docker caching beyond basic layer caching
|
| 28 |
+
- Jenkins nodes have limited capacity; peak commits cause queue delays
|
| 29 |
+
|
| 30 |
+
## Numeric constraints and preferences survive extraction
|
| 31 |
+
|
| 32 |
+
- Overlap score: `0.800`
|
| 33 |
+
- Session date: `2025-03-01 15:07:00`
|
| 34 |
+
- Note: The trained model keeps both the hard concurrency cap and the desired notification style.
|
| 35 |
+
|
| 36 |
+
**Turn**
|
| 37 |
+
|
| 38 |
+
> yeah, I think starting with incremental scans and parallel matrix jobs makes sense. We have 20 concurrent jobs max on GitHub Actions currently. Also want to keep Slack notifications from Snyk consistent with other pipeline alerts—aggregated and concise. Can you help draft the workflow?
|
| 39 |
+
|
| 40 |
+
**GPT-4.1 reference**
|
| 41 |
+
|
| 42 |
+
- GitHub Actions concurrency limit: 20 concurrent jobs
|
| 43 |
+
- Wants Snyk Slack notifications aggregated and concise, consistent with other pipeline alerts
|
| 44 |
+
|
| 45 |
+
**PRISM-Memory `sft4`**
|
| 46 |
+
|
| 47 |
+
- GitHub Actions concurrency limit: 20 concurrent jobs
|
| 48 |
+
- Snyk Slack notifications should be aggregated and concise
|
| 49 |
+
|
| 50 |
+
## Current state and next-step plans stay grouped
|
| 51 |
+
|
| 52 |
+
- Overlap score: `0.799`
|
| 53 |
+
- Session date: `2025-09-09 15:26:00`
|
| 54 |
+
- Note: The extracted memory keeps the current configuration, the planned security step, and the rollout strategy in one compact bundle.
|
| 55 |
+
|
| 56 |
+
**Turn**
|
| 57 |
+
|
| 58 |
+
> yeah good point about resource overhead, we set CPU limits for all sidecars and monitor with Prometheus now. no mTLS yet, but it’s on the roadmap for phase two. as for routing, we want to start with canary deployments and traffic splitting, maybe some basic fault injection for testing.
|
| 59 |
+
|
| 60 |
+
**GPT-4.1 reference**
|
| 61 |
+
|
| 62 |
+
- Istio sidecar CPU limits set and monitored via Prometheus
|
| 63 |
+
- mTLS planned in phase two
|
| 64 |
+
- Plan to use canary deployments, traffic splitting, and basic fault injection
|
| 65 |
+
|
| 66 |
+
**PRISM-Memory `sft4`**
|
| 67 |
+
|
| 68 |
+
- Sidecar CPU limits set and monitored via Prometheus
|
| 69 |
+
- Istio mTLS planned for phase two
|
| 70 |
+
- Routing strategy: canary deployments and traffic splitting; basic fault injection planned
|
| 71 |
+
|
| 72 |
+
## Regeneration
|
| 73 |
+
|
| 74 |
+
```bash
|
| 75 |
+
conda run -n pytorch_p310 python scripts/release/generate_readme_examples.py
|
| 76 |
+
```
|
docs/release/extraction-skill.md
ADDED
|
@@ -0,0 +1,119 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# PRISM-Memory Extraction Skill
|
| 2 |
+
|
| 3 |
+
**Hook:** Turn conversations into durable, searchable memory.
|
| 4 |
+
|
| 5 |
+
This is the single extraction skill to keep from the `better_memory` work.
|
| 6 |
+
Public release should point to one checkpoint and one extraction behavior:
|
| 7 |
+
|
| 8 |
+
- **Model:** `exp15_sft_qwen7b_4ep`
|
| 9 |
+
- **Base model:** `Qwen/Qwen2.5-7B-Instruct`
|
| 10 |
+
- **Role:** proposition extraction for long-term conversational memory
|
| 11 |
+
- **Why this one:** best confirmed total profile, best adversarial behavior, and
|
| 12 |
+
best LongMemEval score
|
| 13 |
+
|
| 14 |
+
## Skill Definition
|
| 15 |
+
|
| 16 |
+
The extractor operates turn by turn and emits `0-5` atomic propositions per
|
| 17 |
+
turn. Each proposition should be a standalone fact about a person, event,
|
| 18 |
+
preference, or property, with dates carried into the fact when available.
|
| 19 |
+
|
| 20 |
+
Canonical prompt:
|
| 21 |
+
|
| 22 |
+
```text
|
| 23 |
+
You are a memory extraction assistant. Given a conversation turn, extract 0-5 atomic, standalone facts. Each fact must be a complete sentence about a specific person, event, preference, or property. Include dates/times when mentioned. Skip greetings, filler, and questions. Output ONLY a JSON array of strings, e.g. ["fact1", "fact2"] or [].
|
| 24 |
+
```
|
| 25 |
+
|
| 26 |
+
This prompt comes from `experiment15_learned_extraction.py` in the upstream
|
| 27 |
+
`better_memory` workspace.
|
| 28 |
+
|
| 29 |
+
## Inference Contract
|
| 30 |
+
|
| 31 |
+
1. Format the turn with speaker and session date.
|
| 32 |
+
2. Extract `0-5` propositions as a JSON array.
|
| 33 |
+
3. Clean speaker references so generic labels become real names.
|
| 34 |
+
4. Resolve relative temporal expressions against the session date.
|
| 35 |
+
5. Prefix each proposition with the normalized session date before indexing.
|
| 36 |
+
6. Retrieve with the PRISM hybrid stack, not with the extractor alone.
|
| 37 |
+
|
| 38 |
+
## Retrieval Setup To Keep
|
| 39 |
+
|
| 40 |
+
- **Retriever:** `PRISMv3Rerank`
|
| 41 |
+
- **Sparse retrieval:** BM25
|
| 42 |
+
- **Dense retrieval:** `all-MiniLM-L6-v2`
|
| 43 |
+
- **Reranker:** `cross-encoder/ms-marco-MiniLM-L-6-v2`
|
| 44 |
+
|
| 45 |
+
Best confirmed retrieval settings:
|
| 46 |
+
|
| 47 |
+
- **LoCoMo:** adversarial `k=5`, multi-hop `k=10`, all other categories `k=8`
|
| 48 |
+
- **LongMemEval:** multi-session `k=20`, all other categories `k=8` except
|
| 49 |
+
single-session-user `k=5`
|
| 50 |
+
|
| 51 |
+
## What Worked
|
| 52 |
+
|
| 53 |
+
1. **The original 20k base mattered.**
|
| 54 |
+
`sft4` came from the exact `train_sft_clean_merged.jsonl` base distribution.
|
| 55 |
+
Runs that changed the base subset regressed.
|
| 56 |
+
|
| 57 |
+
2. **Four epochs was the sweet spot.**
|
| 58 |
+
`sft4` is the local optimum the repo could actually reproduce.
|
| 59 |
+
|
| 60 |
+
3. **Absolute date anchoring helped.**
|
| 61 |
+
Temporal repairs worked when the model saw explicit, normalized dates rather
|
| 62 |
+
than benchmark-specific relative phrasing.
|
| 63 |
+
|
| 64 |
+
4. **Post-processing mattered.**
|
| 65 |
+
Speaker cleanup plus relative-date resolution was necessary to turn raw
|
| 66 |
+
outputs into stable memory records.
|
| 67 |
+
|
| 68 |
+
5. **Hybrid retrieval beat simpler retrieval.**
|
| 69 |
+
BM25 + dense + reranking consistently outperformed BM25-only or dense-only
|
| 70 |
+
approaches.
|
| 71 |
+
|
| 72 |
+
6. **Turn-local extraction was enough.**
|
| 73 |
+
The model performed better without feeding long recent-context windows into
|
| 74 |
+
the extractor.
|
| 75 |
+
|
| 76 |
+
7. **Multihop supervision preserved inferential behavior.**
|
| 77 |
+
When temporal data was added, multihop QA was the only extra signal that
|
| 78 |
+
reliably helped preserve inferential performance.
|
| 79 |
+
|
| 80 |
+
## What Did Not Work
|
| 81 |
+
|
| 82 |
+
1. **Relative-date training.**
|
| 83 |
+
Training the extractor to emit benchmark-style relative dates hurt temporal
|
| 84 |
+
performance instead of helping it.
|
| 85 |
+
|
| 86 |
+
2. **LoCoMo-domain SFT data.**
|
| 87 |
+
Adding LoCoMo training conversations consistently regressed the model.
|
| 88 |
+
|
| 89 |
+
3. **More than 20k original LME examples.**
|
| 90 |
+
Scaling the original noisy temporal labels to 50k amplified anchor loss and
|
| 91 |
+
caused major regression.
|
| 92 |
+
|
| 93 |
+
4. **Small clean bases.**
|
| 94 |
+
5k-base follow-on runs forgot too much and collapsed inferential behavior.
|
| 95 |
+
|
| 96 |
+
5. **Heavy QA multipliers.**
|
| 97 |
+
High temporal or QA multipliers damaged adversarial precision and LongMemEval.
|
| 98 |
+
|
| 99 |
+
6. **High learning rates on follow-on QA runs.**
|
| 100 |
+
Aggressive fine-tuning degraded the traits that made `sft4` good.
|
| 101 |
+
|
| 102 |
+
7. **Trying to push past the local optimum.**
|
| 103 |
+
Most post-`sft4` training traded away adversarial performance for narrower
|
| 104 |
+
gains.
|
| 105 |
+
|
| 106 |
+
## Release Rule
|
| 107 |
+
|
| 108 |
+
Release only this extraction skill and only this checkpoint publicly:
|
| 109 |
+
|
| 110 |
+
- `exp15_sft_qwen7b_4ep`
|
| 111 |
+
|
| 112 |
+
Treat all other checkpoints as internal ablations and learning artifacts, not as
|
| 113 |
+
parallel public releases.
|
| 114 |
+
|
| 115 |
+
Related docs:
|
| 116 |
+
|
| 117 |
+
- [datasets.md](datasets.md)
|
| 118 |
+
- [release-results.md](release-results.md)
|
| 119 |
+
- [technical-blog.md](technical-blog.md)
|
docs/release/release-results.md
ADDED
|
@@ -0,0 +1,74 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# PRISM-Memory Release Results
|
| 2 |
+
|
| 3 |
+
This file summarizes the confirmed release metrics and the internal comparison
|
| 4 |
+
artifacts that informed the public checkpoint choice.
|
| 5 |
+
|
| 6 |
+
## Released Checkpoint
|
| 7 |
+
|
| 8 |
+
- Checkpoint: `exp15_sft_qwen7b_4ep`
|
| 9 |
+
- Base model: `Qwen/Qwen2.5-7B-Instruct`
|
| 10 |
+
- Adapter type: LoRA
|
| 11 |
+
- Confirmed LoCoMo mean: `0.4981204463`
|
| 12 |
+
- Confirmed LongMemEval mean: `0.4767574431`
|
| 13 |
+
- QA cache hits during confirmation: `460`
|
| 14 |
+
- QA cache misses during confirmation: `0`
|
| 15 |
+
|
| 16 |
+
## Baseline Context
|
| 17 |
+
|
| 18 |
+
`PRISM-Memory` fine-tunes `Qwen/Qwen2.5-7B-Instruct` for the proposition
|
| 19 |
+
extraction step that PropMem normally gets from GPT-4.1. On the confirmed run:
|
| 20 |
+
|
| 21 |
+
| Benchmark | PRISM-Memory `sft4` | GPT-4.1-based PropMem reference | Read |
|
| 22 |
+
|---|---:|---:|---|
|
| 23 |
+
| LongMemEval | `0.4768` | `0.4650` | PRISM wins |
|
| 24 |
+
| LoCoMo | `0.4981` | `0.5360` | PRISM trails, but stays close |
|
| 25 |
+
|
| 26 |
+
The QA layer is held constant. This is an extractor-vs-extractor comparison,
|
| 27 |
+
not an end-to-end GPT-4.1 replacement claim.
|
| 28 |
+
|
| 29 |
+
## LoCoMo Breakdown
|
| 30 |
+
|
| 31 |
+
| Category | Score |
|
| 32 |
+
|---|---:|
|
| 33 |
+
| factual | `0.3339551926` |
|
| 34 |
+
| temporal | `0.4978785870` |
|
| 35 |
+
| inferential | `0.2605997475` |
|
| 36 |
+
| multi-hop | `0.5144477744` |
|
| 37 |
+
| adversarial | `0.8837209302` |
|
| 38 |
+
|
| 39 |
+
## LongMemEval Breakdown
|
| 40 |
+
|
| 41 |
+
| Category | Score |
|
| 42 |
+
|---|---:|
|
| 43 |
+
| knowledge-update | `0.5588405797` |
|
| 44 |
+
| multi-session | `0.1390977444` |
|
| 45 |
+
| single-session-assistant | `0.7656395892` |
|
| 46 |
+
| single-session-preference | `0.0519667456` |
|
| 47 |
+
| single-session-user | `0.9133333333` |
|
| 48 |
+
| temporal-reasoning | `0.4316666667` |
|
| 49 |
+
|
| 50 |
+
## Internal Comparison That Informed The Release
|
| 51 |
+
|
| 52 |
+
The closest runner-up was `inferential_from_temporal_heavy`.
|
| 53 |
+
|
| 54 |
+
- Confirmed LoCoMo mean: `0.4975893989`
|
| 55 |
+
- Confirmed LongMemEval mean: `0.4688992148`
|
| 56 |
+
- Pairwise LoCoMo disagreements vs `sft4`: `152 / 400`
|
| 57 |
+
- Question-level wins: `56` for `sft4`, `52` for the runner-up
|
| 58 |
+
|
| 59 |
+
The release decision stayed with `sft4` because it preserved the strongest
|
| 60 |
+
LongMemEval score and the strongest adversarial behavior.
|
| 61 |
+
|
| 62 |
+
## Artifact Files
|
| 63 |
+
|
| 64 |
+
- [../../results/confirmed_exp15_summary.json](../../results/confirmed_exp15_summary.json)
|
| 65 |
+
- [../../results/scenario_comparisons.json](../../results/scenario_comparisons.json)
|
| 66 |
+
- [../../results/locomo_pairwise_question_diffs.json](../../results/locomo_pairwise_question_diffs.json)
|
| 67 |
+
- [../../results/sft4.json](../../results/sft4.json)
|
| 68 |
+
|
| 69 |
+
Related docs:
|
| 70 |
+
|
| 71 |
+
- [extraction-skill.md](extraction-skill.md)
|
| 72 |
+
- [extraction-examples.md](extraction-examples.md)
|
| 73 |
+
- [datasets.md](datasets.md)
|
| 74 |
+
- [model-card.md](model-card.md)
|
docs/release/technical-blog.md
ADDED
|
@@ -0,0 +1,177 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# PRISM-Memory: Turn Conversations Into Durable, Searchable Memory
|
| 2 |
+
|
| 3 |
+
## Summary
|
| 4 |
+
|
| 5 |
+
`PRISM-Memory` is a long-term conversational memory system that converts raw
|
| 6 |
+
dialogue into proposition-level memory and retrieves it with an inspectable
|
| 7 |
+
hybrid stack.
|
| 8 |
+
|
| 9 |
+
This package now ships one public extraction skill and one public checkpoint:
|
| 10 |
+
|
| 11 |
+
- **Checkpoint:** `exp15_sft_qwen7b_4ep`
|
| 12 |
+
- **Confirmed LoCoMo mean:** `0.4981204463`
|
| 13 |
+
- **Confirmed LongMemEval mean:** `0.4767574431`
|
| 14 |
+
- **QA cache misses during confirmation:** `0`
|
| 15 |
+
|
| 16 |
+
The public hook is simple:
|
| 17 |
+
|
| 18 |
+
**PRISM-Memory turns conversations into durable, searchable memory.**
|
| 19 |
+
|
| 20 |
+
## What The Repo Actually Contributed
|
| 21 |
+
|
| 22 |
+
The core contribution is not another opaque memory model. The repo showed that a
|
| 23 |
+
7B open model can replace GPT-4-class extraction with a transparent memory
|
| 24 |
+
pipeline that is still competitive on long-horizon dialogue benchmarks.
|
| 25 |
+
|
| 26 |
+
The released system has three pieces:
|
| 27 |
+
|
| 28 |
+
1. A learned proposition extractor (`Qwen2.5-7B-Instruct` + LoRA).
|
| 29 |
+
2. Post-processing that cleans speaker references and resolves relative time.
|
| 30 |
+
3. Hybrid retrieval (`BM25 + dense retrieval + cross-encoder reranking`).
|
| 31 |
+
|
| 32 |
+
The important part is the interface between them: extracted propositions are not
|
| 33 |
+
just text snippets. They are the memory records that the retriever indexes.
|
| 34 |
+
|
| 35 |
+
## The Single Skill To Keep
|
| 36 |
+
|
| 37 |
+
After reviewing the repo history, there should be one canonical extraction skill
|
| 38 |
+
and one checkpoint publicly exposed:
|
| 39 |
+
|
| 40 |
+
- **Skill:** proposition-level memory extraction
|
| 41 |
+
- **Model:** `exp15_sft_qwen7b_4ep`
|
| 42 |
+
- **Prompt contract:** extract `0-5` atomic standalone facts, include dates when
|
| 43 |
+
present, skip filler and questions, output JSON only
|
| 44 |
+
|
| 45 |
+
That skill is documented directly in
|
| 46 |
+
[extraction-skill.md](extraction-skill.md).
|
| 47 |
+
|
| 48 |
+
## What Worked
|
| 49 |
+
|
| 50 |
+
### 1. The best model came from the stable 20k base, not from aggressive add-ons
|
| 51 |
+
|
| 52 |
+
The repo repeatedly showed that `sft4` was the stable optimum for the 7B line.
|
| 53 |
+
The same 20k clean base distribution was critical. Changing the base subset,
|
| 54 |
+
shrinking it, or overextending it consistently hurt.
|
| 55 |
+
|
| 56 |
+
Why that matters:
|
| 57 |
+
|
| 58 |
+
- the model needed the exact data distribution that produced `sft4`
|
| 59 |
+
- 4 epochs was enough to reach the useful local optimum
|
| 60 |
+
- follow-on runs often traded away robustness for narrower gains
|
| 61 |
+
|
| 62 |
+
### 2. Proposition memory plus hybrid retrieval is the real winning combination
|
| 63 |
+
|
| 64 |
+
The strongest system was not latent-only memory and not raw-turn retrieval. The
|
| 65 |
+
best path was proposition extraction plus `PRISMv3Rerank`.
|
| 66 |
+
|
| 67 |
+
That means:
|
| 68 |
+
|
| 69 |
+
- sparse retrieval captured lexical anchors
|
| 70 |
+
- dense retrieval recovered semantically close memories
|
| 71 |
+
- reranking cleaned up the final shortlist
|
| 72 |
+
|
| 73 |
+
This combination is what made the memory store usable.
|
| 74 |
+
|
| 75 |
+
### 3. Absolute date anchoring and temporal cleanup helped
|
| 76 |
+
|
| 77 |
+
Temporal improvement came from making the memory records cleaner, not from
|
| 78 |
+
teaching the model to imitate LoCoMo’s relative-answer style.
|
| 79 |
+
|
| 80 |
+
What helped:
|
| 81 |
+
|
| 82 |
+
- fixed temporal examples with explicit date resolution
|
| 83 |
+
- normalizing session dates
|
| 84 |
+
- post-processing relative references like `yesterday` or `last weekend`
|
| 85 |
+
|
| 86 |
+
What did **not** help:
|
| 87 |
+
|
| 88 |
+
- training the model to emit relative benchmark-style dates
|
| 89 |
+
|
| 90 |
+
### 4. Turn-local extraction was better than passing long context windows
|
| 91 |
+
|
| 92 |
+
The repo tested extraction with added session context and it regressed. The
|
| 93 |
+
model worked best when extracting from the current turn and letting the memory
|
| 94 |
+
system handle cross-turn reasoning later.
|
| 95 |
+
|
| 96 |
+
That is an important design lesson: keep extraction local, let retrieval do the
|
| 97 |
+
composition.
|
| 98 |
+
|
| 99 |
+
### 5. Adversarial precision was the strongest reason to keep `sft4`
|
| 100 |
+
|
| 101 |
+
Many later variants found small gains in temporal or inferential categories, but
|
| 102 |
+
they usually damaged adversarial behavior. `sft4` held the best confirmed
|
| 103 |
+
adversarial score and the best total LongMemEval score, which is why it is the
|
| 104 |
+
only checkpoint worth releasing publicly.
|
| 105 |
+
|
| 106 |
+
## What Did Not Work
|
| 107 |
+
|
| 108 |
+
### 1. Benchmark-specific format hacks
|
| 109 |
+
|
| 110 |
+
Relative-date training was a dead end. It optimized for the look of a benchmark
|
| 111 |
+
answer rather than for general extraction quality.
|
| 112 |
+
|
| 113 |
+
### 2. LoCoMo-domain training data
|
| 114 |
+
|
| 115 |
+
Adding LoCoMo training conversations consistently regressed performance. The
|
| 116 |
+
best generalization signal remained the cleaned LME-style base data.
|
| 117 |
+
|
| 118 |
+
### 3. More original LME data was not better
|
| 119 |
+
|
| 120 |
+
Scaling from 20k to 50k original LME examples amplified the temporal-anchor
|
| 121 |
+
problem. More noisy temporal labels simply taught the wrong lesson more often.
|
| 122 |
+
|
| 123 |
+
### 4. Small follow-on bases and heavy QA multipliers
|
| 124 |
+
|
| 125 |
+
Runs built on 5k clean bases or extreme QA multipliers tended to forget useful
|
| 126 |
+
behavior. They often improved a narrow category while hurting adversarial
|
| 127 |
+
precision, inferential balance, or LongMemEval.
|
| 128 |
+
|
| 129 |
+
### 5. Assuming the best checkpoint was easy to improve
|
| 130 |
+
|
| 131 |
+
The repo’s most expensive lesson was that `sft4` was already a local optimum for
|
| 132 |
+
the 7B line. Most additional training made the model more specialized and less
|
| 133 |
+
balanced.
|
| 134 |
+
|
| 135 |
+
## Internal Comparisons That Informed The Release
|
| 136 |
+
|
| 137 |
+
The internal ablation story still matters, even though the public package keeps
|
| 138 |
+
only `sft4`.
|
| 139 |
+
|
| 140 |
+
Confirmed internal facts:
|
| 141 |
+
|
| 142 |
+
- `inferential_from_temporal_heavy` nearly tied `sft4` on overall LoCoMo
|
| 143 |
+
- it recovered some inferential and temporal misses
|
| 144 |
+
- it still lost on LongMemEval and adversarial precision
|
| 145 |
+
|
| 146 |
+
Question-level comparison on held-out LoCoMo:
|
| 147 |
+
|
| 148 |
+
- `400` questions replayed
|
| 149 |
+
- `152` answer-level disagreements
|
| 150 |
+
- `56` questions favored `sft4`
|
| 151 |
+
- `52` questions favored the runner-up
|
| 152 |
+
|
| 153 |
+
That is a useful research result, but not a reason to ship two public models.
|
| 154 |
+
The right release decision is one clean skill, one clean checkpoint.
|
| 155 |
+
|
| 156 |
+
## Failure Modes Still Visible In The Release Model
|
| 157 |
+
|
| 158 |
+
The selected model is good enough to release, but its errors are clear:
|
| 159 |
+
|
| 160 |
+
- it can miss specific diagnoses while retaining the broader health frame
|
| 161 |
+
- it can overcommit to a salient retrieved clue in inferential questions
|
| 162 |
+
- it can remember a coarse book description but miss the exact title
|
| 163 |
+
|
| 164 |
+
Those are not packaging issues. They are the current limits of the extraction +
|
| 165 |
+
retrieval stack at this model size.
|
| 166 |
+
|
| 167 |
+
## What Ships
|
| 168 |
+
|
| 169 |
+
Public release surface:
|
| 170 |
+
|
| 171 |
+
1. `PRISM-Memory`
|
| 172 |
+
2. the single extraction skill in [extraction-skill.md](extraction-skill.md)
|
| 173 |
+
3. the best confirmed checkpoint `exp15_sft_qwen7b_4ep`
|
| 174 |
+
4. the best-only Space demo in [../../space/](../../space/)
|
| 175 |
+
|
| 176 |
+
Internal analysis artifacts can stay for provenance, but they should not be
|
| 177 |
+
positioned as parallel public releases.
|
results/confirmed_exp15_summary.json
ADDED
|
@@ -0,0 +1,53 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"results": [
|
| 3 |
+
{
|
| 4 |
+
"alias": "sft4",
|
| 5 |
+
"checkpoint": "exp15_sft_qwen7b_4ep",
|
| 6 |
+
"elapsed_min": 28.93,
|
| 7 |
+
"args": {
|
| 8 |
+
"n_lme": 10,
|
| 9 |
+
"context_window": 0,
|
| 10 |
+
"locomo_temporal_k": 8,
|
| 11 |
+
"locomo_adversarial_k": 5,
|
| 12 |
+
"lme_multisess_k": 20,
|
| 13 |
+
"use_temporal_prompt": false,
|
| 14 |
+
"strict_cache": true
|
| 15 |
+
},
|
| 16 |
+
"qa_cache": {
|
| 17 |
+
"model": "gpt-4.1",
|
| 18 |
+
"cache_size": 16969,
|
| 19 |
+
"hits": 460,
|
| 20 |
+
"misses": 0,
|
| 21 |
+
"missing_examples": []
|
| 22 |
+
},
|
| 23 |
+
"locomo": {
|
| 24 |
+
"categories": {
|
| 25 |
+
"1": 0.3339551926061944,
|
| 26 |
+
"2": 0.4978785869736096,
|
| 27 |
+
"3": 0.26059974747474746,
|
| 28 |
+
"4": 0.514447774438597,
|
| 29 |
+
"5": 0.8837209302325582
|
| 30 |
+
},
|
| 31 |
+
"mean": 0.49812044634514124
|
| 32 |
+
},
|
| 33 |
+
"lme": {
|
| 34 |
+
"categories": {
|
| 35 |
+
"knowledge-update": 0.558840579710145,
|
| 36 |
+
"multi-session": 0.13909774436090225,
|
| 37 |
+
"single-session-assistant": 0.765639589169001,
|
| 38 |
+
"single-session-preference": 0.05196674560130369,
|
| 39 |
+
"single-session-user": 0.9133333333333333,
|
| 40 |
+
"temporal-reasoning": 0.43166666666666664
|
| 41 |
+
},
|
| 42 |
+
"mean": 0.47675744314022533
|
| 43 |
+
},
|
| 44 |
+
"logged_comparison": {
|
| 45 |
+
"logged_locomo_mean": 0.498,
|
| 46 |
+
"logged_lme_mean": 0.477,
|
| 47 |
+
"locomo_delta": 0.00012044634514124519,
|
| 48 |
+
"lme_delta": -0.00024255685977464525
|
| 49 |
+
}
|
| 50 |
+
}
|
| 51 |
+
],
|
| 52 |
+
"failures": []
|
| 53 |
+
}
|
results/readme_extraction_examples.json
ADDED
|
@@ -0,0 +1,57 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"source_dataset": "BETTER_MEMORY_ROOT/data/output/eval_sft.jsonl",
|
| 3 |
+
"model_path": "BETTER_MEMORY_ROOT/exp15_sft_qwen7b_4ep",
|
| 4 |
+
"output_examples": 3,
|
| 5 |
+
"examples": [
|
| 6 |
+
{
|
| 7 |
+
"id": "infra_queue_delays",
|
| 8 |
+
"title": "Infrastructure bottlenecks stay structured",
|
| 9 |
+
"note": "Near-exact match on two operational facts from a single held-out turn.",
|
| 10 |
+
"session_date": "2025-01-04 15:34:00",
|
| 11 |
+
"user_message": "yeah, no real caching beyond basic Docker layer caching. Jenkins nodes have limited capacity, and we sometimes hit queue delays during peak commits.",
|
| 12 |
+
"gpt41_reference": [
|
| 13 |
+
"No caching beyond basic Docker layer caching",
|
| 14 |
+
"Jenkins nodes have limited capacity and experience queue delays during peak commits"
|
| 15 |
+
],
|
| 16 |
+
"prism_memory": [
|
| 17 |
+
"No Docker caching beyond basic layer caching",
|
| 18 |
+
"Jenkins nodes have limited capacity; peak commits cause queue delays"
|
| 19 |
+
],
|
| 20 |
+
"overlap_score": 0.9090909090909092
|
| 21 |
+
},
|
| 22 |
+
{
|
| 23 |
+
"id": "github_actions_notifications",
|
| 24 |
+
"title": "Numeric constraints and preferences survive extraction",
|
| 25 |
+
"note": "The trained model keeps both the hard concurrency cap and the desired notification style.",
|
| 26 |
+
"session_date": "2025-03-01 15:07:00",
|
| 27 |
+
"user_message": "yeah, I think starting with incremental scans and parallel matrix jobs makes sense. We have 20 concurrent jobs max on GitHub Actions currently. Also want to keep Slack notifications from Snyk consistent with other pipeline alerts\u2014aggregated and concise. Can you help draft the workflow?",
|
| 28 |
+
"gpt41_reference": [
|
| 29 |
+
"GitHub Actions concurrency limit: 20 concurrent jobs",
|
| 30 |
+
"Wants Snyk Slack notifications aggregated and concise, consistent with other pipeline alerts"
|
| 31 |
+
],
|
| 32 |
+
"prism_memory": [
|
| 33 |
+
"GitHub Actions concurrency limit: 20 concurrent jobs",
|
| 34 |
+
"Snyk Slack notifications should be aggregated and concise"
|
| 35 |
+
],
|
| 36 |
+
"overlap_score": 0.8000000000000002
|
| 37 |
+
},
|
| 38 |
+
{
|
| 39 |
+
"id": "sidecar_limits_and_rollout",
|
| 40 |
+
"title": "Current state and next-step plans stay grouped",
|
| 41 |
+
"note": "The extracted memory keeps the current configuration, the planned security step, and the rollout strategy in one compact bundle.",
|
| 42 |
+
"session_date": "2025-09-09 15:26:00",
|
| 43 |
+
"user_message": "yeah good point about resource overhead, we set CPU limits for all sidecars and monitor with Prometheus now. no mTLS yet, but it\u2019s on the roadmap for phase two. as for routing, we want to start with canary deployments and traffic splitting, maybe some basic fault injection for testing.",
|
| 44 |
+
"gpt41_reference": [
|
| 45 |
+
"Istio sidecar CPU limits set and monitored via Prometheus",
|
| 46 |
+
"mTLS planned in phase two",
|
| 47 |
+
"Plan to use canary deployments, traffic splitting, and basic fault injection"
|
| 48 |
+
],
|
| 49 |
+
"prism_memory": [
|
| 50 |
+
"Sidecar CPU limits set and monitored via Prometheus",
|
| 51 |
+
"Istio mTLS planned for phase two",
|
| 52 |
+
"Routing strategy: canary deployments and traffic splitting; basic fault injection planned"
|
| 53 |
+
],
|
| 54 |
+
"overlap_score": 0.7985739750445632
|
| 55 |
+
}
|
| 56 |
+
]
|
| 57 |
+
}
|
results/scenario_comparisons.json
ADDED
|
@@ -0,0 +1,140 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"qa_cache": {
|
| 3 |
+
"model": "gpt-4.1",
|
| 4 |
+
"cache_size": 16969,
|
| 5 |
+
"hits": 5,
|
| 6 |
+
"misses": 0,
|
| 7 |
+
"missing_examples": []
|
| 8 |
+
},
|
| 9 |
+
"scenarios": [
|
| 10 |
+
{
|
| 11 |
+
"id": "temporal_anchor_hobby",
|
| 12 |
+
"title": "Temporal Anchor",
|
| 13 |
+
"source_id": "conv-49",
|
| 14 |
+
"category": 2,
|
| 15 |
+
"question": "Which hobby did Sam take up in May 2023?",
|
| 16 |
+
"gold_answer": "painting",
|
| 17 |
+
"kind": "strength",
|
| 18 |
+
"note": "The released model keeps the dated hobby proposition and answers correctly.",
|
| 19 |
+
"systems": [
|
| 20 |
+
{
|
| 21 |
+
"name": "sft4",
|
| 22 |
+
"prediction": "painting",
|
| 23 |
+
"top_retrieval": [
|
| 24 |
+
"Sam: [18 May 2023] Sam is considering trying painting as a new hobby.",
|
| 25 |
+
"Sam: [24 May 2023] Sam has been considering trying painting as a new hobby.",
|
| 26 |
+
"Sam: [6 October 2023] Sam asked Evan if he has explored any fun indoor activities or hobbies.",
|
| 27 |
+
"Sam: [18 May 2023] Sam is excited to try new things.",
|
| 28 |
+
"Sam: [24 May 2023] Sam is trying to break old habits.",
|
| 29 |
+
"Sam: [15 August 2023] Sam attended a cooking class.",
|
| 30 |
+
"Sam: [17 December 2023] Sam used to love hiking.",
|
| 31 |
+
"[1:47 pm on 18 May, 2023] Sam: We hiked a good distance - quite a feat for me back then. It's definitely a great memory."
|
| 32 |
+
]
|
| 33 |
+
}
|
| 34 |
+
]
|
| 35 |
+
},
|
| 36 |
+
{
|
| 37 |
+
"id": "adversarial_refusal_guitar",
|
| 38 |
+
"title": "Adversarial Refusal",
|
| 39 |
+
"source_id": "conv-50",
|
| 40 |
+
"category": 5,
|
| 41 |
+
"question": "Why did Dave get his guitar customized with a shiny finish?",
|
| 42 |
+
"gold_answer": "",
|
| 43 |
+
"kind": "strength",
|
| 44 |
+
"note": "This tests whether the system refuses to invent an answer when the premise is unsupported.",
|
| 45 |
+
"systems": [
|
| 46 |
+
{
|
| 47 |
+
"name": "sft4",
|
| 48 |
+
"prediction": "None",
|
| 49 |
+
"top_retrieval": [
|
| 50 |
+
"[2:55 pm on 31 August, 2023] Dave: That guitar has a gorgeous purple hue. Why did you make it so shiny?",
|
| 51 |
+
"Dave: [2 November 2023] The guitar was in bad condition when Dave found it.",
|
| 52 |
+
"[2:55 pm on 31 August, 2023] Dave: Good pick! The customized purple glow gives it a unique look that really stands out.",
|
| 53 |
+
"[2:55 pm on 31 August, 2023] Dave: That's a great guitar, Calvin! Love the design, it's so unique and special.",
|
| 54 |
+
"Dave: [16 May 2023] Calvin lost his guitar and amp but managed to save his music gear and microphone."
|
| 55 |
+
]
|
| 56 |
+
}
|
| 57 |
+
]
|
| 58 |
+
},
|
| 59 |
+
{
|
| 60 |
+
"id": "diagnosis_specificity",
|
| 61 |
+
"title": "Diagnosis Specificity",
|
| 62 |
+
"source_id": "conv-49",
|
| 63 |
+
"category": 1,
|
| 64 |
+
"question": "Which ailment does Sam have to face due to his weight?",
|
| 65 |
+
"gold_answer": "gastritis",
|
| 66 |
+
"kind": "failure",
|
| 67 |
+
"note": "A representative factual miss: the model retrieves the health-risk frame but not the specific diagnosis.",
|
| 68 |
+
"systems": [
|
| 69 |
+
{
|
| 70 |
+
"name": "sft4",
|
| 71 |
+
"prediction": "serious health risk",
|
| 72 |
+
"top_retrieval": [
|
| 73 |
+
"Sam: [8 October 2023] The doctor told Sam that his weight is a serious health risk.",
|
| 74 |
+
"Sam: [24 May 2023] The doctor's check-up revealed that Sam's weight was not good.",
|
| 75 |
+
"Sam: [13 August 2023] Sam is currently experiencing challenges affecting his health.",
|
| 76 |
+
"[6:48 pm on 17 December, 2023] Sam: Yeah, I'm struggling with my weight and it's affecting my confidence. I feel like I can't overcome all the challenges with my weight, I keep lacking motivation.",
|
| 77 |
+
"Sam: [21 November 2023] Sam has been trying to make dietary changes to address his discomfort.",
|
| 78 |
+
"Sam: [9 November 2023] Sam is a Weight Watchers coach in his group.",
|
| 79 |
+
"Sam: [7 August 2023] Sam has been prioritizing his health for some time.",
|
| 80 |
+
"Sam: [15 August 2023] Sam is concerned about his health."
|
| 81 |
+
]
|
| 82 |
+
}
|
| 83 |
+
]
|
| 84 |
+
},
|
| 85 |
+
{
|
| 86 |
+
"id": "location_inference",
|
| 87 |
+
"title": "Location Inference",
|
| 88 |
+
"source_id": "conv-49",
|
| 89 |
+
"category": 3,
|
| 90 |
+
"question": "Does Evan live close to a beach or mountains?",
|
| 91 |
+
"gold_answer": "beach",
|
| 92 |
+
"kind": "failure",
|
| 93 |
+
"note": "A representative inferential miss: retrieval includes both clues, but the model overcommits to the mountain mention.",
|
| 94 |
+
"systems": [
|
| 95 |
+
{
|
| 96 |
+
"name": "sft4",
|
| 97 |
+
"prediction": "mountains",
|
| 98 |
+
"top_retrieval": [
|
| 99 |
+
"Evan: [27 August 2023] Evan also shared his recent road trip to the Rocky Mountains and love for hiking.",
|
| 100 |
+
"Evan: [27 August 2023] Evan lives within a two-hour drive of a place with incredible views and a peaceful atmosphere.",
|
| 101 |
+
"Evan: [9 November 2023] They also discussed enjoying a sunset together at Evan's favorite spot by the beach, planning to visit it soon to de-stress.",
|
| 102 |
+
"Evan: [10 January 2024] Evan enjoys going on beach sunsets as a low-impact exercise.",
|
| 103 |
+
"[7:11 pm on 24 May, 2023] Evan: Hey Sam, thanks for asking! It was great - fresh air, peacefulness and a cozy cabin surrounded by mountains and forests made it feel like a real retreat.",
|
| 104 |
+
"Evan: [31 December 2023] Sam shared about a recent hiking trip, while Evan mentioned a mountain drive that ended in a minor accident.",
|
| 105 |
+
"Evan: [27 August 2023] Evan recommended a nearby lake for hiking and nature exploration.",
|
| 106 |
+
"Evan: [27 August 2023] Evan enjoys road trips and exploring nature."
|
| 107 |
+
]
|
| 108 |
+
}
|
| 109 |
+
]
|
| 110 |
+
},
|
| 111 |
+
{
|
| 112 |
+
"id": "reading_detail",
|
| 113 |
+
"title": "Reading Detail",
|
| 114 |
+
"source_id": "conv-49",
|
| 115 |
+
"category": 4,
|
| 116 |
+
"question": "What novel is Evan reading that he finds gripping?",
|
| 117 |
+
"gold_answer": "The Great Gatsby",
|
| 118 |
+
"kind": "failure",
|
| 119 |
+
"note": "A representative multi-hop miss: the model retains the coarse book description but misses the specific title.",
|
| 120 |
+
"systems": [
|
| 121 |
+
{
|
| 122 |
+
"name": "sft4",
|
| 123 |
+
"prediction": "a new mystery novel",
|
| 124 |
+
"top_retrieval": [
|
| 125 |
+
"Evan: [27 August 2023] Evan is reading a book that he finds increasingly compelling.",
|
| 126 |
+
"Evan: [27 July 2023] Evan is currently reading a new mystery novel.",
|
| 127 |
+
"Evan: [27 July 2023] Evan is reading 'The Great Gatsby'.",
|
| 128 |
+
"Evan: [26 December 2023] Evan finds that art helps him recognize and handle his own feelings.",
|
| 129 |
+
"Evan: [27 August 2023] Evan expressed interest in a book and discussed potential physical therapy for his knee.",
|
| 130 |
+
"Evan: [10 January 2024] Evan concluded that he needs to be more careful next time.",
|
| 131 |
+
"Evan: [6 October 2023] Evan thinks writing is a great way to express oneself.",
|
| 132 |
+
"Evan: [13 August 2023] Evan suggested checking out a dream interpretation book to help interpret Sam's dream.",
|
| 133 |
+
"Evan: [6 October 2023] Evan believes that writing can be super therapeutic.",
|
| 134 |
+
"Evan: [6 October 2023] Evan usually paints what is on his mind or something he is feeling."
|
| 135 |
+
]
|
| 136 |
+
}
|
| 137 |
+
]
|
| 138 |
+
}
|
| 139 |
+
]
|
| 140 |
+
}
|
tokenizer.json
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:3fd169731d2cbde95e10bf356d66d5997fd885dd8dbb6fb4684da3f23b2585d8
|
| 3 |
+
size 11421892
|
tokenizer_config.json
ADDED
|
@@ -0,0 +1,29 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"add_prefix_space": false,
|
| 3 |
+
"backend": "tokenizers",
|
| 4 |
+
"bos_token": null,
|
| 5 |
+
"clean_up_tokenization_spaces": false,
|
| 6 |
+
"eos_token": "<|im_end|>",
|
| 7 |
+
"errors": "replace",
|
| 8 |
+
"extra_special_tokens": [
|
| 9 |
+
"<|im_start|>",
|
| 10 |
+
"<|im_end|>",
|
| 11 |
+
"<|object_ref_start|>",
|
| 12 |
+
"<|object_ref_end|>",
|
| 13 |
+
"<|box_start|>",
|
| 14 |
+
"<|box_end|>",
|
| 15 |
+
"<|quad_start|>",
|
| 16 |
+
"<|quad_end|>",
|
| 17 |
+
"<|vision_start|>",
|
| 18 |
+
"<|vision_end|>",
|
| 19 |
+
"<|vision_pad|>",
|
| 20 |
+
"<|image_pad|>",
|
| 21 |
+
"<|video_pad|>"
|
| 22 |
+
],
|
| 23 |
+
"is_local": true,
|
| 24 |
+
"model_max_length": 131072,
|
| 25 |
+
"pad_token": "<|endoftext|>",
|
| 26 |
+
"split_special_tokens": false,
|
| 27 |
+
"tokenizer_class": "Qwen2Tokenizer",
|
| 28 |
+
"unk_token": null
|
| 29 |
+
}
|
training_args.bin
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:ab50b36dda4f8579bc40c94cd34d87dc76941d8407e8046b68e59716a94b0f61
|
| 3 |
+
size 5304
|