Publish PRISM-Memory adapter bundle

Browse files

Files changed (10) hide show

README.md +34 -24
docs/release/datasets.md +95 -84
docs/release/extraction-examples.md +7 -7
docs/release/extraction-skill.md +29 -72
docs/release/memory-scenarios.md +1 -1
docs/release/release-results.md +29 -23
docs/release/technical-blog.md +102 -188
results/{scenario_comparisons.json → benchmark_cases.json} +10 -5
results/{readme_extraction_examples.json → extraction_examples.json} +3 -2
results/{confirmed_exp15_summary.json → release_summary.json} +6 -4

README.md CHANGED Viewed

@@ -16,8 +16,14 @@ tags:
 # PRISM-Memory
 PRISM-Memory is a LoRA adapter that trains `Qwen/Qwen2.5-7B-Instruct` to write
-proposition-level memory from dialogue. It is the released `exp15_sft_qwen7b_4ep`
-checkpoint from the original `better_memory` project.
 ## What this release shows
@@ -35,7 +41,7 @@ extractor, not a full end-to-end GPT-4.1 system.
 - It supports dated recall and clean refusal on unsupported questions.
 See [docs/release/memory-scenarios.md](docs/release/memory-scenarios.md) for
-the compact end-to-end examples.
 ## Load the adapter
@@ -55,38 +61,42 @@ base_model = AutoModelForCausalLM.from_pretrained(
 model = PeftModel.from_pretrained(base_model, adapter_id)
 ```
-This repo contains the adapter weights only. You still need the base model.
 ## Training data
 PRISM-Memory was trained on **synthetic** multi-session memory conversations
-with **GPT-4.1-derived proposition labels**. The public release does not use
 real user chat logs.
-| File | Examples | Role |
 |---|---:|---|
-| `train.jsonl` | `2,329` conversations | raw synthetic conversation source |
-| `eval.jsonl` | `584` conversations | held-out synthetic conversation source |
-| `train_sft.jsonl` | `100,427` labels | primary SFT source |
-| `train_sft_clean_merged.jsonl` | `20,000` labels | cleaned follow-on base matching the best run |
-The released checkpoint uses a `20k` sample from `train_sft.jsonl`. See
-[docs/release/datasets.md](docs/release/datasets.md) for the full inventory,
-the evaluation surfaces, and the ablations that regressed.
-### Example data item
-**Synthetic turn**
-> yeah, I think starting with incremental scans and parallel matrix jobs makes sense. We have 20 concurrent jobs max on GitHub Actions currently. Also want to keep Slack notifications from Snyk consistent with other pipeline alerts, aggregated and concise.
-**Target propositions**
-- GitHub Actions concurrency limit: 20 concurrent jobs
-- Wants Snyk Slack notifications aggregated and concise, consistent with other pipeline alerts
-The current release makes the data recipe and examples public. The full raw
-training JSONLs are not bundled in this model repo.
 ## Confirmed results
@@ -149,9 +159,9 @@ More held-out examples live in
 - [docs/release/memory-scenarios.md](docs/release/memory-scenarios.md)
 - [docs/release/release-results.md](docs/release/release-results.md)
 - [docs/release/technical-blog.md](docs/release/technical-blog.md)
-- [results/confirmed_exp15_summary.json](results/confirmed_exp15_summary.json)
-- [results/readme_extraction_examples.json](results/readme_extraction_examples.json)
-- [results/scenario_comparisons.json](results/scenario_comparisons.json)
 ## Demo

 # PRISM-Memory
 PRISM-Memory is a LoRA adapter that trains `Qwen/Qwen2.5-7B-Instruct` to write
+proposition-level memory from dialogue. It is a memory-writing component, not a
+general chat model.
+## Released model
+- Model name: `PRISM-Memory 7B Adapter`
+- Base model: `Qwen/Qwen2.5-7B-Instruct`
+- Adapter type: `LoRA`
 ## What this release shows
 - It supports dated recall and clean refusal on unsupported questions.
 See [docs/release/memory-scenarios.md](docs/release/memory-scenarios.md) for
+compact end-to-end examples.
 ## Load the adapter
 model = PeftModel.from_pretrained(base_model, adapter_id)
 ```
+This repo contains adapter weights only. You still need the base model.
 ## Training data
 PRISM-Memory was trained on **synthetic** multi-session memory conversations
+with **GPT-4.1-derived** memory-writing labels. The public release does not use
 real user chat logs.
+| Item | Count | Notes |
 |---|---:|---|
+| synthetic training conversations | `2,329` | multi-session conversations with inserts, updates, and deletes |
+| synthetic held-out conversations | `584` | evaluation split used for held-out examples |
+| supervised extraction examples | `100,427` | memory-writing labels derived from the synthetic corpus |
+| released training subset | `20,000` | supervised examples used for the public adapter |
+### Example training item
+**Synthetic scenario**
+- Domain: cloud infrastructure performance optimization
+- Persona: senior cloud systems engineer at a fintech startup
+**Synthetic user turn**
+> Here’s the initial architecture outline: deploy microservices on AWS Fargate, use PostgreSQL 13 as the primary database, plan Kubernetes orchestration, use Redis for caching, and keep API latency under 50ms.
+**Target memory records**
+- Deploy microservices on AWS Fargate
+- Orchestrate containers on a Kubernetes cluster (planned)
+- Primary database: PostgreSQL 13
+- Use Redis as an in-memory caching layer
+- Latency target: API responses under 50ms
+The release makes the dataset design, counts, and example records public. It
+does not bundle the full raw corpus files.
 ## Confirmed results
 - [docs/release/memory-scenarios.md](docs/release/memory-scenarios.md)
 - [docs/release/release-results.md](docs/release/release-results.md)
 - [docs/release/technical-blog.md](docs/release/technical-blog.md)
+- [results/release_summary.json](results/release_summary.json)
+- [results/extraction_examples.json](results/extraction_examples.json)
+- [results/benchmark_cases.json](results/benchmark_cases.json)
 ## Demo

docs/release/datasets.md CHANGED Viewed

@@ -1,125 +1,136 @@
-# PRISM-Memory Datasets
-This file separates the data used by the public `PRISM-Memory` release from the
-auxiliary datasets that were only useful for ablations.
-## Data Provenance
-The release training data is **synthetic**.
-- The conversation source was programmatically generated to stress long-horizon
-  memory behavior such as inserts, updates, deletes, contradiction handling,
-  and multi-session recall.
-- The SFT labels were then derived from those synthetic conversations with a
-  GPT-4.1 proposition extractor.
-- No real end-user chat logs are part of this public release story.
-## Released Training Recipe
-The released checkpoint is `exp15_sft_qwen7b_4ep`.
-The core recipe was:
-1. Start from `Qwen/Qwen2.5-7B-Instruct`.
-2. Fine-tune with LoRA on a `20k` sample from `train_sft.jsonl`.
-3. Evaluate on held-out `LoCoMo` and held-out `LongMemEval`.
-## Source Conversations
-The underlying synthetic conversation source lives in the upstream
-`better_memory/data/output/` directory.
-| File | Kind | Split | Notes |
-|---|---|---|---|
-| `train.jsonl` | raw conversations | train | `2,329` synthetic multi-session conversations |
-| `eval.jsonl` | raw conversations | eval | `584` held-out synthetic multi-session conversations |
-| `metadata.json` | split metadata | all | counts by tier, agent type, and update regime |
-The source generator was built to create long-horizon memory stress cases with
-inserts, updates, deletes, and multi-session recall.
-## Example Training Item
-This is the shape of the data the model learned from: a synthetic dialogue turn
-paired with proposition-style extraction targets.
-**Synthetic turn**
-> yeah, I think starting with incremental scans and parallel matrix jobs makes sense. We have 20 concurrent jobs max on GitHub Actions currently. Also want to keep Slack notifications from Snyk consistent with other pipeline alerts, aggregated and concise.
-**Target propositions**
-- GitHub Actions concurrency limit: 20 concurrent jobs
-- Wants Snyk Slack notifications aggregated and concise, consistent with other pipeline alerts
-This example is illustrative of the release data format. The exact public
-release checkpoint was trained on the larger `train_sft.jsonl` corpus, not on
-just this slice.
-## Derived SFT Data
-These are GPT-4.1-derived proposition labels built on top of the raw
-conversations.
-| File | Examples | Role | Release Status |
-|---|---|---|---|
-| `train_sft.jsonl` | `100,427` | primary SFT data | core release data |
-| `train_sft_clean_merged.jsonl` | `20,000` | cleaned resume base matching `sft4` distribution | good follow-on base |
-| `train_sft_temporal_resolved.jsonl` | `2,643` | temporal-fix add-on set | useful for targeted research, not the public base |
-| `eval_sft.jsonl` | reference | GPT-4.1 PropMem extractions on eval conversations | evaluation reference only |
 ## Evaluation Surfaces
-The released model was evaluated on two held-out surfaces:
-| Benchmark | Held-out Surface | Notes |
 |---|---|---|
-| `LoCoMo` | conversations `conv-49` and `conv-50` | five categories: factual, temporal, inferential, multi-hop, adversarial |
-| `LongMemEval` | held-out items stratified by question type | six categories, including temporal reasoning and knowledge updates |
-Both the GPT-4.1 extraction baseline and the released 7B extractor were scored
-with the same GPT-4.1 QA evaluator and the same cache-backed answer surface.
-## What Is Public Right Now
 Public now:
-- dataset description and counts
 - held-out extraction examples
-- release metrics and benchmark breakdowns
 Not public yet:
-- the raw `train.jsonl` and `eval.jsonl` conversation files
-- the full `train_sft.jsonl` and `train_sft_clean_merged.jsonl` label files
-- the auxiliary LoCoMo ablation JSONLs
-So the current release makes the **data recipe** public, but not the full raw
-training corpora.
-## Auxiliary LoCoMo Datasets
-These files were used in ablations and targeted probes. They matter for the
-research story, but they are not the main public training recipe.
-| File | Examples | Intended Use | Outcome |
-|---|---|---|---|
-| `locomo_qa_supervised_factual.jsonl` | `512` | factual QA supervision | neutral to small benefit |
-| `locomo_qa_supervised_multihop.jsonl` | `625` | multihop QA supervision | neutral to small benefit |
-| `locomo_qa_supervised_temporal.jsonl` | `248` | temporal QA supervision with absolute dates | neutral to small benefit |
-| `locomo_qa_supervised_inferential.jsonl` | `133` | inferential QA supervision | too small, hurt balance |
-| `locomo_qa_supervised_temporal_relformat.jsonl` | `248` | temporal QA with benchmark-style relative dates | hurt |
-| `locomo_sft_extra.jsonl` | `2,645` | LoCoMo-domain SFT add-on | hurt |
-| `locomo_sft_extra_relformat.jsonl` | `3,178` | relative-date LoCoMo SFT add-on | hurt |
-## Practical Takeaways
-1. The best 7B model came from the stable `20k` `train_sft.jsonl` base, not
-   from aggressive benchmark-specific add-ons.
-2. Training on LoCoMo-domain conversations did not help generalization.
-3. Relative-date output hacks made the extractor worse.
-4. More original LME data was not automatically better because noisy temporal
-   labels compounded the anchor-loss problem.
 Related docs:

+# PRISM-Memory Training Data
+The PRISM-Memory release is trained on **synthetic** multi-session
+conversations with **GPT-4.1-derived** memory-writing labels. No real user chat
+logs are part of the public release story.
+## Dataset At A Glance
+| Item | Count | What it means |
+|---|---:|---|
+| synthetic training conversations | `2,329` | multi-session conversations used to build the training label bank |
+| synthetic held-out conversations | `584` | held-out conversations used for evaluation examples and reference labels |
+| total generated conversations | `2,913` | train plus eval |
+| supervised extraction examples | `100,427` | memory-writing examples derived from the synthetic conversations |
+| released training subset | `20,000` | supervised examples used to train the public adapter |
+| agent and task families | `6` | research, data analysis, QA, coding, planning, writing |
+The synthetic conversation generator deliberately creates long-horizon memory
+pressure:
+- facts introduced early and queried later
+- updated plans and corrected details
+- deleted or invalidated information
+- multi-session continuity
+- mixtures of preferences, project state, dates, and operational facts
+## How The Data Is Built
+The training pipeline has two layers.
+### 1. Synthetic conversation generation
+The first layer creates multi-session conversations around realistic work and
+assistant scenarios. Each conversation comes with scenario metadata, a persona,
+multiple sessions, and explicit memory events such as inserts, updates, and
+deletes.
+Across the full corpus:
+- `899` conversations are short
+- `1,162` are medium
+- `852` are long
+- `897` are insert-only
+- `937` include updates
+- `435` include both updates and deletes
+### 2. Supervised memory-writing labels
+The second layer converts those conversations into supervised extraction
+examples. Each example contains:
+- retrieved memories seen so far
+- recent conversation context
+- the current user turn
+- target memory operations that should be written from that turn
+The released model learns this memory-writing step.
+## What A Training Example Looks Like
+One real synthetic scenario in the corpus is about **cloud infrastructure
+performance optimization** for a low-latency trading platform.
+**Synthetic scenario**
+- domain: cloud infrastructure performance optimization
+- persona: senior cloud systems engineer at a fintech startup
+- conversation shape: two sessions, ten chunks, five later questions
+**Synthetic user turn**
+> Here’s the initial architecture outline: deploy microservices on AWS Fargate, use PostgreSQL 13 as the primary database, plan Kubernetes orchestration, use Redis for caching, keep API latency under 50ms, and redesign the system with a team of five engineers.
+**Target memory records**
+- Deploy microservices on AWS Fargate
+- Orchestrate containers on a Kubernetes cluster (planned)
+- Primary database: PostgreSQL 13
+- Use Redis as an in-memory caching layer
+- Latency target: API responses under 50ms
+Later turns in the same conversation update that memory with new load targets,
+TTL settings, and rollout constraints such as zero downtime.
+## What Trained The Released Model
+The public adapter was trained on `20,000` supervised extraction examples
+sampled from the larger `100,427`-example label bank.
+In plain terms, the model saw many examples of this pattern:
+1. a conversation turn mentions several durable facts
+2. the target output keeps only the memory-worthy facts
+3. those facts are written as short standalone memory records
+That is why the release behaves like a memory writer rather than a chat model.
 ## Evaluation Surfaces
+The released model is evaluated on two held-out surfaces.
+| Benchmark | Held-out surface | What it tests |
 |---|---|---|
+| `LoCoMo` | held-out conversations `conv-49` and `conv-50` | factual, temporal, inferential, multi-hop, and adversarial recall |
+| `LongMemEval` | held-out items across six categories | knowledge updates, multi-session recall, single-session recall, and temporal reasoning |
+Both the PRISM extractor and the GPT-4.1-based PropMem reference are scored
+with the same QA layer, so the public comparison isolates the extraction step.
+## What Is Public Today
 Public now:
+- the dataset design
+- corpus counts
+- example training records
 - held-out extraction examples
+- benchmark results and category breakdowns
 Not public yet:
+- the full raw synthetic conversation files
+- the full supervised label bank
+- the auxiliary ablation corpora used for follow-on experiments
+## Practical Lessons From The Data
+1. The strongest release model came from the stable `20,000`-example base, not
+   from benchmark-specific add-ons.
+2. Explicit date anchoring helped more than benchmark-style answer formatting.
+3. More narrow benchmark data did not automatically improve generalization.
+4. The supervision is most useful when it teaches durable facts, updates, and
+   contradictions instead of stylistic imitation.
 Related docs:

docs/release/extraction-examples.md CHANGED Viewed

@@ -1,8 +1,8 @@
 # PRISM-Memory Extraction Examples
-Selected held-out examples from the original Exp15 `eval_sft.jsonl` corpus.
-The `GPT-4.1 reference` rows come from the original SFT target propositions.
-The `PRISM-Memory` rows were regenerated from `exp15_sft_qwen7b_4ep` with greedy decoding using the same extraction prompt family used during evaluation.
 These examples are illustrations, not the benchmark itself. Use
 [release-results.md](release-results.md) for the aggregate numbers.
@@ -22,7 +22,7 @@ These examples are illustrations, not the benchmark itself. Use
 - No caching beyond basic Docker layer caching
 - Jenkins nodes have limited capacity and experience queue delays during peak commits
-**PRISM-Memory `sft4`**
 - No Docker caching beyond basic layer caching
 - Jenkins nodes have limited capacity; peak commits cause queue delays
@@ -42,7 +42,7 @@ These examples are illustrations, not the benchmark itself. Use
 - GitHub Actions concurrency limit: 20 concurrent jobs
 - Wants Snyk Slack notifications aggregated and concise, consistent with other pipeline alerts
-**PRISM-Memory `sft4`**
 - GitHub Actions concurrency limit: 20 concurrent jobs
 - Snyk Slack notifications should be aggregated and concise
@@ -63,7 +63,7 @@ These examples are illustrations, not the benchmark itself. Use
 - mTLS planned in phase two
 - Plan to use canary deployments, traffic splitting, and basic fault injection
-**PRISM-Memory `sft4`**
 - Sidecar CPU limits set and monitored via Prometheus
 - Istio mTLS planned for phase two
@@ -72,5 +72,5 @@ These examples are illustrations, not the benchmark itself. Use
 ## Regeneration
 ```bash
-conda run -n pytorch_p310 python scripts/release/generate_readme_examples.py
 ```

 # PRISM-Memory Extraction Examples
+Selected held-out examples from the synthetic evaluation split.
+The `GPT-4.1 reference` rows come from the supervised target memory labels.
+The `PRISM-Memory 7B Adapter` rows were regenerated with greedy decoding using the same extraction prompt family used during evaluation.
 These examples are illustrations, not the benchmark itself. Use
 [release-results.md](release-results.md) for the aggregate numbers.
 - No caching beyond basic Docker layer caching
 - Jenkins nodes have limited capacity and experience queue delays during peak commits
+**PRISM-Memory**
 - No Docker caching beyond basic layer caching
 - Jenkins nodes have limited capacity; peak commits cause queue delays
 - GitHub Actions concurrency limit: 20 concurrent jobs
 - Wants Snyk Slack notifications aggregated and concise, consistent with other pipeline alerts
+**PRISM-Memory**
 - GitHub Actions concurrency limit: 20 concurrent jobs
 - Snyk Slack notifications should be aggregated and concise
 - mTLS planned in phase two
 - Plan to use canary deployments, traffic splitting, and basic fault injection
+**PRISM-Memory**
 - Sidecar CPU limits set and monitored via Prometheus
 - Istio mTLS planned for phase two
 ## Regeneration
 ```bash
+conda run -n pytorch_p310 python scripts/release/generate_extraction_examples.py
 ```

docs/release/extraction-skill.md CHANGED Viewed

@@ -2,20 +2,20 @@
 **Hook:** Turn conversations into durable, searchable memory.
-This is the single extraction skill to keep from the `better_memory` work.
-Public release should point to one checkpoint and one extraction behavior:
-- **Model:** `exp15_sft_qwen7b_4ep`
 - **Base model:** `Qwen/Qwen2.5-7B-Instruct`
 - **Role:** proposition extraction for long-term conversational memory
-- **Why this one:** best confirmed total profile, best adversarial behavior, and
-  best LongMemEval score
 ## Skill Definition
-The extractor operates turn by turn and emits `0-5` atomic propositions per
-turn. Each proposition should be a standalone fact about a person, event,
-preference, or property, with dates carried into the fact when available.
 Canonical prompt:
@@ -23,17 +23,14 @@ Canonical prompt:
 You are a memory extraction assistant. Given a conversation turn, extract 0-5 atomic, standalone facts. Each fact must be a complete sentence about a specific person, event, preference, or property. Include dates/times when mentioned. Skip greetings, filler, and questions. Output ONLY a JSON array of strings, e.g. ["fact1", "fact2"] or [].
 ```
-This prompt comes from `experiment15_learned_extraction.py` in the upstream
-`better_memory` workspace.
 ## Inference Contract
-1. Format the turn with speaker and session date.
 2. Extract `0-5` propositions as a JSON array.
-3. Clean speaker references so generic labels become real names.
 4. Resolve relative temporal expressions against the session date.
-5. Prefix each proposition with the normalized session date before indexing.
-6. Retrieve with the PRISM hybrid stack, not with the extractor alone.
 ## Retrieval Setup To Keep
@@ -48,69 +45,29 @@ Best confirmed retrieval settings:
 - **LongMemEval:** multi-session `k=20`, all other categories `k=8` except
   single-session-user `k=5`
-## What Worked
-1. **The original 20k base mattered.**
-   `sft4` came from the exact `train_sft_clean_merged.jsonl` base distribution.
-   Runs that changed the base subset regressed.
-2. **Four epochs was the sweet spot.**
-   `sft4` is the local optimum the repo could actually reproduce.
-3. **Absolute date anchoring helped.**
-   Temporal repairs worked when the model saw explicit, normalized dates rather
-   than benchmark-specific relative phrasing.
-4. **Post-processing mattered.**
-   Speaker cleanup plus relative-date resolution was necessary to turn raw
-   outputs into stable memory records.
-5. **Hybrid retrieval beat simpler retrieval.**
-   BM25 + dense + reranking consistently outperformed BM25-only or dense-only
-   approaches.
-6. **Turn-local extraction was enough.**
-   The model performed better without feeding long recent-context windows into
-   the extractor.
-7. **Multihop supervision preserved inferential behavior.**
-   When temporal data was added, multihop QA was the only extra signal that
-   reliably helped preserve inferential performance.
-## What Did Not Work
-1. **Relative-date training.**
-   Training the extractor to emit benchmark-style relative dates hurt temporal
-   performance instead of helping it.
-2. **LoCoMo-domain SFT data.**
-   Adding LoCoMo training conversations consistently regressed the model.
-3. **More than 20k original LME examples.**
-   Scaling the original noisy temporal labels to 50k amplified anchor loss and
-   caused major regression.
-4. **Small clean bases.**
-   5k-base follow-on runs forgot too much and collapsed inferential behavior.
-5. **Heavy QA multipliers.**
-   High temporal or QA multipliers damaged adversarial precision and LongMemEval.
-6. **High learning rates on follow-on QA runs.**
-   Aggressive fine-tuning degraded the traits that made `sft4` good.
-7. **Trying to push past the local optimum.**
-   Most post-`sft4` training traded away adversarial performance for narrower
-   gains.
 ## Release Rule
-Release only this extraction skill and only this checkpoint publicly:
-- `exp15_sft_qwen7b_4ep`
-Treat all other checkpoints as internal ablations and learning artifacts, not as
-parallel public releases.
 Related docs:

 **Hook:** Turn conversations into durable, searchable memory.
+This is the single extraction skill the public release keeps.
+- **Released model:** `PRISM-Memory 7B Adapter`
 - **Base model:** `Qwen/Qwen2.5-7B-Instruct`
 - **Role:** proposition extraction for long-term conversational memory
+- **Why this one:** strongest confirmed overall release profile, strongest
+  adversarial behavior, and best confirmed LongMemEval score among the release
+  candidates
 ## Skill Definition
+The extractor operates turn by turn and emits `0-5` atomic memory records per
+turn. Each record should be a standalone fact about a person, event,
+preference, plan, or property, with dates carried into the fact when available.
 Canonical prompt:
 You are a memory extraction assistant. Given a conversation turn, extract 0-5 atomic, standalone facts. Each fact must be a complete sentence about a specific person, event, preference, or property. Include dates/times when mentioned. Skip greetings, filler, and questions. Output ONLY a JSON array of strings, e.g. ["fact1", "fact2"] or [].
 ```
 ## Inference Contract
+1. Format the current turn with speaker and session date.
 2. Extract `0-5` propositions as a JSON array.
+3. Clean speaker references so generic labels become real names when possible.
 4. Resolve relative temporal expressions against the session date.
+5. Prefix each stored proposition with the normalized session date before indexing.
+6. Pair the extractor with the hybrid retrieval stack, not with raw transcript search alone.
 ## Retrieval Setup To Keep
 - **LongMemEval:** multi-session `k=20`, all other categories `k=8` except
   single-session-user `k=5`
+## What Held Up In The Repo
+1. The stable `20,000`-example supervised base mattered more than aggressive
+   benchmark-specific add-ons.
+2. Four epochs was enough to reach the useful local optimum for this 7B line.
+3. Explicit date anchoring helped. Benchmark-style relative-date imitation did not.
+4. Post-processing mattered. Speaker cleanup and relative-date resolution made
+   the extracted records usable.
+5. Hybrid retrieval beat simpler sparse-only or dense-only retrieval.
+6. Turn-local extraction worked better than feeding long recent-context windows
+   into the extractor.
+## What To Avoid
+1. Benchmark-specific format hacks, especially relative-date answer imitation.
+2. Narrow LoCoMo-style SFT add-ons that improve one slice and hurt balance.
+3. Overtraining follow-on variants that trade adversarial precision for narrow gains.
+4. Treating the extractor as a standalone answer model instead of a memory writer.
 ## Release Rule
+Public surfaces should expose exactly one extraction behavior and one released
+model. Other runs remain internal research artifacts.
 Related docs:

docs/release/memory-scenarios.md CHANGED Viewed

@@ -5,7 +5,7 @@ artifacts.
 - The first two use the released held-out extraction examples.
 - The last two use confirmed held-out benchmark cases from
-  [../../results/scenario_comparisons.json](../../results/scenario_comparisons.json).
 The point is not just that the extractor matches GPT-4.1-style labels. The
 point is that a later system can ask a concrete question and get back a useful,

 - The first two use the released held-out extraction examples.
 - The last two use confirmed held-out benchmark cases from
+  [../../results/benchmark_cases.json](../../results/benchmark_cases.json).
 The point is not just that the extractor matches GPT-4.1-style labels. The
 point is that a later system can ask a concrete question and get back a useful,

docs/release/release-results.md CHANGED Viewed

@@ -1,11 +1,11 @@
 # PRISM-Memory Release Results
-This file summarizes the confirmed release metrics and the internal comparison
-artifacts that informed the public checkpoint choice.
-## Released Checkpoint
-- Checkpoint: `exp15_sft_qwen7b_4ep`
 - Base model: `Qwen/Qwen2.5-7B-Instruct`
 - Adapter type: LoRA
 - Confirmed LoCoMo mean: `0.4981204463`
@@ -13,18 +13,18 @@ artifacts that informed the public checkpoint choice.
 - QA cache hits during confirmation: `460`
 - QA cache misses during confirmation: `0`
-## Baseline Context
-`PRISM-Memory` fine-tunes `Qwen/Qwen2.5-7B-Instruct` for the proposition
-extraction step that PropMem normally gets from GPT-4.1. On the confirmed run:
-| Benchmark | PRISM-Memory `sft4` | GPT-4.1-based PropMem reference | Read |
 |---|---:|---:|---|
 | LongMemEval | `0.4768` | `0.4650` | PRISM wins |
-| LoCoMo | `0.4981` | `0.5360` | PRISM trails, but stays close |
-The QA layer is held constant. This is an extractor-vs-extractor comparison,
-not an end-to-end GPT-4.1 replacement claim.
 ## LoCoMo Breakdown
@@ -47,24 +47,30 @@ not an end-to-end GPT-4.1 replacement claim.
 | single-session-user | `0.9133333333` |
 | temporal-reasoning | `0.4316666667` |
-## Internal Comparison That Informed The Release
-The closest runner-up was `inferential_from_temporal_heavy`.
-- Confirmed LoCoMo mean: `0.4975893989`
-- Confirmed LongMemEval mean: `0.4688992148`
-- Pairwise LoCoMo disagreements vs `sft4`: `152 / 400`
-- Question-level wins: `56` for `sft4`, `52` for the runner-up
-The release decision stayed with `sft4` because it preserved the strongest
-LongMemEval score and the strongest adversarial behavior.
 ## Artifact Files
-- [../../results/confirmed_exp15_summary.json](../../results/confirmed_exp15_summary.json)
-- [../../results/scenario_comparisons.json](../../results/scenario_comparisons.json)
-- [../../results/locomo_pairwise_question_diffs.json](../../results/locomo_pairwise_question_diffs.json)
-- [../../results/sft4.json](../../results/sft4.json)
 Related docs:

 # PRISM-Memory Release Results
+This page summarizes the confirmed public release metrics and the internal
+comparison evidence that informed the release choice.
+## Released Model
+- Model: `PRISM-Memory 7B Adapter`
 - Base model: `Qwen/Qwen2.5-7B-Instruct`
 - Adapter type: LoRA
 - Confirmed LoCoMo mean: `0.4981204463`
 - QA cache hits during confirmation: `460`
 - QA cache misses during confirmation: `0`
+## Public Comparison
+PRISM-Memory fine-tunes `Qwen/Qwen2.5-7B-Instruct` for the memory extraction
+step that the PropMem reference gets from GPT-4.1.
+| Benchmark | PRISM-Memory | GPT-4.1-based PropMem reference | Read |
 |---|---:|---:|---|
 | LongMemEval | `0.4768` | `0.4650` | PRISM wins |
+| LoCoMo | `0.4981` | `0.5360` | PRISM trails, but stays competitive |
+The QA layer is held constant. This is an extraction-step comparison, not an
+end-to-end GPT-4.1 replacement claim.
 ## LoCoMo Breakdown
 | single-session-user | `0.9133333333` |
 | temporal-reasoning | `0.4316666667` |
+## Why This Model Was Released
+The closest internal runner-up nearly tied the released model on overall
+LoCoMo, but it lost on the broader release profile:
+- lower LongMemEval score: `0.4689`
+- weaker adversarial precision
+- less balanced behavior across the full evaluation surface
+Question-level comparison on held-out LoCoMo:
+- disagreements: `152 / 400`
+- questions favoring PRISM-Memory: `56`
+- questions favoring the runner-up: `52`
+That is close enough to be a real internal comparison, but not close enough to
+justify two public models.
 ## Artifact Files
+- [../../results/release_summary.json](../../results/release_summary.json)
+- [../../results/release_model.json](../../results/release_model.json)
+- [../../results/benchmark_cases.json](../../results/benchmark_cases.json)
+- [../../results/internal_locomo_pairwise_diffs.json](../../results/internal_locomo_pairwise_diffs.json)
 Related docs:

docs/release/technical-blog.md CHANGED Viewed

@@ -1,244 +1,158 @@
 # PRISM-Memory: Turn Conversations Into Durable, Searchable Memory
-## Summary
-`PRISM-Memory` is a long-term conversational memory system that converts raw
-dialogue into proposition-level memory and retrieves it with an inspectable
-hybrid stack.
-The point is not that a 7B model chats well. The point is that a 7B open model
-can write memory records that another system can actually use later.
-This package now ships one public extraction skill and one public checkpoint:
-- **Checkpoint:** `exp15_sft_qwen7b_4ep`
-- **Confirmed LoCoMo mean:** `0.4981204463`
-- **Confirmed LongMemEval mean:** `0.4767574431`
-- **QA cache misses during confirmation:** `0`
-The public hook is simple:
-**PRISM-Memory turns conversations into durable, searchable memory.**
-## Why This Is Useful In Practice
-A memory writer is only interesting if a later system can ask a pointed
-question and get back a useful answer without rereading the original chat. The
-public release artifacts already show that pattern.
-### 1. Keep hard limits and preferences available for later work
-The extractor can turn a single conversational turn into stable memory like:
-- GitHub Actions concurrency limit: `20` concurrent jobs
-- Snyk Slack notifications should be aggregated and concise
-That means a later system can answer:
-> What is our GitHub Actions concurrency limit, and how should Snyk alerts look?
-with:
-> `20` concurrent jobs. Alerts should be aggregated and concise.
-That is a real product use case. Teams mention constraints and preferences once,
-then expect downstream tools and agents to remember them.
-### 2. Keep current state separate from the roadmap
-The released extractor can also preserve the difference between what is true
-now and what is only planned:
-- sidecar CPU limits are already set and monitored
-- mTLS is planned for phase two
-- rollout strategy is canary deployments plus traffic splitting
-So a later question like:
-> Did we already enable mTLS, and what rollout strategy are we planning?
-can be answered without confusing the present state with the future plan.
-This is a core memory problem, not a style problem. Chat history tends to blur
-these states together.
-### 3. Answer dated questions with dated evidence
-One confirmed held-out benchmark case asks:
-> Which hobby did Sam take up in May 2023?
-The retrieved memory contains explicit dated propositions about Sam trying
-painting in May 2023, and the released system answers:
-> painting
-That matters because the useful behavior is not “remember that hobbies were
-discussed.” The useful behavior is “recover the dated fact that actually
-answers the later question.”
-There is a fourth practical behavior that matters too: refusal. On the held-out
-adversarial guitar case, the released model returns `None` instead of inventing
-a reason for an unsupported premise. That is also part of being useful.
-For the compact scenario version of this story, see
-[memory-scenarios.md](memory-scenarios.md).
-## What The Repo Actually Contributed
-The core contribution is not another opaque memory model. The repo showed that a
-7B open model can replace GPT-4-class extraction with a transparent memory
-pipeline that is still competitive on long-horizon dialogue benchmarks.
-The released system has three pieces:
-1. A learned proposition extractor (`Qwen2.5-7B-Instruct` + LoRA).
 2. Post-processing that cleans speaker references and resolves relative time.
-3. Hybrid retrieval (`BM25 + dense retrieval + cross-encoder reranking`).
-The important part is the interface between them: extracted propositions are not
-just text snippets. They are the memory records that the retriever indexes.
-## The Single Skill To Keep
-After reviewing the repo history, there should be one canonical extraction skill
-and one checkpoint publicly exposed:
-- **Skill:** proposition-level memory extraction
-- **Model:** `exp15_sft_qwen7b_4ep`
-- **Prompt contract:** extract `0-5` atomic standalone facts, include dates when
-  present, skip filler and questions, output JSON only
-That skill is documented directly in
-[extraction-skill.md](extraction-skill.md).
-## What Worked
-### 1. The best model came from the stable 20k base, not from aggressive add-ons
-The repo repeatedly showed that `sft4` was the stable optimum for the 7B line.
-The same 20k clean base distribution was critical. Changing the base subset,
-shrinking it, or overextending it consistently hurt.
-Why that matters:
-- the model needed the exact data distribution that produced `sft4`
-- 4 epochs was enough to reach the useful local optimum
-- follow-on runs often traded away robustness for narrower gains
-### 2. Proposition memory plus hybrid retrieval is the real winning combination
-The strongest system was not latent-only memory and not raw-turn retrieval. The
-best path was proposition extraction plus `PRISMv3Rerank`.
-That means:
-- sparse retrieval captured lexical anchors
-- dense retrieval recovered semantically close memories
-- reranking cleaned up the final shortlist
-This combination is what made the memory store usable.
-### 3. Absolute date anchoring and temporal cleanup helped
-Temporal improvement came from making the memory records cleaner, not from
-teaching the model to imitate LoCoMo’s relative-answer style.
-What helped:
-- fixed temporal examples with explicit date resolution
-- normalizing session dates
-- post-processing relative references like `yesterday` or `last weekend`
-What did **not** help:
-- training the model to emit relative benchmark-style dates
-### 4. Turn-local extraction was better than passing long context windows
-The repo tested extraction with added session context and it regressed. The
-model worked best when extracting from the current turn and letting the memory
-system handle cross-turn reasoning later.
-That is an important design lesson: keep extraction local, let retrieval do the
-composition.
-### 5. Adversarial precision was the strongest reason to keep `sft4`
-Many later variants found small gains in temporal or inferential categories, but
-they usually damaged adversarial behavior. `sft4` held the best confirmed
-adversarial score and the best total LongMemEval score, which is why it is the
-only checkpoint worth releasing publicly.
 ## What Did Not Work
-### 1. Benchmark-specific format hacks
-Relative-date training was a dead end. It optimized for the look of a benchmark
-answer rather than for general extraction quality.
-### 2. LoCoMo-domain training data
-Adding LoCoMo training conversations consistently regressed performance. The
-best generalization signal remained the cleaned LME-style base data.
-### 3. More original LME data was not better
-Scaling from 20k to 50k original LME examples amplified the temporal-anchor
-problem. More noisy temporal labels simply taught the wrong lesson more often.
-### 4. Small follow-on bases and heavy QA multipliers
-Runs built on 5k clean bases or extreme QA multipliers tended to forget useful
-behavior. They often improved a narrow category while hurting adversarial
-precision, inferential balance, or LongMemEval.
-### 5. Assuming the best checkpoint was easy to improve
-The repo’s most expensive lesson was that `sft4` was already a local optimum for
-the 7B line. Most additional training made the model more specialized and less
-balanced.
-## Internal Comparisons That Informed The Release
-The internal ablation story still matters, even though the public package keeps
-only `sft4`.
-Confirmed internal facts:
-- `inferential_from_temporal_heavy` nearly tied `sft4` on overall LoCoMo
-- it recovered some inferential and temporal misses
-- it still lost on LongMemEval and adversarial precision
-Question-level comparison on held-out LoCoMo:
-- `400` questions replayed
-- `152` answer-level disagreements
-- `56` questions favored `sft4`
-- `52` questions favored the runner-up
-That is a useful research result, but not a reason to ship two public models.
-The right release decision is one clean skill, one clean checkpoint.
-## Failure Modes Still Visible In The Release Model
-The selected model is good enough to release, but its errors are clear:
-- it can miss specific diagnoses while retaining the broader health frame
-- it can overcommit to a salient retrieved clue in inferential questions
-- it can remember a coarse book description but miss the exact title
-Those are not packaging issues. They are the current limits of the extraction +
-retrieval stack at this model size.
 ## What Ships
-Public release surface:
 1. `PRISM-Memory`
-2. the single extraction skill in [extraction-skill.md](extraction-skill.md)
-3. the best confirmed checkpoint `exp15_sft_qwen7b_4ep`
-4. the best-only Space demo in [../../space/](../../space/)
-Internal analysis artifacts can stay for provenance, but they should not be
-positioned as parallel public releases.

 # PRISM-Memory: Turn Conversations Into Durable, Searchable Memory
+## The Problem
+Most long-chat systems do not actually have memory. They have transcript
+search. That works until someone asks a later question that depends on a hard
+constraint, a changed plan, a dated fact, or a contradiction that happened
+months ago.
+PRISM-Memory focuses on the part of the stack that usually stays hidden: the
+step that decides what should become memory at all.
+The release model is a 7B adapter that writes short proposition-level memory
+records from dialogue. Those records are then indexed by a hybrid retrieval
+stack and used later for recall.
+## What This Release Shows
+The useful result is narrow and practical:
+- a 7B open model can replace the GPT-4.1 extraction step in this memory pipeline
+- it scores `0.4768` on LongMemEval versus `0.4650` for the GPT-4.1-based PropMem reference
+- it scores `0.4981` on LoCoMo versus `0.5360` for that same reference
+This is not a claim that a 7B model beats GPT-4.1 everywhere. It is a claim
+that a 7B model can take over the memory-writing step and stay competitive on
+the held-out evaluation surface.
+## Why That Matters
+If the memory-writing step is weak, retrieval never gets a clean chance.
+Important details stay buried inside noisy chat turns.
+PRISM-Memory is useful when later questions depend on things like:
+- a hard operational limit: `20` GitHub Actions jobs
+- a durable preference: aggregated Slack alerts instead of noisy ones
+- a status distinction: mTLS is not live yet, it is planned for phase two
+- a dated fact: Sam took up painting in May 2023
+- a refusal case: the system should answer `None` instead of inventing a reason for an unsupported guitar story
+Those are memory problems, not style problems.
+## How The System Works
+The released system has three pieces.
+1. A learned extractor based on `Qwen/Qwen2.5-7B-Instruct` with LoRA.
 2. Post-processing that cleans speaker references and resolves relative time.
+3. Hybrid retrieval with BM25, dense retrieval, and reranking.
+The extracted propositions are the important interface. They are the memory
+records the retriever indexes. That keeps the memory store inspectable instead
+of opaque.
+## What The Training Data Actually Was
+The release data is synthetic.
+- `2,329` synthetic training conversations
+- `584` held-out synthetic conversations
+- `100,427` supervised extraction examples derived from those conversations
+- `20,000` supervised examples used for the released adapter
+The conversations were designed to stress real memory behaviors:
+- new facts introduced in one session and used later
+- updated details that should overwrite stale ones
+- deleted or invalidated facts that should stop influencing answers
+- mixtures of personal details, project facts, preferences, dates, and plans
+The labels were GPT-4.1-derived memory-writing targets. No real user chat logs
+are part of the public release.
+## What Worked
+### 1. The clean supervised base mattered more than clever add-ons
+The release model came from a stable `20,000`-example synthetic supervision
+base. That base was more valuable than trying to patch the model later with
+many narrow benchmark-specific additions.
+### 2. Hybrid retrieval was part of the result
+The release is not just a model story. It is a model-plus-retrieval story.
+Sparse retrieval kept lexical anchors, dense retrieval recovered semantically
+close memories, and reranking cleaned the shortlist.
+### 3. Explicit time anchoring helped
+The model improved when the memory records carried explicit dates and the system
+resolved relative references like `yesterday` or `last weekend` into normalized
+anchors.
+### 4. Turn-local extraction was enough
+Feeding long recent-context windows into the extractor made it worse. The
+stronger pattern was local extraction at write time and cross-turn composition
+later through retrieval.
+### 5. Adversarial precision mattered
+The release model kept the best adversarial behavior among the runs considered
+for public release. That mattered because a memory system that answers
+unsupported questions confidently is worse than one that refuses.
 ## What Did Not Work
+### 1. Benchmark-style formatting tricks
+Trying to train the model toward benchmark-style relative-date outputs hurt more
+than it helped. It optimized the look of answers instead of the quality of the
+stored memory.
+### 2. Narrow LoCoMo-style add-ons
+Adding targeted benchmark-domain data often bought a small gain in one slice of
+LoCoMo and then lost balance somewhere else.
+### 3. More noisy supervision was not automatically better
+Scaling up original noisy temporal supervision amplified the wrong lesson. The
+model became more specialized and less balanced.
+### 4. Overtraining past the local optimum
+Several follow-on variants nearly matched the final release on one metric, but
+they usually gave back LongMemEval performance, adversarial precision, or both.
+## Why Only One Public Model Ships
+The repo tried multiple follow-on variants. The nearest internal runner-up
+nearly tied the released model on overall LoCoMo and disagreed on `152` of the
+`400` held-out LoCoMo questions, which means the comparison was real.
+But the public release decision is simpler than the internal ablation story.
+One model ships because it had the best overall release profile:
+- strongest LongMemEval score
+- strongest adversarial behavior
+- best total balance across the held-out surface
+That is a better public story than shipping several near-tied variants with
+internal names nobody else should care about.
 ## What Ships
+The public release surface is intentionally narrow:
 1. `PRISM-Memory`
+2. one released model
+3. one extraction skill
+4. one Space demo
+5. one set of release docs and benchmark artifacts
+The broader `frontier_memory` harness stays in the repo for ongoing research,
+but the release story stays focused on the memory-writing component that proved
+worth shipping.

results/{scenario_comparisons.json → benchmark_cases.json} RENAMED Viewed

@@ -18,7 +18,8 @@
       "note": "The released model keeps the dated hobby proposition and answers correctly.",
       "systems": [
         {
-          "name": "sft4",
           "prediction": "painting",
           "top_retrieval": [
             "Sam: [18 May 2023] Sam is considering trying painting as a new hobby.",
@@ -44,7 +45,8 @@
       "note": "This tests whether the system refuses to invent an answer when the premise is unsupported.",
       "systems": [
         {
-          "name": "sft4",
           "prediction": "None",
           "top_retrieval": [
             "[2:55 pm on 31 August, 2023] Dave: That guitar has a gorgeous purple hue. Why did you make it so shiny?",
@@ -67,7 +69,8 @@
       "note": "A representative factual miss: the model retrieves the health-risk frame but not the specific diagnosis.",
       "systems": [
         {
-          "name": "sft4",
           "prediction": "serious health risk",
           "top_retrieval": [
             "Sam: [8 October 2023] The doctor told Sam that his weight is a serious health risk.",
@@ -93,7 +96,8 @@
       "note": "A representative inferential miss: retrieval includes both clues, but the model overcommits to the mountain mention.",
       "systems": [
         {
-          "name": "sft4",
           "prediction": "mountains",
           "top_retrieval": [
             "Evan: [27 August 2023] Evan also shared his recent road trip to the Rocky Mountains and love for hiking.",
@@ -119,7 +123,8 @@
       "note": "A representative multi-hop miss: the model retains the coarse book description but misses the specific title.",
       "systems": [
         {
-          "name": "sft4",
           "prediction": "a new mystery novel",
           "top_retrieval": [
             "Evan: [27 August 2023] Evan is reading a book that he finds increasingly compelling.",

       "note": "The released model keeps the dated hobby proposition and answers correctly.",
       "systems": [
         {
+          "name": "release_model",
+          "display_name": "PRISM-Memory 7B Adapter",
           "prediction": "painting",
           "top_retrieval": [
             "Sam: [18 May 2023] Sam is considering trying painting as a new hobby.",
       "note": "This tests whether the system refuses to invent an answer when the premise is unsupported.",
       "systems": [
         {
+          "name": "release_model",
+          "display_name": "PRISM-Memory 7B Adapter",
           "prediction": "None",
           "top_retrieval": [
             "[2:55 pm on 31 August, 2023] Dave: That guitar has a gorgeous purple hue. Why did you make it so shiny?",
       "note": "A representative factual miss: the model retrieves the health-risk frame but not the specific diagnosis.",
       "systems": [
         {
+          "name": "release_model",
+          "display_name": "PRISM-Memory 7B Adapter",
           "prediction": "serious health risk",
           "top_retrieval": [
             "Sam: [8 October 2023] The doctor told Sam that his weight is a serious health risk.",
       "note": "A representative inferential miss: retrieval includes both clues, but the model overcommits to the mountain mention.",
       "systems": [
         {
+          "name": "release_model",
+          "display_name": "PRISM-Memory 7B Adapter",
           "prediction": "mountains",
           "top_retrieval": [
             "Evan: [27 August 2023] Evan also shared his recent road trip to the Rocky Mountains and love for hiking.",
       "note": "A representative multi-hop miss: the model retains the coarse book description but misses the specific title.",
       "systems": [
         {
+          "name": "release_model",
+          "display_name": "PRISM-Memory 7B Adapter",
           "prediction": "a new mystery novel",
           "top_retrieval": [
             "Evan: [27 August 2023] Evan is reading a book that he finds increasingly compelling.",

results/{readme_extraction_examples.json → extraction_examples.json} RENAMED Viewed

@@ -1,6 +1,7 @@
 {
-  "source_dataset": "BETTER_MEMORY_ROOT/data/output/eval_sft.jsonl",
-  "model_path": "BETTER_MEMORY_ROOT/exp15_sft_qwen7b_4ep",
   "output_examples": 3,
   "examples": [
     {

 {
+  "dataset_name": "Held-out synthetic evaluation split",
+  "model_name": "PRISM-Memory 7B Adapter",
+  "base_model": "Qwen/Qwen2.5-7B-Instruct",
   "output_examples": 3,
   "examples": [
     {

results/{confirmed_exp15_summary.json → release_summary.json} RENAMED Viewed

@@ -1,9 +1,11 @@
 {
   "results": [
     {
-      "alias": "sft4",
-      "checkpoint": "exp15_sft_qwen7b_4ep",
-      "elapsed_min": 28.93,
       "args": {
         "n_lme": 10,
         "context_window": 0,
@@ -50,4 +52,4 @@
     }
   ],
   "failures": []
-}

 {
+  "generated_at_unix": 1776521335,
   "results": [
     {
+      "model_key": "release_model",
+      "model_name": "PRISM-Memory 7B Adapter",
+      "base_model": "Qwen/Qwen2.5-7B-Instruct",
+      "elapsed_min": 30.0,
       "args": {
         "n_lme": 10,
         "context_window": 0,
     }
   ],
   "failures": []
+}