AsadIsmail
/

prism-memory

Text Generation

conversational-memory

information-extraction

Model card Files Files and versions

prism-memory / docs /release /release-results.md

AsadIsmail's picture

Publish PRISM-Memory adapter bundle

9088f51 verified 27 days ago

|

history blame contribute delete

2.52 kB

PRISM-Memory Release Results

This page summarizes the confirmed public release metrics and the internal comparison evidence that informed the release choice.

Released Model

Model: PRISM-Memory 7B Adapter
Base model: Qwen/Qwen2.5-7B-Instruct
Adapter type: LoRA
Confirmed LoCoMo mean: 0.4981204463
Confirmed LongMemEval mean: 0.4767574431
QA cache hits during confirmation: 460
QA cache misses during confirmation: 0

Public Comparison

PRISM-Memory fine-tunes Qwen/Qwen2.5-7B-Instruct for the memory extraction step that the PropMem reference gets from GPT-4.1.

Benchmark	PRISM-Memory	GPT-4.1-based PropMem reference	Read
LongMemEval	`0.4768`	`0.4650`	PRISM wins
LoCoMo	`0.4981`	`0.5360`	PRISM trails, but stays competitive

The QA layer is held constant. This is an extraction-step comparison, not an end-to-end GPT-4.1 replacement claim.

LoCoMo Breakdown

Category	Score
factual	`0.3339551926`
temporal	`0.4978785870`
inferential	`0.2605997475`
multi-hop	`0.5144477744`
adversarial	`0.8837209302`

LongMemEval Breakdown

Category	Score
knowledge-update	`0.5588405797`
multi-session	`0.1390977444`
single-session-assistant	`0.7656395892`
single-session-preference	`0.0519667456`
single-session-user	`0.9133333333`
temporal-reasoning	`0.4316666667`

Why This Model Was Released

The closest internal runner-up nearly tied the released model on overall LoCoMo, but it lost on the broader release profile:

lower LongMemEval score: 0.4689
weaker adversarial precision
less balanced behavior across the full evaluation surface

Question-level comparison on held-out LoCoMo:

disagreements: 152 / 400
questions favoring PRISM-Memory: 56
questions favoring the runner-up: 52

That is close enough to be a real internal comparison, but not close enough to justify two public models.

Artifact Files

Related docs: