0xSero/GLM-5.2-504B · Knowledge degradation across the 504B REAP variants

Knowledge degradation across the 504B REAP variants — empirical findings and which cut to pick

by rene98c - opened 3 days ago

0xSero good stuff! model is absolute blast, love it. But, few issues i encountered with it. Not a request, just an observation.

Following is from Sonnet summarizing my findings:

I ran all three public REAP variants through a knowledge-recall probe and want to share the results. 0xSero's model cards document the trade-offs honestly, and this post is meant to corroborate those findings with concrete numbers, help people pick the right variant, and flag one unexpected result (FullKD).

The test

Movie identification in two directions:

Reverse recall: given distinctive dialogue transcription, identify the film. (Tests whether the model can access knowledge from weak surface cues.)
Forward recall: given the film title, recall the director and character names.

These two together cleanly separate "the knowledge is there" from "the model can confabulate plausibly." All runs: temperature ~0.3, thinking off (unless noted), no tools.

Results — forward recall (title given, recall details)

Model	Title	Lead actor	Director	Character names	Verdict
base 504B	"Equalizer 2000" ❌	"Denzel"	—	"Victor"	❌ corrupted
504B-K	"The Equalizer" ✅	Denzel Washington ✅ (consistent ×2)	"Ovenden"/"Golab" ❌	"Vladimir" ❌	◑ partial
504B-FullKD	"Equalizer 2015" ❌	"Denzel Hughes" → thrash	loop spiral ❌	confabulated	❌ + looping
official GLM-5.2	✅	Denzel Washington ✅	Antoine Fuqua ✅	Slavi / Teri ✅	✅ full

The base cut's degradation is deeper than "knowledge is hard to reach" — it corrupts the title itself when the title is handed to it. The -K variant is a real, measurable recovery on the core facts, though director and character names remain confabulated.

Results — reverse recall (dialogue → title, no title given)

Model	Answer	Verdict
base 504B (×4 tries)	"Babec", "Frozen Love", "Boruto: The Movie", "bootleg .srt"	❌ all invented
official GLM-5.2, thinking off	"The Equalizer (2014), Denzel Washington, Alina = Chloë Grace Moretz"	✅
official GLM-5.2, thinking on, low budget	truncated mid-reasoning (`finish_reason: length`)	artifact
official GLM-5.2, thinking on, max budget	correct	✅

Note on the finish_reason: length result: this is not abstention — with z.ai routing, CoT goes into reasoning_content and the model can spend its full budget there, leaving no tokens for the final answer. Budget allocation artifact, not a capability issue.

The FullKD surprise

FullKD is tuned on coding-agent traces (codex/opencode/cursor/claude-code) with full KD data and uniform weighting. Despite being the "quality-maximized" sibling by the metrics on the card, it performs worse than the base cut on knowledge tasks:

Mangles the lead actor's name to "Denzel Hughes" and then thrashes on it
Director recall degenerates into a visible self-correction loop: "Anthony Baker → Anthony Alexander → Avi Anthony Alexander → …" repeating until truncation

This is the attractor behaviour documented on the card — our test hit it live. If you're running a knowledge-heavy application, FullKD is not the variant to reach for.

Looping / stability

Base 504B with thinking on and a large budget degenerated into:

"Let's think about the movie The Red Room... No." ×~30, then truncation.

This matches 0xSero's documented loop rate doubling (3.6% → 7.2%). One serving note: the b12x image emits chain-of-thought in a reasoning field rather than reasoning_content — account for that if you're parsing responses.

Root cause (per 0xSero's cards, corroborated)

Saliency for the base cut was computed from coding traces. Experts that fire primarily on reasoning and world-knowledge are under-weighted by coding saliency → they got pruned → the information was in training but the experts that access it were removed. The -K variant re-includes the 8 highest-priority knowledge-exclusive experts per layer that coding-saliency drops, with Router-KD on 6× the calibration data. It helps meaningfully; it doesn't close the gap fully — the card says exactly that, and our test agrees.

billob01

1 day ago

thanks for sharing the results!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment