Knowledge degradation across the 504B REAP variants β€” empirical findings and which cut to pick

#2
by rene98c - opened

0xSero good stuff! model is absolute blast, love it. But, few issues i encountered with it. Not a request, just an observation.

Following is from Sonnet summarizing my findings:

I ran all three public REAP variants through a knowledge-recall probe and want to share the results. 0xSero's model cards document the trade-offs honestly, and this post is meant to corroborate those findings with concrete numbers, help people pick the right variant, and flag one unexpected result (FullKD).

The test

Movie identification in two directions:

  1. Reverse recall: given distinctive dialogue transcription, identify the film. (Tests whether the model can access knowledge from weak surface cues.)
  2. Forward recall: given the film title, recall the director and character names.

These two together cleanly separate "the knowledge is there" from "the model can confabulate plausibly." All runs: temperature ~0.3, thinking off (unless noted), no tools.


Results β€” forward recall (title given, recall details)

Model Title Lead actor Director Character names Verdict
base 504B "Equalizer 2000" ❌ "Denzel" β€” "Victor" ❌ corrupted
504B-K "The Equalizer" βœ… Denzel Washington βœ… (consistent Γ—2) "Ovenden"/"Golab" ❌ "Vladimir" ❌ β—‘ partial
504B-FullKD "Equalizer 2015" ❌ "Denzel Hughes" β†’ thrash loop spiral ❌ confabulated ❌ + looping
official GLM-5.2 βœ… Denzel Washington βœ… Antoine Fuqua βœ… Slavi / Teri βœ… βœ… full

The base cut's degradation is deeper than "knowledge is hard to reach" β€” it corrupts the title itself when the title is handed to it. The -K variant is a real, measurable recovery on the core facts, though director and character names remain confabulated.


Results β€” reverse recall (dialogue β†’ title, no title given)

Model Answer Verdict
base 504B (Γ—4 tries) "Babec", "Frozen Love", "Boruto: The Movie", "bootleg .srt" ❌ all invented
official GLM-5.2, thinking off "The Equalizer (2014), Denzel Washington, Alina = ChloΓ« Grace Moretz" βœ…
official GLM-5.2, thinking on, low budget truncated mid-reasoning (finish_reason: length) artifact
official GLM-5.2, thinking on, max budget correct βœ…

Note on the finish_reason: length result: this is not abstention β€” with z.ai routing, CoT goes into reasoning_content and the model can spend its full budget there, leaving no tokens for the final answer. Budget allocation artifact, not a capability issue.


The FullKD surprise

FullKD is tuned on coding-agent traces (codex/opencode/cursor/claude-code) with full KD data and uniform weighting. Despite being the "quality-maximized" sibling by the metrics on the card, it performs worse than the base cut on knowledge tasks:

  • Mangles the lead actor's name to "Denzel Hughes" and then thrashes on it
  • Director recall degenerates into a visible self-correction loop: "Anthony Baker β†’ Anthony Alexander β†’ Avi Anthony Alexander β†’ …" repeating until truncation

This is the attractor behaviour documented on the card β€” our test hit it live. If you're running a knowledge-heavy application, FullKD is not the variant to reach for.


Looping / stability

Base 504B with thinking on and a large budget degenerated into:

"Let's think about the movie The Red Room... No." Γ—~30, then truncation.

This matches 0xSero's documented loop rate doubling (3.6% β†’ 7.2%). One serving note: the b12x image emits chain-of-thought in a reasoning field rather than reasoning_content β€” account for that if you're parsing responses.


Root cause (per 0xSero's cards, corroborated)

Saliency for the base cut was computed from coding traces. Experts that fire primarily on reasoning and world-knowledge are under-weighted by coding saliency β†’ they got pruned β†’ the information was in training but the experts that access it were removed. The -K variant re-includes the 8 highest-priority knowledge-exclusive experts per layer that coding-saliency drops, with Router-KD on 6Γ— the calibration data. It helps meaningfully; it doesn't close the gap fully β€” the card says exactly that, and our test agrees.


thanks for sharing the results!

Sign up or log in to comment