Distilled Reasoning or Just XML Cosplay?

#2
by SuperSonnix71 - opened

So let me get this right.
The model is trained on 4,659 rows from one developer’s Claude Code sessions, mostly narrow web and game dev work.
The “reasoning traces” are not clearly verified as real raw reasoning traces.
Some sources apparently had redacted thinking blocks, so the CoT may be reconstructed or added after the fact.
There are no formal evals published.
The tool names do not reliably match the original Claude Code tools, so it may need a wrapper just to behave properly.
And the pitch is still basically: “we distilled Claude level agentic reasoning into Qwen.”
That is not verified reasoning.
That is format imitation vs actual capability.
XML cosplay vs engineering evidence.

You're right on most of these. Almost all of them are stated explicitly in the model card — I'd rather we agree on what this is than have it overclaimed.

Specifically:

  • Narrow distribution: yes, "Honest scope" + "Dataset provenance" sections both call this out. ~4,659 rows, one developer's CC sessions across web/game/physics work, plus a Boeing 747 trace. Not broad.
  • Reasoning traces verification: also acknowledged. armand0e/claude-fable-5-claude-code and victor/fable-5-boeing-747-trace (the two Fable-5 sources I tried first) had 100% redacted thinking blocks — Anthropic's preview-model IP protection. Only Glint-Research/Fable-5-traces ships cleartext CoT, and per Glint's own README they added it themselves post-hoc. That's documented in the provenance chain and the "Note on the other Fable-5 sources" subsection.
  • No formal evals: pending. Every row in the Evaluation table is 🚧 in progress. Standing rule on this project: blank-until-verified, omit-rather-than-mislead. Numbers when they're real.
  • Tool names don't bind to Claude Code's inventory: yes — model emits str_replace_editor, read_file instead of Edit/Read. Called out in "Tool names are not bound to the Claude Code inventory." Downstream consumers define their own tool registry, but yes, this matters.

The pitch is NOT "we distilled Claude-level agentic reasoning into Qwen." The TL;DR explicitly says reasoning ability comes from the Opus 4.7 step in the chain, not from Fable-5. Fable-5 contributes the agentic tool-use axis — system-prompt-conditional, narrower, and the card says so. The TL;DR even spells out the fallback: bare prompts produce markdown code blocks, not XML tool calls.

On "format imitation vs actual capability" — fair distinction, and one I'd answer with evals, not arguments. SWE-bench Lite with an OpenHands harness is in flight (the proxy + runner are committed in the source repo at training/swe_bench/). When that number lands, we'll know if this is engineering evidence or XML cosplay. Until then I won't claim more than the card already does.

If any specific line reads as overclaiming, point at it and I'll tighten it.

Sign up or log in to comment