CFM-Proof-3B ยท MorphMind

A control model that reads a mathematical proof and tells you where it breaks. Give CFM-Proof-3B a theorem and its proof and it returns a structured verdict โ€” support or refute โ€” pinpoints the offending step, and explains why. It is built as a high-recall reviewer: it surfaces nearly every questionable step so a human misses almost nothing.

CFM-Proof-3B is the first release in MorphMind's Control Foundation Model (CFM) line โ€” models whose job is not to generate science but to check it.

By MorphMind. Research preview.

Benchmark โ€” proof-error recall vs. frontier models

CFM-Proof-3B proof-error recall

Recall (share of injected proof errors caught) on the same 150-proof held-out sample โ€” every model given JSON output and an adequate token budget, so the comparison is like-for-like:

Model Recall (errors caught) Size
base Qwen2.5-3B (zero-shot) 0.04 3B
Claude Opus 4.8 0.61 frontier
GPT-5.4 0.84 frontier
CFM-Proof-3B (ours) 0.88 3B ยท single GPU

On this held-out sample CFM-Proof-3B is competitive with frontier models on error catch-rate at roughly 1/100 the size, running on a single GPU. On the full 1,977-proof test and an entirely held-out domain, its robust recall is 0.83 / 0.82 (localization 0.30 / 0.28), consistent across fields (cs.CC 0.87 ยท cs.IT 0.84 ยท cs.LG 0.84 ยท math.OC 0.84 ยท stat 0.80). Read the table as a recall screen, not a verdict on overall capability: the models sit at different precision/recall trade-offs โ€” Opus is more conservative (higher precision, lower recall), while CFM-Proof-3B and GPT-5.4 favor recall, the right bias for a first-pass screen that must not miss errors.

When & how to use it

Use CFM-Proof-3B as a fast first-pass reviewer โ€” to catch slips before a human deep-read, to triage a stack of submissions, or to vet AI-generated proofs. It is most valuable wherever a missed error is expensive: refereeing, internal review, grading, automated theorem generation.

The unit of review is one claim + its proof โ€” not a whole paper. For a long paper, screen it piece by piece:

  1. Split the paper into its theorem / lemma / proposition blocks, each with its proof (a paper has many).
  2. Run CFM-Proof-3B on each block independently.
  3. Collect the blocks it flags โ€” the model hands you a short "look here" list instead of a 40-page read.

This keeps every input short (one proof, the form it was trained on) and scales cleanly to long papers and large batches. Because it is tuned for recall, treat its flags as "worth a human's 30 seconds" โ€” it is a screen, not a final judge.

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
tok = AutoTokenizer.from_pretrained("MorphMind-AI/CFM-Proof-3B")
model = AutoModelForCausalLM.from_pretrained("MorphMind-AI/CFM-Proof-3B",
                                             torch_dtype=torch.bfloat16, device_map="auto")
SYSTEM = ("You are a scientific correctness reviewer. Review the theorem and proof and respond ONLY "
          "with JSON: {\"analysis\":...,\"verdict\":\"support|refute\","
          "\"error_spans\":[{\"text\":...,\"why\":...}],\"action\":\"accept|suggest_edit\"}")

def review(theorem, proof):
    msgs=[{"role":"system","content":SYSTEM},
          {"role":"user","content":f"THEOREM:\n{theorem}\n\nPROOF:\n{proof}"}]
    ids=tok.apply_chat_template(msgs, add_generation_prompt=True, return_tensors="pt").to(model.device)
    out=model.generate(ids, max_new_tokens=320, do_sample=False)
    return tok.decode(out[0, ids.shape[1]:], skip_special_tokens=True)

# For a long paper: for theorem, proof in split_into_proof_blocks(paper): review(theorem, proof)

How it was built

A short supervised warm-start, then RLVR โ€” Reinforcement Learning from Verifiable Rewards: the model proposes a verdict, an automatic checker validates it against ground truth, and only verifiably-correct answers are reinforced. No model-as-judge. Trained on public arXiv LaTeX proofs across statistics, probability, optimization, CS-theory, and ML theory.

Limitations

CFM-Proof-3B is a recall-first screen, and is deliberately built that way:

  • It over-flags (precision โ‰ˆ 0.5) โ€” by design. It is far cheaper to dismiss a false alarm in seconds than to ship a missed error, so it errs toward flagging. Keep a human in the loop.
  • It catches โ‰ˆ83% of errors, not 100% โ€” a strong screen, not a proof of correctness.
  • It localizes the exact step โ‰ˆ30% of the time; otherwise it tells you the proof is suspect and why, and you scan.
  • It was trained on representative injected errors (reversed inequalities, sign flips, altered constants); coverage of every real-world mistake will keep improving with each release.
  • This is a research preview; a permissively-licensed, larger CFM-Proof-7B is in training.

License

Released under the MorphMind CFM Research License (see LICENSE), which incorporates the Qwen Research License of the underlying Qwen2.5-3B base. Research / non-commercial use, with attribution to MorphMind and Qwen. For commercial licensing, contact MorphMind (morphmind.ai).

Citation

MorphMind. CFM-Proof-3B: a control foundation model for scientific-proof correctness. 2026.

Downloads last month
105
Safetensors
Model size
3B params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for MorphMind-AI/CFM-Proof-3B

Base model

Qwen/Qwen2.5-3B
Finetuned
(1355)
this model

Space using MorphMind-AI/CFM-Proof-3B 1