CFM-Proof-7B · MorphMind

A control model that reads a mathematical proof and tells you where it breaks. Give CFM-Proof-7B a theorem and its proof and it returns a structured verdict — support or refute — pinpoints the offending step, and explains why. It is built as a high-recall reviewer: it surfaces nearly every questionable step so a human misses almost nothing.

CFM-Proof-7B is the flagship of MorphMind's Control Foundation Model (CFM) line — models whose job is not to generate science but to check it. It is a full-parameter fine-tune that scales up CFM-Proof-3B, lifting recall from 0.83 to 0.95.

By MorphMind. Research preview.

Benchmark — proof-error recall vs. frontier models

CFM-Proof-7B proof-error recall

Recall (share of injected proof errors caught) on the same 150-proof held-out sample — every model given JSON output and an adequate token budget, so the comparison is like-for-like:

Model Recall (errors caught) Size
base Qwen (untuned) 0.04 —
Claude Opus 4.8 0.61 frontier
GPT-5.4 0.84 frontier
CFM-Proof-3B 0.88 3B
CFM-Proof-7B (ours) 0.96 7B · single GPU

On the full 1,977-proof test and an entirely held-out domain, CFM-Proof-7B's recall is 0.95 / 0.95 — it catches 95% of injected errors while running on a single GPU, at a fraction of the cost of frontier APIs. Read the table as a recall screen, not a verdict on overall capability: the models sit at different precision/recall trade-offs — Opus is more conservative (higher precision, lower recall), while CFM-Proof-7B favors recall, the right bias for a first-pass screen that must not miss errors.

When & how to use it

Use CFM-Proof-7B as a fast first-pass reviewer — to catch slips before a human deep-read, to triage a stack of submissions, or to vet AI-generated proofs. It is most valuable wherever a missed error is expensive: refereeing, internal review, grading, automated theorem generation.

The unit of review is one claim + its proof — not a whole paper. For a long paper, screen it piece by piece:

  1. Split the paper into its theorem / lemma / proposition blocks, each with its proof.
  2. Run CFM-Proof-7B on each block independently.
  3. Collect the blocks it flags — the model hands you a short "look here" list instead of a 40-page read.

This keeps every input short (one proof, the form it was trained on) and scales cleanly to long papers and large batches. Because it is tuned for recall, treat its flags as "worth a human's 30 seconds" — it is a screen, not a final judge.

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
tok = AutoTokenizer.from_pretrained("MorphMind-AI/CFM-Proof-7B")
model = AutoModelForCausalLM.from_pretrained("MorphMind-AI/CFM-Proof-7B",
                                             torch_dtype=torch.bfloat16, device_map="auto")
SYSTEM = ("You are a scientific correctness reviewer. Review the theorem and proof and respond ONLY "
          "with JSON: {\"analysis\":...,\"verdict\":\"support|refute\","
          "\"error_spans\":[{\"text\":...,\"why\":...}],\"action\":\"accept|suggest_edit\"}")

def review(theorem, proof):
    msgs=[{"role":"system","content":SYSTEM},
          {"role":"user","content":f"THEOREM:\n{theorem}\n\nPROOF:\n{proof}"}]
    ids=tok.apply_chat_template(msgs, add_generation_prompt=True, return_tensors="pt").to(model.device)
    out=model.generate(ids, max_new_tokens=320, do_sample=False)
    return tok.decode(out[0, ids.shape[1]:], skip_special_tokens=True)

# For a long paper: for theorem, proof in split_into_proof_blocks(paper): review(theorem, proof)

How it was built

A full-parameter fine-tune of Qwen2.5-7B-Instruct, trained with RLVR — Reinforcement Learning from Verifiable Rewards: the model proposes a verdict, an automatic checker validates it against ground truth, and only verifiably-correct answers are reinforced. No model-as-judge. Trained on public arXiv LaTeX proofs across statistics, probability, optimization, CS-theory, and ML theory.

Limitations

CFM-Proof-7B is a recall-first screen, and is deliberately built that way:

  • It over-flags (precision ≈ 0.5) — by design. It is far cheaper to dismiss a false alarm in seconds than to ship a missed error, so it errs toward flagging. Keep a human in the loop.
  • It catches ≈95% of errors, not 100% — a strong screen, not a proof of correctness.
  • It localizes the exact step ≈10% of the time; otherwise it tells you the proof is suspect and why, and you scan.
  • It was trained on representative injected errors (reversed inequalities, sign flips, altered constants); coverage of every real-world mistake will keep improving with each release.
  • This is a research preview; a multi-domain CFM-7B (adding methodology-conformance and novelty checks) is in training.

License

Released under the MorphMind CFM Research License (see LICENSE). The underlying Qwen2.5-7B base is Apache-2.0; this fine-tune is distributed for research / non-commercial use, with attribution to MorphMind and Qwen. For commercial licensing, contact MorphMind (morphmind.ai).

Citation

MorphMind. CFM-Proof-7B: a control foundation model for scientific-proof correctness. 2026.

Downloads last month
-
Safetensors
Model size
8B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for MorphMind-AI/CFM-Proof-7B

Base model

Qwen/Qwen2.5-7B
Finetuned
(2604)
this model