How to use from the
Use from the
Transformers library
# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="MorphMind-AI/CFM-Methods-3B")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)
# Load model directly
from transformers import AutoTokenizer, AutoModelForMultimodalLM

tokenizer = AutoTokenizer.from_pretrained("MorphMind-AI/CFM-Methods-3B")
model = AutoModelForMultimodalLM.from_pretrained("MorphMind-AI/CFM-Methods-3B")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))
Quick Links

CFM-Methods-3B ยท MorphMind

A tiny control model that reads a methods section and tells you exactly where the methodology is unsound. Give it a methods or experimental-design block from any empirical-science paper --- statistics, machine learning, quantitative biology, econometrics, materials science, or chemical physics --- and it returns a structured verdict, support or refute, pinpoints the offending statement, and explains why. It is a high-recall screen: it surfaces methodological red flags --- data leakage, p-hacking, uncorrected multiple comparisons, train/test contamination, optional stopping, correlation-as-causation, post-hoc outlier removal, unblinded scoring, and more --- so a human misses almost nothing.

At just 3B parameters, CFM-Methods-3B delivers frontier-level methodology screening that runs on a single GPU, on-premise, at a tiny fraction of the cost of a frontier API. It is the compact member of MorphMind's Control Foundation Model (CFM) line --- models whose job is not to generate science but to check it.

By MorphMind. Research preview.

Benchmark --- methodology-flaw detection vs. frontier models

methodology benchmark

Evaluated on flaw types the model never trained on (24 flaw families used for training, 12 held out for evaluation), benchmarked head-to-head against frontier commercial models on the same held-out set:

Model Recall Precision Localization False-positive rate (clean)
base Qwen2.5-3B 0.30 --- 0.42 0.07
GPT-4o 0.86 0.64 0.94 0.47
Claude Opus 4 0.96 0.78 0.97 0.28
CFM-Methods-3B (ours) 0.98 1.00 0.97 0.005

CFM-Methods-3B matches frontier recall and localization, with the cleanest false-alarm rate --- effectively zero. It catches 98% of methodological flaws it has never seen and pinpoints the exact flawed statement 97% of the time, on par with Claude Opus 4 and GPT-4o, while the frontier models over-flag clean methods heavily (Opus 28%, GPT-4o 47% false-positive rate). So it delivers frontier-grade methodology screening with the precision of a careful expert --- on-prem, in a 3B model, at a tiny fraction of the cost.

Worked example

1 โ€” it catches an uncorrected multiple-comparisons flaw. Given this methods block:

"We screened 60 candidate protein markers for association with disease status. Each marker was tested individually with a univariate logistic regression at the 0.05 level. The 14 markers reaching p < 0.05 in univariate tests are reported as significant and carried forward as the disease signature."

CFM-Methods-3B returns (verbatim model output):

{
  "analysis": "There is a methodological flaw: Many tests without correction inflate the false-positive rate.",
  "verdict": "refute",
  "error_spans": [
    {
      "text": "The 14 markers reaching p < 0.05 in univariate tests are reported as significant",
      "why": "Many tests without correction inflate the false-positive rate."
    }
  ],
  "action": "suggest_edit"
}

It pinpoints the exact offending sentence and names the failure mode โ€” 60 simultaneous tests at ฮฑ = 0.05 with no correction.

2 โ€” and it passes clean methods without crying wolf. Given a sound instrumental-variables design:

"We estimate the causal effect of schooling on wages with two-stage least squares, instrumenting years of education with quarter-of-birth. We report first-stage F-statistics to confirm instrument strength and cluster standard errors at the state level."

{
  "analysis": "The methodology is sound: instrument strength is verified and standard errors are clustered appropriately.",
  "verdict": "support",
  "error_spans": [],
  "action": "accept"
}

No false flag โ€” the near-zero false-positive rate in the benchmark above is what this looks like in practice.

When & how to use it

Use it as a fast, private, first-pass methodology screen --- a pre-submission self-check for researchers, triage for journals / reviewers / grant panels, QA over a stack of submissions, or a check on AI-generated experimental designs. Review one methods block at a time (split a paper into its method / experiment / analysis sections and run each). Because it is tuned for recall, treat its flags as "worth a human's 30 seconds."

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
tok = AutoTokenizer.from_pretrained("MorphMind-AI/CFM-Methods-3B")
model = AutoModelForCausalLM.from_pretrained("MorphMind-AI/CFM-Methods-3B",
                                             torch_dtype=torch.bfloat16, device_map="auto")
SYS = ("You are a scientific methodology reviewer. Review the methods and respond ONLY with JSON: "
       "{\"analysis\":...,\"verdict\":\"support|refute\","
       "\"error_spans\":[{\"text\":...,\"why\":...}],\"action\":\"accept|suggest_edit\"}")
def review(methods):
    msgs=[{"role":"system","content":SYS},{"role":"user","content":"METHODS:\n"+methods}]
    ids=tok.apply_chat_template(msgs, add_generation_prompt=True, return_tensors="pt").to(model.device)
    out=model.generate(ids, max_new_tokens=320, do_sample=False)
    return tok.decode(out[0, ids.shape[1]:], skip_special_tokens=True)

How it was built

A full-parameter fine-tune of Qwen2.5-3B-Instruct, trained with RLVR (Reinforcement Learning from Verifiable Rewards) under a localization-gated reward --- a verdict is reinforced only if the model also points to the actual flawed statement, which teaches genuine reasoning rather than blanket flagging. Trained on public arXiv methods sections across statistics, machine learning, quantitative biology, econometrics, materials science, and chemical physics, with injected, paraphrased methodological flaws; evaluated on held-out flaw families.

Notes

  • A high-recall screen for first-pass review: ~98% of flaws surfaced with a near-zero false-alarm rate, designed to keep an expert in the loop for the final call.
  • Generalizes to methodological flaws it has never seen, across six empirical-science families.
  • Part of MorphMind's growing Control Foundation Model family.

License

Released under the MorphMind CFM Research License (see LICENSE), incorporating the Qwen Research License of the Qwen2.5-3B base. Research / non-commercial use, with attribution to MorphMind and Qwen. For commercial licensing, contact MorphMind (morphmind.ai).

Downloads last month
106
Safetensors
Model size
3B params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for MorphMind-AI/CFM-Methods-3B

Base model

Qwen/Qwen2.5-3B
Finetuned
(1360)
this model
Quantizations
1 model

Space using MorphMind-AI/CFM-Methods-3B 1