Qwen3-8B BTProp tree-gen SFT (`main_modified1_verify` prompts)

Fine-tuned from Qwen3-8B to perform belief-tree generation for the BTProp fact-checking pipeline. Distilled from Qwen3.5-397B (≈50× larger) trajectories on 9 fact-checking datasets so the BTProp tree-gen stage no longer needs a 397B teacher at inference time.

Training data


Teacher model	Qwen3.5-397B-A17B (FP8)
Datasets	averitec, felm, factcheckgpt, ifqa, wikibio, factors, quantemp, factchd, wiqa
Source pipeline	BTProp `main_modified1_verify` — L1 strategy_select + Step B from `prompts.py`, L2 `mod1_verify` from `prompts_rule_modified_1.py`
Scope	first 10 % examples of each dataset's `data_utils` loader
Positive filter	HMM posterior matches ground-truth label on root segment
SFT step types	`decompose_nodes`, `support_nodes`, `oppose_nodes`, `mod1_verify`
Total samples	4 059 (prompt, response) pairs (90/10 train/val split at root-segment level)
Format	sharegpt JSONL; Q3.5 plaintext "Thinking Process:" CoT wrapped in `<think>…</think>` to match Qwen3 chat template

Training

Hyperparameter	Value
Framework	LLaMA-Factory v0.9.3, full FT
Precision	bf16 + DeepSpeed Zero3
Hardware	8× H100 (97 GB)
Sequence length	12 288 (drops < 0.5 % of samples)
Per-device batch	1
Gradient accumulation	8 → effective batch 64
Optimizer	AdamW
Learning rate	2e-5, cosine, 3 % warmup
Weight decay	0.01
Epochs	5 (best-by-val_loss → epoch 3 selected)
Final eval_loss	0.3378 (epoch 3 best)

Eval loss trajectory

epoch	eval_loss
1	0.3726
2	0.3425
3	0.3378 ← best
4	0.3439 (overfit start)
5	0.3501

load_best_model_at_end=True restored the epoch-3 weights before final save_model(). The safetensors in this repo correspond to the epoch-3 weights repackaged in inference-ready format (no DeepSpeed shards, no optimizer state).

Results

Headline weighted-AVG AUROC on the 10–15 % held-out test slice (the FT model never saw these examples during training), under the BTProp downstream stack (Q3-8B conf model + BM25 + Dense FAISS + MCP retrieval), with HMM emission tables tuned globally on the NEW table:

Method	Q3-8B baseline	Q3.5-397B	Q3-8B-FT (this model)
baseline+search (no tree, no HMM)	77.74	77.33	77.77
HMM+search (NEW emission)	74.33	76.43	75.97
Mix (0.5·baseline + 0.5·HMM_NEW)	76.99	77.55	77.67
FOL+search	74.58	74.86	75.07

The FT model recovers and exceeds the Q3-8B baseline across every method while using the same parameter budget — i.e. the SFT successfully distills the Q3.5 tree-gen behavior into Q3-8B. See the full comparison for all 9 datasets × 4 metrics × 6 methods × 2 scopes.

Usage

This model uses the Qwen3 chat template with enable_thinking=True. It expects BTProp tree-gen prompts (prompts.qwen_decompose_nodes / qwen_support_nodes / qwen_oppose_nodes and prompts_rule_modified_1.qwen_rule_verify) — it is not a general instruction-following model.

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "RyanFoxW/Qwen3-8B-BTProp-mainmod1verify-SFT"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id, torch_dtype=torch.bfloat16, device_map="auto", trust_remote_code=True,
)
# Then use the BTProp prompts from github.com/BENGAL-UCSB/BTProp

Drop-in vLLM serve (the way BTProp pipeline uses it)

The BTProp pipeline calls Q3.5 via QWEN35_BASE_URL + QWEN35_MODEL_NAME env vars. Serve this model under the same model name so the pipeline picks it up without code changes:

vllm serve RyanFoxW/Qwen3-8B-BTProp-mainmod1verify-SFT \
  --served-model-name qwen3.5-397b-a17b \
  --tensor-parallel-size 1 --data-parallel-size 8 \
  --dtype bfloat16 --max-model-len 12288 \
  --gpu-memory-utilization 0.40 --enable-prefix-caching \
  --trust-remote-code --host 127.0.0.1 --port 9003

# Then run the BTProp pipeline:
QWEN35_BASE_URL=http://127.0.0.1:9003/v1 \
QWEN35_MODEL_NAME=qwen3.5-397b-a17b \
  bash experiments/run_main_modified1_verify_p5.sh

Limitations

Trained on first 10 % of each dataset; generalization to other fact-checking benchmarks is untested.
~52 % of test root statements still get strategy=stop from the L1 prompt (model decides "can't decompose"), so HMM/FOL only adds signal on the other ~48 %.
Acc@0.5 is essentially identical to baseline+search on most datasets — the tree pipeline's lift shows up mainly in AUROC/PRAUC, not threshold-fixed Acc.

Citation

@article{hou2024probabilistic,
  title  = {A Probabilistic Framework for LLM Hallucination Detection via Belief Tree Propagation},
  author = {Hou, Bairu and Zhang, Yang and Andreas, Jacob and Chang, Shiyu},
  journal= {arXiv preprint arXiv:2406.06950},
  year   = {2024}
}

Model tree for RyanFoxW/Qwen3-8B-BTProp-mainmod1verify-SFT

Base model

Qwen/Qwen3-8B-Base

Finetuned

Qwen/Qwen3-8B

Finetuned

(1653)

this model

Paper for RyanFoxW/Qwen3-8B-BTProp-mainmod1verify-SFT

A Probabilistic Framework for LLM Hallucination Detection via Belief Tree Propagation

Paper • 2406.06950 • Published Feb 8, 2025

RyanFoxW
/

Qwen3-8B-BTProp-mainmod1verify-SFT

Qwen3-8B BTProp tree-gen SFT (`main_modified1_verify` prompts)

Training data

Training

Eval loss trajectory

Results

Usage

Drop-in vLLM serve (the way BTProp pipeline uses it)

Limitations

Citation

Links

Model tree for RyanFoxW/Qwen3-8B-BTProp-mainmod1verify-SFT

Paper for RyanFoxW/Qwen3-8B-BTProp-mainmod1verify-SFT

A Probabilistic Framework for LLM Hallucination Detection via Belief Tree Propagation

Qwen3-8B BTProp tree-gen SFT (main_modified1_verify prompts)

Training data

Training

Eval loss trajectory

Results

Usage

Drop-in vLLM serve (the way BTProp pipeline uses it)

Limitations

Citation

Links

Model tree for RyanFoxW/Qwen3-8B-BTProp-mainmod1verify-SFT

Paper for RyanFoxW/Qwen3-8B-BTProp-mainmod1verify-SFT

Qwen3-8B BTProp tree-gen SFT (`main_modified1_verify` prompts)