Qwen3-8B BTProp tree-gen SFT (main_modified1_verify prompts)

Fine-tuned from Qwen3-8B to perform belief-tree generation for the BTProp fact-checking pipeline. Distilled from Qwen3.5-397B (≈50× larger) trajectories on 9 fact-checking datasets so the BTProp tree-gen stage no longer needs a 397B teacher at inference time.

Training data

Teacher model Qwen3.5-397B-A17B (FP8)
Datasets averitec, felm, factcheckgpt, ifqa, wikibio, factors, quantemp, factchd, wiqa
Source pipeline BTProp main_modified1_verify — L1 strategy_select + Step B from prompts.py, L2 mod1_verify from prompts_rule_modified_1.py
Scope first 10 % examples of each dataset's data_utils loader
Positive filter HMM posterior matches ground-truth label on root segment
SFT step types decompose_nodes, support_nodes, oppose_nodes, mod1_verify
Total samples 4 059 (prompt, response) pairs (90/10 train/val split at root-segment level)
Format sharegpt JSONL; Q3.5 plaintext "Thinking Process:" CoT wrapped in <think>…</think> to match Qwen3 chat template

Training

Hyperparameter Value
Framework LLaMA-Factory v0.9.3, full FT
Precision bf16 + DeepSpeed Zero3
Hardware 8× H100 (97 GB)
Sequence length 12 288 (drops < 0.5 % of samples)
Per-device batch 1
Gradient accumulation 8 → effective batch 64
Optimizer AdamW
Learning rate 2e-5, cosine, 3 % warmup
Weight decay 0.01
Epochs 5 (best-by-val_loss → epoch 3 selected)
Final eval_loss 0.3378 (epoch 3 best)

Eval loss trajectory

epoch eval_loss
1 0.3726
2 0.3425
3 0.3378 ← best
4 0.3439 (overfit start)
5 0.3501

load_best_model_at_end=True restored the epoch-3 weights before final save_model(). The safetensors in this repo correspond to the epoch-3 weights repackaged in inference-ready format (no DeepSpeed shards, no optimizer state).

Results

Headline weighted-AVG AUROC on the 10–15 % held-out test slice (the FT model never saw these examples during training), under the BTProp downstream stack (Q3-8B conf model + BM25 + Dense FAISS + MCP retrieval), with HMM emission tables tuned globally on the NEW table:

Method Q3-8B baseline Q3.5-397B Q3-8B-FT (this model)
baseline+search (no tree, no HMM) 77.74 77.33 77.77
HMM+search (NEW emission) 74.33 76.43 75.97
Mix (0.5·baseline + 0.5·HMM_NEW) 76.99 77.55 77.67
FOL+search 74.58 74.86 75.07

The FT model recovers and exceeds the Q3-8B baseline across every method while using the same parameter budget — i.e. the SFT successfully distills the Q3.5 tree-gen behavior into Q3-8B. See the full comparison for all 9 datasets × 4 metrics × 6 methods × 2 scopes.

Usage

This model uses the Qwen3 chat template with enable_thinking=True. It expects BTProp tree-gen prompts (prompts.qwen_decompose_nodes / qwen_support_nodes / qwen_oppose_nodes and prompts_rule_modified_1.qwen_rule_verify) — it is not a general instruction-following model.

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "RyanFoxW/Qwen3-8B-BTProp-mainmod1verify-SFT"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id, torch_dtype=torch.bfloat16, device_map="auto", trust_remote_code=True,
)
# Then use the BTProp prompts from github.com/BENGAL-UCSB/BTProp

Drop-in vLLM serve (the way BTProp pipeline uses it)

The BTProp pipeline calls Q3.5 via QWEN35_BASE_URL + QWEN35_MODEL_NAME env vars. Serve this model under the same model name so the pipeline picks it up without code changes:

vllm serve RyanFoxW/Qwen3-8B-BTProp-mainmod1verify-SFT \
  --served-model-name qwen3.5-397b-a17b \
  --tensor-parallel-size 1 --data-parallel-size 8 \
  --dtype bfloat16 --max-model-len 12288 \
  --gpu-memory-utilization 0.40 --enable-prefix-caching \
  --trust-remote-code --host 127.0.0.1 --port 9003

# Then run the BTProp pipeline:
QWEN35_BASE_URL=http://127.0.0.1:9003/v1 \
QWEN35_MODEL_NAME=qwen3.5-397b-a17b \
  bash experiments/run_main_modified1_verify_p5.sh

Limitations

  • Trained on first 10 % of each dataset; generalization to other fact-checking benchmarks is untested.
  • ~52 % of test root statements still get strategy=stop from the L1 prompt (model decides "can't decompose"), so HMM/FOL only adds signal on the other ~48 %.
  • Acc@0.5 is essentially identical to baseline+search on most datasets — the tree pipeline's lift shows up mainly in AUROC/PRAUC, not threshold-fixed Acc.

Citation

@article{hou2024probabilistic,
  title  = {A Probabilistic Framework for LLM Hallucination Detection via Belief Tree Propagation},
  author = {Hou, Bairu and Zhang, Yang and Andreas, Jacob and Chang, Shiyu},
  journal= {arXiv preprint arXiv:2406.06950},
  year   = {2024}
}

Links

Downloads last month
16
Safetensors
Model size
8B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for RyanFoxW/Qwen3-8B-BTProp-mainmod1verify-SFT

Finetuned
Qwen/Qwen3-8B
Finetuned
(1653)
this model

Paper for RyanFoxW/Qwen3-8B-BTProp-mainmod1verify-SFT