Qwen3-8B BTProp tree-gen SFT (main_modified1_verify prompts)
Fine-tuned from Qwen3-8B to perform belief-tree generation for the BTProp fact-checking pipeline. Distilled from Qwen3.5-397B (≈50× larger) trajectories on 9 fact-checking datasets so the BTProp tree-gen stage no longer needs a 397B teacher at inference time.
Training data
| Teacher model | Qwen3.5-397B-A17B (FP8) |
| Datasets | averitec, felm, factcheckgpt, ifqa, wikibio, factors, quantemp, factchd, wiqa |
| Source pipeline | BTProp main_modified1_verify — L1 strategy_select + Step B from prompts.py, L2 mod1_verify from prompts_rule_modified_1.py |
| Scope | first 10 % examples of each dataset's data_utils loader |
| Positive filter | HMM posterior matches ground-truth label on root segment |
| SFT step types | decompose_nodes, support_nodes, oppose_nodes, mod1_verify |
| Total samples | 4 059 (prompt, response) pairs (90/10 train/val split at root-segment level) |
| Format | sharegpt JSONL; Q3.5 plaintext "Thinking Process:" CoT wrapped in <think>…</think> to match Qwen3 chat template |
Training
| Hyperparameter | Value |
|---|---|
| Framework | LLaMA-Factory v0.9.3, full FT |
| Precision | bf16 + DeepSpeed Zero3 |
| Hardware | 8× H100 (97 GB) |
| Sequence length | 12 288 (drops < 0.5 % of samples) |
| Per-device batch | 1 |
| Gradient accumulation | 8 → effective batch 64 |
| Optimizer | AdamW |
| Learning rate | 2e-5, cosine, 3 % warmup |
| Weight decay | 0.01 |
| Epochs | 5 (best-by-val_loss → epoch 3 selected) |
| Final eval_loss | 0.3378 (epoch 3 best) |
Eval loss trajectory
| epoch | eval_loss |
|---|---|
| 1 | 0.3726 |
| 2 | 0.3425 |
| 3 | 0.3378 ← best |
| 4 | 0.3439 (overfit start) |
| 5 | 0.3501 |
load_best_model_at_end=True restored the epoch-3 weights before final save_model().
The safetensors in this repo correspond to the epoch-3 weights repackaged in
inference-ready format (no DeepSpeed shards, no optimizer state).
Results
Headline weighted-AVG AUROC on the 10–15 % held-out test slice (the FT model never saw these examples during training), under the BTProp downstream stack (Q3-8B conf model + BM25 + Dense FAISS + MCP retrieval), with HMM emission tables tuned globally on the NEW table:
| Method | Q3-8B baseline | Q3.5-397B | Q3-8B-FT (this model) |
|---|---|---|---|
| baseline+search (no tree, no HMM) | 77.74 | 77.33 | 77.77 |
| HMM+search (NEW emission) | 74.33 | 76.43 | 75.97 |
| Mix (0.5·baseline + 0.5·HMM_NEW) | 76.99 | 77.55 | 77.67 |
| FOL+search | 74.58 | 74.86 | 75.07 |
The FT model recovers and exceeds the Q3-8B baseline across every method while using the same parameter budget — i.e. the SFT successfully distills the Q3.5 tree-gen behavior into Q3-8B. See the full comparison for all 9 datasets × 4 metrics × 6 methods × 2 scopes.
Usage
This model uses the Qwen3 chat template with enable_thinking=True. It expects BTProp
tree-gen prompts (prompts.qwen_decompose_nodes / qwen_support_nodes /
qwen_oppose_nodes and prompts_rule_modified_1.qwen_rule_verify) — it is not a
general instruction-following model.
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_id = "RyanFoxW/Qwen3-8B-BTProp-mainmod1verify-SFT"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id, torch_dtype=torch.bfloat16, device_map="auto", trust_remote_code=True,
)
# Then use the BTProp prompts from github.com/BENGAL-UCSB/BTProp
Drop-in vLLM serve (the way BTProp pipeline uses it)
The BTProp pipeline calls Q3.5 via QWEN35_BASE_URL + QWEN35_MODEL_NAME env vars.
Serve this model under the same model name so the pipeline picks it up without code
changes:
vllm serve RyanFoxW/Qwen3-8B-BTProp-mainmod1verify-SFT \
--served-model-name qwen3.5-397b-a17b \
--tensor-parallel-size 1 --data-parallel-size 8 \
--dtype bfloat16 --max-model-len 12288 \
--gpu-memory-utilization 0.40 --enable-prefix-caching \
--trust-remote-code --host 127.0.0.1 --port 9003
# Then run the BTProp pipeline:
QWEN35_BASE_URL=http://127.0.0.1:9003/v1 \
QWEN35_MODEL_NAME=qwen3.5-397b-a17b \
bash experiments/run_main_modified1_verify_p5.sh
Limitations
- Trained on first 10 % of each dataset; generalization to other fact-checking benchmarks is untested.
- ~52 % of test root statements still get
strategy=stopfrom the L1 prompt (model decides "can't decompose"), so HMM/FOL only adds signal on the other ~48 %. - Acc@0.5 is essentially identical to baseline+search on most datasets — the tree pipeline's lift shows up mainly in AUROC/PRAUC, not threshold-fixed Acc.
Citation
@article{hou2024probabilistic,
title = {A Probabilistic Framework for LLM Hallucination Detection via Belief Tree Propagation},
author = {Hou, Bairu and Zhang, Yang and Andreas, Jacob and Chang, Shiyu},
journal= {arXiv preprint arXiv:2406.06950},
year = {2024}
}
Links
- 🐙 Code: github.com/BENGAL-UCSB/BTProp (branch
exp/qwen35_ds_full) - 🏗 Base model: Qwen/Qwen3-8B
- 📊 Headline comparison:
results/comparisons/main_FINAL_oldvsnew_emission.md - 📖 Full guide:
RESULTS_GUIDE.md
- Downloads last month
- 16