Qwen2.5Math-IVON-SFT-7B

📦 Code: insait-institute/c3po

Qwen2.5-Math 7B supervised-fine-tuned with the variational optimizer IVON, from the paper "Parameter Exploration for RLVR via Variational Learning".

This is a warm-start checkpoint: SFT'ing with IVON yields not just point weights but an approximate Gaussian posterior over them (a mean and a diagonal Hessian/precision estimate). That posterior is the learned prior used to seed the 3PO RLVR runs (B3PO / M3PO / C3PO), where weight perturbations sampled from it drive parameter-space exploration.

Training

Foundation model Qwen/Qwen2.5-Math-7B
Stage Warm-start SFT
Data Llama-Nemotron Post-Training Dataset (SFT subset)
Optimizer IVON, lr 50.0, ESS (λ) 1e10
Hardware 8× NVIDIA H200 (144 GB)

Usage

Loads as a standard causal LM:

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("BayesRL/Qwen2.5Math-IVON-SFT-7B")
tok = AutoTokenizer.from_pretrained("BayesRL/Qwen2.5Math-IVON-SFT-7B")

To use it as the warm-start prior for 3PO RLVR, load the IVON optimizer state via IVON_INIT_METHOD=trained in the companion code's run_rl.sh.

Citation

@misc{venkatkrishna2026parameter,
      title={Parameter Exploration for RLVR via Variational Learning},
      author={Vatsal Venkatkrishna and Nico Daheim and Iryna Gurevych},
      year={2026},
}
Downloads last month
64
Safetensors
Model size
8B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for BayesRL/Llama3.1-IVON-SFT-8B

Base model

Qwen/Qwen2.5-7B
Finetuned
(1027)
this model

Dataset used to train BayesRL/Llama3.1-IVON-SFT-8B

Collection including BayesRL/Llama3.1-IVON-SFT-8B

Paper for BayesRL/Llama3.1-IVON-SFT-8B