⚡ Olmo 3.1 7B Think

A drop-in upgrade for Olmo 3 7B Think — one extra epoch of RLVR, no recipe changes.
Stronger instruction following and safety, reasoning held steady. Also the compute-matched control for OlmoLogic.

📝 Blog  •  💻 Training Code  •  📊 Eval Code  •  🧠 OlmoLogic 7B Think

TL;DR

Olmo 3.1 7B Think is a continued-RLVR extension of allenai/Olmo-3-7B-Think. We take the official one-epoch Olmo 3 7B Think checkpoint and train it for roughly one additional epoch (1,850 RLVR steps) on the original Olmo-3 RLVR mixture (allenai/Dolci-Think-RL-7B) — no recipe changes, no new data.

The result is a drop-in upgrade for downstream use: stronger instruction following, safety, and reasoning.

⚖️ Olmo 3.1 was also built as the compute-matched control for our OlmoLogic experiments — same total step budget, but without the SLR logic data. See the blog post for the full story.


📊 What changed vs. Olmo 3 7B Think

Benchmark (avg) Olmo-3-7B-Think Olmo 3.1 7B Think Δ
Instruction Following 64.9 71.5 +6.6 🔥
Safety 70.7 74.5 +3.8
Reasoning 75.8 76.7 +0.9
SLR-Bench 15.1 15.7 +0.6
Logic 59.1 59.1 +0.0
Math 71.1 70.5 −0.5
Knowledge 49.2 48.7 −0.5
Coding 76.6 75.0 −1.6
Chat 52.1 41.6 −10.5

The trade-off: the main regression is on open-ended Chat (−10.5), a known cost of extensive RLVR optimization. Code (−1.6) and knowledge (−0.5) shift within noise. If you care more about reasoning and instruction following than open-ended chat, this is a clean upgrade.

All numbers come from a single reproducible OLMES pipeline.


⚙️ Training

  • Base model: allenai/Olmo-3-7B-Think
  • Algorithm: GRPO via Slurm-adapted open-instruct (DeepSpeed ZeRO-3)
  • Data: allenai/Dolci-Think-RL-7B (the original Olmo-3 RLVR mix, unchanged)
  • Added training: ~1 epoch / 1,850 steps (3,350 total, matching OlmoLogic)
  • Settings: default Olmo-3 RLVR config — β = 0, constant LR 1e-6, global batch 512 (64 prompts × 8 rollouts), vLLM temperature 1.0

🚀 Inference

vLLM

from vllm import LLM, SamplingParams

model_id = "LukasHug/Olmo-3.1-7B-Think"
llm = LLM(model=model_id)

sampling_params = SamplingParams(
    temperature=0.6,
    top_p=0.95,
    max_tokens=32768,
)

prompt = "Explain why the square root of 2 is irrational."
outputs = llm.generate(prompt, sampling_params)
print(outputs[0].outputs[0].text)

Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "LukasHug/Olmo-3.1-7B-Think"
tok = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")

messages = [{"role": "user", "content": "Explain why the square root of 2 is irrational."}]
inputs = tok.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt").to(model.device)
out = model.generate(inputs, max_new_tokens=32768, temperature=0.6, top_p=0.95)
print(tok.decode(out[0][inputs.shape[-1]:], skip_special_tokens=True))

💡 This is a Think model with long chain-of-thought — allow a generous max_tokens (16k–32k) for hard tasks.


✅ Takeaways

  • Pure continued RLVR. Same recipe, same data, one more epoch — a clean upgrade path, not a new model family.
  • Instruction following and safety improve; reasoning holds. The cost is concentrated in open-ended chat.
  • A faithful control. Compute-matched to OlmoLogic, so the SLR ablation isolates the effect of logic data, not extra steps.

Model Details

  • Developed by: Artificial Intelligence and Machine Learning Lab, Technical University of Darmstadt (TU Darmstadt)
  • Model type: Transformer autoregressive LM with long chain-of-thought
  • Language: English
  • License: Apache 2.0
  • Base model: allenai/Olmo-3-7B-Think

Sources

Citation

This work is based on the following two papers. If you build on it, please cite:

For the SLR-Bench, please cite:

@inproceedings{helff2025slr,
  title     = {{SLR: Automated Synthesis for Scalable Logical Reasoning}},
  author    = {Helff, Lukas and Omar, Ahmad and Friedrich, Felix and W{\"u}st, Antonia
               and Shindo, Hikaru and Woydt, Tim and Mitchell, Rupert
               and Schramowski, Patrick and Stammer, Wolfgang and Kersting, Kristian},
  booktitle = {Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026)},
  year      = {2026},
  url       = {https://openreview.net/forum?id=omMnuTTEn7}
}

For the Reward Hacking paper, please cite:

@inproceedings{helff2026llms,
  title     = {{LLMs Gaming Verifiers: RLVR can Lead to Reward Hacking}},
  author    = {Lukas Helff and Quentin Delfosse and David Steinmann and Ruben H{\"a}rle
               and Hikaru Shindo and Patrick Schramowski and Wolfgang Stammer
               and Kristian Kersting and Felix Friedrich},
  booktitle = {ICLR 2026 Workshop on Logical Reasoning of Large Language Models},
  year      = {2026},
  url       = {https://openreview.net/forum?id=4B3WfRNqe3}
}

Acknowledgments

Supported by DFKI and the hessian.AI Innovation Lab (BMFTR grant 16IS22091), the hessian.AISC Service Center (BMBF grant 01IS22091), and CERTAIN, with further support from TAILOR (EU Horizon 2020, GA 952215), the Hessian LOEWE program, NHR4CES, the BMWK project SOOFI (13IPC040G), the Cluster of Excellence "Reasonable AI" (DFG, EXC-3057), DFG SPP 2422, the AlephAlpha Collaboration Lab 1141, and OpenAI Research Credits.

Downloads last month
30
Safetensors
Model size
528k params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for AIML-TUDA/Olmo-3.1-7B-Think

Finetuned
(11)
this model
Quantizations
2 models

Dataset used to train AIML-TUDA/Olmo-3.1-7B-Think

Collection including AIML-TUDA/Olmo-3.1-7B-Think