OlmoLogic 7B Think

🧠 OlmoLogic 7B Think

The first fully open model to bring the ILP paradigm into RLVR.
We wire a Prolog interpreter straight into the reward loop and execute logic programs to grade the model.

📝 Blog  •  💻 Training Code  •  📊 Eval Code  •  🤗 SLR-Bench  •  🤗 Olmo 3.1 7B Think

TL;DR

Open RLVR recipes center on math and code, and logical reasoning gets left behind. OlmoLogic fixes that. Starting from Olmo-3-7B-Think-DPO, we wire the paradigm of Inductive Logic Programming into Olmo 3's RLVR receipe. OlmoLogic is post-train from-scratch on 56×H100 for 6 days straight (3,350 optimization steps) broadly outperming Olmo-3 7B Think with large gains on logical reasoning.

📊 Results

Benchmark Suite Olmo-3-7B-Think OlmoLogic 7B Think Δ
SLR-Bench 15.1 45.1 +30.0 🔥
Logic (avg) 59.1 64.4 +5.4
Reasoning (avg) 75.8 76.6 +0.8
Math (avg) 71.1 73.0 +1.9
Instruction Following 64.9 66.6 +1.7
Knowledge (avg) 49.2 49.5 +0.3
Safety (avg) 70.7 74.0 +3.3

Held-out logic suite: LogiGLUE, KOR-Bench, bAbI-16, CLUTRR, FOLIO, ProntoQA, RuleBERT, and abductive reasoning. All numbers from a single reproducible OLMES pipeline.

📖 Full ablations, training dynamics, and the compute-matched control (Olmo 3.1 7B Think) are in the blog post.


RLVR with Inductive Logic Programming

The trick is what grades the model. Instead of a judge or a learned reward, we ground every reward in real symbolic execution: a Prolog interpreter runs the model's proposed rule against the task, exactly the way a Python interpreter runs code against tests. No judge model, no learned reward — just a gold-standard oracle.

🚀 SLR-Bench tripled: 15.1 → 45.1 (+30.0) 📈 +5.4 avg on held-out logic benchmarks never seen in training ⚖️ Math, code, and instruction-following held steady


🧩 What an SLR task looks like

The model plays a train classifier: trains travel east- or westbound, each made of cars with properties (color, length, wall type). Given positive and negative examples plus background facts, it must write the shortest Prolog rule that perfectly separates them.

% Background knowledge (excerpt)
eastbound(train0).            westbound(train1).
has_car(train0, car0_1).      has_car(train1, car1_1).
car_color(car0_1, red).       car_color(car1_1, blue).
car_len(car0_1, long).        car_len(car1_1, short).

% A valid induced hypothesis:
eastbound(Train) :- has_car(Train, Car), car_color(Car, red).
% "A train is eastbound if it carries a red car."

That rule is then executed against the task to check completeness and consistency — and that execution is the reward.


⚙️ Training Recipe

Base model: allenai/Olmo-3-7B-Think-DPO Algorithm: GRPO via a Slurm-adapted open-instruct (DeepSpeed ZeRO-3).

Data mixture

A 1:1 dataset-weighted blend of allenai/Dolci-Think-RL-7B (the original Olmo-3 RLVR mix) and AIML-TUDA/SLR-Bench:v1-All. SLR's verbose background theories push prompts long, so we raised the prompt cap to 5,000 tokens (up from Olmo-3's 2,048). Rewards are routed by source: SLR → Prolog verifier; Dolci → original code/math/judge verifiers (Qwen/Qwen3-32B as LLM judge).

Source Prompts Share
IF Multi-Constraint 29,813 26.8%
OMEGA Math 15,000 13.5%
AceCoder 10,107 9.1%
SLR-Bench 9,402 8.4%
Tulu 3 Rewritten 7,109 6.4%
Multi-Subject RLVR 7,106 6.4%
AceReason-Math 6,598 5.9%
WildChat English 6,421 5.8%
KlearReasoner Code 6,272 5.6%
SYNTHETIC-2 / PrimeIntellect 3,000 2.7%
MathSub-30K 2,999 2.7%
ORZ Math 2,999 2.7%
DAPO-Math 2,584 2.3%
Llama-Nemotron Post-Training 2,006 1.8%
Total 111,416 100%

🎯 Reward design

Rewards come straight from Prolog execution. We use the isomorphic verifier variant (from LLMs Gaming Verifiers) to block reward hacking. Since every SLR task is balanced binary classification, anything at or below a coin flip is information-negative and earns nothing.

Let $p \in [0,1]$ be rule-classification accuracy, $s \in [0,1]$ the simplicity bonus, gate $g = 0.5$, exponent $k = 4$:

R(p,s)  =  {9.5+0.5sif p=1(solved)min ⁣(9,  p4(9+s))if 0.5p<1(partial)0if p<0.5(no reward) R(p,\, s) \;=\; \begin{cases} 9.5 \,+\, 0.5\,s & \text{if } p = 1 \quad (\text{solved}) \\[4pt] \min\!\bigl(9,\; p^{4}\,(9 \,+\, s)\bigr) & \text{if } 0.5 \le p < 1 \quad (\text{partial}) \\[4pt] 0 & \text{if } p < 0.5 \quad (\text{no reward}) \end{cases}

Three things this shape buys us:

  • Fourth-power compression on partial credit — 50% → ~0.016, 80% → 0.26, 95% → 0.74. The model is rewarded for getting almost everything right, not most things.
  • A hard gate at 0.5 zeros out anything worse than a coin flip.
  • A clean separation between partial (≤ 9.0) and fully correct (≥ 9.5) so a correct rollout's GRPO advantage is never undercut by a simpler-but-partial sibling.

The simplicity bonus enters multiplicatively, nudging toward shorter rules. Syntax validity is tracked but earns no reward — any rule that executes with $p > 0$ is valid by definition.

Hyperparameters

Steps 3,350 (~2 epochs)
Learning rate 1e-6 (constant)
KL anchor (β) 0
Global batch 512 (64 prompts × 8 rollouts)
Clip-higher 0.272
IS ratio cap 2.0
vLLM temperature 1.0
Max prompt / response 5k / 25k tokens (packed to 35.8k)

Why two epochs? Olmo-3's published recipe runs one (~1,500 steps). We observed rewards still climbing well past the one-epoch mark — both the SLR head and overall verifiable reward only level off in epoch two — so we trained the full 3,350 steps.

Throughput: 7× H100 nodes (8 GPUs each) — one judge node (Qwen3-32B + Prolog verifier API), one Ray head running the trainer, five worker nodes hosting 40 vLLM engines, with async rollouts and inflight weight updates.


🚀 Inference

vLLM

from vllm import LLM, SamplingParams

model_id = "AIML-TUDA/OlmoLogic-7B-Think"
llm = LLM(model=model_id)

sampling_params = SamplingParams(
    temperature=0.6,
    top_p=0.95,
    max_tokens=32768,
)

prompt = "Who would win in a fight — a dinosaur or a cow named MooMoo?"
outputs = llm.generate(prompt, sampling_params)
print(outputs[0].outputs[0].text)

Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "AIML-TUDA/OlmoLogic-7B-Think"
tok = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")

messages = [{"role": "user", "content": "Induce the shortest rule that separates these examples..."}]
inputs = tok.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt").to(model.device)
out = model.generate(inputs, max_new_tokens=32768, temperature=0.6, top_p=0.95)
print(tok.decode(out[0][inputs.shape[-1]:], skip_special_tokens=True))

💡 This is a Think model with long chain-of-thought. Give it room — max_tokens of 16k–32k is recommended for hard reasoning tasks.


✅ Takeaways

  • One dataset, one verifier — no training-stack changes. SLR drops into an existing RLVR mix and teaches broad logical reasoning.
  • Reasoning transfers. Gains show up on held-out logic benchmarks the model never saw.
  • Logic-program execution is a faithful, efficient oracle. No judge model, no learned reward, no proxy.

Model Details

  • Developed by: Artificial Intelligence and Machine Learning Lab, Technical University of Darmstadt (TU Darmstadt)
  • Model type: Transformer autoregressive LM with long chain-of-thought
  • Language: English
  • License: Apache 2.0
  • Base model: allenai/Olmo-3-7B-Think-DPO
  • Training framework: GRPO via DeepSpeed ZeRO-3 on a custom Slurm deployment of open-instruct

Sources

Companion checkpoint

Olmo 3.1 7B Think — a compute-matched extension of Olmo-3-Think on the original mix (no SLR). It serves as the clean control for the SLR ablation and as a stronger general-purpose Olmo-3-Think base for downstream use.

Citation

This work is based on the following two papers. If you build on it, please cite:

For the SLR-Bench, please cite:

@inproceedings{helff2025slr,
  title     = {{SLR: Automated Synthesis for Scalable Logical Reasoning}},
  author    = {Helff, Lukas and Omar, Ahmad and Friedrich, Felix and W{\"u}st, Antonia
               and Shindo, Hikaru and Woydt, Tim and Mitchell, Rupert
               and Schramowski, Patrick and Stammer, Wolfgang and Kersting, Kristian},
  booktitle = {Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026)},
  year      = {2026},
  url       = {https://openreview.net/forum?id=omMnuTTEn7}
}

For the Reward Hacking paper, please cite:

@inproceedings{helff2026llms,
  title     = {{LLMs Gaming Verifiers: RLVR can Lead to Reward Hacking}},
  author    = {Lukas Helff and Quentin Delfosse and David Steinmann and Ruben H{\"a}rle
               and Hikaru Shindo and Patrick Schramowski and Wolfgang Stammer
               and Kristian Kersting and Felix Friedrich},
  booktitle = {ICLR 2026 Workshop on Logical Reasoning of Large Language Models},
  year      = {2026},
  url       = {https://openreview.net/forum?id=4B3WfRNqe3}
}

Acknowledgments

Supported by DFKI and the hessian.AI Innovation Lab (BMFTR grant 16IS22091), the hessian.AISC Service Center (BMBF grant 01IS22091), and CERTAIN. This work further benefited from TAILOR (EU Horizon 2020, GA 952215), the Hessian LOEWE program ("WhiteBox"), the HMWK clusters "Adaptive Mind" and "Third Wave of AI", NHR4CES, the BMWK project SOOFI (13IPC040G), early stages of the Cluster of Excellence "Reasonable AI" (DFG, EXC-3057), DFG SPP 2422, the AlephAlpha Collaboration Lab 1141, and OpenAI Research Credits.

Downloads last month
3
Safetensors
Model size
528k params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for AIML-TUDA/OlmoLogic-7B-Think

Finetuned
(4)
this model

Datasets used to train AIML-TUDA/OlmoLogic-7B-Think

Collection including AIML-TUDA/OlmoLogic-7B-Think

Papers for AIML-TUDA/OlmoLogic-7B-Think