Instructions to use AIML-TUDA/OlmoLogic-7B-Think with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use AIML-TUDA/OlmoLogic-7B-Think with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="AIML-TUDA/OlmoLogic-7B-Think")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("AIML-TUDA/OlmoLogic-7B-Think")
model = AutoModelForCausalLM.from_pretrained("AIML-TUDA/OlmoLogic-7B-Think")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use AIML-TUDA/OlmoLogic-7B-Think with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "AIML-TUDA/OlmoLogic-7B-Think"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "AIML-TUDA/OlmoLogic-7B-Think",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/AIML-TUDA/OlmoLogic-7B-Think

SGLang

How to use AIML-TUDA/OlmoLogic-7B-Think with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "AIML-TUDA/OlmoLogic-7B-Think" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "AIML-TUDA/OlmoLogic-7B-Think",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "AIML-TUDA/OlmoLogic-7B-Think" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "AIML-TUDA/OlmoLogic-7B-Think",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use AIML-TUDA/OlmoLogic-7B-Think with Docker Model Runner:
```
docker model run hf.co/AIML-TUDA/OlmoLogic-7B-Think
```

OlmoLogic 7B Think

🧠 OlmoLogic 7B Think

The first fully open model to bring the ILP paradigm into RLVR.
We wire a Prolog interpreter straight into the reward loop and execute logic programs to grade the model.

📝 Blog • 💻 Training Code • 📊 Eval Code • 🤗 SLR-Bench • 🤗 Olmo 3.1 7B Think

TL;DR

Open RLVR recipes center on math and code, and logical reasoning gets left behind. OlmoLogic fixes that. Starting from Olmo-3-7B-Think-DPO, we wire the paradigm of Inductive Logic Programming into Olmo 3's RLVR receipe. OlmoLogic is post-train from-scratch on 56×H100 for 6 days straight (3,350 optimization steps) broadly outperming Olmo-3 7B Think with large gains on logical reasoning.

📊 Results

Benchmark Suite	Olmo-3-7B-Think	OlmoLogic 7B Think	Δ
SLR-Bench	15.1	45.1	+30.0 🔥
Logic (avg)	59.1	64.4	+5.4
Reasoning (avg)	75.8	76.6	+0.8
Math (avg)	71.1	73.0	+1.9
Instruction Following	64.9	66.6	+1.7
Knowledge (avg)	49.2	49.5	+0.3
Safety (avg)	70.7	74.0	+3.3

Held-out logic suite: LogiGLUE, KOR-Bench, bAbI-16, CLUTRR, FOLIO, ProntoQA, RuleBERT, and abductive reasoning. All numbers from a single reproducible OLMES pipeline.

📖 Full ablations, training dynamics, and the compute-matched control (Olmo 3.1 7B Think) are in the blog post.

RLVR with Inductive Logic Programming

The trick is what grades the model. Instead of a judge or a learned reward, we ground every reward in real symbolic execution: a Prolog interpreter runs the model's proposed rule against the task, exactly the way a Python interpreter runs code against tests. No judge model, no learned reward — just a gold-standard oracle.

🚀 SLR-Bench tripled: 15.1 → 45.1 (+30.0) 📈 +5.4 avg on held-out logic benchmarks never seen in training ⚖️ Math, code, and instruction-following held steady

🧩 What an SLR task looks like

The model plays a train classifier: trains travel east- or westbound, each made of cars with properties (color, length, wall type). Given positive and negative examples plus background facts, it must write the shortest Prolog rule that perfectly separates them.

% Background knowledge (excerpt)
eastbound(train0).            westbound(train1).
has_car(train0, car0_1).      has_car(train1, car1_1).
car_color(car0_1, red).       car_color(car1_1, blue).
car_len(car0_1, long).        car_len(car1_1, short).

% A valid induced hypothesis:
eastbound(Train) :- has_car(Train, Car), car_color(Car, red).
% "A train is eastbound if it carries a red car."

That rule is then executed against the task to check completeness and consistency — and that execution is the reward.

⚙️ Training Recipe

Base model: allenai/Olmo-3-7B-Think-DPO Algorithm: GRPO via a Slurm-adapted open-instruct (DeepSpeed ZeRO-3).

Data mixture

A 1:1 dataset-weighted blend of allenai/Dolci-Think-RL-7B (the original Olmo-3 RLVR mix) and AIML-TUDA/SLR-Bench:v1-All. SLR's verbose background theories push prompts long, so we raised the prompt cap to 5,000 tokens (up from Olmo-3's 2,048). Rewards are routed by source: SLR → Prolog verifier; Dolci → original code/math/judge verifiers (Qwen/Qwen3-32B as LLM judge).

Source	Prompts	Share
IF Multi-Constraint	29,813	26.8%
OMEGA Math	15,000	13.5%
AceCoder	10,107	9.1%
SLR-Bench	9,402	8.4%
Tulu 3 Rewritten	7,109	6.4%
Multi-Subject RLVR	7,106	6.4%
AceReason-Math	6,598	5.9%
WildChat English	6,421	5.8%
KlearReasoner Code	6,272	5.6%
SYNTHETIC-2 / PrimeIntellect	3,000	2.7%
MathSub-30K	2,999	2.7%
ORZ Math	2,999	2.7%
DAPO-Math	2,584	2.3%
Llama-Nemotron Post-Training	2,006	1.8%
Total	111,416	100%

🎯 Reward design

Rewards come straight from Prolog execution. We use the isomorphic verifier variant (from LLMs Gaming Verifiers) to block reward hacking. Since every SLR task is balanced binary classification, anything at or below a coin flip is information-negative and earns nothing.

Let $p \in [0,1]$ be rule-classification accuracy, $s \in [0,1]$ the simplicity bonus, gate $g = 0.5$, exponent $k = 4$:

$R(p,\, s) \;=\; \begin{cases} 9.5 \,+\, 0.5\,s & \text{if } p = 1 \quad (\text{solved}) \\[4pt] \min\!\bigl(9,\; p^{4}\,(9 \,+\, s)\bigr) & \text{if } 0.5 \le p < 1 \quad (\text{partial}) \\[4pt] 0 & \text{if } p < 0.5 \quad (\text{no reward}) \end{cases}$

Three things this shape buys us:

Fourth-power compression on partial credit — 50% → ~0.016, 80% → 0.26, 95% → 0.74. The model is rewarded for getting almost everything right, not most things.
A hard gate at 0.5 zeros out anything worse than a coin flip.
A clean separation between partial (≤ 9.0) and fully correct (≥ 9.5) so a correct rollout's GRPO advantage is never undercut by a simpler-but-partial sibling.

The simplicity bonus enters multiplicatively, nudging toward shorter rules. Syntax validity is tracked but earns no reward — any rule that executes with $p > 0$ is valid by definition.

Hyperparameters


Steps	3,350 (~2 epochs)
Learning rate	1e-6 (constant)
KL anchor (β)	0
Global batch	512 (64 prompts × 8 rollouts)
Clip-higher	0.272
IS ratio cap	2.0
vLLM temperature	1.0
Max prompt / response	5k / 25k tokens (packed to 35.8k)

Why two epochs? Olmo-3's published recipe runs one (~1,500 steps). We observed rewards still climbing well past the one-epoch mark — both the SLR head and overall verifiable reward only level off in epoch two — so we trained the full 3,350 steps.

Throughput: 7× H100 nodes (8 GPUs each) — one judge node (Qwen3-32B + Prolog verifier API), one Ray head running the trainer, five worker nodes hosting 40 vLLM engines, with async rollouts and inflight weight updates.

🚀 Inference

vLLM

from vllm import LLM, SamplingParams

model_id = "AIML-TUDA/OlmoLogic-7B-Think"
llm = LLM(model=model_id)

sampling_params = SamplingParams(
    temperature=0.6,
    top_p=0.95,
    max_tokens=32768,
)

prompt = "Who would win in a fight — a dinosaur or a cow named MooMoo?"
outputs = llm.generate(prompt, sampling_params)
print(outputs[0].outputs[0].text)

Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "AIML-TUDA/OlmoLogic-7B-Think"
tok = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")

messages = [{"role": "user", "content": "Induce the shortest rule that separates these examples..."}]
inputs = tok.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt").to(model.device)
out = model.generate(inputs, max_new_tokens=32768, temperature=0.6, top_p=0.95)
print(tok.decode(out[0][inputs.shape[-1]:], skip_special_tokens=True))

💡 This is a Think model with long chain-of-thought. Give it room — max_tokens of 16k–32k is recommended for hard reasoning tasks.

✅ Takeaways

One dataset, one verifier — no training-stack changes. SLR drops into an existing RLVR mix and teaches broad logical reasoning.
Reasoning transfers. Gains show up on held-out logic benchmarks the model never saw.
Logic-program execution is a faithful, efficient oracle. No judge model, no learned reward, no proxy.

Model Details

Developed by: Artificial Intelligence and Machine Learning Lab, Technical University of Darmstadt (TU Darmstadt)
Model type: Transformer autoregressive LM with long chain-of-thought
Language: English
License: Apache 2.0
Base model: allenai/Olmo-3-7B-Think-DPO
Training framework: GRPO via DeepSpeed ZeRO-3 on a custom Slurm deployment of open-instruct

Sources

Blog: https://huggingface.co/blog/LukasHug/olmo-logic
Training code: https://github.com/lukashelff/open-instruct-slurm
Eval code: https://github.com/lukashelff/olmes-slurm
SLR-Bench: https://huggingface.co/datasets/AIML-TUDA/SLR-Bench
Papers: SLR · Reward Hacking

Companion checkpoint

Olmo 3.1 7B Think — a compute-matched extension of Olmo-3-Think on the original mix (no SLR). It serves as the clean control for the SLR ablation and as a stronger general-purpose Olmo-3-Think base for downstream use.

Citation

This work is based on the following two papers. If you build on it, please cite:

For the SLR-Bench, please cite:

@inproceedings{helff2025slr,
  title     = {{SLR: Automated Synthesis for Scalable Logical Reasoning}},
  author    = {Helff, Lukas and Omar, Ahmad and Friedrich, Felix and W{\"u}st, Antonia
               and Shindo, Hikaru and Woydt, Tim and Mitchell, Rupert
               and Schramowski, Patrick and Stammer, Wolfgang and Kersting, Kristian},
  booktitle = {Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026)},
  year      = {2026},
  url       = {https://openreview.net/forum?id=omMnuTTEn7}
}

For the Reward Hacking paper, please cite:

@inproceedings{helff2026llms,
  title     = {{LLMs Gaming Verifiers: RLVR can Lead to Reward Hacking}},
  author    = {Lukas Helff and Quentin Delfosse and David Steinmann and Ruben H{\"a}rle
               and Hikaru Shindo and Patrick Schramowski and Wolfgang Stammer
               and Kristian Kersting and Felix Friedrich},
  booktitle = {ICLR 2026 Workshop on Logical Reasoning of Large Language Models},
  year      = {2026},
  url       = {https://openreview.net/forum?id=4B3WfRNqe3}
}

Acknowledgments

Supported by DFKI and the hessian.AI Innovation Lab (BMFTR grant 16IS22091), the hessian.AISC Service Center (BMBF grant 01IS22091), and CERTAIN. This work further benefited from TAILOR (EU Horizon 2020, GA 952215), the Hessian LOEWE program ("WhiteBox"), the HMWK clusters "Adaptive Mind" and "Third Wave of AI", NHR4CES, the BMWK project SOOFI (13IPC040G), early stages of the Cluster of Excellence "Reasonable AI" (DFG, EXC-3057), DFG SPP 2422, the AlephAlpha Collaboration Lab 1141, and OpenAI Research Credits.