OpenEnv documentation
Collecting rollouts with OpenEnv for supervised training
Collecting rollouts with OpenEnv for supervised training
OpenEnv environments are not only useful for RL training — they are also a natural tool for collecting rollouts that become supervised training data. The environment handles episode management, automatic scoring, and reproducibility, so you get a reward-labeled dataset without writing any of that infrastructure yourself.
This tutorial shows the full pipeline:
- Run a strong teacher model inside an OpenEnv environment to collect rollouts.
- Use the environment’s reward signal to filter out incorrect examples automatically.
- Train a smaller student model on the filtered rollouts with TRL’s
SFTTrainer.
As a concrete application, the resulting checkpoint is used as a warm-start for GRPO: once the student
reliably produces valid tool calls, GRPO’s reward_std is non-zero from the first batch and the reward
curve climbs immediately.
Why use an environment to collect training data
Building a supervised dataset usually means writing a custom collection loop, a scorer, and episode bookkeeping. An OpenEnv environment gives you all three out of the box:
- Automatic scoring — every
step()returns a reward. Filter byreward == 1.0and you have a clean, correct dataset with no manual labelling. - Reproducible episodes —
reset(seed=42, size=N)produces the same sequence of problems every run. Anyone can regenerate the exact dataset. - Configurable difficulty — adjust
DATASET_CONFIGto control problem complexity without changing any collection code. - Portable across environments — the same collect → filter → train pipeline works for any OpenEnv environment. Swap the env and the tool definition; everything else stays the same.
What you’ll use
| Student model | Qwen/Qwen3-1.7B |
| Teacher model | gpt-5-mini via the OpenAI API |
| Environment | reasoning_gym_env / chain_sum |
| SFT trainer | TRL SFTTrainer |
| Next step | End-to-end walkthrough with GRPO |
1. Install dependencies
!pip install -q openai trl
!pip install -q openenv
!pip install -q --no-deps git+https://huggingface.co/spaces/sergiopaniego/reasoning_gym
!pip install -Uq "transformers>=5.3.0"2. Set your credentials
import getpass, os
if "OPENAI_API_KEY" not in os.environ:
os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API key: ")You’ll also need a Hugging Face login to download the base model and push both the collected dataset and the fine-tuned checkpoint:
from huggingface_hub import notebook_login
notebook_login()YOUR_HF_USERNAME = "your-username" # replace with your Hugging Face username
assert YOUR_HF_USERNAME != "your-username", "Replace YOUR_HF_USERNAME with your Hugging Face username"3. Define the system prompt
Use the same prompt as the GRPO tutorial so the SFT-trained model is a drop-in replacement when you continue with GRPO.
SYSTEM_PROMPT = """You are a careful arithmetic assistant.
You will be given a chain of integer additions. Compute the result and submit it as a single number.
Rules:
1. Read the question carefully.
2. Use the tool `answer` exactly once with your final number.
3. The answer must be a single integer with no units or explanation.
"""4. Configure data collection
DATASET_CONFIG controls the difficulty of the chain_sum problems the environment generates:
min_terms/max_terms set how many integers are added together, and min_digits/max_digits set
how many digits each integer has. At these settings each problem is a sum of 2–3 two-digit numbers
— easy enough for gpt-5-mini to answer correctly ~90% of the time, which gives a clean training
signal after filtering.
N_EPISODES is the number of problems to collect. 300 is enough to get ~270 correct examples after
filtering, which is sufficient for format compliance training.
DATASET_CONFIG = {
"min_terms": 2,
"max_terms": 3,
"min_digits": 2,
"max_digits": 2,
}
N_EPISODES = 3005. Collect rollouts with openenv collect
openenv collect runs the teacher model inside the environment and records every episode — the
environment’s step() reward is written alongside the messages, so filtering by correctness requires
no additional scoring code.
import json, shlex
dataset_config_arg = shlex.quote(json.dumps(DATASET_CONFIG))
system_prompt_arg = shlex.quote(SYSTEM_PROMPT)
hub_repo_arg = shlex.quote(f"{YOUR_HF_USERNAME}/chain-sum-rollouts")
!openenv collect reasoning_gym:chain_sum \
--base-url https://sergiopaniego-reasoning-gym.hf.space \
--provider openai \
--model gpt-5-mini \
--num-episodes {N_EPISODES} \
--max-tokens 1024 \
--dataset-config {dataset_config_arg} \
--system-prompt {system_prompt_arg} \
--push-to-hub {hub_repo_arg} \
--output-dir ./rolloutsThe command prints a live progress summary and pushes the collected episodes to the Hub as
{YOUR_HF_USERNAME}/chain-sum-rollouts. Pull them back to start filtering:
from datasets import load_dataset
ds = load_dataset(f"{YOUR_HF_USERNAME}/chain-sum-rollouts", split="train")
raw_rollouts = list(ds)
print(f"Collected {len(raw_rollouts)} episodes")The messages field stores the full conversation in standard OpenAI format (assistant messages have
a tool_calls list). Convert to Qwen3’s <tool_call> text format before training — GRPOTrainer
produces this same format during RL, so the SFT checkpoint becomes a direct drop-in:
def to_qwen3_messages(record):
converted = []
for msg in record["messages"]:
if msg["role"] == "tool":
continue # strip environment responses; SFT only needs the assistant turn
if msg["role"] == "assistant" and msg.get("tool_calls"):
tc = msg["tool_calls"][0]
args = json.loads(tc["function"]["arguments"])
answer_str = args.get("answer", "")
tool_call_text = (
"<tool_call>\n"
+ json.dumps({"name": "answer", "arguments": {"answer": answer_str}})
+ "\n</tool_call>"
)
converted.append({"role": "assistant", "content": tool_call_text})
else:
converted.append(msg)
return {"messages": converted, "reward": record["reward"]}
rollouts = [to_qwen3_messages(r) for r in raw_rollouts]6. Filter the dataset
Keep only episodes where the teacher answered correctly. The environment’s reward signal does the labelling — no manual annotation needed.
correct = [r for r in rollouts if r["reward"] == 1.0]
print(f"Correct: {len(correct)} / {len(rollouts)} ({len(correct)/len(rollouts):.1%})")gpt-5-mini typically scores above 90% on chain_sum at this difficulty, so you should end up with
~270 examples from 300 rollouts.
7. Inspect the dataset before training
Always look at your data before training. Automated collection can introduce unexpected patterns that the student model will learn to imitate.
import random
for row in random.sample(correct, 3):
question = row["messages"][0]["content"]
response = row["messages"][1]["content"]
print(f"Q: {question}")
print(f"A: {response}")
print()Things to check:
- Does every response contain a valid
<tool_call>block? - Are the answers integers with no extra text?
- Is there any reasoning in the assistant message that you don’t want the student to learn? (For example: an internal monologue, disclaimers, or repeated phrasing that the teacher leaked from its own system prompt.)
8. Measure token lengths
Set max_length in SFTConfig to cover nearly all examples without wasting GPU memory on padding.
The 99th percentile is a good target: you truncate fewer than 1% of examples while keeping batches tight.
import numpy as np
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-1.7B")
lengths = []
for row in correct:
text = tokenizer.apply_chat_template(
row["messages"], tokenize=False, add_generation_prompt=False
)
ids = tokenizer.encode(text)
lengths.append(len(ids))
lengths = np.array(lengths)
MAX_SEQ_LEN = int(np.percentile(lengths, 99)) + 16
print(
f"p50={np.percentile(lengths, 50):.0f} "
f"p95={np.percentile(lengths, 95):.0f} "
f"p99={np.percentile(lengths, 99):.0f} "
f"max={lengths.max()}"
)
print(f"Setting MAX_SEQ_LEN = {MAX_SEQ_LEN}")9. Fine-tune with SFTTrainer
assistant_only_loss=True in SFTConfig masks the prompt tokens so the loss is computed only on the
assistant response — the <tool_call> block. This is more efficient than full-sequence training and avoids
accidentally reinforcing the system prompt wording.
from datasets import Dataset
from transformers import AutoModelForCausalLM
from trl import SFTConfig, SFTTrainer
dataset = Dataset.from_list([{"messages": r["messages"]} for r in correct])
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-1.7B")
sft_config = SFTConfig(
output_dir="reasoning-gym-chain-sum-Qwen3-1.7B-sft",
max_length=MAX_SEQ_LEN,
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=2,
learning_rate=2e-5,
warmup_steps=10,
lr_scheduler_type="cosine",
logging_steps=5,
save_strategy="no",
assistant_only_loss=True,
)
trainer = SFTTrainer(
model=model,
train_dataset=dataset,
processing_class=tokenizer,
args=sft_config,
)
trainer.train()
trainer.push_to_hub(commit_message="SFT warm-up on reasoning_gym chain_sum")Training ~270 examples for 3 epochs takes around 5 minutes on a single A100 (40 GB). The goal is format compliance, not task mastery — a handful of epochs is enough. Mastery comes from GRPO.
10. Evaluate: before vs after
Run both the base model and the SFT checkpoint on a held-out set and compare. The key metric for a
warm-up evaluation is format compliance — how often the model uses <tool_call> correctly — as
well as overall accuracy.
import re
from transformers import pipeline
from reasoning_gym_env.client import ReasoningGymEnv
from reasoning_gym_env.models import ReasoningGymAction
async def evaluate_model(model_name, n_eval=50, seed=999):
gen = pipeline(
"text-generation",
model=model_name,
tokenizer=model_name,
device_map="auto",
dtype="auto",
)
gen.model.generation_config.max_length = None
tok = AutoTokenizer.from_pretrained(model_name)
env = ReasoningGymEnv(base_url="https://sergiopaniego-reasoning-gym.hf.space")
obs = await env.reset(
dataset_name="chain_sum",
dataset_config=DATASET_CONFIG,
seed=seed,
size=n_eval,
)
rewards, format_hits = [], 0
for i in range(n_eval):
if i > 0:
obs = await env.reset()
question = obs.observation.question
messages = [
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": question},
]
prompt = tok.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
completion = gen(prompt, max_new_tokens=64)[0]["generated_text"][len(prompt):]
m = re.search(r'"answer"\s*:\s*"?(\d+)"?', completion)
if m:
format_hits += 1
answer = m.group(1)
else:
nums = re.findall(r"\b(\d+)\b", completion)
answer = nums[-1] if nums else "0"
result = await env.step(ReasoningGymAction(answer=answer))
rewards.append(float(result.observation.score or 0.0))
await env.close()
del gen # free GPU memory before loading the next model
return {
"accuracy": sum(rewards) / len(rewards),
"format_compliance": format_hits / n_eval,
}
base_metrics = await evaluate_model("Qwen/Qwen3-1.7B")
sft_metrics = await evaluate_model(f"{YOUR_HF_USERNAME}/reasoning-gym-chain-sum-Qwen3-1.7B-sft")
print(f"\n{'Metric':<25} {'Base model':>12} {'After SFT':>12} {'Delta':>10}")
print("-" * 62)
for key, label in [("format_compliance", "Format compliance"), ("accuracy", "Accuracy")]:
b, s = base_metrics[key], sft_metrics[key]
print(f"{label:<25} {b:>12.1%} {s:>12.1%} {(s - b) * 100:>+9.1f} pp")A successful warm-up looks like this:
| Metric | Base model | After SFT | Delta |
|---|---|---|---|
| Format compliance | ~0% | ~68% | +68 pp |
| Accuracy | ~4% | ~68% | +64 pp |
Format compliance should jump sharply from near-zero — that’s the primary goal. Qwen3-1.7B produces
essentially no valid <tool_call> blocks out of the box. After SFT on ~270 examples, the model reliably
uses the format, and accuracy rises in lockstep because correct format is a prerequisite for the
environment’s scorer to award any credit.
11. Where to go next: GRPO
The SFT checkpoint is ready to use as the starting model for GRPO. In the end-to-end walkthrough, change one line in section 8:
# Before (cold-start from the base model):
MODEL_NAME = "Qwen/Qwen3-1.7B"
# After (warm-start from your SFT checkpoint):
MODEL_NAME = f"{YOUR_HF_USERNAME}/reasoning-gym-chain-sum-Qwen3-1.7B-sft"With format compliance already near 100%, GRPO’s reward_std will be non-zero from the very first
batch and the reward curve will climb immediately — no cold-start stall.
Other directions:
- Harder tasks. Increase
max_termsormax_digitsinDATASET_CONFIGand collect a new SFT set. Once the student handles easier examples reliably, a harder GRPO phase can push further. - Different environments. The same pipeline — teacher collects → filter → SFT → GRPO — applies to
any OpenEnv environment. Swap
reasoning_gym_envand theanswertool definition for your env’s tool surface. - Larger teacher.
gpt-5orclaude-opus-4as teacher will yield higher-quality examples, especially for tasks wheregpt-5-ministruggles.