OpenEnv documentation
Rubrics: Composable Reward Computation
Rubrics: Composable Reward Computation
Rubrics are OpenEnv’s first-class abstraction for computing rewards. They let you build multi-criteria reward functions from small reusable pieces. This tutorial walks through the API end-to-end, from a one-line rubric to a full environment that introspects its reward signal at training time.
Why Rubrics?
Before rubrics, each environment rolled its own reward logic. Three pain points surfaced repeatedly:
- No standard interface. Every environment author invented their own
compute_reward(...)shape, so reusing a reward component across environments meant copy-pasting. - Multi-criteria evaluation was ad-hoc. “Code must compile, tests must pass, style matters a bit” becomes a tangle of nested
if/elseand hand-rolled weighted averages. There was no consistent way to ask which criterion caused a low reward. - LLM judges and sandboxed checks are slow. Without a framework-level concept of “reward component”, batch evaluation couldn’t parallelise the I/O-bound pieces.
The Rubric API is small: you subclass, implement forward, and the framework gives you composition, introspection, and parallel evaluation for free.
Your First Rubric
A rubric is a callable with a forward(action, observation) -> float method.
from openenv.core.rubrics import Rubric
class MessageLengthRubric(Rubric):
"""Reward 1.0 if the message is 5–20 characters long, else 0.0."""
def forward(self, action, observation) -> float:
length = len(action.message)
return 1.0 if 5 <= length <= 20 else 0.0That’s the whole contract. Instantiate it and call it:
rubric = MessageLengthRubric()
score = rubric(action, observation) # runs forward + hooks
print(rubric.last_score) # latest score is cached on the rubricRubric.__call__ runs pre- and post-hooks around your forward, caches the result on self.last_score, and supports async forward implementations transparently. (If you’ve used torch.nn.Module, the subclass-and-implement-forward pattern will feel familiar — children assigned as instance attributes auto-register with the parent.)
Optional hooks for observability
You can attach hooks without subclassing — useful for logging every component’s score without polluting forward. Post-hooks run after forward completes and see the returned score; pre-hooks run before forward and are handy for input validation or instrumentation. When a rubric is async, hooks are awaited transparently.
def log_score(rubric, action, obs, result):
print(f"{type(rubric).__name__}: {result:.2f}")
rubric.register_forward_hook(log_score) # fires after forward()
rubric.register_forward_pre_hook(lambda r, a, o: None) # fires before forward()State dict
Rubrics implement state_dict() / load_state_dict(state) so their configuration (thresholds, prompt templates, etc.) can be serialised alongside model checkpoints. The default implementations return an empty dict — override them when your rubric has tunable parameters.
Composing Rubrics
The real power shows up when you stack rubrics. openenv.core.rubrics ships with four containers.
WeightedSum — multi-criteria averaging
Use when several independent criteria each contribute to the final score.
from openenv.core.rubrics import WeightedSum
class TestsPassRubric(Rubric):
def forward(self, action, observation) -> float:
return observation.tests_passed / max(observation.tests_total, 1)
class StyleRubric(Rubric):
def forward(self, action, observation) -> float:
return 1.0 if action.code.count("\n\n\n") == 0 else 0.6
reward = WeightedSum(
[TestsPassRubric(), StyleRubric()],
weights=[0.7, 0.3],
)Weights must sum to 1.0. WeightedSum evaluates its children with asyncio.gather when any of them is async, so an LLM-backed child does not block the synchronous ones.
Gate — hard constraints
Use when a child score below a threshold should short-circuit the reward to zero.
from openenv.core.rubrics import Gate
reward = Gate(TestsPassRubric(), threshold=0.5) # 0.0 if fewer than half the tests passGate returns 0.0 when the child score is below the threshold, and passes the child score through unchanged otherwise.
Sequential — fail-fast pipeline
Use when criteria are ordered: a later criterion only matters if the earlier ones passed. Sequential returns 0.0 the moment any child returns 0.0 and does not evaluate the remaining children — great for gating expensive checks like sandboxed test runs or LLM calls.
from openenv.core.rubrics import Sequential
reward = Sequential(
Gate(CompilesRubric(), threshold=1.0), # skip everything if it doesn't compile
Gate(TestsPassRubric(), threshold=0.5), # and skip style if tests are failing
WeightedSum([TestsPassRubric(), StyleRubric()], [0.7, 0.3]),
)RubricList and RubricDict — dynamic dispatch
When the right rubric depends on the current observation (e.g. one rubric per game in a multi-game environment), wrap the options in a RubricList or RubricDict and dispatch in your parent rubric’s forward.
from openenv.core.rubrics import Rubric, RubricDict
class MultiGameRubric(Rubric):
def __init__(self):
super().__init__()
self.games = RubricDict({
"pong": PongRubric(),
"breakout": BreakoutRubric(),
})
def forward(self, action, observation) -> float:
return self.games[observation.game_id](action, observation)RubricList and RubricDict do not aggregate on their own — calling them directly raises. Their job is auto-registration (so their children show up in named_rubrics()) and indexed access. Reach for them when the parent rubric needs to pick a child at runtime based on the observation — if the set of children is fixed, plain attributes are simpler.
Introspection: named_rubrics()
Assigning a child rubric as an attribute auto-registers it with the parent. Training code can then walk the tree:
composite = WeightedSum(
[Gate(CompilesRubric(), 1.0), TestsPassRubric(), StyleRubric()],
[0.2, 0.5, 0.3],
)
for name, child in composite.named_rubrics():
print(f"{name:30s} last_score={child.last_score}")After running the composite once, every component’s most recent score is cached on last_score — no manual bookkeeping.
LLM-as-judge: LLMJudge
When a criterion is too subjective for a handwritten heuristic (“is this argument persuasive?”, “is this explanation clear?”), use an LLM as the judge. LLMJudge wraps an LLMClient with a prompt template and a score extractor.
Any OpenAI-compatible endpoint works: hosted OpenAI / Anthropic, or open-weight models served through vLLM, Ollama, Hugging Face Inference Providers, etc. Pick a client and hand it to LLMJudge:
import os
from openenv.core.llm_client import OpenAIClient, create_llm_client
from openenv.core.rubrics import LLMJudge
# Option 1 — hosted OpenAI (the factory also supports "anthropic").
client = create_llm_client(
"openai",
model="gpt-4.1-mini",
api_key=os.environ["OPENAI_API_KEY"],
)
# Option 2 — open-weight model served via a local OpenAI-compatible endpoint
# (vLLM, Ollama, Hugging Face Inference Providers, …). Point OpenAIClient
# at the base URL and the model id the server exposes. `api_key` is optional
# and defaults to "not-needed" for local endpoints.
client = OpenAIClient(
endpoint="http://localhost",
port=8000,
model="Qwen/Qwen3-1.7B",
)
clarity_judge = LLMJudge(
client=client,
prompt_template=(
"Rate the clarity of this explanation on a 0-10 scale. "
"Reply with the number only.\n\n"
"Explanation:\n{action}\n"
),
score_pattern=r"(\d+(?:\.\d+)?)",
normalize=True, # clamps extracted score to [0, 1]
)LLMJudge.forward is async. When you put it inside WeightedSum or Sequential, the container awaits it transparently. A few caveats worth stating up front:
- Cost and latency scale with the number of episodes and the number of rubric calls per step.
Sequential+Gateearlier in the pipeline is the usual answer. - Determinism is not free. Cache scores when you can, and consider temperature 0 for repeatable eval runs.
- API keys belong in environment variables (
OPENAI_API_KEY,ANTHROPIC_API_KEY, …), not in code that ships to the Hub.
Delayed Rewards: TrajectoryRubric
Some signals only materialise at the end of an episode — chess win/loss, unit-test suite success, a goal reached after many steps. TrajectoryRubric accumulates (action, observation) pairs internally and only invokes your scoring logic on the terminal observation.
from openenv.core.rubrics import TrajectoryRubric
class WinLossRubric(TrajectoryRubric):
def score_trajectory(self, trajectory) -> float:
_, final_obs = trajectory[-1]
return final_obs.reward # +1 win, -1 loss, 0 draw
def compute_step_rewards(self):
# Credit assignment: distribute the final score across steps however you like.
final = self.score_trajectory(self._trajectory)
return [final] * len(self._trajectory)forward(action, obs) returns intermediate_reward (default 0.0) until observation.done is True, then calls score_trajectory. After the episode ends, call rubric.compute_step_rewards() to get one reward per step — same length as the trajectory. This is the hook for credit assignment: training code feeds these per-step rewards back into advantage estimation, return-to-go, or whatever your optimizer expects. ExponentialDiscountingTrajectoryRubric precomputes gamma^(T-1-t) * final_score for you; override compute_step_rewards in your subclass if you want a different strategy (all-to-last, equal split, task-specific shaping).
If
observation.donenever becomesTrue,score_trajectoryis never called and the trajectory grows unbounded in memory. Make surestepflipsdoneon every terminal transition, and callself._reset_rubric()inEnvironment.resetso trajectories do not leak across episodes.
For the common exponentially-discounted case, subclass ExponentialDiscountingTrajectoryRubric instead and only implement score_trajectory:
from openenv.core.rubrics import ExponentialDiscountingTrajectoryRubric
class ChessOutcomeRubric(ExponentialDiscountingTrajectoryRubric):
def score_trajectory(self, trajectory) -> float:
_, final_obs = trajectory[-1]
return final_obs.reward # already +1 / 0 / -1 from the engineThis is exactly the pattern the built-in envs/chess_env/ uses — see envs/chess_env/server/rubrics.py for the complete real-world example.
The
TrajectoryRubrickeeps the trajectory in CPU memory. If your observation carries GPU tensors (images, embeddings), detach and move them to CPU before returning fromstep()— otherwise the trajectory holds onto GPU memory across the whole episode.
Wiring a Rubric into an Environment
Rubrics are server-side. Each environment declares its rubric in __init__, and step runs it via the _apply_rubric helper. The base Environment class accepts the rubric through its constructor and stores it as self.rubric.
Here is a complete minimal environment that composes a Sequential gate-then-WeightedSum pipeline and exposes the reward through its observation:
from openenv.core.env_server.interfaces import Environment
from openenv.core.env_server.types import Action, Observation, State
from openenv.core.rubrics import Gate, Rubric, Sequential, WeightedSum
class CodeAction(Action):
code: str
class CodeObservation(Observation):
compiles: bool = False
tests_passed: int = 0
tests_total: int = 0
class CodeState(State):
attempts: int = 0
class CompilesRubric(Rubric):
def forward(self, action, observation) -> float:
return 1.0 if observation.compiles else 0.0
class TestsPassRubric(Rubric):
def forward(self, action, observation) -> float:
if observation.tests_total == 0:
return 0.0
return observation.tests_passed / observation.tests_total
class StyleRubric(Rubric):
def forward(self, action, observation) -> float:
return 1.0 if action.code.count("\n\n\n") == 0 else 0.6
def build_code_rubric() -> Rubric:
return Sequential(
Gate(CompilesRubric(), threshold=1.0), # gate everything on compilation
WeightedSum(
[
TestsPassRubric(),
StyleRubric(),
],
weights=[0.7, 0.3],
),
)
class CodeEnvironment(Environment[CodeAction, CodeObservation, CodeState]):
def __init__(self):
super().__init__(rubric=build_code_rubric())
self._state = CodeState()
def reset(self, seed=None, episode_id=None, **kwargs) -> CodeObservation:
self._reset_rubric() # clear any trajectory / cached last_score
self._state = CodeState()
return CodeObservation()
def step(self, action: CodeAction, timeout_s=None, **kwargs) -> CodeObservation:
self._state.attempts += 1
obs = self._run_code(action) # your domain-specific execution
obs.reward = self._apply_rubric(action, obs)
return obs
@property
def state(self) -> CodeState:
return self._state
def _run_code(self, action: CodeAction) -> CodeObservation:
# Placeholder for whatever your environment actually does.
compiles = "def " in action.code
return CodeObservation(
compiles=compiles,
tests_passed=3 if compiles else 0,
tests_total=3,
)The three pieces the base class expects from you:
- Pass the rubric to
super().__init__(rubric=...)soself.rubricis set. - Call
self._reset_rubric()fromresetso trajectory state does not leak between episodes. - Call
self._apply_rubric(action, obs)fromstepand attach the result toobs.reward. There is also_apply_rubric_asyncforstep_async.
Some environments already compute
obs.rewardfrom game mechanics or a handcrafted multi-component signal (seeenvs/chess_env/andenvs/carla_env/). In that case, callself._apply_rubric(action, obs)without assigning its return value — the rubric still accumulates the trajectory forcompute_step_rewards()and still exposes per-component scores vianamed_rubrics(), butobs.rewardstays authoritative.
Inspecting rewards from training code
Because children are auto-registered, the training loop can walk the rubric tree and log component-level diagnostics without the environment exposing a custom API:
env = CodeEnvironment()
obs = env.reset()
obs = env.step(CodeAction(code="def solution(): return 42"))
for name, component in env.rubric.named_rubrics():
print(f"{name:30s} last_score={component.last_score:.2f}")That snippet works for any OpenEnv environment that sets self.rubric, regardless of whether the rubric is a single scalar or a deeply nested composition.
Where the reward ends up during training
Training frameworks consume the reward through the same channel as any other OpenEnv observation field: step() returns an Observation whose reward is the rubric’s output, and the client delivers it via result.reward.
With TRL, the recommended path is GRPOTrainer’s environment_factory: you define a thin wrapper class with tool methods that call the OpenEnv client, store self.reward = result.observation.reward after each step, and a plain reward function reads it off the environments parameter. The TRL OpenEnv integration guide has the full recipe, and examples/scripts/openenv/ ships ready-to-run scripts. The same observation shape works with torchforge and other OpenEnv-compatible training stacks.
named_rubrics() is orthogonal: use it to log per-component scores (to Weights & Biases, TensorBoard, trackio, …) while training, without changing the reward the optimiser sees.
Using Rubrics for Evaluation
A rubric is just a callable — nothing forces you to run it inside a training loop. Drop it into a for-loop over a static dataset and you have a multi-criteria scoring function for offline eval:
rubric = build_code_rubric()
scores = []
for action, obs in eval_dataset:
scores.append(rubric(action, obs))
print(f"mean reward: {sum(scores) / len(scores):.3f}")
for name, component in rubric.named_rubrics():
print(f" {name:30s} last_score={component.last_score:.3f}")The same rubric object used to compute training rewards doubles as the eval metric — one source of truth for “what is a good response”. Per-component last_score gives you a per-criterion breakdown for free (useful for regression dashboards and failure analysis). When a component like LLMJudge is async, wrap the loop with asyncio.run(...) and await rubric(action, obs) so the judge calls can overlap.
Next Steps
- Real-world trajectory example — walk through
envs/chess_env/server/rubrics.pyandchess_environment.pyto seeExponentialDiscountingTrajectoryRubricwired into a game environment. - Design details — RFC 004 covers the rationale for the composable API and the “rewards inside the environment” invariant.
- Reward design basics — the Reward Design guide covers sparse-vs-dense signals and common pitfalls that still apply on top of any rubric composition.
- Training loop integration — see the RL Framework Integration guide and the TRL OpenEnv integration guide for the recommended
environment_factorypattern.