RubricARROW
Collection
2 items • Updated
How to use OpenRubrics/RubricARROW-8B-Judge with Transformers:
# Use a pipeline as a high-level helper
from transformers import pipeline
pipe = pipeline("text-generation", model="OpenRubrics/RubricARROW-8B-Judge")
messages = [
{"role": "user", "content": "Who are you?"},
]
pipe(messages) # Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("OpenRubrics/RubricARROW-8B-Judge")
model = AutoModelForCausalLM.from_pretrained("OpenRubrics/RubricARROW-8B-Judge")
messages = [
{"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
messages,
add_generation_prompt=True,
tokenize=True,
return_dict=True,
return_tensors="pt",
).to(model.device)
outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))How to use OpenRubrics/RubricARROW-8B-Judge with vLLM:
# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "OpenRubrics/RubricARROW-8B-Judge"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
-H "Content-Type: application/json" \
--data '{
"model": "OpenRubrics/RubricARROW-8B-Judge",
"messages": [
{
"role": "user",
"content": "What is the capital of France?"
}
]
}'docker model run hf.co/OpenRubrics/RubricARROW-8B-Judge
How to use OpenRubrics/RubricARROW-8B-Judge with SGLang:
# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
--model-path "OpenRubrics/RubricARROW-8B-Judge" \
--host 0.0.0.0 \
--port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
-H "Content-Type: application/json" \
--data '{
"model": "OpenRubrics/RubricARROW-8B-Judge",
"messages": [
{
"role": "user",
"content": "What is the capital of France?"
}
]
}'docker run --gpus all \
--shm-size 32g \
-p 30000:30000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HF_TOKEN=<secret>" \
--ipc=host \
lmsysorg/sglang:latest \
python3 -m sglang.launch_server \
--model-path "OpenRubrics/RubricARROW-8B-Judge" \
--host 0.0.0.0 \
--port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
-H "Content-Type: application/json" \
--data '{
"model": "OpenRubrics/RubricARROW-8B-Judge",
"messages": [
{
"role": "user",
"content": "What is the capital of France?"
}
]
}'How to use OpenRubrics/RubricARROW-8B-Judge with Docker Model Runner:
docker model run hf.co/OpenRubrics/RubricARROW-8B-Judge
This is an 8B RubricARROW-Judge model, finetuned from Qwen/Qwen3-8B as introduced in the paper RUBRIC-ARROW: Alternating Pointwise Rubric Reward Modeling for LLM Post-training in Non-verifiable Domains.
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "OpenRubrics/RubricARROW-8B-Judge"
tok = AutoTokenizer.from_pretrained(model_id, use_fast=True)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto")
To evaluate the model, please use the following format to build up message.
Here rubric_item should be generated with a RubricARROW-Rubric model.
JUDGE_PROMPT_TEMPLATE = """
Your job is to look at a conversation and a set of rubric items, and score the last turn (i.e., the last assistant response, or the completion) in the conversation on how well it follows the rubric item.
# Conversation
<<conversation>>
# Rubric item
<<rubric_item>>
# Instructions
Return a json object. For each rubric item i (starting from 1), keys must be exactly "explanation_i" and "criteria_met_i" for each i and it includes two top-level fields in the JSON object:
- The "explanation_i" field should be a string explaining why the response does or does not meet the criteria of the rubric item.
- The "criteria_met_i" field should be a boolean indicating (true/false) whether the response meets the criteria of the rubric item. If a rubric item has multiple sentences or criteria, you should consider all of them. If any of the criteria is not met, the answer should be false. Only return true is all of the criteria are met.
- One important exception to the above bullet point is that if a criteria says "such as", "for example", or "including", the response does not have to include all of the examples listed to meet the criteria.
# Final Output Format (a single JSON object, not an array)
{
"explanation_1": "...",
"criteria_met_1": true/false,
"explanation_2": "...",
"criteria_met_2": true/false,
... repeat this pattern for every rubric item i in order (i = 1, 2, 3, ...)
}
# Final instruction
Return just the json object. Do not include any other text in the response.
""".strip()
conversation = f"user: {instruction}
assistant: {response}"
user_text = (
JUDGE_PROMPT_TEMPLATE
.replace("<<conversation>>", conversation)
.replace("<<rubric_item>>", rubric_item)
)
messages_list = [
{"role": "user", "content": user_text},
]
message = tok.apply_chat_template(
messages_list,
tokenize=False,
add_generation_prompt=True,
enable_thinking=False
)
# Remaining step: Use either HF or vLLM for evaluation.
# ...
# ...
For probability-based scoring, we compute the final score as follows:
def weight(tags):
t = {str(x).strip().lower() for x in (tags or [])}
return 3.0 if "hard rule" in t else 1.0 if "principle" in t else 0.0
def group_score(rubric_outputs):
return sum((x.get("true_prob", 0.0) - x.get("false_prob", 0.0)) * weight(x.get("tags"))
for x in rubric_outputs)
If you find our work helpful, please consider citing our paper:
@misc{jiang2026rubric,
title={RUBRIC-ARROW: Alternating Pointwise Rubric Reward Modeling for LLM Post-training in Non-verifiable Domains},
author={Haoxiang Jiang and Zihan Dong and Tianci Liu and Wanying Wang and Ran Xu and Tony Yu and Linjun Zhang and Haoyu Wang},
year={2026},
eprint={2605.29156},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2605.29156},
}