Instructions to use HumorR1/rm-qwen25vl-3b-nodesc with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use HumorR1/rm-qwen25vl-3b-nodesc with PEFT:
Task type is invalid.
- Notebooks
- Google Colab
- Kaggle
humor-r1 reward model — Qwen2.5-VL-3B + LoRA + scalar head (OOD-usable)
A Bradley-Terry preference reward model for cartoon captions. Given an image and a candidate caption, it returns a scalar funniness score. Used as the reward signal for GRPO fine-tuning of caption-writing policies.
This adapter sits on top of Qwen/Qwen2.5-VL-3B-Instruct. We add a
scalar score head that pools the last non-pad token's hidden state
through a single nn.Linear(2048, 1, bias=False) and train the whole
thing with the standard Bradley-Terry loss
-log σ(score(chosen) - score(rejected)).
What's different from rm-qwen25vl-3b-20k
The earlier RM used the dataset's prompt field which carries
GPT-4o-generated Scene/Twist/Location/Entities descriptions for the
specific 813 cartoons in caption_sft_train. Including them was
hand-feeding the answer to a vision-language model and made the RM
unusable on cartoons without those annotations.
This RM ignores the prompt field — every cartoon gets the same generic "Write a funny one-line caption" framing — so it generalizes to any single-panel cartoon image.
The cost: about 2 pp of pairwise accuracy on the same held-out eval set (0.682 → 0.664). Worth it for OOD usability.
Headline numbers
Held-out validation, n=2000 BT pairs (image + chosen vs rejected):
| metric | value |
|---|---|
| pairwise accuracy | 0.6635 |
| reward margin (chosen − rejected) | +0.626 ± 1.42 std |
| BT loss | 0.626 |
In-flight val tracking (n=200, every 125 steps): 0.675, 0.625, 0.660, 0.670, 0.640. Mean 0.654, margins grew throughout (+0.39 → +0.69 → +0.63), so the model gained confidence on the pairs it does get right.
Training recipe
| backbone | Qwen/Qwen2.5-VL-3B-Instruct (LoRA-adapted) |
| LoRA | r=32, α=32, target_modules="all-linear", bias=none |
| score head | nn.Linear(2048, 1, bias=False), zero-initialized |
| pooling | last non-pad token of the (single) user turn |
| message format | one user turn: image + "Write a funny one-line caption ... Candidate caption: {X} ... Judge how funny this caption is for the cartoon." |
| optimizer | AdamW (fused), weight_decay=0 |
| LR | 2e-4 constant, no warmup |
| max_grad_norm | 1.0 |
| effective batch size | 32 (per-device 4 × accum 8 × 1 GPU) |
| precision | bf16, FlashAttention-2, no gradient checkpointing |
| epochs | 1 |
| image preprocessing | long-edge resized to 448 px |
| training pairs | 20 000 BT pairs (3-σ filter, ≤1000/contest, from caption_sft_train) |
| hardware | 1 × NVIDIA A100-SXM4-80GB |
| wall clock | ~62 min |
Files
backbone_adapter/— LoRA adapter onQwen2.5-VL-3B-Instructprocessor/— Qwen2.5-VL processor (image processor + tokenizer)reward_head.pt—nn.Linear(2048, 1, bias=False)state_dictreward_model_config.json— base model id + score head shapeeval_2k.json— 2K-pair eval JSON
Usage
import torch
from huggingface_hub import snapshot_download
from peft import PeftModel
from PIL import Image
from torch import nn
from transformers import AutoModel, AutoProcessor
local = snapshot_download("HumorR1/rm-qwen25vl-3b-nodesc")
base = AutoModel.from_pretrained(
"Qwen/Qwen2.5-VL-3B-Instruct",
dtype=torch.bfloat16, attn_implementation="sdpa",
)
backbone = PeftModel.from_pretrained(base, f"{local}/backbone_adapter")
score_head = nn.Linear(base.config.text_config.hidden_size, 1, bias=False).to(torch.bfloat16)
score_head.load_state_dict(torch.load(f"{local}/reward_head.pt"))
processor = AutoProcessor.from_pretrained(f"{local}/processor")
backbone.eval().to("cuda"); score_head.eval().to("cuda")
@torch.no_grad()
def score(image: Image.Image, caption: str) -> float:
text = (
"Write a funny one-line caption for this New Yorker-style cartoon.\n\n"
f"Candidate caption: {caption}\n\n"
"Judge how funny this caption is for the cartoon."
)
messages = [
{"role": "user", "content": [
{"type": "image"}, {"type": "text", "text": text},
]}
]
chat = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=False)
inputs = processor(text=[chat], images=[image], return_tensors="pt", padding=True).to("cuda")
out = backbone(**inputs, return_dict=True)
last_hidden = out.last_hidden_state
last_idx = inputs["attention_mask"].long().sum(dim=1) - 1
pooled = last_hidden[
torch.arange(last_hidden.size(0), device=last_hidden.device), last_idx
]
return float(score_head(pooled.to(score_head.weight.dtype)).item())
A higher score means "funnier"; only the difference between two scores is calibrated, not the absolute value (Bradley-Terry shift-invariance).
Limitations
- Trained on a New Yorker–specific humor distribution; OOD on other cartoons is unverified.
- BT pairs were filtered to 3-σ rating gap, so the RM is well-calibrated on easy preferences but its accuracy on subtle pairs is lower.
- The model is 3B + LoRA; the same recipe scales straightforwardly to Qwen2.5-VL-7B for stronger ranking accuracy.
- Downloads last month
- -
Model tree for HumorR1/rm-qwen25vl-3b-nodesc
Base model
Qwen/Qwen2.5-VL-3B-Instruct