humor-r1 reward model — Qwen2.5-VL-3B + LoRA + scalar head (OOD-usable)

A Bradley-Terry preference reward model for cartoon captions. Given an image and a candidate caption, it returns a scalar funniness score. Used as the reward signal for GRPO fine-tuning of caption-writing policies.

This adapter sits on top of Qwen/Qwen2.5-VL-3B-Instruct. We add a scalar score head that pools the last non-pad token's hidden state through a single nn.Linear(2048, 1, bias=False) and train the whole thing with the standard Bradley-Terry loss -log σ(score(chosen) - score(rejected)).

What's different from `rm-qwen25vl-3b-20k`

The earlier RM used the dataset's prompt field which carries GPT-4o-generated Scene/Twist/Location/Entities descriptions for the specific 813 cartoons in caption_sft_train. Including them was hand-feeding the answer to a vision-language model and made the RM unusable on cartoons without those annotations.

This RM ignores the prompt field — every cartoon gets the same generic "Write a funny one-line caption" framing — so it generalizes to any single-panel cartoon image.

The cost: about 2 pp of pairwise accuracy on the same held-out eval set (0.682 → 0.664). Worth it for OOD usability.

Headline numbers

Held-out validation, n=2000 BT pairs (image + chosen vs rejected):

metric	value
pairwise accuracy	0.6635
reward margin (chosen − rejected)	+0.626 ± 1.42 std
BT loss	0.626

In-flight val tracking (n=200, every 125 steps): 0.675, 0.625, 0.660, 0.670, 0.640. Mean 0.654, margins grew throughout (+0.39 → +0.69 → +0.63), so the model gained confidence on the pairs it does get right.

Training recipe


backbone	`Qwen/Qwen2.5-VL-3B-Instruct` (LoRA-adapted)
LoRA	r=32, α=32, `target_modules="all-linear"`, bias=none
score head	`nn.Linear(2048, 1, bias=False)`, zero-initialized
pooling	last non-pad token of the (single) user turn
message format	one user turn: image + "Write a funny one-line caption ... Candidate caption: {X} ... Judge how funny this caption is for the cartoon."
optimizer	AdamW (fused), weight_decay=0
LR	2e-4 constant, no warmup
max_grad_norm	1.0
effective batch size	32 (per-device 4 × accum 8 × 1 GPU)
precision	bf16, FlashAttention-2, no gradient checkpointing
epochs	1
image preprocessing	long-edge resized to 448 px
training pairs	20 000 BT pairs (3-σ filter, ≤1000/contest, from `caption_sft_train`)
hardware	1 × NVIDIA A100-SXM4-80GB
wall clock	~62 min

Files

backbone_adapter/ — LoRA adapter on Qwen2.5-VL-3B-Instruct
processor/ — Qwen2.5-VL processor (image processor + tokenizer)
reward_head.pt — nn.Linear(2048, 1, bias=False) state_dict
reward_model_config.json — base model id + score head shape
eval_2k.json — 2K-pair eval JSON

Usage

import torch
from huggingface_hub import snapshot_download
from peft import PeftModel
from PIL import Image
from torch import nn
from transformers import AutoModel, AutoProcessor

local = snapshot_download("HumorR1/rm-qwen25vl-3b-nodesc")
base = AutoModel.from_pretrained(
    "Qwen/Qwen2.5-VL-3B-Instruct",
    dtype=torch.bfloat16, attn_implementation="sdpa",
)
backbone = PeftModel.from_pretrained(base, f"{local}/backbone_adapter")
score_head = nn.Linear(base.config.text_config.hidden_size, 1, bias=False).to(torch.bfloat16)
score_head.load_state_dict(torch.load(f"{local}/reward_head.pt"))

processor = AutoProcessor.from_pretrained(f"{local}/processor")
backbone.eval().to("cuda"); score_head.eval().to("cuda")


@torch.no_grad()
def score(image: Image.Image, caption: str) -> float:
    text = (
        "Write a funny one-line caption for this New Yorker-style cartoon.\n\n"
        f"Candidate caption: {caption}\n\n"
        "Judge how funny this caption is for the cartoon."
    )
    messages = [
        {"role": "user", "content": [
            {"type": "image"}, {"type": "text", "text": text},
        ]}
    ]
    chat = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=False)
    inputs = processor(text=[chat], images=[image], return_tensors="pt", padding=True).to("cuda")
    out = backbone(**inputs, return_dict=True)
    last_hidden = out.last_hidden_state
    last_idx = inputs["attention_mask"].long().sum(dim=1) - 1
    pooled = last_hidden[
        torch.arange(last_hidden.size(0), device=last_hidden.device), last_idx
    ]
    return float(score_head(pooled.to(score_head.weight.dtype)).item())

A higher score means "funnier"; only the difference between two scores is calibrated, not the absolute value (Bradley-Terry shift-invariance).

Limitations

Trained on a New Yorker–specific humor distribution; OOD on other cartoons is unverified.
BT pairs were filtered to 3-σ rating gap, so the RM is well-calibrated on easy preferences but its accuracy on subtle pairs is lower.
The model is 3B + LoRA; the same recipe scales straightforwardly to Qwen2.5-VL-7B for stronger ranking accuracy.

Downloads last month: -

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for HumorR1/rm-qwen25vl-3b-nodesc

Base model

Qwen/Qwen2.5-VL-3B-Instruct

Adapter

(197)

this model

HumorR1
/

rm-qwen25vl-3b-nodesc

humor-r1 reward model — Qwen2.5-VL-3B + LoRA + scalar head (OOD-usable)

What's different from `rm-qwen25vl-3b-20k`

Headline numbers

Training recipe

Files

Usage

Limitations

Model tree for HumorR1/rm-qwen25vl-3b-nodesc

Dataset used to train HumorR1/rm-qwen25vl-3b-nodesc

humor-r1 reward model — Qwen2.5-VL-3B + LoRA + scalar head (OOD-usable)

What's different from rm-qwen25vl-3b-20k

Headline numbers

Training recipe

Files

Usage

Limitations

Model tree for HumorR1/rm-qwen25vl-3b-nodesc

Dataset used to train HumorR1/rm-qwen25vl-3b-nodesc

What's different from `rm-qwen25vl-3b-20k`