IROH: Jokes on Gemma4-31B - Humor Retrieval Judge

CLEF 2026 · JOKER Track · Task 1 English · Team VANGUARD

Ana-Maria Luisa Mocanu · Sebastian Mocanu · Ciprian-Octavian Truică · Elena-Simona Apostol

Model Description

A QLoRA-finetuned gemma-4-31b-it, trained as Stage 3 LLM judges in the IROH humor retrieval pipeline. Given a query describing a humor topic and a candidate text, each model returns a soft YES/NO probability indicating whether the candidate is a relevant joke, pun, or wordplay.

Trained on generic rationales - one-sentence explanations of why a text is or is not a joke, generated by Gemma 4 using a lightweight "General Wordplay" query placeholder. The simplicity of this prompt produces more consistent supervision than the structured typed alternative. Serve as complementary correctors to the primary Qwen judge.

Models

Adapter folder	Base model	LoRA r	Training data	Ensemble weight	MAP (standalone)
adapter_model.safetensors	gemma-4-31b-it	32	Generic rationales, no aug	0.30	0.5718

Usage

from transformers import AutoTokenizer, AutoModelForImageTextToText, BitsAndBytesConfig
from peft import PeftModel
import torch

base_model_id = "google/gemma-4-31b-it"
adapter_id = "DS4AI-UPB/jokes-on-gemma4-31b"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)

tokenizer = AutoTokenizer.from_pretrained(base_model_id)
model = PeftModel.from_pretrained(
    AutoModelForImageTextToText.from_pretrained(
        base_model_id,
        quantization_config=bnb_config,
        device_map="auto",
    ),
    adapter_id,
)
model.eval()

SYSTEM = (
    "You are a humor and wordplay detection judge. You evaluate whether a text is relevant to a "
    "query AND contains humor, jokes, puns, wordplay, or any form of linguistic wit (double "
    "meanings, homophones, malapropisms, ironic twists). Answer only YES or NO."
)

def score(query: str, text: str) -> float:
    messages = [
        {"role": "system", "content": SYSTEM},
        {"role": "user", "content": f'Query: "{query}"\nText: "{text}"\nIs this a relevant joke? Answer YES or NO.'},
    ]
    tokenized = tokenizer.apply_chat_template(
        messages, add_generation_prompt=True, return_tensors="pt", return_dict=True
    )
    ids = tokenized["input_ids"].to(model.device)
    with torch.no_grad():
        logits = model(ids).logits[:, -1, :]
    yes_id = tokenizer.convert_tokens_to_ids("YES")
    no_id  = tokenizer.convert_tokens_to_ids("NO")
    return torch.softmax(torch.stack([logits[0, yes_id], logits[0, no_id]]), dim=0)[0].item()

Requirements

pip install -U transformers peft bitsandbytes accelerate

Requires transformers >= 5.5.0, peft >= 0.14, bitsandbytes >= 0.43. Requires a CUDA GPU with ~30GB VRAM for 4-bit quantization (e.g. A100 on Colab Pro).

Training Data

Query-document pairs from the JOKER 2025 and 2026 Task 1 corpora, deduplicated across editions and balanced between joke and non-joke examples. Each pair is annotated with a one-sentence rationale generated by Gemma 4 (gemma4:e4b via Ollama). Rationale generation scripts are available in the code repository.

Intended Use

Intended: Stage 3 LLM judges in a multi-stage humor retrieval pipeline, used together in a weighted ensemble alongside jokes-on-qwen2.5-7b.
Out of scope: General-purpose text classification; production deployment without safety validation; languages other than English.

Limitations

English only - training data, prompts, and taxonomy are English-specific.
Binary YES/NO framing - may be poorly calibrated on borderline cases; graded relevance training is a promising future direction.
Optimized for short jokes, puns, and wordplay in the JOKER corpus.

Citation

@InProceedings{Mocanu2026IROH,
    author    = {Mocanu, Ana-Maria Luisa and Mocanu, Sebastian and Truică, Ciprian-Octavian and Apostol, Elena-Simona},
    title     = {IROH: Insightful Ranking Of Humor using Multi-Stage Hybrid Retrieval with Rationale-Distilled LLM Judges for JOKER 2026 Track Task 1 English},
    booktitle = {Working Notes of CLEF 2026},
    month     = {September},
    year      = {2026}
}

Links

Resource	Link
Paper	WIP — will be updated when proceedings are published
arXiv	WIP
Code	GitHub — DS4AI-UPB/VANGUARD-CLEF2026-JOKER
Primary judge	jokes-on-qwen2.5-7b

Downloads last month: -

Collection including DS4AI-UPB/jokes-on-gemma4-31b

CLEF 2026 - VANGUARD

Collection

Our CLEF best models, for EXIST and JOKER tracks 2026 • 3 items • Updated about 9 hours ago

Evaluation results

MAP — Generic (standalone)
self-reported

0.572