IROH: Jokes on Qwen2.5-7B - Humor Retrieval Judge (Generic Rationales)

CLEF 2026 · JOKER Track · Task 1 English · Team VANGUARD

Ana-Maria Luisa Mocanu · Sebastian Mocanu · Ciprian-Octavian Truică · Elena-Simona Apostol

Paper arXiv GitHub License


Model Description

A QLoRA-finetuned Qwen2.5-7B-Instruct trained as Stage 3 LLM judge in the IROH humor retrieval pipeline. Given a query describing a humor topic and a candidate text, it returns a soft YES/NO probability indicating whether the candidate is a relevant joke, pun, or wordplay.

This is the primary judge in the winning ensemble (weight 0.60), outperforming every Gemma-4-31B configuration despite being 4× smaller. We attribute this to better score calibration: the lighter model produces smoother probability distributions that blend more effectively with the upstream cross-encoder signal.

Trained on generic rationales - one-sentence explanations of why a text is or is not a joke, generated by Gemma 4 using a lightweight "General Wordplay" query placeholder. The simplicity of this prompt produces more consistent supervision than the structured typed alternative.


Models

File Base model LoRA r Training data MAP
adapter_model.safetensors Qwen2.5-7B-Instruct 64 Generic rationales, no augmentation 0.6055

Usage

from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from peft import PeftModel
import torch

base_model_id = "Qwen/Qwen2.5-7B-Instruct"
adapter_id = "DS4AI-UPB/jokes-on-qwen2.5-7b"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)

tokenizer = AutoTokenizer.from_pretrained(base_model_id)
model = PeftModel.from_pretrained(
    AutoModelForCausalLM.from_pretrained(
        base_model_id,
        quantization_config=bnb_config,
        device_map="auto",
    ),
    adapter_id,
)
model.eval()

SYSTEM = (
    "You are a humor and wordplay detection judge. You evaluate whether a text is relevant to a "
    "query AND contains humor, jokes, puns, wordplay, or any form of linguistic wit (double "
    "meanings, homophones, malapropisms, ironic twists). Answer only YES or NO."
)

def score(query: str, text: str) -> float:
    messages = [
        {"role": "system", "content": SYSTEM},
        {"role": "user", "content": f'Query: "{query}"\nText: "{text}"\nIs this a relevant joke? Answer YES or NO.'},
    ]
    tokenized = tokenizer.apply_chat_template(
        messages, add_generation_prompt=True, return_tensors="pt", return_dict=True
    )
    ids = tokenized["input_ids"].to(model.device)
    with torch.no_grad():
        logits = model(ids).logits[:, -1, :]
    yes_id = tokenizer.convert_tokens_to_ids("YES")
    no_id  = tokenizer.convert_tokens_to_ids("NO")
    return torch.softmax(torch.stack([logits[0, yes_id], logits[0, no_id]]), dim=0)[0].item()

Requirements

pip install -U transformers peft bitsandbytes accelerate

Tested with transformers >= 4.45, peft >= 0.14, bitsandbytes >= 0.43. Requires a CUDA GPU for 4-bit quantization.


Training Data

Query-document pairs from the JOKER 2025 and 2026 Task 1 corpora, deduplicated across editions, balanced between joke (label 1) and non-joke (label 0) examples. Each pair is annotated with a one-sentence rationale generated by Gemma 4 (gemma4:e4b via Ollama) explaining why the text is or is not a relevant joke. Hard negatives (literal rewrites, defused jokes, wrong-topic jokes) are excluded from this variant — augmentation consistently degraded official evaluation performance.


Intended Use

  • Intended: Stage 3 LLM judge in a multi-stage humor retrieval pipeline, after hybrid sparse-dense retrieval and cross-encoder reranking.
  • Out of scope: General-purpose text classification; production deployment without safety validation; languages other than English.

Limitations

  • English only - training data, prompts, and taxonomy are English-specific.
  • Binary YES/NO framing - may be poorly calibrated on borderline cases; graded relevance training is a promising future direction.
  • Optimized for short jokes, puns, and wordplay in the JOKER corpus.

Citation

@InProceedings{Mocanu2026IROH,
    author    = {Mocanu, Ana-Maria Luisa and Mocanu, Sebastian and Truică, Ciprian-Octavian and Apostol, Elena-Simona},
    title     = {IROH: Insightful Ranking Of Humor using Multi-Stage Hybrid Retrieval with Rationale-Distilled LLM Judges for JOKER 2026 Track Task 1 English},
    booktitle = {Working Notes of CLEF 2026},
    month     = {September},
    year      = {2026}
}

Links

Resource Link
Paper WIP - will be updated when proceedings are published
arXiv WIP
Code GitHub — DS4AI-UPB/VANGUARD-CLEF2026-JOKER
Gemma judge jokes-on-gemma4-31b
Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for DS4AI-UPB/jokes-on-qwen2.5-7b

Base model

Qwen/Qwen2.5-7B
Adapter
(2189)
this model

Collection including DS4AI-UPB/jokes-on-qwen2.5-7b

Evaluation results