Instructions to use DS4AI-UPB/jokes-on-qwen2.5-7b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use DS4AI-UPB/jokes-on-qwen2.5-7b with PEFT:
from peft import PeftModel from transformers import AutoModelForCausalLM base_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-7B-Instruct") model = PeftModel.from_pretrained(base_model, "DS4AI-UPB/jokes-on-qwen2.5-7b") - Notebooks
- Google Colab
- Kaggle
IROH: Jokes on Qwen2.5-7B - Humor Retrieval Judge (Generic Rationales)
CLEF 2026 · JOKER Track · Task 1 English · Team VANGUARD
Ana-Maria Luisa Mocanu · Sebastian Mocanu · Ciprian-Octavian Truică · Elena-Simona Apostol
Model Description
A QLoRA-finetuned Qwen2.5-7B-Instruct trained as Stage 3 LLM judge in the IROH humor retrieval pipeline. Given a query describing a humor topic and a candidate text, it returns a soft YES/NO probability indicating whether the candidate is a relevant joke, pun, or wordplay.
This is the primary judge in the winning ensemble (weight 0.60), outperforming every Gemma-4-31B configuration despite being 4× smaller. We attribute this to better score calibration: the lighter model produces smoother probability distributions that blend more effectively with the upstream cross-encoder signal.
Trained on generic rationales - one-sentence explanations of why a text is or is not a joke, generated by Gemma 4 using a lightweight "General Wordplay" query placeholder. The simplicity of this prompt produces more consistent supervision than the structured typed alternative.
Models
| File | Base model | LoRA r | Training data | MAP |
|---|---|---|---|---|
adapter_model.safetensors |
Qwen2.5-7B-Instruct | 64 | Generic rationales, no augmentation | 0.6055 |
Usage
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from peft import PeftModel
import torch
base_model_id = "Qwen/Qwen2.5-7B-Instruct"
adapter_id = "DS4AI-UPB/jokes-on-qwen2.5-7b"
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
)
tokenizer = AutoTokenizer.from_pretrained(base_model_id)
model = PeftModel.from_pretrained(
AutoModelForCausalLM.from_pretrained(
base_model_id,
quantization_config=bnb_config,
device_map="auto",
),
adapter_id,
)
model.eval()
SYSTEM = (
"You are a humor and wordplay detection judge. You evaluate whether a text is relevant to a "
"query AND contains humor, jokes, puns, wordplay, or any form of linguistic wit (double "
"meanings, homophones, malapropisms, ironic twists). Answer only YES or NO."
)
def score(query: str, text: str) -> float:
messages = [
{"role": "system", "content": SYSTEM},
{"role": "user", "content": f'Query: "{query}"\nText: "{text}"\nIs this a relevant joke? Answer YES or NO.'},
]
tokenized = tokenizer.apply_chat_template(
messages, add_generation_prompt=True, return_tensors="pt", return_dict=True
)
ids = tokenized["input_ids"].to(model.device)
with torch.no_grad():
logits = model(ids).logits[:, -1, :]
yes_id = tokenizer.convert_tokens_to_ids("YES")
no_id = tokenizer.convert_tokens_to_ids("NO")
return torch.softmax(torch.stack([logits[0, yes_id], logits[0, no_id]]), dim=0)[0].item()
Requirements
pip install -U transformers peft bitsandbytes accelerate
Tested with
transformers >= 4.45,peft >= 0.14,bitsandbytes >= 0.43. Requires a CUDA GPU for 4-bit quantization.
Training Data
Query-document pairs from the JOKER 2025 and 2026 Task 1 corpora, deduplicated across editions, balanced between joke (label 1) and non-joke (label 0) examples. Each pair is annotated with a one-sentence rationale generated by Gemma 4 (gemma4:e4b via Ollama) explaining why the text is or is not a relevant joke. Hard negatives (literal rewrites, defused jokes, wrong-topic jokes) are excluded from this variant — augmentation consistently degraded official evaluation performance.
Intended Use
- Intended: Stage 3 LLM judge in a multi-stage humor retrieval pipeline, after hybrid sparse-dense retrieval and cross-encoder reranking.
- Out of scope: General-purpose text classification; production deployment without safety validation; languages other than English.
Limitations
- English only - training data, prompts, and taxonomy are English-specific.
- Binary YES/NO framing - may be poorly calibrated on borderline cases; graded relevance training is a promising future direction.
- Optimized for short jokes, puns, and wordplay in the JOKER corpus.
Citation
@InProceedings{Mocanu2026IROH,
author = {Mocanu, Ana-Maria Luisa and Mocanu, Sebastian and Truică, Ciprian-Octavian and Apostol, Elena-Simona},
title = {IROH: Insightful Ranking Of Humor using Multi-Stage Hybrid Retrieval with Rationale-Distilled LLM Judges for JOKER 2026 Track Task 1 English},
booktitle = {Working Notes of CLEF 2026},
month = {September},
year = {2026}
}
Links
| Resource | Link |
|---|---|
| Paper | WIP - will be updated when proceedings are published |
| arXiv | WIP |
| Code | GitHub — DS4AI-UPB/VANGUARD-CLEF2026-JOKER |
| Gemma judge | jokes-on-gemma4-31b |
- Downloads last month
- -
Model tree for DS4AI-UPB/jokes-on-qwen2.5-7b
Collection including DS4AI-UPB/jokes-on-qwen2.5-7b
Evaluation results
- MAP (standalone)self-reported0.606