DiffRetriever — LLaDA-8B (multi-representation)

Multi-representation (K_q=4, K_p=4) ColBERT-style retriever fine-tuned on GSAI-ML/LLaDA-8B-Instruct, released with DiffRetriever: Parallel Representative Tokens for Retrieval with Diffusion Language Models (arXiv:2605.07210 · code).

DiffRetriever uses a diffusion language model's masked-position prediction interface directly for retrieval: it appends K_q=4 query / K_p=4 passage masked positions after a retrieval prompt and reads the hidden states (dense) and next-token logit vectors (sparse) from a single bidirectional forward pass (Fwd=1). With K>1 this gives ColBERT-style multi-representation retrieval at near single-pass encoding cost. The autoregressive equivalent must decode each representation sequentially.

This repo ships the LoRA adapter only (~tens of MB). The base backbone is downloaded automatically from GSAI-ML/LLaDA-8B-Instruct the first time you load the model.

Model summary


Backbone	`GSAI-ML/LLaDA-8B-Instruct` — LLaDA 8B, diffusion LM
Adapter	LoRA (r=16, α=64), merged at load time
Representations	K_q=4 query, K_p=4 passage
Denoising steps	1 (single forward pass)
Embedding dim	4096
Max input length	156 tokens
Recommended scoring	ColBERT-style MaxSim (multi_dense)
Also supports	sparse (`sparse_max`) and hybrid fusion

Results

Fine-tuned results. **Dense** is the recommended/headline score for this
checkpoint; sparse and hybrid are also available from the same single forward
pass when the checkpoint was trained with sparse supervision.

In-domain (MS MARCO dev, TREC DL19/DL20)

Benchmark	Metric	Dense	Sparse	Hybrid
MS MARCO dev	MRR@10	.427	.348	.408
TREC DL19	NDCG@10	.718	.636	.718
TREC DL20	NDCG@10	.721	.614	.698

Out-of-domain — BEIR-7 (NDCG@10, dense)

NQ	HQA	SciFact	COVID	FiQA	ArguAna	Quora	Avg
.622	.647	.744	.846	.443	.412	.798	.645

See the paper for the full comparison against PromptReps, DiffEmbed, RepLLaMA, and BM25, and for latency analysis.

Usage

This repo is self-contained: the model code ships with it, so one call loads everything (the base LLaDA backbone is pulled from the Hub automatically and the LoRA adapter is attached on top).

pip install "transformers==4.54.0" peft torch    # + accelerate, safetensors

import torch
import torch.nn.functional as F
from transformers import AutoModel

# trust_remote_code runs the modeling code shipped in this repo.
model = AutoModel.from_pretrained("ielabgroup/diffretriever-llada-8b-multi-q4-p4", trust_remote_code=True)
model.eval()

# A tiny query / passage set.
queries = ["what causes the seasons on earth?"]
passages = [
    "The tilt of Earth's axis relative to its orbital plane drives the seasons.",
    "Photosynthesis converts carbon dioxide and water into glucose using sunlight.",
]

# Encode — one forward pass per batch (tokenize() builds the prompt + masks).
def encode(texts, is_query):
    ids, mask = model.tokenize(texts, is_query=is_query)
    dev = next(model.backbone.parameters()).device
    with torch.inference_mode():
        return model.encode(ids.to(dev), mask.to(dev),
                            is_query=is_query, compute_sparse=False)

q = encode(queries,  is_query=True)
p = encode(passages, is_query=False)

# ── Scoring: ColBERT MaxSim over the K-vector outputs (multi_dense) ─────────
qv = F.normalize(q["repr_hidden"].float(), dim=-1)   # [Q, K_q=4, H]
pv = F.normalize(p["repr_hidden"].float(), dim=-1)   # [P, K_p=4, H]
sim = torch.einsum("qkh,pdh->qkpd", qv, pv)          # [Q, K_q, P, K_p]
scores = sim.max(dim=-1).values.clamp(min=0).sum(dim=1)   # [Q, P]

print(scores)   # [Q, P] — higher = more relevant

To rank a corpus, encode all passages once (offline), then encode each query and take scores.topk(k). For sharded encoding, the sparse/hybrid modes, and full BEIR/MS MARCO evaluation, see scripts/encode.py and scripts/evaluate_sweep.py in https://github.com/ielab/diffretriever.

Scoring modes

The encoder returns repr_hidden (dense, [B, K, H]) and — with compute_sparse=True — sparse_indices/sparse_values (sparse lexical weights). These support the paper's five modes: single_dense, multi_dense, sparse_max, fusion_single_sparse_max, fusion_multi_sparse_max. This checkpoint is tuned for ColBERT-style MaxSim (multi_dense); scripts/evaluate_sweep.py runs all five in one pass.

Training details


Objective	InfoNCE (dense, and sparse when sparse_weight>0), temperature τ=0.01
Negatives	1 positive + 15 hard negatives per query, plus in-batch negatives
Data	Tevatron/msmarco-passage-aug (MS MARCO passage, augmented triples)
Adapter	LoRA r=16, α=64 (query/key/value/output + MLP projections)
Sparse weight	1.0
Representations	K_q=4, K_p=4, 1 denoising step
Max length	156 tokens, L2-normalized embeddings=True
Schedule	3 epochs, AdamW, cosine schedule
Infrastructure	DeepSpeed ZeRO-2, single H100 node

For diffusion backbones the query/passage budgets (K_q, K_p) are selected on MS MARCO train; the paper uses (4, 16) for Dream and (4, 4) for LLaDA.

Related checkpoints

Citation

@article{wang2026diffretriever,
  title={ DiffRetriever: Parallel Representative Tokens for Retrieval with Diffusion Language Models },
  author={Wang, Shuai and Yin, Yu and Zhuang, Shengyao and Koopman, Bevan and Zuccon, Guido},
  journal={arXiv preprint arXiv:2605.07210},
  year={2026}
}

License

MIT. The base model is subject to its own license — see GSAI-ML/LLaDA-8B-Instruct.

Downloads last month: -

Model tree for ielabgroup/diffretriever-llada-8b-multi-q4-p4

Base model

GSAI-ML/LLaDA-8B-Instruct

Adapter

(59)

this model

Collection including ielabgroup/diffretriever-llada-8b-multi-q4-p4

Diffretriever

Collection

Paper and model checkpoint traiend for diffretriever • 5 items • Updated about 7 hours ago

Paper for ielabgroup/diffretriever-llada-8b-multi-q4-p4

DiffRetriever: Parallel Representative Tokens for Retrieval with Diffusion Language Models

Paper • 2605.07210 • Published May 8 • 4