DiffRetriever β€” LLaDA-8B (single-representation)

Single-representation (K=1) dense + sparse retriever fine-tuned on GSAI-ML/LLaDA-8B-Instruct, released with DiffRetriever: Parallel Representative Tokens for Retrieval with Diffusion Language Models (arXiv:2605.07210 Β· code).

DiffRetriever uses a diffusion language model's masked-position prediction interface directly for retrieval: it appends a single masked position (K=1) after a retrieval prompt and reads the hidden states (dense) and next-token logit vectors (sparse) from a single bidirectional forward pass (Fwd=1). With K=1 this is a fast single-vector dense + sparse retriever. The autoregressive equivalent must decode each representation sequentially.

This repo ships the LoRA adapter only (~tens of MB). The base backbone is downloaded automatically from GSAI-ML/LLaDA-8B-Instruct the first time you load the model.

Model summary

Backbone GSAI-ML/LLaDA-8B-Instruct β€” LLaDA 8B, diffusion LM
Adapter LoRA (r=16, Ξ±=64), merged at load time
Representations K=1 (single)
Denoising steps 1 (single forward pass)
Embedding dim 4096
Max input length 156 tokens
Recommended scoring single-vector dense (single_dense)
Also supports sparse (sparse_max) and hybrid fusion

Results

Fine-tuned results. **Dense** is the recommended/headline score for this
checkpoint; sparse and hybrid are also available from the same single forward
pass when the checkpoint was trained with sparse supervision.

In-domain (MS MARCO dev, TREC DL19/DL20)

Benchmark Metric Dense Sparse Hybrid
MS MARCO dev MRR@10 .424 .347 .405
TREC DL19 NDCG@10 .715 .621 .704
TREC DL20 NDCG@10 .715 .624 .701

Out-of-domain β€” BEIR-7 (NDCG@10, dense)

NQ HQA SciFact COVID FiQA ArguAna Quora Avg
.620 .640 .733 .840 .453 .414 .799 .643

See the paper for the full comparison against PromptReps, DiffEmbed, RepLLaMA, and BM25, and for latency analysis.

Usage

This repo is self-contained: the model code ships with it, so one call loads everything (the base LLaDA backbone is pulled from the Hub automatically and the LoRA adapter is attached on top).

pip install "transformers==4.54.0" peft torch    # + accelerate, safetensors
import torch
import torch.nn.functional as F
from transformers import AutoModel

# trust_remote_code runs the modeling code shipped in this repo.
model = AutoModel.from_pretrained("ielabgroup/diffretriever-llada-8b-single", trust_remote_code=True)
model.eval()

# A tiny query / passage set.
queries = ["what causes the seasons on earth?"]
passages = [
    "The tilt of Earth's axis relative to its orbital plane drives the seasons.",
    "Photosynthesis converts carbon dioxide and water into glucose using sunlight.",
]

# Encode β€” one forward pass per batch (tokenize() builds the prompt + masks).
def encode(texts, is_query):
    ids, mask = model.tokenize(texts, is_query=is_query)
    dev = next(model.backbone.parameters()).device
    with torch.inference_mode():
        return model.encode(ids.to(dev), mask.to(dev),
                            is_query=is_query, compute_sparse=False)

q = encode(queries,  is_query=True)
p = encode(passages, is_query=False)

# ── Scoring: single-vector dense (single_dense) ─────────────────────────────
# K=1: L2-normalize the single representation, then dot product.
qv = F.normalize(q["repr_hidden"].float(), dim=-1).mean(dim=1)   # [Q, H]
pv = F.normalize(p["repr_hidden"].float(), dim=-1).mean(dim=1)   # [P, H]
scores = qv @ pv.T                                              # [Q, P]

print(scores)   # [Q, P] β€” higher = more relevant

To rank a corpus, encode all passages once (offline), then encode each query and take scores.topk(k). For sharded encoding, the sparse/hybrid modes, and full BEIR/MS MARCO evaluation, see scripts/encode.py and scripts/evaluate_sweep.py in https://github.com/ielab/diffretriever.

Scoring modes

The encoder returns repr_hidden (dense, [B, K, H]) and β€” with compute_sparse=True β€” sparse_indices/sparse_values (sparse lexical weights). These support the paper's five modes: single_dense, multi_dense, sparse_max, fusion_single_sparse_max, fusion_multi_sparse_max. This checkpoint is tuned for single-vector dense (single_dense); scripts/evaluate_sweep.py runs all five in one pass.

Training details

Objective InfoNCE (dense, and sparse when sparse_weight>0), temperature Ο„=0.01
Negatives 1 positive + 15 hard negatives per query, plus in-batch negatives
Data Tevatron/msmarco-passage-aug (MS MARCO passage, augmented triples)
Adapter LoRA r=16, Ξ±=64 (query/key/value/output + MLP projections)
Sparse weight 1.0
Representations K=1, 1 denoising step
Max length 156 tokens, L2-normalized embeddings=True
Schedule 3 epochs, AdamW, cosine schedule
Infrastructure DeepSpeed ZeRO-2, single H100 node

For diffusion backbones the query/passage budgets (K_q, K_p) are selected on MS MARCO train; the paper uses (4, 16) for Dream and (4, 4) for LLaDA.

Related checkpoints

Citation

@article{wang2026diffretriever,
  title={ DiffRetriever: Parallel Representative Tokens for Retrieval with Diffusion Language Models },
  author={Wang, Shuai and Yin, Yu and Zhuang, Shengyao and Koopman, Bevan and Zuccon, Guido},
  journal={arXiv preprint arXiv:2605.07210},
  year={2026}
}

License

MIT. The base model is subject to its own license β€” see GSAI-ML/LLaDA-8B-Instruct.

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for ielabgroup/diffretriever-llada-8b-single

Adapter
(59)
this model

Collection including ielabgroup/diffretriever-llada-8b-single

Paper for ielabgroup/diffretriever-llada-8b-single