Mixed-Distill enru-ru (English–Russian)

A cross-lingual ColBERT late-interaction retriever (XLM-RoBERTa-large backbone) for English ⇄ Russian web search. The model is distilled from a strong reranker and trained on code-switched (mixed-language) queries, following the Mixed-Distill recipe from the MiLQ paper.

The repo name encodes the training direction: enru-ru = code-switched en+ru queries → ru documents. It is designed to be robust when bilingual users issue mixed English+Russian queries against Russian-language documents.

What "Mixed-Distill" means

  • Mixed — queries are code-switched (English tokens randomly mixed into Russian queries, MUSE-based, ~0.5 mixing ratio), so the model handles native, English, and mixed-language queries.
  • Distill — trained with knowledge distillation (KL-divergence) from teacher relevance scores (mT5-XXL / monoT5 over mMARCO), 6-way passage scoring.

Intended use

  • Queries: English, Russian, or code-switched English+Russian.
  • Documents: Russian-language passages.
  • Scoring: ColBERT late interaction (MaxSim over per-token embeddings).

Specs

Base model xlm-roberta-large
Architecture ColBERT (late interaction)
Projection dim 128
Similarity cosine
Query max length 32
Doc max length 180
Training KD (KLD), n-way 6, teacher: mT5-XXL/monoT5 on mMARCO

Usage

Load with the ColBERT library:

from colbert.modeling.checkpoint import Checkpoint
from colbert.infra import ColBERTConfig

ckpt = Checkpoint("<your-username>/ColBERT-XLMR-Mixed-Distill-enru-ru",
                  colbert_config=ColBERTConfig())
Q = ckpt.queryFromText(["mixed English+Russian query ..."])
D = ckpt.docFromText(["Russian document passage ..."])

Citation

@misc{kim2025milqbenchmarkingirmodels,
      title={MiLQ: Benchmarking IR Models for Bilingual Web Search with Mixed Language Queries},
      author={Jonghwi Kim and Deokhyung Kang and Seonjeong Hwang and Yunsu Kim and Jungseul Ok and Gary Lee},
      year={2025},
      eprint={2505.16631},
      archivePrefix={arXiv},
      primaryClass={cs.IR},
      url={https://arxiv.org/abs/2505.16631},
}
Downloads last month
14
Safetensors
Model size
0.6B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for jonghwi/ColBERT-XLMR-Mixed-Distill-enru-ru

Finetuned
(965)
this model

Paper for jonghwi/ColBERT-XLMR-Mixed-Distill-enru-ru