MiLQ: Benchmarking IR Models for Bilingual Web Search with Mixed Language Queries
Paper • 2505.16631 • Published
A cross-lingual ColBERT late-interaction retriever (XLM-RoBERTa-large backbone) for English ⇄ Russian web search. The model is distilled from a strong reranker and trained on code-switched (mixed-language) queries, following the Mixed-Distill recipe from the MiLQ paper.
The repo name encodes the training direction: enru-ru = code-switched en+ru queries → ru documents.
It is designed to be robust when bilingual users issue mixed English+Russian queries against
Russian-language documents.
| Base model | xlm-roberta-large |
| Architecture | ColBERT (late interaction) |
| Projection dim | 128 |
| Similarity | cosine |
| Query max length | 32 |
| Doc max length | 180 |
| Training | KD (KLD), n-way 6, teacher: mT5-XXL/monoT5 on mMARCO |
Load with the ColBERT library:
from colbert.modeling.checkpoint import Checkpoint
from colbert.infra import ColBERTConfig
ckpt = Checkpoint("<your-username>/ColBERT-XLMR-Mixed-Distill-enru-ru",
colbert_config=ColBERTConfig())
Q = ckpt.queryFromText(["mixed English+Russian query ..."])
D = ckpt.docFromText(["Russian document passage ..."])
@misc{kim2025milqbenchmarkingirmodels,
title={MiLQ: Benchmarking IR Models for Bilingual Web Search with Mixed Language Queries},
author={Jonghwi Kim and Deokhyung Kang and Seonjeong Hwang and Yunsu Kim and Jungseul Ok and Gary Lee},
year={2025},
eprint={2505.16631},
archivePrefix={arXiv},
primaryClass={cs.IR},
url={https://arxiv.org/abs/2505.16631},
}
Base model
FacebookAI/xlm-roberta-large