MiLQ: Benchmarking IR Models for Bilingual Web Search with Mixed Language Queries
Paper • 2505.16631 • Published
A cross-lingual ColBERT late-interaction retriever (XLM-RoBERTa-large backbone) for English ⇄ Persian (Farsi) web search. The model is distilled from a strong reranker and trained on code-switched (mixed-language) queries, following the Mixed-Distill recipe from the MiLQ paper.
The repo name encodes the training direction: enfa-fa = code-switched en+fa queries → fa documents.
It is designed to be robust when bilingual users issue mixed English+Persian queries against
Persian (Farsi)-language documents.
| Base model | xlm-roberta-large |
| Architecture | ColBERT (late interaction) |
| Projection dim | 128 |
| Similarity | cosine |
| Query max length | 32 |
| Doc max length | 180 |
| Training | KD (KLD), n-way 6, teacher: mT5-XXL/monoT5 on mMARCO |
Load with the ColBERT library:
from colbert.modeling.checkpoint import Checkpoint
from colbert.infra import ColBERTConfig
ckpt = Checkpoint("<your-username>/ColBERT-XLMR-Mixed-Distill-enfa-fa",
colbert_config=ColBERTConfig())
Q = ckpt.queryFromText(["mixed English+Persian query ..."])
D = ckpt.docFromText(["Persian (Farsi) document passage ..."])
@misc{kim2025milqbenchmarkingirmodels,
title={MiLQ: Benchmarking IR Models for Bilingual Web Search with Mixed Language Queries},
author={Jonghwi Kim and Deokhyung Kang and Seonjeong Hwang and Yunsu Kim and Jungseul Ok and Gary Lee},
year={2025},
eprint={2505.16631},
archivePrefix={arXiv},
primaryClass={cs.IR},
url={https://arxiv.org/abs/2505.16631},
}
Base model
FacebookAI/xlm-roberta-large