Mixed-Distill enfa-fa (English–Persian)

A cross-lingual ColBERT late-interaction retriever (XLM-RoBERTa-large backbone) for English ⇄ Persian (Farsi) web search. The model is distilled from a strong reranker and trained on code-switched (mixed-language) queries, following the Mixed-Distill recipe from the MiLQ paper.

The repo name encodes the training direction: enfa-fa = code-switched en+fa queries → fa documents. It is designed to be robust when bilingual users issue mixed English+Persian queries against Persian (Farsi)-language documents.

What "Mixed-Distill" means

Mixed — queries are code-switched (English tokens randomly mixed into Persian queries, MUSE-based, ~0.5 mixing ratio), so the model handles native, English, and mixed-language queries.
Distill — trained with knowledge distillation (KL-divergence) from teacher relevance scores (mT5-XXL / monoT5 over mMARCO), 6-way passage scoring.

Intended use

Queries: English, Persian, or code-switched English+Persian.
Documents: Persian (Farsi)-language passages.
Scoring: ColBERT late interaction (MaxSim over per-token embeddings).

Specs


Base model	`xlm-roberta-large`
Architecture	ColBERT (late interaction)
Projection dim	128
Similarity	cosine
Query max length	32
Doc max length	180
Training	KD (KLD), n-way 6, teacher: mT5-XXL/monoT5 on mMARCO

Usage

Load with the ColBERT library:

from colbert.modeling.checkpoint import Checkpoint
from colbert.infra import ColBERTConfig

ckpt = Checkpoint("<your-username>/ColBERT-XLMR-Mixed-Distill-enfa-fa",
                  colbert_config=ColBERTConfig())
Q = ckpt.queryFromText(["mixed English+Persian query ..."])
D = ckpt.docFromText(["Persian (Farsi) document passage ..."])

Citation

@misc{kim2025milqbenchmarkingirmodels,
      title={MiLQ: Benchmarking IR Models for Bilingual Web Search with Mixed Language Queries},
      author={Jonghwi Kim and Deokhyung Kang and Seonjeong Hwang and Yunsu Kim and Jungseul Ok and Gary Lee},
      year={2025},
      eprint={2505.16631},
      archivePrefix={arXiv},
      primaryClass={cs.IR},
      url={https://arxiv.org/abs/2505.16631},
}

Downloads last month: 13

Safetensors

Model size

0.6B params

Tensor type

F32

Model tree for jonghwi/ColBERT-XLMR-Mixed-Distill-enfa-fa

Base model

FacebookAI/xlm-roberta-large

Finetuned

(968)

this model

Paper for jonghwi/ColBERT-XLMR-Mixed-Distill-enfa-fa

MiLQ: Benchmarking IR Models for Bilingual Web Search with Mixed Language Queries

Paper • 2505.16631 • Published Oct 19, 2025