Mixed-Distill enfa-fa (English–Persian)

A cross-lingual ColBERT late-interaction retriever (XLM-RoBERTa-large backbone) for English ⇄ Persian (Farsi) web search. The model is distilled from a strong reranker and trained on code-switched (mixed-language) queries, following the Mixed-Distill recipe from the MiLQ paper.

The repo name encodes the training direction: enfa-fa = code-switched en+fa queries → fa documents. It is designed to be robust when bilingual users issue mixed English+Persian queries against Persian (Farsi)-language documents.

What "Mixed-Distill" means

  • Mixed — queries are code-switched (English tokens randomly mixed into Persian queries, MUSE-based, ~0.5 mixing ratio), so the model handles native, English, and mixed-language queries.
  • Distill — trained with knowledge distillation (KL-divergence) from teacher relevance scores (mT5-XXL / monoT5 over mMARCO), 6-way passage scoring.

Intended use

  • Queries: English, Persian, or code-switched English+Persian.
  • Documents: Persian (Farsi)-language passages.
  • Scoring: ColBERT late interaction (MaxSim over per-token embeddings).

Specs

Base model xlm-roberta-large
Architecture ColBERT (late interaction)
Projection dim 128
Similarity cosine
Query max length 32
Doc max length 180
Training KD (KLD), n-way 6, teacher: mT5-XXL/monoT5 on mMARCO

Usage

Load with the ColBERT library:

from colbert.modeling.checkpoint import Checkpoint
from colbert.infra import ColBERTConfig

ckpt = Checkpoint("<your-username>/ColBERT-XLMR-Mixed-Distill-enfa-fa",
                  colbert_config=ColBERTConfig())
Q = ckpt.queryFromText(["mixed English+Persian query ..."])
D = ckpt.docFromText(["Persian (Farsi) document passage ..."])

Citation

@misc{kim2025milqbenchmarkingirmodels,
      title={MiLQ: Benchmarking IR Models for Bilingual Web Search with Mixed Language Queries},
      author={Jonghwi Kim and Deokhyung Kang and Seonjeong Hwang and Yunsu Kim and Jungseul Ok and Gary Lee},
      year={2025},
      eprint={2505.16631},
      archivePrefix={arXiv},
      primaryClass={cs.IR},
      url={https://arxiv.org/abs/2505.16631},
}
Downloads last month
13
Safetensors
Model size
0.6B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for jonghwi/ColBERT-XLMR-Mixed-Distill-enfa-fa

Finetuned
(968)
this model

Paper for jonghwi/ColBERT-XLMR-Mixed-Distill-enfa-fa