--- language: - en - zh - fa - ru tags: - clir - colbertx - plaidx - xlm-roberta-large datasets: - ms_marco - hltcoe/tdist-msmarco-scores task_categories: - text-retrieval - information-retrieval task_ids: - passage-retrieval - cross-language-retrieval license: mit --- # ColBERT-X for English-Chinese/Persian/Russian MLIR using Multilingual Translate-Distill ## MLIR Model Setting - Query language: English - Query length: 32 token max - Document language: Chinese/Persian/Russian - Document length: 180 token max (please use MaxP to aggregate the passage score if needed) ## Model Description Multilingual Translate-Distill is a training technique that produces state-of-the-art MLIR dense retrieval model through translation and distillation. `plaidx-large-neuclir-mtd-mix-entries-mt5xxl-engeng` is trained with KL-Divergence from the `mt5xxl` MonoT5 reranker [`unicamp-dl/mt5-13b-mmarco-100k`](https://huggingface.co/unicamp-dl/mt5-13b-mmarco-100k) inferenced on English MS MARCO training queries and passages. The teacher scores can be found in [`hltcoe/tdist-msmarco-scores`](https://huggingface.co/datasets/hltcoe/tdist-msmarco-scores/blob/main/t53b-monot5-msmarco-engeng.jsonl.gz). ### Training Parameters - learning rate: 5e-6 - update steps: 200,000 - nway (number of passages per query): 6 (randomly selected from 50; 2 if using `round-robin-entires`, see below) - per device batch size (number of query-passage set): 8 - training GPU: 8 NVIDIA V100 with 32 GB memory ### Mixing Strategies - `mix-passages`: languages are randomly assigned to the 6 sampled passages for a given query during training. - `mix-entries`: all passages in the a given query-passage set are randomly assigned to the same language. - `round-robin-entires`: for each query, the query-passage set is repeated `n` times to iterate through all languages. ## Usage To properly load ColBERT-X models from Huggingface Hub, please use the following version of PLAID-X. ```bash pip install PLAID-X>=0.3.1 ``` Following code snippet loads the model through Huggingface API. ```python from colbert.modeling.checkpoint import Checkpoint from colbert.infra import ColBERTConfig Checkpoint('hltcoe/plaidx-large-neuclir-mtd-mix-entries-mt5xxl-engeng', colbert_config=ColBERTConfig()) ``` For full tutorial, please refer to the [PLAID-X Jupyter Notebook](https://colab.research.google.com/github/hltcoe/clir-tutorial/blob/main/notebooks/clir_tutorial_plaidx.ipynb), which is part of the [SIGIR 2023 CLIR Tutorial](https://github.com/hltcoe/clir-tutorial). ## BibTeX entry and Citation Info Please cite the following two papers if you use the model. ```bibtex @inproceedings{mtt, title = {Neural Approaches to Multilingual Information Retrieval}, author = {Dawn Lawrie and Eugene Yang and Douglas W Oard and James Mayfield}, booktitle = {Proceedings of the 45th European Conference on Information Retrieval (ECIR)}, year = {2023}, doi = {10.1007/978-3-031-28244-7_33}, url = {https://arxiv.org/abs/2209.01335} } ``` ```bibtex @inproceedings{mtd, author = {Eugene Yang and Dawn Lawrie and James Mayfield}, title = {Distillation for Multilingual Information Retrieval}, booktitle = {Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR) (Short Paper) (Accepted)}, year = {2024} url = {https://arxiv.org/abs/2405.00977} } ```