--- language: - en - zh tags: - clir - colbertx - plaidx - xlm-roberta-large datasets: - ms_marco - hltcoe/tdist-msmarco-scores task_categories: - text-retrieval - information-retrieval task_ids: - passage-retrieval - cross-language-retrieval license: mit --- # ColBERT-X for English-Chinese CLIR using Translate-Distill ## CLIR Model Setting - Query language: English - Query length: 32 token max - Document language: Chinese - Document length: 180 token max (please use MaxP to aggregate the passage score if needed) ## Model Description Translate-Distill is a training technique that produces state-of-the-art CLIR dense retrieval model through translation and distillation. `plaidx-large-zho-tdist-mt5xxl-engeng` is trained with KL-Divergence from the mt5xxl MonoT5 reranker inferenced on English MS MARCO training queries and English passages. ### Teacher Models: - `t53b`: [`castorini/monot5-3b-msmarco-10k`](https://huggingface.co/castorini/monot5-3b-msmarco-10k) - `mt5xxl`: [`unicamp-dl/mt5-13b-mmarco-100k`](https://huggingface.co/unicamp-dl/mt5-13b-mmarco-100k) ### Training Parameters - learning rate: 5e-6 - update steps: 200,000 - nway (number of passages per query): 6 (randomly selected from 50) - per device batch size (number of query-passage set): 8 - training GPU: 8 NVIDIA V100 with 32 GB memory ## Usage To properly load ColBERT-X models from Huggingface Hub, please use the following version of PLAID-X. ```bash pip install PLAID-X==0.3.1 ``` Following code snippet loads the model through Huggingface API. ```python from colbert.modeling.checkpoint import Checkpoint from colbert.infra import ColBERTConfig Checkpoint('hltcoe/plaidx-large-zho-tdist-mt5xxl-engeng', colbert_config=ColBERTConfig()) ``` For full tutorial, please refer to the [PLAID-X Jupyter Notebook](https://colab.research.google.com/github/hltcoe/clir-tutorial/blob/main/notebooks/clir_tutorial_plaidx.ipynb), which is part of the [SIGIR 2023 CLIR Tutorial](https://github.com/hltcoe/clir-tutorial). ## BibTeX entry and Citation Info Please cite the following two papers if you use the model. ```bibtex @inproceedings{colbert-x, author = {Suraj Nair and Eugene Yang and Dawn Lawrie and Kevin Duh and Paul McNamee and Kenton Murray and James Mayfield and Douglas W. Oard}, title = {Transfer Learning Approaches for Building Cross-Language Dense Retrieval Models}, booktitle = {Proceedings of the 44th European Conference on Information Retrieval (ECIR)}, year = {2022}, url = {https://arxiv.org/abs/2201.08471} } ``` ```bibtex @inproceedings{translate-distill, author = {Eugene Yang and Dawn Lawrie and James Mayfield and Douglas W. Oard and Scott Miller}, title = {Translate-Distill: Learning Cross-Language Dense Retrieval by Translation and Distillation}, booktitle = {Proceedings of the 46th European Conference on Information Retrieval (ECIR)}, year = {2024}, url = {https://arxiv.org/abs/2401.04810} } ```