|
--- |
|
language: |
|
- en |
|
- zh |
|
- fa |
|
- ru |
|
tags: |
|
- clir |
|
- colbertx |
|
- plaidx |
|
- xlm-roberta-large |
|
datasets: |
|
- ms_marco |
|
- hltcoe/tdist-msmarco-scores |
|
task_categories: |
|
- text-retrieval |
|
- information-retrieval |
|
task_ids: |
|
- passage-retrieval |
|
- cross-language-retrieval |
|
license: mit |
|
--- |
|
|
|
# ColBERT-X for English-Chinese/Persian/Russian MLIR using Multilingual Translate-Distill |
|
|
|
## MLIR Model Setting |
|
|
|
- Query language: English |
|
- Query length: 32 token max |
|
- Document language: Chinese/Persian/Russian |
|
- Document length: 180 token max (please use MaxP to aggregate the passage score if needed) |
|
|
|
## Model Description |
|
|
|
Multilingual Translate-Distill is a training technique that produces state-of-the-art MLIR dense retrieval model through translation and distillation. |
|
`plaidx-large-neuclir-mtd-mix-passages-mt5xxl-engeng` is trained with KL-Divergence from the `mt5xxl` MonoT5 reranker |
|
[`unicamp-dl/mt5-13b-mmarco-100k`](https://huggingface.co/unicamp-dl/mt5-13b-mmarco-100k) |
|
inferenced on English MS MARCO training queries and passages. |
|
The teacher scores can be found in |
|
[`hltcoe/tdist-msmarco-scores`](https://huggingface.co/datasets/hltcoe/tdist-msmarco-scores/blob/main/t53b-monot5-msmarco-engeng.jsonl.gz). |
|
|
|
### Training Parameters |
|
|
|
- learning rate: 5e-6 |
|
- update steps: 200,000 |
|
- nway (number of passages per query): 6 (randomly selected from 50; 2 if using `round-robin-entires`, see below) |
|
- per device batch size (number of query-passage set): 8 |
|
- training GPU: 8 NVIDIA V100 with 32 GB memory |
|
|
|
### Mixing Strategies |
|
|
|
- `mix-passages`: languages are randomly assigned to the 6 sampled passages for a given query during training. |
|
- `mix-entries`: all passages in the a given query-passage set are randomly assigned to the same language. |
|
- `round-robin-entires`: for each query, the query-passage set is repeated `n` times to iterate through all languages. |
|
|
|
## Usage |
|
|
|
To properly load ColBERT-X models from Huggingface Hub, please use the following version of PLAID-X. |
|
```bash |
|
pip install PLAID-X>=0.3.1 |
|
``` |
|
|
|
Following code snippet loads the model through Huggingface API. |
|
```python |
|
from colbert.modeling.checkpoint import Checkpoint |
|
from colbert.infra import ColBERTConfig |
|
|
|
Checkpoint('hltcoe/plaidx-large-neuclir-mtd-mix-passages-mt5xxl-engeng', colbert_config=ColBERTConfig()) |
|
``` |
|
|
|
For full tutorial, please refer to the [PLAID-X Jupyter Notebook](https://colab.research.google.com/github/hltcoe/clir-tutorial/blob/main/notebooks/clir_tutorial_plaidx.ipynb), |
|
which is part of the [SIGIR 2023 CLIR Tutorial](https://github.com/hltcoe/clir-tutorial). |
|
|
|
## BibTeX entry and Citation Info |
|
|
|
Please cite the following two papers if you use the model. |
|
|
|
|
|
```bibtex |
|
@inproceedings{mtt, |
|
title = {Neural Approaches to Multilingual Information Retrieval}, |
|
author = {Dawn Lawrie and Eugene Yang and Douglas W Oard and James Mayfield}, |
|
booktitle = {Proceedings of the 45th European Conference on Information Retrieval (ECIR)}, |
|
year = {2023}, |
|
doi = {10.1007/978-3-031-28244-7_33}, |
|
url = {https://arxiv.org/abs/2209.01335} |
|
} |
|
``` |
|
|
|
```bibtex |
|
@inproceedings{mtd, |
|
author = {Eugene Yang and Dawn Lawrie and James Mayfield}, |
|
title = {Distillation for Multilingual Information Retrieval}, |
|
booktitle = {Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR) (Short Paper) (Accepted)}, |
|
year = {2024} |
|
url = {https://arxiv.org/abs/2405.00977} |
|
} |
|
``` |
|
|