Transformers
PyTorch
xlm-roberta
clir
colbertx
plaidx
xlm-roberta-large
Inference Endpoints
Edit model card

ColBERT-X for English-Chinese/Persian/Russian MLIR using Multilingual Translate-Distill

MLIR Model Setting

  • Query language: English
  • Query length: 32 token max
  • Document language: Chinese/Persian/Russian
  • Document length: 180 token max (please use MaxP to aggregate the passage score if needed)

Model Description

Multilingual Translate-Distill is a training technique that produces state-of-the-art MLIR dense retrieval model through translation and distillation. plaidx-large-neuclir-mtd-mix-entries-mt5xxl-engeng is trained with KL-Divergence from the mt5xxl MonoT5 reranker unicamp-dl/mt5-13b-mmarco-100k inferenced on English MS MARCO training queries and passages. The teacher scores can be found in hltcoe/tdist-msmarco-scores.

Training Parameters

  • learning rate: 5e-6
  • update steps: 200,000
  • nway (number of passages per query): 6 (randomly selected from 50; 2 if using round-robin-entires, see below)
  • per device batch size (number of query-passage set): 8
  • training GPU: 8 NVIDIA V100 with 32 GB memory

Mixing Strategies

  • mix-passages: languages are randomly assigned to the 6 sampled passages for a given query during training.
  • mix-entries: all passages in the a given query-passage set are randomly assigned to the same language.
  • round-robin-entires: for each query, the query-passage set is repeated n times to iterate through all languages.

Usage

To properly load ColBERT-X models from Huggingface Hub, please use the following version of PLAID-X.

pip install PLAID-X>=0.3.1

Following code snippet loads the model through Huggingface API.

from colbert.modeling.checkpoint import Checkpoint
from colbert.infra import ColBERTConfig

Checkpoint('hltcoe/plaidx-large-neuclir-mtd-mix-entries-mt5xxl-engeng', colbert_config=ColBERTConfig())

For full tutorial, please refer to the PLAID-X Jupyter Notebook, which is part of the SIGIR 2023 CLIR Tutorial.

BibTeX entry and Citation Info

Please cite the following two papers if you use the model.

@inproceedings{mtt,
    title = {Neural Approaches to Multilingual Information Retrieval},
    author = {Dawn Lawrie and Eugene Yang and Douglas W Oard and James Mayfield},
    booktitle = {Proceedings of the 45th European Conference on Information Retrieval (ECIR)},
    year = {2023},
    doi = {10.1007/978-3-031-28244-7_33},
    url = {https://arxiv.org/abs/2209.01335}
}
@inproceedings{mtd,
    author = {Eugene Yang and Dawn Lawrie and James Mayfield},
    title = {Distillation for Multilingual Information Retrieval},
    booktitle = {Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR) (Short Paper) (Accepted)},
    year = {2024}
        url = {https://arxiv.org/abs/2405.00977}
}
Downloads last month
58
Inference API
Unable to determine this model’s pipeline type. Check the docs .

Datasets used to train hltcoe/plaidx-large-neuclir-mtd-mix-entries-mt5xxl-engeng

Collection including hltcoe/plaidx-large-neuclir-mtd-mix-entries-mt5xxl-engeng