hltcoe
/

plaidx-large-fas-tdist-mt5xxl-engfas

xlm-roberta-large

Inference Endpoints

Model card Files Files and versions Community

plaidx-large-fas-tdist-mt5xxl-engfas / README.md

eugene-yang's picture

push model

e0e4b55 10 months ago

|

2.79 kB

	---
	language:
	- en
	- fa
	tags:
	- clir
	- colbertx
	- plaidx
	- xlm-roberta-large
	datasets:
	- ms_marco
	- eugene-yang/tdist-msmarco-scores
	task_categories:
	- text-retrieval
	- information-retrieval
	task_ids:
	- passage-retrieval
	- cross-language-retrieval
	license: mit
	---

	# ColBERT-X for English-Persian CLIR using Translate-Distill

	## Model Description

	Translate-Distill is a training technique that produces state-of-the-art CLIR dense retrieval model through translation and distillation.
	`plaidx-large-fas-tdist-mt5xxl-engfas` is trained with KL-Divergence from the mt5xxl MonoT5 reranker inferenced on
	English MS MARCO training queries and Persian translated passages.

	### Teacher Models:

	- `t53b`: [`castorini/monot5-3b-msmarco-10k`](https://huggingface.co/castorini/monot5-3b-msmarco-10k)
	- `mt5xxl`: [`unicamp-dl/mt5-13b-mmarco-100k`](https://huggingface.co/unicamp-dl/mt5-13b-mmarco-100k)

	### Training Parameters

	- learning rate: 5e-6
	- update steps: 200,000
	- nway (number of passages per query): 6 (randomly selected from 50)
	- per device batch size (number of query-passage set): 8
	- training GPU: 8 NVIDIA V100 with 32 GB memory

	## Usage

	To properly load ColBERT-X models from Huggingface Hub, please use the following version of PLAID-X.
	```bash
	pip install git+https://github.com/hltcoe/ColBERT-X.git@plaid-x
	```

	Following code snippet loads the model through Huggingface API.
	```python
	from colbert.modeling.checkpoint import Checkpoint
	from colbert.infra import ColBERTConfig

	Checkpoint('plaidx-large-fas-tdist-mt5xxl-engfas', colbert_config=ColBERTConfig())
	```

	For full tutorial, please refer to the [PLAID-X Jupyter Notebook](https://colab.research.google.com/github/hltcoe/clir-tutorial/blob/main/notebooks/clir_tutorial_plaidx.ipynb),
	which is part of the [SIGIR 2023 CLIR Tutorial](https://github.com/hltcoe/clir-tutorial).

	## BibTeX entry and Citation Info

	Please cite the following two papers if you use the model.


	```bibtex
	@inproceedings{colbert-x,
	author = {Suraj Nair and Eugene Yang and Dawn Lawrie and Kevin Duh and Paul McNamee and Kenton Murray and James Mayfield and Douglas W. Oard},
	title = {Transfer Learning Approaches for Building Cross-Language Dense Retrieval Models},
	booktitle = {Proceedings of the 44th European Conference on Information Retrieval (ECIR)},
	year = {2022},
	url = {https://arxiv.org/abs/2201.08471}
	}
	```

	```bibtex
	@inproceedings{translate-distill,
	author = {Eugene Yang and Dawn Lawrie and James Mayfield and Douglas W. Oard and Scott Miller},
	title = {Translate-Distill: Learning Cross-Language \ Dense Retrieval by Translation and Distillation},
	booktitle = {Proceedings of the 46th European Conference on Information Retrieval (ECIR)},
	year = {2024},
	url = {tba}
	}
	```