Update README.md

ec36305 verified 8 months ago

6.94 kB

	---
	pipeline_tag: sentence-similarity
	language: fr
	license: mit
	datasets:
	- unicamp-dl/mmarco
	metrics:
	- recall
	tags:
	- colbert
	- passage-retrieval
	base_model: camembert-base
	library_name: RAGatouille
	inference: false
	model-index:
	- name: colbertv1-camembert-base-mmarcoFR
	results:
	- task:
	type: sentence-similarity
	name: Passage Retrieval
	dataset:
	type: unicamp-dl/mmarco
	name: mMARCO-fr
	config: french
	split: validation
	metrics:
	- type: recall_at_1000
	name: Recall@1000
	value: 89.7
	- type: recall_at_500
	name: Recall@500
	value: 88.4
	- type: recall_at_100
	name: Recall@100
	value: 80.0
	- type: recall_at_10
	name: Recall@10
	value: 54.2
	- type: mrr_at_10
	name: MRR@10
	value: 29.5
	---

	# colbertv1-camembert-base-mmarcoFR

	This is a [ColBERTv1](https://doi.org/10.48550/arXiv.2004.12832) model for French that can be used for semantic search. It encodes queries and passages into matrices
	of token-level embeddings and efficiently finds passages that contextually match the query using scalable vector-similarity (MaxSim) operators.

	## Usage

	Here are some examples for using the model with [RAGatouille](https://github.com/bclavie/RAGatouille) or [colbert-ai](https://github.com/stanford-futuredata/ColBERT).

	### Using RAGatouille

	First, you will need to install the following libraries:

	```bash
	pip install -U ragatouille
	```

	Then, you can use the model like this:

	```python
	from ragatouille import RAGPretrainedModel

	index_name: str = "my_index" # The name of your index, i.e. the name of your vector database
	documents: list = ["Ceci est un premier document.", "Voici un second document.", "etc."] # Corpus

	# Step 1: Indexing.
	RAG = RAGPretrainedModel.from_pretrained("antoinelouis/colbertv1-camembert-base-mmarcoFR")
	RAG.index(name=index_name, collection=documents)

	# Step 2: Searching.
	RAG = RAGPretrainedModel.from_index(index_name) # if not already loaded
	RAG.search(query="Comment effectuer une recherche avec ColBERT ?", k=10)
	```

	### Using ColBERT-AI

	First, you will need to install the following libraries:

	```bash
	pip install git+https://github.com/stanford-futuredata/ColBERT.git torch faiss-gpu==1.7.2
	```

	Then, you can use the model like this:

	```python
	from colbert import Indexer, Searcher
	from colbert.infra import Run, RunConfig

	n_gpu: int = 1 # Set your number of available GPUs
	experiment: str = "colbert" # Name of the folder where the logs and created indices will be stored
	index_name: str = "my_index" # The name of your index, i.e. the name of your vector database
	documents: list = ["Ceci est un premier document.", "Voici un second document.", "etc."] # Corpus

	# Step 1: Indexing. This step encodes all passages into matrices, stores them on disk, and builds data structures for efficient search.
	with Run().context(RunConfig(nranks=n_gpu,experiment=experiment)):
	indexer = Indexer(checkpoint="antoinelouis/colbertv1-camembert-base-mmarcoFR")
	indexer.index(name=index_name, collection=documents)

	# Step 2: Searching. Given the model and index, you can issue queries over the collection to retrieve the top-k passages for each query.
	with Run().context(RunConfig(nranks=n_gpu,experiment=experiment)):
	searcher = Searcher(index=index_name) # You don't need to specify checkpoint again, the model name is stored in the index.
	results = searcher.search(query="Comment effectuer une recherche avec ColBERT ?", k=10)
	# results: tuple of tuples of length k containing ((passage_id, passage_rank, passage_score), ...)
	```

	## Evaluation

	The model is evaluated on the smaller development set of [mMARCO-fr](https://ir-datasets.com/mmarco.html#mmarco/v2/fr/), which consists of 6,980 queries for a corpus of
	8.8M candidate passages. We report the mean reciprocal rank (MRR), normalized discounted cumulative gainand (NDCG), mean average precision (MAP), and recall at various cut-offs (R@k).
	Below, we compare its performance with other publicly available French ColBERT models fine-tuned on the same dataset. To see how it compares to other neural retrievers in French,
	check out the [DécouvrIR](https://huggingface.co/spaces/antoinelouis/decouvrir) leaderboard.

	\| model \| #Param.(↓) \| Size \| Dim. \| Index \| R@1000 \| R@500 \| R@100 \| R@10 \| MRR@10 \|
	\|:-----------------------------------------------------------------------------------------------------------\|-----------:\|------:\|-----:\|------:\|-------:\|------:\|------:\|-----:\|-------:\|
	\| [colbertv2-camembert-L4-mmarcoFR](https://huggingface.co/antoinelouis/colbertv2-camembert-L4-mmarcoFR) \| 54M \| 0.2GB \| 32 \| 9GB \| 91.9 \| 90.3 \| 81.9 \| 56.7 \| 32.3 \|
	\| [FraColBERTv2](https://huggingface.co/bclavie/FraColBERTv2) \| 111M \| 0.4GB \| 128 \| 28GB \| 90.0 \| 88.9 \| 81.2 \| 57.1 \| 32.4 \|
	\| colbertv1-camembert-base-mmarcoFR \| 111M \| 0.4GB \| 128 \| 28GB \| 89.7 \| 88.4 \| 80.0 \| 54.2 \| 29.5 \|

	NB: Index corresponds to the size of the mMARCO-fr index (8.8M passages) on disk when using ColBERTv2's residual compression mechanism.

	## Training

	#### Data

	We use the French training set from the [mMARCO](https://huggingface.co/datasets/unicamp-dl/mmarco) dataset,
	a multilingual machine-translated version of MS MARCO that contains 8.8M passages and 539K training queries.
	We sample 12.8M (q, p+, p-) triples from the official ~39.8M [training triples](https://microsoft.github.io/msmarco/Datasets.html#passage-ranking-dataset).

	#### Implementation

	The model is initialized from the [camembert-base](https://huggingface.co/camembert-base) checkpoint and optimized via a combination of the pairwise softmax
	cross-entropy loss computed over predicted scores for the positive and hard negative passages (as in [ColBERTv1](https://doi.org/10.48550/arXiv.2004.12832))
	and the in-batch sampled softmax cross-entropy loss (as in [ColBERTv2](https://doi.org/10.48550/arXiv.2112.01488)). It was trained on a single Tesla V100 GPU
	with 32GBs of memory during 200k steps using a batch size of 64 and the AdamW optimizer with a constant learning rate of 3e-06. The embedding dimension was set
	to 128, and the maximum sequence lengths for questions and passages length were fixed to 32 and 256 tokens, respectively.

	## Citation

	```bibtex
	@online{louis2024decouvrir,
	author = 'Antoine Louis',
	title = 'DécouvrIR: A Benchmark for Evaluating the Robustness of Information Retrieval Models in French',
	publisher = 'Hugging Face',
	month = 'mar',
	year = '2024',
	url = 'https://huggingface.co/spaces/antoinelouis/decouvrir',
	}
	```