antoinelouis
/

colbertv2-camembert-L4-mmarcoFR

+---
+pipeline_tag: sentence-similarity
+language: fr
+license: mit
+datasets:
+- unicamp-dl/mmarco
+metrics:
+- recall
+tags:
+- feature-extraction
+- sentence-similarity
+library_name: colbert
+inference: false
+---
+# colbertv2-camembert-L4-mmarcoFR
+This is a lightweight [ColBERTv2](https://doi.org/10.48550/arXiv.2112.01488) model for French that can be used for semantic search. It encodes queries & passages into matrices of token-level embeddings and efficiently finds passages that contextually match the query using scalable vector-similarity (MaxSim) operators. The model was trained on the **French** portion of the [mMARCO](https://huggingface.co/datasets/unicamp-dl/mmarco) dataset.
+## Usage
+Here are some examples for using the model with [colbert-ai](https://github.com/stanford-futuredata/ColBERT) or [RAGatouille](https://github.com/bclavie/RAGatouille).
+### Using ColBERT-AI
+First, you will need to install the following libraries:
+```bash
+pip install git+https://github.com/stanford-futuredata/ColBERT.git torch faiss-gpu==1.7.2
+```
+Then, you can use the model like this:
+```python
+from colbert import Indexer, Searcher
+from colbert.infra import Run, RunConfig
+n_gpu: int = 1 # Set your number of available GPUs
+experiment: str = "colbert" # Name of the folder where the logs and created indices will be stored
+index_name: str = "my_index" # The name of your index, i.e. the name of your vector database
+documents: list = ["Ceci est un premier document.", "Voici un second document.", "etc."] # Corpus
+# Step 1: Indexing. This step encodes all passages into matrices, stores them on disk, and builds data structures for efficient search.
+with Run().context(RunConfig(nranks=n_gpu,experiment=experiment)):
+    indexer = Indexer(checkpoint="antoinelouis/colbertv2-camembert-L4-mmarcoFR")
+    indexer.index(name=index_name, collection=documents)
+# Step 2: Searching. Given the model and index, you can issue queries over the collection to retrieve the top-k passages for each query.
+with Run().context(RunConfig(nranks=n_gpu,experiment=experiment)):
+    searcher = Searcher(index=index_name) # You don't need to specify checkpoint again, the model name is stored in the index.
+    results = searcher.search(query="Comment effectuer une recherche avec ColBERT ?", k=10)
+    # results: tuple of tuples of length k containing ((passage_id, passage_rank, passage_score), ...)
+```
+### Using RAGatouille
+First, you will need to install the following libraries:
+```bash
+pip install -U ragatouille
+```
+Then, you can use the model like this:
+```python
+from ragatouille import RAGPretrainedModel
+index_name: str = "my_index" # The name of your index, i.e. the name of your vector database
+documents: list = ["Ceci est un premier document.", "Voici un second document.", "etc."] # Corpus
+# Step 1: Indexing.
+RAG = RAGPretrainedModel.from_pretrained("antoinelouis/colbertv2-camembert-L4-mmarcoFR")
+RAG.index(name=index_name, collection=documents)
+# Step 2: Searching.
+RAG = RAGPretrainedModel.from_index(index_name) # if not already loaded
+RAG.search(query="Comment effectuer une recherche avec ColBERT ?", k=10)
+```
+***
+## Evaluation
+The model is evaluated on the smaller development set of mMARCO-fr, which consists of 6,980 queries for a corpus of 8.8M candidate passages. Below, we compared its
+performance with other publicly available 🇫🇷 ColBERT models (as well as one single-vector representation model) fine-tuned on the same dataset. We report the
+mean reciprocal rank (MRR) and recall at various cut-offs (R@k).
+| model                                                                                                      | #Param.(↓) |  Size | Dim. | Index | R@1000 | R@500 | R@100 | R@10 | MRR@10 |
+|:-----------------------------------------------------------------------------------------------------------|-----------:|------:|-----:|------:|-------:|------:|------:|-----:|-------:|
+| **colbertv2-camembert-L4-mmarcoFR**                                                                        |        54M | 216MB |   32 |    GB |   91.9 |  90.3 |  81.9 | 56.7 |   32.3 |
+| [FraColBERTv2](bclavie/FraColBERTv2)                                                                       |       110M | 443MB |  128 |    GB |        |       |       |      |        |
+| [colbertv1-camembert-base-mmarcoFR](https://huggingface.co/antoinelouis/colbertv1-camembert-base-mmarcoFR) |       110M | 443MB |  128 |    GB |   89.7 |  88.4 |  80.0 | 54.2 |   29.5 |
+| [biencoder-camembert-base-mmarcoFR](https://huggingface.co/antoinelouis/biencoder-camembert-base-mmarcoFR) |       110M | 443MB |  128 |    GB |      - |  89.1 |  77.8 | 51.5 |   28.5 |
+NB: The index corresponds to the size of the mMARCO-fr index (8.8M passages) on disk.
+***
+## Training
+#### Data
+We use the French training samples from the [mMARCO](https://huggingface.co/datasets/unicamp-dl/mmarco) dataset, a multilingual machine-translated version of
+MS MARCO that contains 8.8M passages and 539K training queries. We do not employ the BM25 negatives provided by the official [triples](https://microsoft.github.io/msmarco/Datasets.html#passage-ranking-dataset)
+but instead sample 62 harder negatives mined from 12 distinct dense retrievers for each query, using the [msmarco-hard-negatives](https://huggingface.co/datasets/sentence-transformers/msmarco-hard-negatives#msmarco-hard-negativesjsonlgz)
+distillation dataset. Next, we collect the relevance scores of an expressive [cross-encoder reranker](https://huggingface.co/cross-encoder/ms-marco-MiniLM-L-6-v2)
+for all our (query, paragraph) pairs using the [cross-encoder-ms-marco-MiniLM-L-6-v2-scores](https://huggingface.co/datasets/sentence-transformers/msmarco-hard-negatives#cross-encoder-ms-marco-minilm-l-6-v2-scorespklgz) dataset.
+Eventually, we end up with 10.4M different 64-way tuples of the form [query, (pos, pos_score), (neg1, neg1_score), ..., (neg62, neg62_score)] for training the model.
+#### Implementation
+The model is initialized from the [camembert-L4](https://huggingface.co/antoinelouis/camembert-L4) checkpoint and optimized via a combination of KL-Divergence loss
+for distilling the cross-encoder scores into the model with the in-batch sampled softmax cross-entropy loss applied to the positive score of each query against all
+passages corresponding to other queries in the same batch (as in [ColBERTv2](https://doi.org/10.48550/arXiv.2112.01488)). The model is fine-tuned on one 80GB NVIDIA
+H100 GPU for 325k steps using the AdamW optimizer with a batch size of 32, a peak learning rate of 1e-5 with warm up along the first 20k steps and linear scheduling.
+The embedding dimension is set to 32, and the maximum sequence lengths for questions and passages length were fixed to 32 and 160 tokens, respectively. We use
+the cosine similarity to compute relevance scores.
+***
+## Citation
+```bibtex
+@online{louis2023,
+   author    = 'Antoine Louis',
+   title     = 'colbertv2-camembert-L4-mmarcoFR: A Lightweight ColBERTv2 Model for French',
+   publisher = 'Hugging Face',
+   month     = 'mar',
+   year      = '2024',
+   url       = 'https://huggingface.co/antoinelouis/colbertv2-camembert-L4-mmarcoFR',
+}
+```