---
pipeline_tag: sentence-similarity
language: fr
license: apache-2.0
datasets:
- unicamp-dl/mmarco
metrics:
- recall
tags:
- feature-extraction
- sentence-similarity
library_name: colbert
inference: false
---

# colbertv1-camembert-base-mmarcoFR

This is a [ColBERTv1](https://github.com/stanford-futuredata/ColBERT) model: it encodes queries & passages into matrices of token-level embeddings and efficiently finds passages that contextually match the query using scalable vector-similarity (MaxSim) operators. It can be used for tasks like clustering or semantic search. The model was trained on the **French** portion of the [mMARCO](https://huggingface.co/datasets/unicamp-dl/mmarco) dataset.

## Installation

To use this model, you will need to install the following libraries:
```
pip install colbert-ir @ git+https://github.com/stanford-futuredata/ColBERT.git faiss-gpu==1.7.2
```


## Usage

**Step 1: Indexing.** This step encodes all passages into matrices, stores them on disk, and builds data structures for efficient search. ⚠️ ColBERT indexing requires a GPU!
```
from colbert import Indexer
from colbert.infra import Run, RunConfig

n_gpu: int = 1 # Set your number of available GPUs
experiment: str = "" # Name of the folder where the logs and created indices will be stored
index_name: str = "" # The name of your index, i.e. the name of your vector database

with Run().context(RunConfig(nranks=n_gpu,experiment=experiment)):
    indexer = Indexer(checkpoint="antoinelouis/colbertv1-camembert-base-mmarcoFR")
    documents = [
      "Ceci est un premier document.",
      "Voici un second document.",
      ...
    ]
    indexer.index(name=index_name, collection=documents)

```

**Step 2: Searching.** Given the model and index, you can issue queries over the collection to retrieve the top-k passages for each query.
```
from colbert import Searcher
from colbert.infra import Run, RunConfig

n_gpu: int = 0
experiment: str = "" # Name of the folder where the logs and created indices will be stored
index_name: str = "" # Name of your previously created index where the documents you want to search are stored.
k: int = 10 # how many results you want to retrieve

with Run().context(RunConfig(nranks=n_gpu,experiment=experiment)):
    searcher = Searcher(index=index_name) # You don't need to specify checkpoint again, the model name is stored in the index.
    query = "Comment effectuer une recherche avec ColBERT ?"
    results = searcher.search(query, k=k)
    # results: tuple of tuples of length k containing ((passage_id, passage_rank, passage_score), ...)

```

## Evaluation

We evaluated our model on the smaller development set of mMARCO-fr, which consists of 6,980 queries for a corpus of 8.8M candidate passages. Below, we compared the model performance with a biencoder model fine-tuned on the same dataset. We report the mean reciprocal rank (MRR) and recall at various cut-offs (R@k).

| model                                                                                                                   | Vocab. | #Param. |  Size |   MRR@10 |   R@10 |   R@100(↑) |   R@500 |
|:------------------------------------------------------------------------------------------------------------------------|:-------|--------:|------:|---------:|-------:|-----------:|--------:|
| **colbertv1-camembert-base-mmarcoFR**                                                                                   |     🇫🇷 |    110M | 443MB |    29.51 |  54.21 |      80.00 |   88.40 |
| [biencoder-camembert-base-mmarcoFR](https://huggingface.co/antoinelouis/biencoder-camembert-base-mmarcoFR)              |     🇫🇷 |    110M | 438MB |    28.53 |  51.46 |      77.82 |   89.13 |

## Training

#### Details

We used the [camembert-base](https://huggingface.co/camembert-base) model and fine-tuned it on a 500K sentence triples dataset in French via pairwise softmax cross-entropy loss over the computed scores of the positive and negative passages associated to a query. We trained the model on a single Tesla V100 GPU with 32GBs of memory during 200k steps using a batch size of 64. We used the AdamW optimizer with a constant learning rate of 3e-06. The passage length was limited to 256 tokens and the query length to 32 tokens.

#### Data

We used the French version of the [mMARCO](https://huggingface.co/datasets/unicamp-dl/mmarco) dataset to fine-tune our model. mMARCO is a multi-lingual machine-translated version of the MS MARCO dataset, a large-scale IR dataset comprising:
- a corpus of 8.8M passages;
- a training set of ~533k queries (with at least one relevant passage);
- a development set of ~101k queries;
- a smaller dev set of 6,980 queries (which is actually used for evaluation in most published works).
Link: [https://ir-datasets.com/mmarco.html#mmarco/v2/fr/](https://ir-datasets.com/mmarco.html#mmarco/v2/fr/)

## Citation

```bibtex
@online{louis2023,
   author    = 'Antoine Louis',
   title     = 'colbertv1-camembert-base-mmarcoFR: A ColBERTv1 Model Trained on French mMARCO',
   publisher = 'Hugging Face',
   month     = 'dec',
   year      = '2023',
   url       = 'https://huggingface.co/antoinelouis/colbertv1-camembert-base-mmarcoFR',
}
```