File size: 6,942 Bytes

faa6aeb
97d37d7
faa6aeb
b463025
faa6aeb
 
 
 
 
7627c1f
ee7bcda
7627c1f
 
defef1e
ee7bcda
 
 
 
 
 
 
 
 
 
 
 
973151c
 
ec36305
ee7bcda
 
ec36305
ee7bcda
 
ec36305
ee7bcda
 
ec36305
ee7bcda
 
ec36305
faa6aeb
 
ee7bcda
faa6aeb
c21a032
 
faa6aeb
e307298
97d37d7
ee7bcda
e307298
ee7bcda
e307298
 
 
 
ee7bcda
97d37d7
faa6aeb
e307298
faa6aeb
e307298
ee7bcda
97d37d7
e307298
 
faa6aeb
ee7bcda
 
 
faa6aeb
ee7bcda
 
 
faa6aeb
 
ee7bcda
e307298
 
 
 
ee7bcda
faa6aeb
 
e307298
faa6aeb
e307298
ee7bcda
 
e307298
ee7bcda
 
e307298
 
 
ee7bcda
 
 
 
faa6aeb
ee7bcda
 
 
 
 
97d37d7
faa6aeb
 
 
c21a032
 
 
 
a280fde
c21a032
 
 
 
 
faa6aeb
c21a032
7627c1f
faa6aeb
 
 
 
b463025
 
 
 
 
4a94c57
b463025
 
 
 
 
faa6aeb
 
 
 
c21a032
 
 
 
 
 
 
faa6aeb

---
pipeline_tag: sentence-similarity
language: fr
license: mit
datasets:
- unicamp-dl/mmarco
metrics:
- recall
tags:
- colbert
- passage-retrieval
base_model: camembert-base
library_name: RAGatouille
inference: false
model-index:
- name: colbertv1-camembert-base-mmarcoFR
  results:
    - task:
        type: sentence-similarity
        name: Passage Retrieval
      dataset:
        type: unicamp-dl/mmarco
        name: mMARCO-fr
        config: french
        split: validation
      metrics:
        - type: recall_at_1000
          name: Recall@1000
          value: 89.7
        - type: recall_at_500
          name: Recall@500
          value: 88.4
        - type: recall_at_100
          name: Recall@100
          value: 80.0
        - type: recall_at_10
          name: Recall@10
          value: 54.2
        - type: mrr_at_10
          name: MRR@10
          value: 29.5
---

# colbertv1-camembert-base-mmarcoFR

This is a [ColBERTv1](https://doi.org/10.48550/arXiv.2004.12832) model for **French** that can be used for semantic search. It encodes queries and passages into matrices 
of token-level embeddings and efficiently finds passages that contextually match the query using scalable vector-similarity (MaxSim) operators.

## Usage

Here are some examples for using the model with [RAGatouille](https://github.com/bclavie/RAGatouille) or [colbert-ai](https://github.com/stanford-futuredata/ColBERT).

### Using RAGatouille

First, you will need to install the following libraries:

```bash
pip install -U ragatouille
```

Then, you can use the model like this:

```python
from ragatouille import RAGPretrainedModel

index_name: str = "my_index" # The name of your index, i.e. the name of your vector database
documents: list = ["Ceci est un premier document.", "Voici un second document.", "etc."] # Corpus

# Step 1: Indexing.
RAG = RAGPretrainedModel.from_pretrained("antoinelouis/colbertv1-camembert-base-mmarcoFR")
RAG.index(name=index_name, collection=documents)

# Step 2: Searching.
RAG = RAGPretrainedModel.from_index(index_name) # if not already loaded
RAG.search(query="Comment effectuer une recherche avec ColBERT ?", k=10)
```

### Using ColBERT-AI

First, you will need to install the following libraries:

```bash
pip install git+https://github.com/stanford-futuredata/ColBERT.git torch faiss-gpu==1.7.2
```

Then, you can use the model like this:

```python
from colbert import Indexer, Searcher
from colbert.infra import Run, RunConfig

n_gpu: int = 1 # Set your number of available GPUs
experiment: str = "colbert" # Name of the folder where the logs and created indices will be stored
index_name: str = "my_index" # The name of your index, i.e. the name of your vector database
documents: list = ["Ceci est un premier document.", "Voici un second document.", "etc."] # Corpus

# Step 1: Indexing. This step encodes all passages into matrices, stores them on disk, and builds data structures for efficient search.
with Run().context(RunConfig(nranks=n_gpu,experiment=experiment)):
    indexer = Indexer(checkpoint="antoinelouis/colbertv1-camembert-base-mmarcoFR")
    indexer.index(name=index_name, collection=documents)

# Step 2: Searching. Given the model and index, you can issue queries over the collection to retrieve the top-k passages for each query.
with Run().context(RunConfig(nranks=n_gpu,experiment=experiment)):
    searcher = Searcher(index=index_name) # You don't need to specify checkpoint again, the model name is stored in the index.
    results = searcher.search(query="Comment effectuer une recherche avec ColBERT ?", k=10)
    # results: tuple of tuples of length k containing ((passage_id, passage_rank, passage_score), ...)
```

## Evaluation

The model is evaluated on the smaller development set of [mMARCO-fr](https://ir-datasets.com/mmarco.html#mmarco/v2/fr/), which consists of 6,980 queries for a corpus of 
8.8M candidate passages. We report the mean reciprocal rank (MRR), normalized discounted cumulative gainand (NDCG), mean average precision (MAP), and recall at various cut-offs (R@k). 
Below, we compare its performance with other publicly available French ColBERT models fine-tuned on the same dataset. To see how it compares to other neural retrievers in French, 
check out the [*DécouvrIR*](https://huggingface.co/spaces/antoinelouis/decouvrir) leaderboard.

| model                                                                                                      | #Param.(↓) |  Size | Dim. | Index | R@1000 | R@500 | R@100 | R@10 | MRR@10 |     
|:-----------------------------------------------------------------------------------------------------------|-----------:|------:|-----:|------:|-------:|------:|------:|-----:|-------:|
| [colbertv2-camembert-L4-mmarcoFR](https://huggingface.co/antoinelouis/colbertv2-camembert-L4-mmarcoFR)     |        54M | 0.2GB |   32 |   9GB |   91.9 |  90.3 |  81.9 | 56.7 |   32.3 | 
| [FraColBERTv2](https://huggingface.co/bclavie/FraColBERTv2)                                                |       111M | 0.4GB |  128 |  28GB |   90.0 |  88.9 |  81.2 | 57.1 |   32.4 |
| **colbertv1-camembert-base-mmarcoFR**                                                                      |       111M | 0.4GB |  128 |  28GB |   89.7 |  88.4 |  80.0 | 54.2 |   29.5 |  

NB: Index corresponds to the size of the mMARCO-fr index (8.8M passages) on disk when using ColBERTv2's residual compression mechanism.

## Training

#### Data

We use the French training set from the [mMARCO](https://huggingface.co/datasets/unicamp-dl/mmarco) dataset, 
a multilingual machine-translated version of MS MARCO that contains 8.8M passages and 539K training queries. 
We sample 12.8M (q, p+, p-) triples from the official ~39.8M [training triples](https://microsoft.github.io/msmarco/Datasets.html#passage-ranking-dataset).

#### Implementation

The model is initialized from the [camembert-base](https://huggingface.co/camembert-base) checkpoint and optimized via a combination of the pairwise softmax 
cross-entropy loss computed over predicted scores for the positive and hard negative passages (as in [ColBERTv1](https://doi.org/10.48550/arXiv.2004.12832)) 
and the in-batch sampled softmax cross-entropy loss (as in [ColBERTv2](https://doi.org/10.48550/arXiv.2112.01488)). It was trained on a single Tesla V100 GPU 
with 32GBs of memory during 200k steps using a batch size of 64 and the AdamW optimizer with a constant learning rate of 3e-06. The embedding dimension was set 
to 128, and the maximum sequence lengths for questions and passages length were fixed to 32 and 256 tokens, respectively.

## Citation

```bibtex
@online{louis2024decouvrir,
	author    = 'Antoine Louis',
	title     = 'DécouvrIR: A Benchmark for Evaluating the Robustness of Information Retrieval Models in French',
	publisher = 'Hugging Face',
	month     = 'mar',
	year      = '2024',
	url       = 'https://huggingface.co/spaces/antoinelouis/decouvrir',
}
```