File size: 6,444 Bytes
faa6aeb 97d37d7 faa6aeb b463025 faa6aeb 7627c1f ee7bcda 7627c1f defef1e ee7bcda 973151c ee7bcda faa6aeb ee7bcda faa6aeb ee7bcda faa6aeb e307298 97d37d7 ee7bcda e307298 ee7bcda e307298 ee7bcda 97d37d7 faa6aeb e307298 faa6aeb e307298 ee7bcda 97d37d7 e307298 faa6aeb ee7bcda faa6aeb ee7bcda faa6aeb ee7bcda e307298 ee7bcda faa6aeb e307298 faa6aeb e307298 ee7bcda e307298 ee7bcda e307298 ee7bcda faa6aeb ee7bcda 97d37d7 faa6aeb 7627c1f faa6aeb 4a94c57 a280fde 200a637 faa6aeb 7627c1f faa6aeb b463025 4a94c57 b463025 faa6aeb ee7bcda faa6aeb ee7bcda faa6aeb |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 |
---
pipeline_tag: sentence-similarity
language: fr
license: mit
datasets:
- unicamp-dl/mmarco
metrics:
- recall
tags:
- colbert
- passage-retrieval
base_model: camembert-base
library_name: RAGatouille
inference: false
model-index:
- name: colbertv1-camembert-base-mmarcoFR
results:
- task:
type: sentence-similarity
name: Passage Retrieval
dataset:
type: unicamp-dl/mmarco
name: mMARCO-fr
config: french
split: validation
metrics:
- type: recall_at_1000
name: Recall@1000
value: 89.70
- type: recall_at_500
name: Recall@500
value: 88.40
- type: recall_at_100
name: Recall@100
value: 80.00
- type: recall_at_10
name: Recall@10
value: 54.21
- type: mrr_at_10
name: MRR@10
value: 29.51
---
# colbertv1-camembert-base-mmarcoFR
This is a [ColBERTv1](https://doi.org/10.48550/arXiv.2004.12832) model for semantic search. It encodes queries and passages into matrices of token-level embeddings and efficiently finds passages that contextually match the query using scalable vector-similarity (MaxSim) operators. The model was trained on the **French** portion of the [mMARCO](https://huggingface.co/datasets/unicamp-dl/mmarco) dataset.
## Usage
Here are some examples for using the model with [RAGatouille](https://github.com/bclavie/RAGatouille) or [colbert-ai](https://github.com/stanford-futuredata/ColBERT).
### Using RAGatouille
First, you will need to install the following libraries:
```bash
pip install -U ragatouille
```
Then, you can use the model like this:
```python
from ragatouille import RAGPretrainedModel
index_name: str = "my_index" # The name of your index, i.e. the name of your vector database
documents: list = ["Ceci est un premier document.", "Voici un second document.", "etc."] # Corpus
# Step 1: Indexing.
RAG = RAGPretrainedModel.from_pretrained("antoinelouis/colbertv1-camembert-base-mmarcoFR")
RAG.index(name=index_name, collection=documents)
# Step 2: Searching.
RAG = RAGPretrainedModel.from_index(index_name) # if not already loaded
RAG.search(query="Comment effectuer une recherche avec ColBERT ?", k=10)
```
### Using ColBERT-AI
First, you will need to install the following libraries:
```bash
pip install git+https://github.com/stanford-futuredata/ColBERT.git torch faiss-gpu==1.7.2
```
Then, you can use the model like this:
```python
from colbert import Indexer, Searcher
from colbert.infra import Run, RunConfig
n_gpu: int = 1 # Set your number of available GPUs
experiment: str = "colbert" # Name of the folder where the logs and created indices will be stored
index_name: str = "my_index" # The name of your index, i.e. the name of your vector database
documents: list = ["Ceci est un premier document.", "Voici un second document.", "etc."] # Corpus
# Step 1: Indexing. This step encodes all passages into matrices, stores them on disk, and builds data structures for efficient search.
with Run().context(RunConfig(nranks=n_gpu,experiment=experiment)):
indexer = Indexer(checkpoint="antoinelouis/colbertv1-camembert-base-mmarcoFR")
indexer.index(name=index_name, collection=documents)
# Step 2: Searching. Given the model and index, you can issue queries over the collection to retrieve the top-k passages for each query.
with Run().context(RunConfig(nranks=n_gpu,experiment=experiment)):
searcher = Searcher(index=index_name) # You don't need to specify checkpoint again, the model name is stored in the index.
results = searcher.search(query="Comment effectuer une recherche avec ColBERT ?", k=10)
# results: tuple of tuples of length k containing ((passage_id, passage_rank, passage_score), ...)
```
***
## Evaluation
The model is evaluated on the smaller development set of mMARCO-fr, which consists of 6,980 queries for a corpus of 8.8M candidate passages. Below, we compared its performance to a single-vector representation model fine-tuned on the same dataset. We report the mean reciprocal rank (MRR) and recall at various cut-offs (R@k).
| model | Vocab. | #Param. | Size | MRR@10 | R@10 | R@100(↑) | R@500 |
|:------------------------------------------------------------------------------------------------------------------------|:-------|--------:|------:|---------:|-------:|-----------:|--------:|
| **colbertv1-camembert-base-mmarcoFR** | 🇫🇷 | 110M | 443MB | 29.51 | 54.21 | 80.00 | 88.40 |
| [biencoder-camembert-base-mmarcoFR](https://huggingface.co/antoinelouis/biencoder-camembert-base-mmarcoFR) | 🇫🇷 | 110M | 443MB | 28.53 | 51.46 | 77.82 | 89.13 |
***
## Training
#### Data
We use the French training set from the [mMARCO](https://huggingface.co/datasets/unicamp-dl/mmarco) dataset,
a multilingual machine-translated version of MS MARCO that contains 8.8M passages and 539K training queries.
We sample 12.8M (q, p+, p-) triples from the official ~39.8M [training triples](https://microsoft.github.io/msmarco/Datasets.html#passage-ranking-dataset).
#### Implementation
The model is initialized from the [camembert-base](https://huggingface.co/camembert-base) checkpoint and optimized via a combination of the pairwise softmax
cross-entropy loss computed over predicted scores for the positive and hard negative passages (as in [ColBERTv1](https://doi.org/10.48550/arXiv.2004.12832))
and the in-batch sampled softmax cross-entropy loss (as in [ColBERTv2](https://doi.org/10.48550/arXiv.2112.01488)). It was trained on a single Tesla V100 GPU
with 32GBs of memory during 200k steps using a batch size of 64 and the AdamW optimizer with a constant learning rate of 3e-06. The embedding dimension was set
to 128, and the maximum sequence lengths for questions and passages length were fixed to 32 and 256 tokens, respectively.
***
## Citation
```bibtex
@online{louis2023,
author = 'Antoine Louis',
title = 'colbertv1-camembert-base-mmarcoFR: The 1st ColBERT Model for French',
publisher = 'Hugging Face',
month = 'dec',
year = '2023',
url = 'https://huggingface.co/antoinelouis/colbertv1-camembert-base-mmarcoFR',
}
``` |