antoinelouis commited on
Commit
99f1d46
1 Parent(s): 4612a7c

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +132 -0
README.md ADDED
@@ -0,0 +1,132 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ pipeline_tag: sentence-similarity
3
+ language: fr
4
+ license: mit
5
+ datasets:
6
+ - unicamp-dl/mmarco
7
+ metrics:
8
+ - recall
9
+ tags:
10
+ - feature-extraction
11
+ - sentence-similarity
12
+ library_name: colbert
13
+ inference: false
14
+ ---
15
+
16
+ # colbertv2-camembert-L4-mmarcoFR
17
+
18
+ This is a lightweight [ColBERTv2](https://doi.org/10.48550/arXiv.2112.01488) model for French that can be used for semantic search. It encodes queries & passages into matrices of token-level embeddings and efficiently finds passages that contextually match the query using scalable vector-similarity (MaxSim) operators. The model was trained on the **French** portion of the [mMARCO](https://huggingface.co/datasets/unicamp-dl/mmarco) dataset.
19
+
20
+ ## Usage
21
+
22
+ Here are some examples for using the model with [colbert-ai](https://github.com/stanford-futuredata/ColBERT) or [RAGatouille](https://github.com/bclavie/RAGatouille).
23
+
24
+ ### Using ColBERT-AI
25
+
26
+ First, you will need to install the following libraries:
27
+
28
+ ```bash
29
+ pip install git+https://github.com/stanford-futuredata/ColBERT.git torch faiss-gpu==1.7.2
30
+ ```
31
+
32
+ Then, you can use the model like this:
33
+
34
+ ```python
35
+ from colbert import Indexer, Searcher
36
+ from colbert.infra import Run, RunConfig
37
+
38
+ n_gpu: int = 1 # Set your number of available GPUs
39
+ experiment: str = "colbert" # Name of the folder where the logs and created indices will be stored
40
+ index_name: str = "my_index" # The name of your index, i.e. the name of your vector database
41
+ documents: list = ["Ceci est un premier document.", "Voici un second document.", "etc."] # Corpus
42
+
43
+ # Step 1: Indexing. This step encodes all passages into matrices, stores them on disk, and builds data structures for efficient search.
44
+ with Run().context(RunConfig(nranks=n_gpu,experiment=experiment)):
45
+ indexer = Indexer(checkpoint="antoinelouis/colbertv2-camembert-L4-mmarcoFR")
46
+ indexer.index(name=index_name, collection=documents)
47
+
48
+ # Step 2: Searching. Given the model and index, you can issue queries over the collection to retrieve the top-k passages for each query.
49
+ with Run().context(RunConfig(nranks=n_gpu,experiment=experiment)):
50
+ searcher = Searcher(index=index_name) # You don't need to specify checkpoint again, the model name is stored in the index.
51
+ results = searcher.search(query="Comment effectuer une recherche avec ColBERT ?", k=10)
52
+ # results: tuple of tuples of length k containing ((passage_id, passage_rank, passage_score), ...)
53
+ ```
54
+
55
+ ### Using RAGatouille
56
+
57
+ First, you will need to install the following libraries:
58
+
59
+ ```bash
60
+ pip install -U ragatouille
61
+ ```
62
+
63
+ Then, you can use the model like this:
64
+
65
+ ```python
66
+ from ragatouille import RAGPretrainedModel
67
+
68
+ index_name: str = "my_index" # The name of your index, i.e. the name of your vector database
69
+ documents: list = ["Ceci est un premier document.", "Voici un second document.", "etc."] # Corpus
70
+
71
+ # Step 1: Indexing.
72
+ RAG = RAGPretrainedModel.from_pretrained("antoinelouis/colbertv2-camembert-L4-mmarcoFR")
73
+ RAG.index(name=index_name, collection=documents)
74
+
75
+ # Step 2: Searching.
76
+ RAG = RAGPretrainedModel.from_index(index_name) # if not already loaded
77
+ RAG.search(query="Comment effectuer une recherche avec ColBERT ?", k=10)
78
+ ```
79
+
80
+ ***
81
+
82
+ ## Evaluation
83
+
84
+ The model is evaluated on the smaller development set of mMARCO-fr, which consists of 6,980 queries for a corpus of 8.8M candidate passages. Below, we compared its
85
+ performance with other publicly available 🇫🇷 ColBERT models (as well as one single-vector representation model) fine-tuned on the same dataset. We report the
86
+ mean reciprocal rank (MRR) and recall at various cut-offs (R@k).
87
+
88
+ | model | #Param.(↓) | Size | Dim. | Index | R@1000 | R@500 | R@100 | R@10 | MRR@10 |
89
+ |:-----------------------------------------------------------------------------------------------------------|-----------:|------:|-----:|------:|-------:|------:|------:|-----:|-------:|
90
+ | **colbertv2-camembert-L4-mmarcoFR** | 54M | 216MB | 32 | GB | 91.9 | 90.3 | 81.9 | 56.7 | 32.3 |
91
+ | [FraColBERTv2](bclavie/FraColBERTv2) | 110M | 443MB | 128 | GB | | | | | |
92
+ | [colbertv1-camembert-base-mmarcoFR](https://huggingface.co/antoinelouis/colbertv1-camembert-base-mmarcoFR) | 110M | 443MB | 128 | GB | 89.7 | 88.4 | 80.0 | 54.2 | 29.5 |
93
+ | [biencoder-camembert-base-mmarcoFR](https://huggingface.co/antoinelouis/biencoder-camembert-base-mmarcoFR) | 110M | 443MB | 128 | GB | - | 89.1 | 77.8 | 51.5 | 28.5 |
94
+
95
+ NB: The index corresponds to the size of the mMARCO-fr index (8.8M passages) on disk.
96
+
97
+ ***
98
+
99
+ ## Training
100
+
101
+ #### Data
102
+
103
+ We use the French training samples from the [mMARCO](https://huggingface.co/datasets/unicamp-dl/mmarco) dataset, a multilingual machine-translated version of
104
+ MS MARCO that contains 8.8M passages and 539K training queries. We do not employ the BM25 negatives provided by the official [triples](https://microsoft.github.io/msmarco/Datasets.html#passage-ranking-dataset)
105
+ but instead sample 62 harder negatives mined from 12 distinct dense retrievers for each query, using the [msmarco-hard-negatives](https://huggingface.co/datasets/sentence-transformers/msmarco-hard-negatives#msmarco-hard-negativesjsonlgz)
106
+ distillation dataset. Next, we collect the relevance scores of an expressive [cross-encoder reranker](https://huggingface.co/cross-encoder/ms-marco-MiniLM-L-6-v2)
107
+ for all our (query, paragraph) pairs using the [cross-encoder-ms-marco-MiniLM-L-6-v2-scores](https://huggingface.co/datasets/sentence-transformers/msmarco-hard-negatives#cross-encoder-ms-marco-minilm-l-6-v2-scorespklgz) dataset.
108
+ Eventually, we end up with 10.4M different 64-way tuples of the form [query, (pos, pos_score), (neg1, neg1_score), ..., (neg62, neg62_score)] for training the model.
109
+
110
+ #### Implementation
111
+
112
+ The model is initialized from the [camembert-L4](https://huggingface.co/antoinelouis/camembert-L4) checkpoint and optimized via a combination of KL-Divergence loss
113
+ for distilling the cross-encoder scores into the model with the in-batch sampled softmax cross-entropy loss applied to the positive score of each query against all
114
+ passages corresponding to other queries in the same batch (as in [ColBERTv2](https://doi.org/10.48550/arXiv.2112.01488)). The model is fine-tuned on one 80GB NVIDIA
115
+ H100 GPU for 325k steps using the AdamW optimizer with a batch size of 32, a peak learning rate of 1e-5 with warm up along the first 20k steps and linear scheduling.
116
+ The embedding dimension is set to 32, and the maximum sequence lengths for questions and passages length were fixed to 32 and 160 tokens, respectively. We use
117
+ the cosine similarity to compute relevance scores.
118
+
119
+ ***
120
+
121
+ ## Citation
122
+
123
+ ```bibtex
124
+ @online{louis2023,
125
+ author = 'Antoine Louis',
126
+ title = 'colbertv2-camembert-L4-mmarcoFR: A Lightweight ColBERTv2 Model for French',
127
+ publisher = 'Hugging Face',
128
+ month = 'mar',
129
+ year = '2024',
130
+ url = 'https://huggingface.co/antoinelouis/colbertv2-camembert-L4-mmarcoFR',
131
+ }
132
+ ```