antoinelouis commited on
Commit
c21a032
1 Parent(s): 973151c

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +19 -18
README.md CHANGED
@@ -43,7 +43,8 @@ model-index:
43
 
44
  # colbertv1-camembert-base-mmarcoFR
45
 
46
- This is a [ColBERTv1](https://doi.org/10.48550/arXiv.2004.12832) model for semantic search. It encodes queries and passages into matrices of token-level embeddings and efficiently finds passages that contextually match the query using scalable vector-similarity (MaxSim) operators. The model was trained on the **French** portion of the [mMARCO](https://huggingface.co/datasets/unicamp-dl/mmarco) dataset.
 
47
 
48
  ## Usage
49
 
@@ -105,18 +106,20 @@ with Run().context(RunConfig(nranks=n_gpu,experiment=experiment)):
105
  # results: tuple of tuples of length k containing ((passage_id, passage_rank, passage_score), ...)
106
  ```
107
 
108
- ***
109
-
110
  ## Evaluation
111
 
112
- The model is evaluated on the smaller development set of mMARCO-fr, which consists of 6,980 queries for a corpus of 8.8M candidate passages. Below, we compared its performance to a single-vector representation model fine-tuned on the same dataset. We report the mean reciprocal rank (MRR) and recall at various cut-offs (R@k).
 
 
 
113
 
114
- | model | Vocab. | #Param. | Size | MRR@10 | R@10 | R@100(↑) | R@500 |
115
- |:------------------------------------------------------------------------------------------------------------------------|:-------|--------:|------:|---------:|-------:|-----------:|--------:|
116
- | **colbertv1-camembert-base-mmarcoFR** | 🇫🇷 | 110M | 443MB | 29.51 | 54.21 | 80.00 | 88.40 |
117
- | [biencoder-camembert-base-mmarcoFR](https://huggingface.co/antoinelouis/biencoder-camembert-base-mmarcoFR) | 🇫🇷 | 110M | 443MB | 28.53 | 51.46 | 77.82 | 89.13 |
 
118
 
119
- ***
120
 
121
  ## Training
122
 
@@ -134,17 +137,15 @@ and the in-batch sampled softmax cross-entropy loss (as in [ColBERTv2](https://d
134
  with 32GBs of memory during 200k steps using a batch size of 64 and the AdamW optimizer with a constant learning rate of 3e-06. The embedding dimension was set
135
  to 128, and the maximum sequence lengths for questions and passages length were fixed to 32 and 256 tokens, respectively.
136
 
137
- ***
138
-
139
  ## Citation
140
 
141
  ```bibtex
142
- @online{louis2023,
143
- author = 'Antoine Louis',
144
- title = 'colbertv1-camembert-base-mmarcoFR: The 1st ColBERT Model for French',
145
- publisher = 'Hugging Face',
146
- month = 'dec',
147
- year = '2023',
148
- url = 'https://huggingface.co/antoinelouis/colbertv1-camembert-base-mmarcoFR',
149
  }
150
  ```
 
43
 
44
  # colbertv1-camembert-base-mmarcoFR
45
 
46
+ This is a [ColBERTv1](https://doi.org/10.48550/arXiv.2004.12832) model for **French** that can be used for semantic search. It encodes queries and passages into matrices
47
+ of token-level embeddings and efficiently finds passages that contextually match the query using scalable vector-similarity (MaxSim) operators.
48
 
49
  ## Usage
50
 
 
106
  # results: tuple of tuples of length k containing ((passage_id, passage_rank, passage_score), ...)
107
  ```
108
 
 
 
109
  ## Evaluation
110
 
111
+ The model is evaluated on the smaller development set of [mMARCO-fr](https://ir-datasets.com/mmarco.html#mmarco/v2/fr/), which consists of 6,980 queries for a corpus of
112
+ 8.8M candidate passages. We report the mean reciprocal rank (MRR), normalized discounted cumulative gainand (NDCG), mean average precision (MAP), and recall at various cut-offs (R@k).
113
+ Below, we compare its performance with other publicly available French ColBERT models fine-tuned on the same dataset. To see how it compares to other neural retrievers in French,
114
+ check out the [*DécouvrIR*](https://huggingface.co/spaces/antoinelouis/decouvrir) leaderboard.
115
 
116
+ | model | #Param.(↓) | Size | Dim. | Index | R@1000 | R@500 | R@100 | R@10 | MRR@10 |
117
+ |:-----------------------------------------------------------------------------------------------------------|-----------:|------:|-----:|------:|-------:|------:|------:|-----:|-------:|
118
+ | [colbertv2-camembert-L4-mmarcoFR](https://huggingface.co/antoinelouis/colbertv2-camembert-L4-mmarcoFR) | 54M | 0.2GB | 32 | 9GB | 91.9 | 90.3 | 81.9 | 56.7 | 32.3 |
119
+ | [FraColBERTv2](https://huggingface.co/bclavie/FraColBERTv2) | 111M | 0.4GB | 128 | 28GB | 90.0 | 88.9 | 81.2 | 57.1 | 32.4 |
120
+ | **colbertv1-camembert-base-mmarcoFR** | 111M | 0.4GB | 128 | 28GB | 89.7 | 88.4 | 80.0 | 54.2 | 29.5 |
121
 
122
+ NB: Index corresponds to the size of the mMARCO-fr index (8.8M passages) on disk when using ColBERTv2's residual compression mechanism.
123
 
124
  ## Training
125
 
 
137
  with 32GBs of memory during 200k steps using a batch size of 64 and the AdamW optimizer with a constant learning rate of 3e-06. The embedding dimension was set
138
  to 128, and the maximum sequence lengths for questions and passages length were fixed to 32 and 256 tokens, respectively.
139
 
 
 
140
  ## Citation
141
 
142
  ```bibtex
143
+ @online{louis2024decouvrir,
144
+ author = 'Antoine Louis',
145
+ title = 'DécouvrIR: A Benchmark for Evaluating the Robustness of Information Retrieval Models in French',
146
+ publisher = 'Hugging Face',
147
+ month = 'mar',
148
+ year = '2024',
149
+ url = 'https://huggingface.co/spaces/antoinelouis/decouvrir',
150
  }
151
  ```