antoinelouis
/

colbertv1-camembert-base-mmarcoFR

@@ -43,7 +43,8 @@ model-index:
 # colbertv1-camembert-base-mmarcoFR
-This is a [ColBERTv1](https://doi.org/10.48550/arXiv.2004.12832) model for semantic search. It encodes queries and passages into matrices of token-level embeddings and efficiently finds passages that contextually match the query using scalable vector-similarity (MaxSim) operators. The model was trained on the **French** portion of the [mMARCO](https://huggingface.co/datasets/unicamp-dl/mmarco) dataset.
 ## Usage
@@ -105,18 +106,20 @@ with Run().context(RunConfig(nranks=n_gpu,experiment=experiment)):
     # results: tuple of tuples of length k containing ((passage_id, passage_rank, passage_score), ...)
 ```
-***
 ## Evaluation
-The model is evaluated on the smaller development set of mMARCO-fr, which consists of 6,980 queries for a corpus of 8.8M candidate passages. Below, we compared its performance to a single-vector representation model fine-tuned on the same dataset. We report the mean reciprocal rank (MRR) and recall at various cut-offs (R@k).
-| model                                                                                                                   | Vocab. | #Param. |  Size |   MRR@10 |   R@10 |   R@100(↑) |   R@500 |
-|:------------------------------------------------------------------------------------------------------------------------|:-------|--------:|------:|---------:|-------:|-----------:|--------:|
-| **colbertv1-camembert-base-mmarcoFR**                                                                                   |     🇫🇷 |    110M | 443MB |    29.51 |  54.21 |      80.00 |   88.40 |
-| [biencoder-camembert-base-mmarcoFR](https://huggingface.co/antoinelouis/biencoder-camembert-base-mmarcoFR)              |     🇫🇷 |    110M | 443MB |    28.53 |  51.46 |      77.82 |   89.13 |
-***
 ## Training
@@ -134,17 +137,15 @@ and the in-batch sampled softmax cross-entropy loss (as in [ColBERTv2](https://d
 with 32GBs of memory during 200k steps using a batch size of 64 and the AdamW optimizer with a constant learning rate of 3e-06. The embedding dimension was set
 to 128, and the maximum sequence lengths for questions and passages length were fixed to 32 and 256 tokens, respectively.
-***
 ## Citation
 ```bibtex
-@online{louis2023,
-   author    = 'Antoine Louis',
-   title     = 'colbertv1-camembert-base-mmarcoFR: The 1st ColBERT Model for French',
-   publisher = 'Hugging Face',
-   month     = 'dec',
-   year      = '2023',
-   url       = 'https://huggingface.co/antoinelouis/colbertv1-camembert-base-mmarcoFR',
 }
 ```

 # colbertv1-camembert-base-mmarcoFR
+This is a [ColBERTv1](https://doi.org/10.48550/arXiv.2004.12832) model for **French** that can be used for semantic search. It encodes queries and passages into matrices
+of token-level embeddings and efficiently finds passages that contextually match the query using scalable vector-similarity (MaxSim) operators.
 ## Usage
     # results: tuple of tuples of length k containing ((passage_id, passage_rank, passage_score), ...)
 ```
 ## Evaluation
+The model is evaluated on the smaller development set of [mMARCO-fr](https://ir-datasets.com/mmarco.html#mmarco/v2/fr/), which consists of 6,980 queries for a corpus of
+8.8M candidate passages. We report the mean reciprocal rank (MRR), normalized discounted cumulative gainand (NDCG), mean average precision (MAP), and recall at various cut-offs (R@k).
+Below, we compare its performance with other publicly available French ColBERT models fine-tuned on the same dataset. To see how it compares to other neural retrievers in French,
+check out the [*DécouvrIR*](https://huggingface.co/spaces/antoinelouis/decouvrir) leaderboard.
+| model                                                                                                      | #Param.(↓) |  Size | Dim. | Index | R@1000 | R@500 | R@100 | R@10 | MRR@10 |
+|:-----------------------------------------------------------------------------------------------------------|-----------:|------:|-----:|------:|-------:|------:|------:|-----:|-------:|
+| [colbertv2-camembert-L4-mmarcoFR](https://huggingface.co/antoinelouis/colbertv2-camembert-L4-mmarcoFR)     |        54M | 0.2GB |   32 |   9GB |   91.9 |  90.3 |  81.9 | 56.7 |   32.3 |
+| [FraColBERTv2](https://huggingface.co/bclavie/FraColBERTv2)                                                |       111M | 0.4GB |  128 |  28GB |   90.0 |  88.9 |  81.2 | 57.1 |   32.4 |
+| **colbertv1-camembert-base-mmarcoFR**                                                                      |       111M | 0.4GB |  128 |  28GB |   89.7 |  88.4 |  80.0 | 54.2 |   29.5 |
+NB: Index corresponds to the size of the mMARCO-fr index (8.8M passages) on disk when using ColBERTv2's residual compression mechanism.
 ## Training
 with 32GBs of memory during 200k steps using a batch size of 64 and the AdamW optimizer with a constant learning rate of 3e-06. The embedding dimension was set
 to 128, and the maximum sequence lengths for questions and passages length were fixed to 32 and 256 tokens, respectively.
 ## Citation
 ```bibtex
+@online{louis2024decouvrir,
+	author    = 'Antoine Louis',
+	title     = 'DécouvrIR: A Benchmark for Evaluating the Robustness of Information Retrieval Models in French',
+	publisher = 'Hugging Face',
+	month     = 'mar',
+	year      = '2024',
+	url       = 'https://huggingface.co/spaces/antoinelouis/decouvrir',
 }
 ```