antoinelouis
/

biencoder-mMiniLMv2-L12-mmarcoFR

@@ -12,7 +12,7 @@ tags:
 library_name: sentence-transformers
 ---
-# biencoder-mMiniLMv2-L12-H384-distilled-from-XLMR-Large-mmarcoFR
 This is a [sentence-transformers](https://www.SBERT.net) model: It maps sentences & paragraphs to a 384 dimensional dense vector space and can be used for tasks like clustering or semantic search. The model was trained on the **French** portion of the [mMARCO](https://huggingface.co/datasets/unicamp-dl/mmarco) dataset.
@@ -33,7 +33,7 @@ Then you can use the model like this:
 from sentence_transformers import SentenceTransformer
 sentences = ["This is an example sentence", "Each sentence is converted"]
-model = SentenceTransformer('antoinelouis/biencoder-mMiniLMv2-L12-H384-distilled-from-XLMR-Large-mmarcoFR')
 embeddings = model.encode(sentences)
 print(embeddings)
 ```
@@ -58,8 +58,8 @@ def mean_pooling(model_output, attention_mask):
 sentences = ['This is an example sentence', 'Each sentence is converted']
 # Load model from HuggingFace Hub
-tokenizer = AutoTokenizer.from_pretrained('antoinelouis/biencoder-mMiniLMv2-L12-H384-distilled-from-XLMR-Large-mmarcoFR')
-model = AutoModel.from_pretrained('antoinelouis/biencoder-mMiniLMv2-L12-H384-distilled-from-XLMR-Large-mmarcoFR')
 # Tokenize sentences
 encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
@@ -80,17 +80,16 @@ print(sentence_embeddings)
 We evaluated our model on the smaller development set of mMARCO-fr, which consists of 6,980 queries for a corpus of 8.8M candidate passages. Below, we compared the model performance with other biencoder models fine-tuned on the same dataset. We report the mean reciprocal rank (MRR), normalized discounted cumulative gainand (NDCG), mean average precision (MAP), and recall at various cut-offs (R@k).
-|    | model                                                                                                                                                                            |  Size |   MRR@10 |   NDCG@10 |   MAP@10 |   R@10 |   R@100(↑) |   R@500 |
-|---:|:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------:|---------:|----------:|---------:|-------:|-----------:|--------:|
-|  1 | [biencoder-camembert-base-mmarcoFR](https://huggingface.co/antoinelouis/biencoder-camembert-base-mmarcoFR)                                                                       | 443MB |    28.53 |     33.72 |    27.93 |  51.46 |      77.82 |   89.13 |
-|  2 | [biencoder-all-mpnet-base-v2-mmarcoFR](https://huggingface.co/antoinelouis/biencoder-all-mpnet-base-v2-mmarcoFR)                                                                 | 438MB |    28.04 |     33.28 |    27.5  |  51.07 |      77.68 |   88.67 |
-|  3 | [biencoder-sentence-camembert-base-mmarcoFR](https://huggingface.co/antoinelouis/biencoder-sentence-camembert-base-mmarcoFR)                                                     | 443MB |    27.63 |     32.7  |    27.01 |  50.10 |      76.85 |   88.73 |
-|  4 | [biencoder-distilcamembert-base-mmarcoFR](https://huggingface.co/antoinelouis/biencoder-distilcamembert-base-mmarcoFR)                                                           | 272MB |    26.80 |     31.87 |    26.23 |  49.20 |      76.44 |   87.87 |
-|  5 | **biencoder-mMiniLMv2-L12-H384-distilled-from-XLMR-Large-mmarcoFR**                                                                                                              | 471MB |    24.74 |     29.41 |    24.23 |  45.40 |      71.52 |   84.42 |
-|  6 | [biencoder-camemberta-base-mmarcoFR](https://huggingface.co/antoinelouis/biencoder-camemberta-base-mmarcoFR)                                                                     | 447MB |    24.78 |     29.24 |    24.23 |  44.58 |      69.59 |   82.18 |
-|  7 | [biencoder-electra-base-french-europeana-cased-discriminator-mmarcoFR](https://huggingface.co/antoinelouis/biencoder-electra-base-french-europeana-cased-discriminator-mmarcoFR) | 440MB |    23.38 |     27.97 |    22.91 |  43.50 |      68.96 |   81.61 |
-|  8 | [biencoder-mMiniLM-L6-v2-mmarco-mmarcoFR](https://huggingface.co/antoinelouis/biencoder-mMiniLM-L6-v2-mmarco-mmarcoFR)                                                           | 428MB |    22.87 |     27.26 |    22.37 |  42.3  |      68.78 |   81.39 |
-|  9 | [biencoder-mMiniLMv2-L6-H384-distilled-from-XLMR-Large-mmarcoFR](https://huggingface.co/antoinelouis/biencoder-mMiniLMv2-L6-H384-distilled-from-XLMR-Large-mmarcoFR)             | 428MB |    22.29 |     26.57 |    21.8  |  41.25 |      66.78 |   79.83 |
 ## Training
 ***
@@ -112,17 +111,15 @@ We used the French version of the [mMARCO](https://huggingface.co/datasets/unica
 - a smaller dev set of 6,980 queries (which is actually used for evaluation in most published works).
 Link: [https://ir-datasets.com/mmarco.html#mmarco/v2/fr/](https://ir-datasets.com/mmarco.html#mmarco/v2/fr/)
 ## Citation
 ```bibtex
 @online{louis2023,
    author    = 'Antoine Louis',
-   title     = 'biencoder-mMiniLMv2-L12-H384-distilled-from-XLMR-Large-mmarcoFR: A Biencoder Model Trained on French mMARCO',
    publisher = 'Hugging Face',
    month     = 'may',
    year      = '2023',
-   url       = 'https://huggingface.co/antoinelouis/biencoder-mMiniLMv2-L12-H384-distilled-from-XLMR-Large-mmarcoFR',
 }
 ```

 library_name: sentence-transformers
 ---
+# biencoder-mMiniLMv2-L12-mmarcoFR
 This is a [sentence-transformers](https://www.SBERT.net) model: It maps sentences & paragraphs to a 384 dimensional dense vector space and can be used for tasks like clustering or semantic search. The model was trained on the **French** portion of the [mMARCO](https://huggingface.co/datasets/unicamp-dl/mmarco) dataset.
 from sentence_transformers import SentenceTransformer
 sentences = ["This is an example sentence", "Each sentence is converted"]
+model = SentenceTransformer('antoinelouis/biencoder-mMiniLMv2-L12-mmarcoFR')
 embeddings = model.encode(sentences)
 print(embeddings)
 ```
 sentences = ['This is an example sentence', 'Each sentence is converted']
 # Load model from HuggingFace Hub
+tokenizer = AutoTokenizer.from_pretrained('antoinelouis/biencoder-mMiniLMv2-L12-mmarcoFR')
+model = AutoModel.from_pretrained('antoinelouis/biencoder-mMiniLMv2-L12-mmarcoFR')
 # Tokenize sentences
 encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
 We evaluated our model on the smaller development set of mMARCO-fr, which consists of 6,980 queries for a corpus of 8.8M candidate passages. Below, we compared the model performance with other biencoder models fine-tuned on the same dataset. We report the mean reciprocal rank (MRR), normalized discounted cumulative gainand (NDCG), mean average precision (MAP), and recall at various cut-offs (R@k).
+|    | model                                                                                                                   | Vocab. | #Param. |  Size |   MRR@10 |   NDCG@10 |   MAP@10 |   R@10 |   R@100(↑) |   R@500 |
+|---:|:------------------------------------------------------------------------------------------------------------------------|:-------|--------:|------:|---------:|----------:|---------:|-------:|-----------:|--------:|
+|  1 | [biencoder-camembert-base-mmarcoFR](https://huggingface.co/antoinelouis/biencoder-camembert-base-mmarcoFR)              |     🇫🇷 |    110M | 443MB |    28.53 |     33.72 |    27.93 |  51.46 |      77.82 |   89.13 |
+|  2 | [biencoder-mpnet-base-all-v2-mmarcoFR](https://huggingface.co/antoinelouis/biencoder-mpnet-base-all-v2-mmarcoFR)        |     🇬🇧 |    109M | 438MB |    28.04 |     33.28 |    27.50 |  51.07 |      77.68 |   88.67 |
+|  3 | [biencoder-distilcamembert-mmarcoFR](https://huggingface.co/antoinelouis/biencoder-distilcamembert-mmarcoFR)            |     🇫🇷 |     68M | 272MB |    26.80 |     31.87 |    26.23 |  49.20 |      76.44 |   87.87 |
+|  4 | [biencoder-MiniLM-L6-all-v2-mmarcoFR](https://huggingface.co/antoinelouis/biencoder-MiniLM-L6-all-v2-mmarcoFR)          |     🇬🇧 |     23M |  91MB |    25.49 |     30.39 |    24.99 |  47.10 |      73.48 |   86.09 |
+|  5 | **biencoder-mMiniLMv2-L12-mmarcoFR**                                                                                    | 🇫🇷,99+ |    117M | 471MB |    24.74 |     29.41 |    24.23 |  45.40 |      71.52 |   84.42 |
+|  6 | [biencoder-camemberta-base-mmarcoFR](https://huggingface.co/antoinelouis/biencoder-camemberta-base-mmarcoFR)            |     🇫🇷 |    112M | 447MB |    24.78 |     29.24 |    24.23 |  44.58 |      69.59 |   82.18 |
+|  7 | [biencoder-electra-base-french-mmarcoFR](https://huggingface.co/antoinelouis/biencoder-electra-base-french-mmarcoFR)    |     🇫🇷 |    110M | 440MB |    23.38 |     27.97 |    22.91 |  43.50 |      68.96 |   81.61 |
+|  8 | [biencoder-mMiniLMv2-L6-mmarcoFR](https://huggingface.co/antoinelouis/biencoder-mMiniLMv2-L6-mmarcoFR)                  | 🇫🇷,99+ |    107M | 428MB |    22.29 |     26.57 |    21.80 |  41.25 |      66.78 |   79.83 |
 ## Training
 ***
 - a smaller dev set of 6,980 queries (which is actually used for evaluation in most published works).
 Link: [https://ir-datasets.com/mmarco.html#mmarco/v2/fr/](https://ir-datasets.com/mmarco.html#mmarco/v2/fr/)
 ## Citation
 ```bibtex
 @online{louis2023,
    author    = 'Antoine Louis',
+   title     = 'biencoder-mMiniLMv2-L12-mmarcoFR: A Biencoder Model Trained on French mMARCO',
    publisher = 'Hugging Face',
    month     = 'may',
    year      = '2023',
+   url       = 'https://huggingface.co/antoinelouis/biencoder-mMiniLMv2-L12-mmarcoFR',
 }
 ```