hakim commited on
Commit
70ec7b2
1 Parent(s): 8a65935

update readme

Browse files
Files changed (3) hide show
  1. 1_Pooling/config.json +1 -2
  2. README.md +139 -43
  3. config.json +2 -2
1_Pooling/config.json CHANGED
@@ -5,6 +5,5 @@
5
  "pooling_mode_max_tokens": false,
6
  "pooling_mode_mean_sqrt_len_tokens": false,
7
  "pooling_mode_weightedmean_tokens": false,
8
- "pooling_mode_lasttoken": false,
9
- "include_prompt": true
10
  }
 
5
  "pooling_mode_max_tokens": false,
6
  "pooling_mode_mean_sqrt_len_tokens": false,
7
  "pooling_mode_weightedmean_tokens": false,
8
+ "pooling_mode_lasttoken": false
 
9
  }
README.md CHANGED
@@ -1,34 +1,54 @@
1
  ---
 
 
2
  library_name: sentence-transformers
3
- pipeline_tag: sentence-similarity
4
  tags:
5
- - sentence-transformers
6
- - feature-extraction
7
- - sentence-similarity
8
- - transformers
9
  datasets:
10
- - stsb_multi_mt
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
11
  ---
12
 
13
- # h4c5/sts-distilcamembert-base
14
 
15
- This is a [sentence-transformers](https://www.SBERT.net) model: It maps sentences & paragraphs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search.
 
 
16
 
17
- <!--- Describe your model here -->
18
 
19
- ## Usage (Sentence-Transformers)
 
20
 
21
- Using this model becomes easy when you have [sentence-transformers](https://www.SBERT.net) installed:
22
 
23
  ```
24
  pip install -U sentence-transformers
25
  ```
26
 
27
- Then you can use the model like this:
28
-
29
  ```python
30
  from sentence_transformers import SentenceTransformer
31
- sentences = ["This is an example sentence", "Each sentence is converted"]
32
 
33
  model = SentenceTransformer('h4c5/sts-distilcamembert-base')
34
  embeddings = model.encode(sentences)
@@ -36,50 +56,86 @@ print(embeddings)
36
  ```
37
 
38
 
 
39
 
40
- ## Usage (HuggingFace Transformers)
41
- Without [sentence-transformers](https://www.SBERT.net), you can use the model like this: First, you pass your input through the transformer model, then you have to apply the right pooling-operation on-top of the contextualized word embeddings.
 
42
 
43
  ```python
44
  from transformers import AutoTokenizer, AutoModel
45
  import torch
46
 
 
 
 
 
47
 
48
- #Mean Pooling - Take attention mask into account for correct averaging
49
  def mean_pooling(model_output, attention_mask):
50
- token_embeddings = model_output[0] #First element of model_output contains all token embeddings
51
- input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
52
- return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
 
 
 
 
 
 
 
 
 
 
 
 
 
53
 
 
 
54
 
55
- # Sentences we want sentence embeddings for
56
- sentences = ['This is an example sentence', 'Each sentence is converted']
57
 
58
- # Load model from HuggingFace Hub
59
- tokenizer = AutoTokenizer.from_pretrained('h4c5/sts-distilcamembert-base')
60
- model = AutoModel.from_pretrained('h4c5/sts-distilcamembert-base')
61
 
62
- # Tokenize sentences
63
- encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
64
 
65
- # Compute token embeddings
66
- with torch.no_grad():
67
- model_output = model(**encoded_input)
68
 
69
- # Perform pooling. In this case, mean pooling.
70
- sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
71
 
72
- print("Sentence embeddings:")
73
- print(sentence_embeddings)
74
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
75
 
 
 
76
 
 
77
 
78
- ## Evaluation Results
 
79
 
80
- <!--- Describe how your model was evaluated -->
 
 
 
 
 
 
81
 
82
- For an automated evaluation of this model, see the *Sentence Embeddings Benchmark*: [https://seb.sbert.net](https://seb.sbert.net?model_name=h4c5/sts-distilcamembert-base)
83
 
84
 
85
  ## Training
@@ -96,11 +152,11 @@ The model was trained with the parameters:
96
 
97
  `sentence_transformers.losses.CosineSimilarityLoss.CosineSimilarityLoss`
98
 
99
- Parameters of the fit()-Method:
100
  ```
101
  {
102
  "epochs": 10,
103
- "evaluation_steps": 0,
104
  "evaluator": "sentence_transformers.evaluation.EmbeddingSimilarityEvaluator.EmbeddingSimilarityEvaluator",
105
  "max_grad_norm": 1,
106
  "optimizer_class": "<class 'torch.optim.adamw.AdamW'>",
@@ -116,6 +172,7 @@ Parameters of the fit()-Method:
116
 
117
 
118
  ## Full Model Architecture
 
119
  ```
120
  SentenceTransformer(
121
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: CamembertModel
@@ -123,6 +180,45 @@ SentenceTransformer(
123
  )
124
  ```
125
 
126
- ## Citing & Authors
127
-
128
- <!--- Describe where people can find more information -->
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ language: fr
3
+ license: mit
4
  library_name: sentence-transformers
5
+ pipeline_tag: feature-extraction
6
  tags:
7
+ - sentence-transformers
8
+ - feature-extraction
9
+ - sentence-similarity
10
+ - transformers
11
  datasets:
12
+ - stsb_multi_mt
13
+ metrics:
14
+ - pearsonr
15
+ base_model: almanach/camembert-base
16
+ model-index:
17
+ - name: sts-distilcamembert-base
18
+ results:
19
+ - task:
20
+ name: Sentence Similarity
21
+ type: sentence-similarity
22
+ dataset:
23
+ name: STSb French
24
+ type: stsb_multi_mt
25
+ args: fr
26
+ metrics:
27
+ - name: Pearson Correlation - stsb_multi_mt fr
28
+ type: pearsonr
29
+ value: 0.8165
30
  ---
31
 
32
+ ## Description
33
 
34
+ Ce modèle [sentence-transformers](https://www.SBERT.net) a été obtenu en finetunant le modèle
35
+ [`almanach/camembert-base`](https://huggingface.co/almanach/camembert-base) à l'aide de la librairie
36
+ [sentence-transformers](https://www.SBERT.net).
37
 
38
+ Il permet d'encoder une phrase ou un pararaphe (514 tokens maximum) en un vecteur de dimension 768.
39
 
40
+ Le modèle [CamemBERT](https://arxiv.org/abs/1911.03894) sur lequel il est basé est un modèle de type RoBERTa qui est
41
+ à l'état de l'art pour la langue française.
42
 
43
+ ## Utilisation via la librairie `sentence-transformers`
44
 
45
  ```
46
  pip install -U sentence-transformers
47
  ```
48
 
 
 
49
  ```python
50
  from sentence_transformers import SentenceTransformer
51
+ sentences = ["Ceci est un exemple", "deuxième exemple"]
52
 
53
  model = SentenceTransformer('h4c5/sts-distilcamembert-base')
54
  embeddings = model.encode(sentences)
 
56
  ```
57
 
58
 
59
+ ## Utilisation via la librairie `transformers`
60
 
61
+ ```
62
+ pip install -U transformers
63
+ ```
64
 
65
  ```python
66
  from transformers import AutoTokenizer, AutoModel
67
  import torch
68
 
69
+ tokenizer = AutoTokenizer.from_pretrained("h4c5/sts-distilcamembert-base")
70
+ model = AutoModel.from_pretrained("h4c5/sts-distilcamembert-base")
71
+ model.eval()
72
+
73
 
74
+ # Mean Pooling
75
  def mean_pooling(model_output, attention_mask):
76
+ token_embeddings = model_output[
77
+ 0
78
+ ] # First element of model_output contains all token embeddings
79
+ input_mask_expanded = (
80
+ attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
81
+ )
82
+ return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(
83
+ input_mask_expanded.sum(1), min=1e-9
84
+ )
85
+
86
+ # Tokenization et calcul des embeddings des tokens
87
+ encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")
88
+ model_output = model(**encoded_input)
89
+
90
+ # Mean pooling
91
+ sentence_embeddings = mean_pooling(model_output, encoded_input["attention_mask"])
92
 
93
+ print(sentence_embeddings)
94
+ ```
95
 
 
 
96
 
97
+ ## Evaluation
 
 
98
 
99
+ Le modèle a été évalué sur le jeu de données [STSb fr](https://huggingface.co/datasets/stsb_multi_mt) :
 
100
 
101
+ ```python
102
+ from datasets import load_dataset
103
+ from sentence_transformers import InputExample, evaluation
104
 
 
 
105
 
106
+ def dataset_to_input_examples(dataset):
107
+ return [
108
+ InputExample(
109
+ texts=[example["sentence1"], example["sentence2"]],
110
+ label=example["similarity_score"] / 5.0,
111
+ )
112
+ for example in dataset
113
+ ]
114
+
115
+
116
+ sts_test_dataset = load_dataset("stsb_multi_mt", name="fr", split="test")
117
+ sts_test_examples = dataset_to_input_examples(sts_test_dataset)
118
+
119
+ sts_test_evaluator = evaluation.EmbeddingSimilarityEvaluator.from_input_examples(
120
+ sts_test_examples, name="sts-test"
121
+ )
122
 
123
+ sts_test_evaluator(model, ".")
124
+ ```
125
 
126
+ ### Résultats
127
 
128
+ Ci-dessous, les résultats de l'évaluation du modèle sur le jeu données [`stsb_multi_mt`](https://huggingface.co/datasets/stsb_multi_mt)
129
+ (données `fr`, split `test`)
130
 
131
+ | Model | Pearson Correlation | Paramètres |
132
+ | :--------------------------------------------------------------------------------------------------------------------------------------------- | :-----------------: | ---------: |
133
+ | [`h4c5/sts-camembert-base`](https://huggingface.co/h4c5/sts-camembert-base) | **0.837** | 110M |
134
+ | [`Lajavaness/sentence-camembert-base`](https://huggingface.co/Lajavaness/sentence-camembert-base) | 0.835 | 110M |
135
+ | [`inokufu/flaubert-base-uncased-xnli-sts`](https://huggingface.co/inokufu/flaubert-base-uncased-xnli-sts) | 0.828 | 137M |
136
+ | [`h4c5/sts-distilcamembert-base`](https://huggingface.co/h4c5/sts-distilcamembert-base) | 0.817 | 64M |
137
+ | [`sentence-transformers/distiluse-base-multilingual-cased-v2`](https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased) | 0.786 | 135M |
138
 
 
139
 
140
 
141
  ## Training
 
152
 
153
  `sentence_transformers.losses.CosineSimilarityLoss.CosineSimilarityLoss`
154
 
155
+ Parameters of the `fit()` method:
156
  ```
157
  {
158
  "epochs": 10,
159
+ "evaluation_steps": 1000,
160
  "evaluator": "sentence_transformers.evaluation.EmbeddingSimilarityEvaluator.EmbeddingSimilarityEvaluator",
161
  "max_grad_norm": 1,
162
  "optimizer_class": "<class 'torch.optim.adamw.AdamW'>",
 
172
 
173
 
174
  ## Full Model Architecture
175
+
176
  ```
177
  SentenceTransformer(
178
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: CamembertModel
 
180
  )
181
  ```
182
 
183
+ ## Citing
184
+
185
+ @inproceedings{reimers-2019-sentence-bert,
186
+ title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
187
+ author = "Reimers, Nils and Gurevych, Iryna",
188
+ booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
189
+ month = "11",
190
+ year = "2019",
191
+ publisher = "Association for Computational Linguistics",
192
+ journal={"https://arxiv.org/abs/1908.10084"},
193
+ }
194
+
195
+ @inproceedings{sanh2019distilbert,
196
+ title={DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter},
197
+ author={Sanh, Victor and Debut, Lysandre and Chaumond, Julien and Wolf, Thomas},
198
+ booktitle={NeurIPS EMC^2 Workshop},
199
+ journal={https://arxiv.org/abs/1910.01108},
200
+ year={2019}
201
+ }
202
+
203
+ @inproceedings{martin2020camembert,
204
+ title={CamemBERT: a Tasty French Language Model},
205
+ author={Martin, Louis and Muller, Benjamin and Su{\'a}rez, Pedro Javier Ortiz and Dupont, Yoann and Romary, Laurent and de la Clergerie, {\'E}ric Villemonte and Seddah, Djam{\'e} and Sagot, Beno{\^\i}t},
206
+ booktitle={Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics},
207
+ journal={https://arxiv.org/abs/1911.03894},
208
+ year={2020}
209
+ }
210
+
211
+ @inproceedings{delestre:hal-03674695,
212
+ TITLE = {{DistilCamemBERT : une distillation du mod{\`e}le fran{\c c}ais CamemBERT}},
213
+ AUTHOR = {Delestre, Cyrile and Amar, Abibatou},
214
+ URL = {https://hal.archives-ouvertes.fr/hal-03674695},
215
+ BOOKTITLE = {{CAp (Conf{\'e}rence sur l'Apprentissage automatique)}},
216
+ ADDRESS = {Vannes, France},
217
+ YEAR = {2022},
218
+ MONTH = Jul,
219
+ KEYWORDS = {NLP ; Transformers ; CamemBERT ; Distillation},
220
+ PDF = {https://hal.archives-ouvertes.fr/hal-03674695/file/cap2022.pdf},
221
+ HAL_ID = {hal-03674695},
222
+ HAL_VERSION = {v1},
223
+ journal={https://arxiv.org/abs/2205.11111},
224
+ }
config.json CHANGED
@@ -1,5 +1,5 @@
1
  {
2
- "_name_or_path": "cmarkea/distilcamembert-base",
3
  "architectures": [
4
  "CamembertModel"
5
  ],
@@ -25,4 +25,4 @@
25
  "type_vocab_size": 1,
26
  "use_cache": true,
27
  "vocab_size": 32005
28
- }
 
1
  {
2
+ "_name_or_path": "h4c5/sts-distilcamembert-base",
3
  "architectures": [
4
  "CamembertModel"
5
  ],
 
25
  "type_vocab_size": 1,
26
  "use_cache": true,
27
  "vocab_size": 32005
28
+ }