dangvantuan commited on
Commit
1e61310
1 Parent(s): 1b23db2

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +104 -4
README.md CHANGED
@@ -22,12 +22,12 @@ model-index:
22
  metrics:
23
  - name: Test Pearson correlation coefficient
24
  type: Pearson_correlation_coefficient
25
- value: xx.xx
26
  ---
 
27
  ## Pre-trained sentence embedding models are the state-of-the-art of Sentence Embeddings for French.
28
  Model is Fine-tuned using pre-trained [flaubert/flaubert_base_uncased](https://huggingface.co/flaubert/flaubert_base_uncased) and
29
- [Siamese BERT-Networks with 'sentences-transformers'](https://www.sbert.net/) combine with Augmented SBERT on dataset [stsb](https://huggingface.co/datasets/stsb_multi_mt/viewer/fr/train)
30
-
31
 
32
  ## Usage
33
  The model can be used directly (without a language model) as follows:
@@ -35,11 +35,111 @@ The model can be used directly (without a language model) as follows:
35
  ```python
36
  from sentence_transformers import SentenceTransformer
37
  model = SentenceTransformer("Lajavaness/sentence-flaubert-base")
 
38
  sentences = ["Un avion est en train de décoller.",
39
  "Un homme joue d'une grande flûte.",
40
  "Un homme étale du fromage râpé sur une pizza.",
41
  "Une personne jette un chat au plafond.",
42
  "Une personne est en train de plier un morceau de papier.",
43
  ]
 
44
  embeddings = model.encode(sentences)
45
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
22
  metrics:
23
  - name: Test Pearson correlation coefficient
24
  type: Pearson_correlation_coefficient
25
+ value: 87.14
26
  ---
27
+
28
  ## Pre-trained sentence embedding models are the state-of-the-art of Sentence Embeddings for French.
29
  Model is Fine-tuned using pre-trained [flaubert/flaubert_base_uncased](https://huggingface.co/flaubert/flaubert_base_uncased) and
30
+ [Siamese BERT-Networks with 'sentences-transformers'](https://www.sbert.net/) combined with [Augmented SBERT](https://aclanthology.org/2021.naacl-main.28.pdf) on dataset [stsb](https://huggingface.co/datasets/stsb_multi_mt/viewer/fr/train) along with Pair Sampling Strategies through 2 models [CrossEncoder-camembert-large](https://huggingface.co/dangvantuan/CrossEncoder-camembert-large) and [dangvantuan/sentence-camembert-large](https://huggingface.co/dangvantuan/sentence-camembert-large)
 
31
 
32
  ## Usage
33
  The model can be used directly (without a language model) as follows:
 
35
  ```python
36
  from sentence_transformers import SentenceTransformer
37
  model = SentenceTransformer("Lajavaness/sentence-flaubert-base")
38
+
39
  sentences = ["Un avion est en train de décoller.",
40
  "Un homme joue d'une grande flûte.",
41
  "Un homme étale du fromage râpé sur une pizza.",
42
  "Une personne jette un chat au plafond.",
43
  "Une personne est en train de plier un morceau de papier.",
44
  ]
45
+
46
  embeddings = model.encode(sentences)
47
+ ```
48
+
49
+ ## Evaluation
50
+ The model can be evaluated as follows on the French test data of stsb.
51
+
52
+ ```python
53
+ from sentence_transformers import SentenceTransformer
54
+ from sentence_transformers.readers import InputExample
55
+ from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator
56
+ from datasets import load_dataset
57
+ def convert_dataset(dataset):
58
+ dataset_samples=[]
59
+ for df in dataset:
60
+ score = float(df['similarity_score'])/5.0 # Normalize score to range 0 ... 1
61
+ inp_example = InputExample(texts=[df['sentence1'],
62
+ df['sentence2']], label=score)
63
+ dataset_samples.append(inp_example)
64
+ return dataset_samples
65
+
66
+ # Loading the dataset for evaluation
67
+ df_dev = load_dataset("stsb_multi_mt", name="fr", split="dev")
68
+ df_test = load_dataset("stsb_multi_mt", name="fr", split="test")
69
+
70
+ # Convert the dataset for evaluation
71
+
72
+ # For Dev set:
73
+ dev_samples = convert_dataset(df_dev)
74
+ val_evaluator = EmbeddingSimilarityEvaluator.from_input_examples(dev_samples, name='sts-dev')
75
+ val_evaluator(model, output_path="./")
76
+
77
+ # For Test set:
78
+ test_samples = convert_dataset(df_test)
79
+ test_evaluator = EmbeddingSimilarityEvaluator.from_input_examples(test_samples, name='sts-test')
80
+ test_evaluator(model, output_path="./")
81
+ ```
82
+
83
+ **Test Result**:
84
+ The performance is measured using Pearson and Spearman correlation on the sts-benchmark:
85
+ - On dev
86
+
87
+
88
+ | Model | Pearson correlation | Spearman correlation | #params |
89
+ | ------------- | ------------- | ------------- |------------- |
90
+ | [Lajavaness/sentence-flaubert-base](https://huggingface.co/Lajavaness/sentence-flaubert-base)| 87.14 |87.10 | 137M |
91
+ | [Lajavaness/sentence-camembert-base](https://huggingface.co/Lajavaness/sentence-camembert-base)| 86.88 |86.73 | 110M |
92
+ | [dangvantuan/sentence-camembert-base](https://huggingface.co/dangvantuan/sentence-camembert-base)| 86.73 |86.54 | 110M |
93
+ [inokufu/flaubert-base-uncased-xnli-sts](https://huggingface.co/inokufu/flaubert-base-uncased-xnli-sts)| 85.85 |85.71 | 137M |
94
+ | [distiluse-base-multilingual-cased](https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased) | 79.22 | 79.16|135M |
95
+
96
+
97
+ - On test: Pearson and Spearman correlation are evaluated on many different benchmarks dataset:
98
+
99
+ **Pearson score**
100
+ | Model | [STS-B](https://huggingface.co/datasets/stsb_multi_mt/viewer/fr/train) | [STS12-fr ](https://huggingface.co/datasets/Lajavaness/STS12-fr)| [STS13-fr](https://huggingface.co/datasets/Lajavaness/STS13-fr) | [STS14-fr](https://huggingface.co/datasets/Lajavaness/STS14-fr) | [STS15-fr](https://huggingface.co/datasets/Lajavaness/STS15-fr) | [STS16-fr](https://huggingface.co/datasets/Lajavaness/STS16-fr) | [SICK-fr](https://huggingface.co/datasets/Lajavaness/SICK-fr) | params |
101
+ |-----------------------------------------------------------|---------|----------|----------|----------|----------|----------|---------|--------|
102
+ | [Lajavaness/sentence-flaubert-base](https://huggingface.co/Lajavaness/sentence-flaubert-base) | 85.5 | 86.64 | 87.24 | 85.68 | 88.00 | 75.78 | 82.84 | 110M |
103
+ | [Lajavaness/sentence-camembert-base](https://huggingface.co/Lajavaness/sentence-camembert-base) | 83.46 | 84.49 | 84.61 | 83.94 | 86.94 | 75.20 | 82.86 | 110M |
104
+ | [inokufu/flaubert-base-uncased-xnli-sts](https://huggingface.co/inokufu/flaubert-base-uncased-xnli-sts) | 82.82 | 84.79 | 85.76 | 82.81 | 85.38 | 74.05 | 82.23 | 137M |
105
+ | [dangvantuan/sentence-camembert-base](https://huggingface.co/dangvantuan/sentence-camembert-base) | 82.36 | 82.06 | 84.08 | 81.51 | 85.54 | 73.97 | 80.91 | 110M |
106
+ | [sentence-transformers/distiluse-base-multilingual-cased-v2](https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased)| 78.63 | 72.51 | 67.25 | 70.12 | 79.93 | 66.67 | 77.76 | 135M |
107
+ | [hugorosen/flaubert_base_uncased-xnli-sts](https://huggingface.co/hugorosen/flaubert_base_uncased-xnli-sts) | 78.38 | 79.00 | 77.61 | 76.56 | 79.03 | 71.22 | 80.58 | 137M |
108
+ | [antoinelouis/biencoder-camembert-base-mmarcoFR](https://huggingface.co/antoinelouis/biencoder-camembert-base-mmarcoFR) | 76.97 | 71.43 | 73.50 | 70.56 | 78.44 | 71.23 | 77.62 | 110M |
109
+
110
+
111
+ **Spearman score**
112
+ | Model | [STS-B](https://huggingface.co/datasets/stsb_multi_mt/viewer/fr/train) | [STS12-fr ](https://huggingface.co/datasets/Lajavaness/STS12-fr)| [STS13-fr](https://huggingface.co/datasets/Lajavaness/STS13-fr) | [STS14-fr](https://huggingface.co/datasets/Lajavaness/STS14-fr) | [STS15-fr](https://huggingface.co/datasets/Lajavaness/STS15-fr) | [STS16-fr](https://huggingface.co/datasets/Lajavaness/STS16-fr) | [SICK-fr](https://huggingface.co/datasets/Lajavaness/SICK-fr) | params |
113
+ |-----------------------------------------------------------|---------|----------|----------|----------|----------|----------|---------|--------|
114
+ | [Lajavaness/sentence-flaubert-base](https://huggingface.co/Lajavaness/sentence-flaubert-base) | 85.67 | 80.00 | 86.91 | 84.59 | 88.10 | 77.84 | 77.55 | 110M |
115
+ | [inokufu/flaubert-base-uncased-xnli-sts](https://huggingface.co/inokufu/flaubert-base-uncased-xnli-sts) | 83.07 | 77.34 | 85.88 | 80.96 | 85.70 | 76.43 | 77.00 | 137M |
116
+ | [Lajavaness/sentence-camembert-base](https://huggingface.co/Lajavaness/sentence-camembert-base) | 82.92 | 77.71 | 84.19 | 81.83 | 87.04 | 76.81 | 76.36 | 110M |
117
+ | [dangvantuan/sentence-camembert-base](https://huggingface.co/dangvantuan/sentence-camembert-base) | 81.64 | 75.45 | 83.86 | 78.63 | 85.66 | 75.36 | 74.18 | 110M |
118
+ | [sentence-transformers/distiluse-base-multilingual-cased-v2](https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased) | 77.49 | 69.80 | 68.85 | 68.17 | 80.27 | 70.04 | 72.49 | 135M |
119
+ | [hugorosen/flaubert_base_uncased-xnli-sts](https://huggingface.co/hugorosen/flaubert_base_uncased-xnli-sts) | 76.93 | 68.96 | 77.62 | 71.87 | 79.33 | 72.86 | 73.91 | 137M |
120
+ | [antoinelouis/biencoder-camembert-base-mmarcoFR](https://huggingface.co/antoinelouis/biencoder-camembert-base-mmarcoFR) | 75.55 | 66.89 | 73.90 | 67.14 | 78.78 | 72.64 | 72.03 | 110M |
121
+
122
+
123
+ ## Citation
124
+
125
+
126
+ @article{reimers2019sentence,
127
+ title={Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks},
128
+ author={Nils Reimers, Iryna Gurevych},
129
+ journal={https://arxiv.org/abs/1908.10084},
130
+ year={2019}
131
+ }
132
+
133
+
134
+ @article{martin2020camembert,
135
+ title={CamemBERT: a Tasty French Language Mode},
136
+ author={Martin, Louis and Muller, Benjamin and Su{\'a}rez, Pedro Javier Ortiz and Dupont, Yoann and Romary, Laurent and de la Clergerie, {\'E}ric Villemonte and Seddah, Djam{\'e} and Sagot, Beno{\^\i}t},
137
+ journal={Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics},
138
+ year={2020}
139
+ }
140
+ @article{thakur2020augmented,
141
+ title={Augmented SBERT: Data Augmentation Method for Improving Bi-Encoders for Pairwise Sentence Scoring Tasks},
142
+ author={Thakur, Nandan and Reimers, Nils and Daxenberger, Johannes and Gurevych, Iryna},
143
+ journal={arXiv e-prints},
144
+ pages={arXiv--2010},
145
+ year={2020}