File size: 9,467 Bytes
6889235
adc3451
1b23db2
 
 
adc3451
1b23db2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1e61310
1b23db2
1e61310
1b23db2
 
1e61310
1b23db2
 
 
adc3451
1b23db2
 
 
1e61310
1b23db2
 
 
 
 
 
1e61310
1b23db2
1e61310
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
22265f5
1e61310
 
 
 
 
 
 
 
 
 
 
6274f30
1e61310
 
 
 
 
 
 
 
 
 
 
6274f30
1e61310
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
---
pipeline_tag: sentence-similarity
language: fr
datasets:
- stsb_multi_mt
tags:
- Text
- Sentence Similarity
- Sentence-Embedding
- camembert-base
license: apache-2.0
model-index:
- name: sentence-flaubert-base by Van Tuan DANG
  results:
  - task: 
      name: Sentence-Embedding
      type: Text Similarity
    dataset:
      name: Text Similarity fr
      type: stsb_multi_mt
      args: fr
    metrics:
       - name: Test Pearson correlation coefficient
         type: Pearson_correlation_coefficient
         value:  87.14
---

## Pre-trained sentence embedding models are the state-of-the-art of Sentence Embeddings for French.
Model is Fine-tuned using pre-trained [flaubert/flaubert_base_uncased](https://huggingface.co/flaubert/flaubert_base_uncased) and
[Siamese BERT-Networks with 'sentences-transformers'](https://www.sbert.net/) combined with [Augmented SBERT](https://aclanthology.org/2021.naacl-main.28.pdf) on dataset [stsb](https://huggingface.co/datasets/stsb_multi_mt/viewer/fr/train) along with Pair Sampling Strategies through 2 models [CrossEncoder-camembert-large](https://huggingface.co/dangvantuan/CrossEncoder-camembert-large) and [dangvantuan/sentence-camembert-large](https://huggingface.co/dangvantuan/sentence-camembert-large)

## Usage
The model can be used directly (without a language model) as follows:

```python
from sentence_transformers import SentenceTransformer
model =  SentenceTransformer("Lajavaness/sentence-flaubert-base")

sentences = ["Un avion est en train de décoller.",
          "Un homme joue d'une grande flûte.",
          "Un homme étale du fromage râpé sur une pizza.",
          "Une personne jette un chat au plafond.",
          "Une personne est en train de plier un morceau de papier.",
          ]

embeddings = model.encode(sentences)
```

## Evaluation
The model can be evaluated as follows on the French test data of stsb.

```python
from sentence_transformers import SentenceTransformer
from sentence_transformers.readers import InputExample
from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator
from datasets import load_dataset
def convert_dataset(dataset):
    dataset_samples=[]
    for df in dataset:
        score = float(df['similarity_score'])/5.0  # Normalize score to range 0 ... 1
        inp_example = InputExample(texts=[df['sentence1'], 
                                    df['sentence2']], label=score)
        dataset_samples.append(inp_example)
    return dataset_samples

# Loading the dataset for evaluation
df_dev = load_dataset("stsb_multi_mt", name="fr", split="dev")
df_test = load_dataset("stsb_multi_mt", name="fr", split="test")

# Convert the dataset for evaluation

# For Dev set:
dev_samples = convert_dataset(df_dev)
val_evaluator = EmbeddingSimilarityEvaluator.from_input_examples(dev_samples, name='sts-dev')
val_evaluator(model, output_path="./")

# For Test set:
test_samples = convert_dataset(df_test)
test_evaluator = EmbeddingSimilarityEvaluator.from_input_examples(test_samples, name='sts-test')
test_evaluator(model, output_path="./")
```

**Test Result**: 
The performance is measured using Pearson and Spearman correlation on the sts-benchmark:
- On dev


| Model  | Pearson correlation | Spearman correlation  |  #params  |
| ------------- | ------------- | ------------- |------------- |
| [Lajavaness/sentence-flaubert-base](https://huggingface.co/Lajavaness/sentence-flaubert-base)| **87.14** |**87.10** | 137M |
| [Lajavaness/sentence-camembert-base](https://huggingface.co/Lajavaness/sentence-camembert-base)| 86.88 |86.73 | 110M |
| [dangvantuan/sentence-camembert-base](https://huggingface.co/dangvantuan/sentence-camembert-base)| 86.73 |86.54 | 110M |
 [inokufu/flaubert-base-uncased-xnli-sts](https://huggingface.co/inokufu/flaubert-base-uncased-xnli-sts)| 85.85 |85.71 | 137M |
| [distiluse-base-multilingual-cased](https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased) | 79.22 | 79.16|135M |


- On test: Pearson and Spearman correlation are evaluated on many different benchmarks dataset:

**Pearson score**
| Model                                                      | [STS-B](https://huggingface.co/datasets/stsb_multi_mt/viewer/fr/train)   | [STS12-fr ](https://huggingface.co/datasets/Lajavaness/STS12-fr)| [STS13-fr](https://huggingface.co/datasets/Lajavaness/STS13-fr) | [STS14-fr](https://huggingface.co/datasets/Lajavaness/STS14-fr) | [STS15-fr](https://huggingface.co/datasets/Lajavaness/STS15-fr) | [STS16-fr](https://huggingface.co/datasets/Lajavaness/STS16-fr) | [SICK-fr](https://huggingface.co/datasets/Lajavaness/SICK-fr) | params |
|-----------------------------------------------------------|---------|----------|----------|----------|----------|----------|---------|--------|
| [Lajavaness/sentence-flaubert-base](https://huggingface.co/Lajavaness/sentence-flaubert-base)                        | **85.5** | **86.64**    | **87.24**    | **85.68**   | **88.00**    | **75.78**    | **82.84**   | 137M   |
| [Lajavaness/sentence-camembert-base](https://huggingface.co/Lajavaness/sentence-camembert-base)                        | 83.46 | 84.49    | 84.61    | 83.94    | 86.94    | 75.20    | 82.86   | 110M   |
| [inokufu/flaubert-base-uncased-xnli-sts](https://huggingface.co/inokufu/flaubert-base-uncased-xnli-sts)                    | 82.82 | 84.79    | 85.76    | 82.81    | 85.38    | 74.05    | 82.23   | 137M   |
| [dangvantuan/sentence-camembert-base](https://huggingface.co/dangvantuan/sentence-camembert-base)                       | 82.36 | 82.06    | 84.08    | 81.51    | 85.54    | 73.97    | 80.91   | 110M   |
| [sentence-transformers/distiluse-base-multilingual-cased-v2](https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased)| 78.63 | 72.51    | 67.25    | 70.12    | 79.93    | 66.67    | 77.76   | 135M   |
| [hugorosen/flaubert_base_uncased-xnli-sts](https://huggingface.co/hugorosen/flaubert_base_uncased-xnli-sts)                  | 78.38 | 79.00    | 77.61    | 76.56    | 79.03    | 71.22    | 80.58   | 137M   |
| [antoinelouis/biencoder-camembert-base-mmarcoFR](https://huggingface.co/antoinelouis/biencoder-camembert-base-mmarcoFR)            | 76.97 | 71.43    | 73.50    | 70.56    | 78.44    | 71.23    | 77.62   | 110M   |


**Spearman score**
| Model                                                      | [STS-B](https://huggingface.co/datasets/stsb_multi_mt/viewer/fr/train)   | [STS12-fr ](https://huggingface.co/datasets/Lajavaness/STS12-fr)| [STS13-fr](https://huggingface.co/datasets/Lajavaness/STS13-fr) | [STS14-fr](https://huggingface.co/datasets/Lajavaness/STS14-fr) | [STS15-fr](https://huggingface.co/datasets/Lajavaness/STS15-fr) | [STS16-fr](https://huggingface.co/datasets/Lajavaness/STS16-fr) | [SICK-fr](https://huggingface.co/datasets/Lajavaness/SICK-fr) | params |
|-----------------------------------------------------------|---------|----------|----------|----------|----------|----------|---------|--------|
| [Lajavaness/sentence-flaubert-base](https://huggingface.co/Lajavaness/sentence-flaubert-base)                         | **85.67** | **80.00**    | **86.91**    | **84.59**    | **88.10**    | **77.84**    | **77.55**   | 137M   |
| [inokufu/flaubert-base-uncased-xnli-sts](https://huggingface.co/inokufu/flaubert-base-uncased-xnli-sts)                     | 83.07 | 77.34    | 85.88    | 80.96    | 85.70    | 76.43    | 77.00   | 137M   |
| [Lajavaness/sentence-camembert-base](https://huggingface.co/Lajavaness/sentence-camembert-base)                         | 82.92 | 77.71    | 84.19    | 81.83    | 87.04    | 76.81    | 76.36   | 110M   |
| [dangvantuan/sentence-camembert-base](https://huggingface.co/dangvantuan/sentence-camembert-base)                        | 81.64 | 75.45    | 83.86    | 78.63    | 85.66    | 75.36    | 74.18   | 110M   |
| [sentence-transformers/distiluse-base-multilingual-cased-v2](https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased) | 77.49 | 69.80    | 68.85    | 68.17    | 80.27    | 70.04    | 72.49   | 135M   |
| [hugorosen/flaubert_base_uncased-xnli-sts](https://huggingface.co/hugorosen/flaubert_base_uncased-xnli-sts)                   | 76.93 | 68.96    | 77.62    | 71.87    | 79.33    | 72.86    | 73.91   | 137M   |
| [antoinelouis/biencoder-camembert-base-mmarcoFR](https://huggingface.co/antoinelouis/biencoder-camembert-base-mmarcoFR)             | 75.55 | 66.89    | 73.90    | 67.14    | 78.78    | 72.64    | 72.03   | 110M   |


## Citation


	@article{reimers2019sentence,
	   title={Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks},
	   author={Nils Reimers, Iryna Gurevych},
	   journal={https://arxiv.org/abs/1908.10084},
	   year={2019}
	}


	@article{martin2020camembert,
	   title={CamemBERT: a Tasty French Language Mode},
	   author={Martin, Louis and Muller, Benjamin and Su{\'a}rez, Pedro Javier Ortiz and Dupont, Yoann and Romary, Laurent and de la Clergerie, {\'E}ric Villemonte and Seddah, Djam{\'e} and Sagot, Beno{\^\i}t},
	   journal={Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics},
	   year={2020}
	}
    @article{thakur2020augmented,
      title={Augmented SBERT: Data Augmentation Method for Improving Bi-Encoders for Pairwise Sentence Scoring Tasks},
      author={Thakur, Nandan and Reimers, Nils and Daxenberger, Johannes and Gurevych, Iryna},
      journal={arXiv e-prints},
      pages={arXiv--2010},
      year={2020}