Van Tuan DANG commited on
Commit
0ec4340
1 Parent(s): 81be38b

add discription

Browse files
Files changed (1) hide show
  1. README.md +108 -1
README.md CHANGED
@@ -1 +1,108 @@
1
- Pre-trained sentence embedding models are the state-of-the-art of Sentence Embeddings for French:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: fr
3
+ datasets:
4
+ - stsb_multi_mt
5
+ tags:
6
+ - Text
7
+ - Text Similarity
8
+ - Sentence-Embedding
9
+ - camembert-large
10
+ license: apache-2.0
11
+ model-index:
12
+ - name: sentence-camembert-large by Van Tuan DANG
13
+ results:
14
+ - task:
15
+ name: Sentence-Embedding
16
+ type: Text Similarity
17
+ dataset:
18
+ name: Text Similarity fr
19
+ type: stsb_multi_mt
20
+ args: fr
21
+ metrics:
22
+ - name: Test Pearson correlation coefficient
23
+ type: Pearson_correlation_coefficient
24
+ value: xx.xx
25
+ ---
26
+
27
+ Pre-trained sentence embedding models are the state-of-the-art of Sentence Embeddings for French.
28
+ Model is Fine-tuned using pre-trained [facebook/camembert-large](https://huggingface.co/camembert/camembert-large).
29
+ [Using Siamese BERT-Networks with 'sentences-transformers'](https://www.sbert.net/) and dataset [stsb](https://huggingface.co/datasets/stsb_multi_mt)
30
+
31
+
32
+ ## Usage
33
+ The model can be used directly (without a language model) as follows:
34
+
35
+ ```python
36
+ from sentence_transformers import SentenceTransformer
37
+ model = SentenceTransformer("dangvantuan/sentence-camembert-large")
38
+
39
+ sentences = ["Un avion est en train de décoller.",
40
+ "Un homme joue d'une grande flûte.",
41
+ "Un homme étale du fromage râpé sur une pizza.",
42
+ "Une personne jette un chat au plafond.",
43
+ "Une personne est en train de plier un morceau de papier.",
44
+ ]
45
+
46
+ embeddings = model.encode(sentences)
47
+ ```
48
+
49
+ ## Evaluation
50
+ The model can be evaluated as follows on the French test data of stsb.
51
+
52
+ ```python
53
+ from sentence_transformers import SentenceTransformer
54
+ from sentence_transformers.readers import InputExample
55
+ from datasets import load_dataset
56
+ def convert_dataset(dataset):
57
+ dataset_samples=[]
58
+ for df in dataset:
59
+ score = float(df['similarity_score'])/5.0 # Normalize score to range 0 ... 1
60
+ inp_example = InputExample(texts=[df['sentence1'],
61
+ df['sentence2']], label=score)
62
+ dataset_samples.append(inp_example)
63
+ return dataset_samples
64
+
65
+ # Loading the dataset for evaluation
66
+ df_dev = load_dataset("stsb_multi_mt", name="fr", split="dev")
67
+ df_test = load_dataset("stsb_multi_mt", name="fr", split="test")
68
+
69
+ # Convert the dataset for evaluation
70
+ dev_samples = convert_dataset(df_dev)
71
+ val_evaluator = EmbeddingSimilarityEvaluator.from_input_examples(dev_samples, name='sts-dev')
72
+ val_evaluator(model, output_path="./")
73
+
74
+ test_samples = convert_dataset(df_dev)
75
+ test_evaluator = EmbeddingSimilarityEvaluator.from_input_examples(test_samples, name='sts-test')
76
+ test_evaluator(model, output_path="./")
77
+ ```
78
+
79
+ **Test Result**:
80
+ The performance is measured using Pearson and Spearman correlation:
81
+ - On dev
82
+ | Model | Pearson correlation | Spearman correlation |
83
+ | ------------- | ------------- |
84
+ | [dangvantuan/sentence-camembert-large](https://huggingface.co/camembert/camembert-large)| 88.2 |88.02 |
85
+ | [distiluse-base-multilingual-cased-v1](https://www.sbert.net/examples/training/multilingual/README.html) | 81.15 | 81.15|
86
+ - On test
87
+ | Model | Pearson correlation | Spearman correlation |
88
+ | ------------- | ------------- |
89
+ | [dangvantuan/sentence-camembert-large](https://huggingface.co/camembert/camembert-large)| 85.9 | 85.8|
90
+ | [distiluse-base-multilingual-cased-v1](https://www.sbert.net/examples/training/multilingual/README.html) | 79.16 | 77.73|
91
+
92
+
93
+ ## Citation
94
+
95
+
96
+ @article{reimers2019sentence,
97
+ title={Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks},
98
+ author={Nils Reimers, Iryna Gurevych},
99
+ journal={https://arxiv.org/abs/1908.10084},
100
+ year={2019}
101
+ }
102
+
103
+ @inproceedings{martin2020camembert,
104
+ title={CamemBERT: a Tasty French Language Model},
105
+ author={Martin, Louis and Muller, Benjamin and Su{\'a}rez, Pedro Javier Ortiz and Dupont, Yoann and Romary, Laurent and de la Clergerie, {\'E}ric Villemonte and Seddah, Djam{\'e} and Sagot, Beno{\^\i}t},
106
+ booktitle={Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics},
107
+ year={2020}
108
+ }