driwnet commited on
Commit
51266d8
1 Parent(s): 3e1ade0

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +48 -0
README.md ADDED
@@ -0,0 +1,48 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: ca
3
+ datasets:
4
+ - stsb_multi_mt
5
+ tags:
6
+ - sentence-similarity
7
+ - sentence-transformers
8
+ ---
9
+ # distilbert-base-uncased trained for Semantic Textual Similarity in Catalan
10
+
11
+ This is a test model that was fine-tuned using the Catalan traduction of Spanish datasets from [stsb_multi_mt](https://huggingface.co/datasets/stsb_multi_mt) in order to understand and benchmark STS models.
12
+
13
+ ## Model and training data description
14
+
15
+ This model was built taking `distilbert-base-uncased` and training it on a Semantic Textual Similarity task using a modified version of the training script for STS from Sentece Transformers (the modified script is included in the repo). It was trained using the Spanish datasets from [stsb_multi_mt](https://huggingface.co/datasets/stsb_multi_mt) which are the STSBenchmark datasets automatically translated to other languages using deepl.com. and salt.gva.es. Refer to the dataset repository for more details.
16
+
17
+ ## Intended uses & limitations
18
+
19
+ This model was built just as a proof-of-concept on STS fine-tuning using Catalan data and no specific use other than getting a sense on how this training works.
20
+
21
+ ## How to use
22
+
23
+ You may use it as any other STS trained model to extract sentence embeddings. Check Sentence Transformers documentation.
24
+
25
+ ## Training procedure
26
+
27
+ Use the included script to train in Catalan the base model. You can also try to train another model passing it's reference as first argument. You can also train in some other language of those included in the training dataset.
28
+
29
+ ## Evaluation results
30
+
31
+ Evaluating `distilbert-base-uncased` on the Catalan test dataset before training results in:
32
+
33
+ ```
34
+ Cosine-Similarity : Pearson: 0.3180 Spearman: 0.4014
35
+ ```
36
+
37
+ While the fine-tuned version with the defaults of the training script and the Catalan training dataset results in:
38
+
39
+ ```
40
+ Cosine-Similarity : Pearson: 0.7368 Spearman: 0.7288
41
+ ```
42
+
43
+ ## Resources
44
+
45
+ - Training dataset [stsb_multi_mt](https://huggingface.co/datasets/stsb_multi_mt)
46
+ - Sentence Transformers [Semantic Textual Similarity](https://www.sbert.net/examples/training/sts/README.html)
47
+ - Check [sts_eval](https://github.com/eduardofv/sts_eval) for a comparison with Tensorflow and Sentence-Transformers models
48
+ - Check the [development environment to run the scripts and evaluation](https://github.com/eduardofv/ai-denv)