AgaMiko commited on
Commit
358969f
1 Parent(s): 82a3664

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +70 -0
README.md CHANGED
@@ -1,3 +1,73 @@
1
  ---
2
  license: cc-by-4.0
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: cc-by-4.0
3
  ---
4
+ # SHerbert large - Polish SentenceBERT
5
+ SentenceBERT is a modification of the pretrained BERT network that use siamese and triplet network structures to derive semantically meaningful sentence embeddings that can be compared using cosine-similarity. Training was based on the original paper [Siamese BERT models for the task of semantic textual similarity (STS)](https://arxiv.org/abs/1908.10084) with a slight modification of how the training data was used. The goal of the model is to generate different embeddings based on the semantic and topic similarity of the given text.
6
+
7
+ > Semantic textual similarity analyzes how similar two pieces of texts are.
8
+
9
+ Read more about how the model was prepared in our [blog post](https://voicelab.ai/blog/).
10
+
11
+ The base trained model is a Polish HerBERT. HerBERT is a BERT-based Language Model. For more details, please refer to: "HerBERT: Efficiently Pretrained Transformer-based Language Model for Polish".
12
+
13
+ # Corpus
14
+ Te model was trained solely on [Wikipedia](https://dumps.wikimedia.org/).
15
+
16
+
17
+ # Tokenizer
18
+
19
+ As in the original HerBERT implementation, the training dataset was tokenized into subwords using a character level byte-pair encoding (CharBPETokenizer) with a vocabulary size of 50k tokens. The tokenizer itself was trained with a tokenizers library.
20
+
21
+ We kindly encourage you to use the Fast version of the tokenizer, namely HerbertTokenizerFast.
22
+
23
+ # Usage
24
+
25
+ ```python
26
+ from transformers import AutoTokenizer, AutoModel
27
+ from sklearn.metrics import pairwise
28
+
29
+ sbert = AutoModel.from_pretrained("Voicelab/sherbert-large-cased")
30
+ tokenizer = AutoTokenizer.from_pretrained("Voicelab/sherbert-large-cased")
31
+
32
+ s0 = "Uczenie maszynowe jest konsekwencją rozwoju idei sztucznej inteligencji i metod jej wdrażania praktycznego."
33
+ s1 = "Głębokie uczenie maszynowe jest sktukiem wdrażania praktycznego metod sztucznej inteligencji oraz jej rozwoju."
34
+ s2 = "Kasparow zarzucił firmie IBM oszustwo, kiedy odmówiła mu dostępu do historii wcześniejszych gier Deep Blue. "
35
+ base
36
+
37
+ tokens = tokenizer([s0, s1, s2],
38
+ padding=True,
39
+ truncation=True,
40
+ return_tensors='pt')
41
+ x = sbert(tokens["input_ids"],
42
+ tokens["attention_mask"]).pooler_output
43
+
44
+ # similarity between sentences s0 and s1
45
+ print(pairwise.cosine_similarity(x[0], x[1])) # Result: 0.7952354
46
+
47
+ # similarity between sentences s0 and s2
48
+ print(pairwise.cosine_similarity(x[0], x[2))) # Result: 0.42359722
49
+
50
+ ```
51
+ # Results
52
+
53
+ | Model | Accuracy | Source |
54
+ |--------------------------|------------|----------------------------------------------------------|
55
+ | SBERT-WikiSec-base (EN) | 80.42% | https://arxiv.org/abs/1908.10084 |
56
+ | SBERT-WikiSec-large (EN) | 80.78% | https://arxiv.org/abs/1908.10084 |
57
+ | SHerbert-base (PL) | 82.31% | https://huggingface.co/Voicelab/sherbert-base-cased |
58
+ | **SHerbert-large (PL)** | **84.42%** | **https://huggingface.co/Voicelab/sherbert-large-cased** |
59
+
60
+ # License
61
+
62
+ CC BY 4.0
63
+
64
+ # Citation
65
+
66
+ If you use this model, please cite the following paper:
67
+
68
+
69
+ # Authors
70
+
71
+ The model was trained by NLP Research Team at Voicelab.ai.
72
+
73
+ You can contact us [here](https://voicelab.ai/contact/).