Voicelab
/

sbert-large-cased-pl

 ---
 license: cc-by-4.0
 ---
+# SHerbert large - Polish SentenceBERT
+SentenceBERT is a modification of the pretrained BERT network that use siamese and triplet network structures to derive semantically meaningful sentence embeddings that can be compared using cosine-similarity. Training was based on the original paper [Siamese BERT models for the task of semantic textual similarity (STS)](https://arxiv.org/abs/1908.10084) with a slight modification of how the training data was used. The goal of the model is to generate different embeddings based on the semantic and topic similarity of the given text.
+> Semantic textual similarity analyzes how similar two pieces of texts are.
+Read more about how the model was prepared in our [blog post](https://voicelab.ai/blog/).
+The base trained model is a Polish HerBERT. HerBERT is a BERT-based Language Model. For more details, please refer to: "HerBERT: Efficiently Pretrained Transformer-based Language Model for Polish".
+# Corpus
+Te model was trained solely on [Wikipedia](https://dumps.wikimedia.org/).
+# Tokenizer
+As in the original HerBERT implementation, the training dataset was tokenized into subwords using a character level byte-pair encoding (CharBPETokenizer) with a vocabulary size of 50k tokens. The tokenizer itself was trained with a tokenizers library.
+We kindly encourage you to use the Fast version of the tokenizer, namely HerbertTokenizerFast.
+# Usage
+ ```python
+from transformers import AutoTokenizer, AutoModel
+from sklearn.metrics import pairwise
+sbert = AutoModel.from_pretrained("Voicelab/sherbert-large-cased")
+tokenizer = AutoTokenizer.from_pretrained("Voicelab/sherbert-large-cased")
+s0 = "Uczenie maszynowe jest konsekwencją rozwoju idei sztucznej inteligencji i metod jej wdrażania praktycznego."
+s1 = "Głębokie uczenie maszynowe jest sktukiem wdrażania praktycznego metod sztucznej inteligencji oraz jej rozwoju."
+s2 = "Kasparow zarzucił firmie IBM oszustwo, kiedy odmówiła mu dostępu do historii wcześniejszych gier Deep Blue. "
+base
+tokens = tokenizer([s0, s1, s2],
+                    padding=True,
+                    truncation=True,
+                    return_tensors='pt')
+x = sbert(tokens["input_ids"],
+            tokens["attention_mask"]).pooler_output
+# similarity between sentences s0 and s1
+print(pairwise.cosine_similarity(x[0], x[1])) # Result: 0.7952354
+# similarity between sentences s0 and s2
+print(pairwise.cosine_similarity(x[0], x[2))) # Result: 0.42359722
+ ```
+# Results
+| Model                    | Accuracy   | Source                                                   |
+|--------------------------|------------|----------------------------------------------------------|
+| SBERT-WikiSec-base (EN)  | 80.42%     | https://arxiv.org/abs/1908.10084                         |
+| SBERT-WikiSec-large (EN) | 80.78%     | https://arxiv.org/abs/1908.10084                         |
+| SHerbert-base (PL)       | 82.31%     | https://huggingface.co/Voicelab/sherbert-base-cased      |
+| **SHerbert-large (PL)**  | **84.42%** | **https://huggingface.co/Voicelab/sherbert-large-cased** |
+# License
+CC BY 4.0
+# Citation
+If you use this model, please cite the following paper:
+# Authors
+The model was trained by NLP Research Team at Voicelab.ai.
+You can contact us [here](https://voicelab.ai/contact/).