Update README.md

0dc2d91 about 2 years ago

4.22 kB

	---
	license: cc-by-4.0
	language:
	- pl
	datasets:
	- Wikipedia
	pipeline_tag: sentence-similarity
	tags:
	- sentence-transformers
	- feature-extraction
	- sentence-similarity
	widget:
	- source_sentence: "Uczenie maszynowe jest konsekwencją rozwoju idei sztucznej inteligencji i metod jej wdrażania praktycznego."
	sentences:
	- "Głębokie uczenie maszynowe jest sktukiem wdrażania praktycznego metod sztucznej inteligencji oraz jej rozwoju."
	- "Kasparow zarzucił firmie IBM oszustwo, kiedy odmówiła mu dostępu do historii wcześniejszych gier Deep Blue. "
	- "Samica o długości ciała 10–11 mm, szczoteczki na tylnych nogach służące do zbierania pyłku oraz włoski na końcu odwłoka jaskrawo pomarańczowoczerwone. "
	example_title: "Uczenie maszynowe"
	---
	<img src="https://public.3.basecamp.com/p/rs5XqmAuF1iEuW6U7nMHcZeY/upload/download/VL-NLP-short.png" alt="logo voicelab nlp" style="width:300px;"/>

	# SHerbert - Polish SentenceBERT
	SentenceBERT is a modification of the pretrained BERT network that use siamese and triplet network structures to derive semantically meaningful sentence embeddings that can be compared using cosine-similarity. Training was based on the original paper [Siamese BERT models for the task of semantic textual similarity (STS)](https://arxiv.org/abs/1908.10084) with a slight modification of how the training data was used. The goal of the model is to generate different embeddings based on the semantic and topic similarity of the given text.

	> Semantic textual similarity analyzes how similar two pieces of texts are.

	Read more about how the model was prepared in our [blog post](https://voicelab.ai/blog/).

	The base trained model is a Polish HerBERT. HerBERT is a BERT-based Language Model. For more details, please refer to: "HerBERT: Efficiently Pretrained Transformer-based Language Model for Polish".

	# Corpus
	Te model was trained solely on [Wikipedia](https://dumps.wikimedia.org/).


	# Tokenizer

	As in the original HerBERT implementation, the training dataset was tokenized into subwords using a character level byte-pair encoding (CharBPETokenizer) with a vocabulary size of 50k tokens. The tokenizer itself was trained with a tokenizers library.

	We kindly encourage you to use the Fast version of the tokenizer, namely HerbertTokenizerFast.

	# Usage

	```python
	from transformers import AutoTokenizer, AutoModel
	from sklearn.metrics import pairwise

	sbert = AutoModel.from_pretrained("Voicelab/sbert-base-cased-pl")
	tokenizer = AutoTokenizer.from_pretrained("Voicelab/sbert-base-cased-pl")

	s0 = "Uczenie maszynowe jest konsekwencją rozwoju idei sztucznej inteligencji i metod jej wdrażania praktycznego."
	s1 = "Głębokie uczenie maszynowe jest sktukiem wdrażania praktycznego metod sztucznej inteligencji oraz jej rozwoju."
	s2 = "Kasparow zarzucił firmie IBM oszustwo, kiedy odmówiła mu dostępu do historii wcześniejszych gier Deep Blue. "


	tokens = tokenizer([s0, s1, s2],
	padding=True,
	truncation=True,
	return_tensors='pt')
	x = sbert(tokens["input_ids"],
	tokens["attention_mask"]).pooler_output

	# similarity between sentences s0 and s1
	print(pairwise.cosine_similarity(x[0], x[1])) # Result: 0.7952354

	# similarity between sentences s0 and s2
	print(pairwise.cosine_similarity(x[0], x[2])) # Result: 0.42359722

	```
	# Results

	\| Model \| Accuracy \| Source \|
	\|--------------------------\|------------\|---------------------------------------------------------\|
	\| SBERT-WikiSec-base (EN) \| 80.42% \| https://arxiv.org/abs/1908.10084 \|
	\| SBERT-WikiSec-large (EN) \| 80.78% \| https://arxiv.org/abs/1908.10084 \|
	\| sbert-base-cased-pl \| 82.31% \| https://huggingface.co/Voicelab/sbert-base-cased-pl \|
	\| sbert-large-cased-pl \| 84.42% \| https://huggingface.co/Voicelab/sbert-large-cased-pl \|

	# License

	CC BY 4.0

	# Citation

	If you use this model, please cite the following paper:


	# Authors

	The model was trained by NLP Research Team at Voicelab.ai.

	You can contact us [here](https://voicelab.ai/contact/).