matus commited on
Commit
88729f4
1 Parent(s): d9b0c33

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +37 -60
README.md CHANGED
@@ -1,19 +1,31 @@
1
  ---
 
 
2
  pipeline_tag: sentence-similarity
3
  tags:
4
  - sentence-transformers
5
  - feature-extraction
6
  - sentence-similarity
7
- - transformers
 
 
 
 
 
 
8
  ---
9
 
10
- # kinit/slovakbert-sts-stsb
11
 
12
- This is a [sentence-transformers](https://www.SBERT.net) model: It maps sentences & paragraphs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search.
13
 
14
- <!--- Describe your model here -->
15
 
16
- ## Usage (Sentence-Transformers)
 
 
 
 
 
17
 
18
  Using this model becomes easy when you have [sentence-transformers](https://www.SBERT.net) installed:
19
 
@@ -33,61 +45,26 @@ print(embeddings)
33
  ```
34
 
35
 
 
36
 
37
- ## Usage (HuggingFace Transformers)
38
- Without [sentence-transformers](https://www.SBERT.net), you can use the model like this: First, you pass your input through the transformer model, then you have to apply the right pooling-operation on-top of the contextualized word embeddings.
39
-
40
- ```python
41
- from transformers import AutoTokenizer, AutoModel
42
- import torch
43
-
44
-
45
- #Mean Pooling - Take attention mask into account for correct averaging
46
- def mean_pooling(model_output, attention_mask):
47
- token_embeddings = model_output[0] #First element of model_output contains all token embeddings
48
- input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
49
- return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
50
-
51
-
52
- # Sentences we want sentence embeddings for
53
- sentences = ['This is an example sentence', 'Each sentence is converted']
54
-
55
- # Load model from HuggingFace Hub
56
- tokenizer = AutoTokenizer.from_pretrained('kinit/slovakbert-sts-stsb')
57
- model = AutoModel.from_pretrained('kinit/slovakbert-sts-stsb')
58
-
59
- # Tokenize sentences
60
- encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
61
-
62
- # Compute token embeddings
63
- with torch.no_grad():
64
- model_output = model(**encoded_input)
65
-
66
- # Perform pooling. In this case, mean pooling.
67
- sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
68
-
69
- print("Sentence embeddings:")
70
- print(sentence_embeddings)
71
- ```
72
-
73
-
74
-
75
- ## Evaluation Results
76
-
77
- <!--- Describe how your model was evaluated -->
78
-
79
- For an automated evaluation of this model, see the *Sentence Embeddings Benchmark*: [https://seb.sbert.net](https://seb.sbert.net?model_name=kinit/slovakbert-sts-stsb)
80
-
81
-
82
-
83
- ## Full Model Architecture
84
  ```
85
- SentenceTransformer(
86
- (0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: RobertaModel
87
- (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
88
- )
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
89
  ```
90
-
91
- ## Citing & Authors
92
-
93
- <!--- Describe where people can find more information -->
 
1
  ---
2
+ language:
3
+ - sk
4
  pipeline_tag: sentence-similarity
5
  tags:
6
  - sentence-transformers
7
  - feature-extraction
8
  - sentence-similarity
9
+ license: cc
10
+ datasets:
11
+ - glue
12
+ metrics:
13
+ - spearmanr
14
+ widget:
15
+ - text: "Kde tá ľudská duša drieme?"
16
  ---
17
 
 
18
 
19
+ # Sentence similarity model based on SlovakBERT
20
 
21
+ This is a sentence similarity model based on [SlovakBERT](https://huggingface.co/gerulata/slovakbert). The model was fine-tuned using [STSbenchmark](ixa2.si.ehu.eus/stswiki/index.php/STSbenchmark) [Cer et al 2017] translated to Slovak using [M2M100](https://huggingface.co/facebook/m2m100_1.2B). The model can be used as an universal sentence encoder for Slovak sentences.
22
 
23
+ ## Results
24
+
25
+ The model was evaluated in [our paper](https://arxiv.org/abs/2109.15254) [Pikuliak et al 2021, Section 4.3]. It achieves \\(0.791\%\\) Spearman correlation.
26
+
27
+
28
+ ## Usage
29
 
30
  Using this model becomes easy when you have [sentence-transformers](https://www.SBERT.net) installed:
31
 
 
45
  ```
46
 
47
 
48
+ ## Cite
49
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
50
  ```
51
+ @article{DBLP:journals/corr/abs-2109-15254,
52
+ author = {Mat{\'{u}}s Pikuliak and
53
+ Stefan Grivalsky and
54
+ Martin Konopka and
55
+ Miroslav Blst{\'{a}}k and
56
+ Martin Tamajka and
57
+ Viktor Bachrat{\'{y}} and
58
+ Mari{\'{a}}n Simko and
59
+ Pavol Bal{\'{a}}zik and
60
+ Michal Trnka and
61
+ Filip Uhl{\'{a}}rik},
62
+ title = {SlovakBERT: Slovak Masked Language Model},
63
+ journal = {CoRR},
64
+ volume = {abs/2109.15254},
65
+ year = {2021},
66
+ url = {https://arxiv.org/abs/2109.15254},
67
+ eprinttype = {arXiv},
68
+ eprint = {2109.15254},
69
+ }
70
  ```