evilfreelancer commited on
Commit
a8fd3f6
1 Parent(s): bc29382

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +86 -26
README.md CHANGED
@@ -1,19 +1,46 @@
1
  ---
2
- library_name: sentence-transformers
3
- pipeline_tag: sentence-similarity
4
  tags:
 
5
  - sentence-transformers
6
- - feature-extraction
7
  - sentence-similarity
 
8
  - transformers
9
-
 
 
 
 
 
 
10
  ---
11
 
12
- # {MODEL_NAME}
 
 
 
 
 
 
 
 
 
 
 
 
13
 
14
- This is a [sentence-transformers](https://www.SBERT.net) model: It maps sentences & paragraphs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search.
 
 
15
 
16
- <!--- Describe your model here -->
 
 
 
 
 
 
 
17
 
18
  ## Usage (Sentence-Transformers)
19
 
@@ -27,36 +54,56 @@ Then you can use the model like this:
27
 
28
  ```python
29
  from sentence_transformers import SentenceTransformer
30
- sentences = ["This is an example sentence", "Each sentence is converted"]
31
 
32
- model = SentenceTransformer('{MODEL_NAME}')
 
 
 
 
 
 
 
 
 
 
 
33
  embeddings = model.encode(sentences)
34
  print(embeddings)
35
  ```
36
 
37
-
38
-
39
  ## Usage (HuggingFace Transformers)
40
- Without [sentence-transformers](https://www.SBERT.net), you can use the model like this: First, you pass your input through the transformer model, then you have to apply the right pooling-operation on-top of the contextualized word embeddings.
 
 
 
41
 
42
  ```python
43
  from transformers import AutoTokenizer, AutoModel
44
  import torch
45
 
46
 
47
- #Mean Pooling - Take attention mask into account for correct averaging
48
  def mean_pooling(model_output, attention_mask):
49
- token_embeddings = model_output[0] #First element of model_output contains all token embeddings
50
  input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
51
  return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
52
 
53
 
54
  # Sentences we want sentence embeddings for
55
- sentences = ['This is an example sentence', 'Each sentence is converted']
 
 
 
 
 
 
 
 
 
56
 
57
  # Load model from HuggingFace Hub
58
- tokenizer = AutoTokenizer.from_pretrained('{MODEL_NAME}')
59
- model = AutoModel.from_pretrained('{MODEL_NAME}')
60
 
61
  # Tokenize sentences
62
  encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
@@ -72,30 +119,43 @@ print("Sentence embeddings:")
72
  print(sentence_embeddings)
73
  ```
74
 
75
-
76
-
77
  ## Evaluation Results
78
 
79
- <!--- Describe how your model was evaluated -->
 
 
 
80
 
81
- For an automated evaluation of this model, see the *Sentence Embeddings Benchmark*: [https://seb.sbert.net](https://seb.sbert.net?model_name={MODEL_NAME})
82
 
 
 
 
 
 
83
 
84
  ## Training
 
85
  The model was trained with the parameters:
86
 
87
  **DataLoader**:
88
 
89
- `torch.utils.data.dataloader.DataLoader` of length 573 with parameters:
90
- ```
91
- {'batch_size': 64, 'sampler': 'torch.utils.data.sampler.RandomSampler', 'batch_sampler': 'torch.utils.data.sampler.BatchSampler'}
 
 
 
 
 
92
  ```
93
 
94
  **Loss**:
95
 
96
- `sentence_transformers.losses.MSELoss.MSELoss`
97
 
98
  Parameters of the fit()-Method:
 
99
  ```
100
  {
101
  "epochs": 20,
@@ -114,8 +174,8 @@ Parameters of the fit()-Method:
114
  }
115
  ```
116
 
117
-
118
  ## Full Model Architecture
 
119
  ```
120
  SentenceTransformer(
121
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel
 
1
  ---
2
+ pipeline_tag: feature-extraction
 
3
  tags:
4
+ - pytorch
5
  - sentence-transformers
 
6
  - sentence-similarity
7
+ - feature-extraction
8
  - transformers
9
+ language:
10
+ - ru
11
+ - en
12
+ datasets:
13
+ - evilfreelancer/opus-php-en-ru-cleaned
14
+ - evilfreelancer/golang-en-ru
15
+ - Helsinki-NLP/opus_books
16
  ---
17
 
18
+ # Enbeddrus v0.2 - English and Russian embedder
19
+
20
+ > This model trained on Parallel Corpora of Russian and English texts
21
+
22
+ This is a BERT (uncased) [sentence-transformers](https://www.SBERT.net) model: It maps sentences & paragraphs to a 768 dimensional
23
+ dense vector space and can be used for tasks like clustering or semantic search.
24
+
25
+ - **Parameters**: 168 million
26
+ - **Layers**: 12
27
+ - **Hidden Size**: 768
28
+ - **Attention Heads**: 12
29
+ - **Vocabulary Size**: 119,547
30
+ - **Maximum Sequence Length**: 512 tokens
31
 
32
+ The Enbeddrus model is designed to extract similar embeddings for comparable English and Russian phrases. It is based on
33
+ the [bert-base-multilingual-uncased](https://huggingface.co/google-bert/bert-base-multilingual-cased) model and was
34
+ trained over 20 epochs on the following datasets:
35
 
36
+ - [evilfreelancer/opus-php-en-ru-cleaned](https://huggingface.co/datasets/evilfreelancer/opus-php-en-ru-cleaned) (train): 1.6k lines
37
+ - [evilfreelancer/golang-en-ru](https://huggingface.co/datasets/evilfreelancer/golang-en-ru) (train): 554 lines
38
+ - [Helsinki-NLP/opus_books](https://huggingface.co/datasets/Helsinki-NLP/opus_books/viewer/en-ru) (en-ru, train): 17.5k lines
39
+
40
+ The goal of this model is to generate identical or very similar embeddings regardless of whether the text is written in
41
+ English or Russian.
42
+
43
+ [Enbeddrus GGUF](https://ollama.com/evilfreelancer/enbeddrus) version available via Ollama.
44
 
45
  ## Usage (Sentence-Transformers)
46
 
 
54
 
55
  ```python
56
  from sentence_transformers import SentenceTransformer
 
57
 
58
+ sentences = [
59
+ "PHP является скриптовым языком программирования, широко используемым для веб-разработки.",
60
+ "PHP is a scripting language widely used for web development.",
61
+ "PHP поддерживает множество баз данных, таких как MySQL, PostgreSQL и SQLite.",
62
+ "PHP supports many databases like MySQL, PostgreSQL, and SQLite.",
63
+ "Функция echo в PHP используется для вывода текста на экран.",
64
+ "The echo function in PHP is used to output text to the screen.",
65
+ "Машинное обучение помогает создавать интеллектуальные системы.",
66
+ "Machine learning helps to create intelligent systems.",
67
+ ]
68
+
69
+ model = SentenceTransformer('evilfreelancer/enbeddrus-v0.1')
70
  embeddings = model.encode(sentences)
71
  print(embeddings)
72
  ```
73
 
 
 
74
  ## Usage (HuggingFace Transformers)
75
+
76
+ Without [sentence-transformers](https://www.SBERT.net), you can use the model like this: First, you pass your input
77
+ through the transformer model, then you have to apply the right pooling-operation on-top of the contextualized word
78
+ embeddings.
79
 
80
  ```python
81
  from transformers import AutoTokenizer, AutoModel
82
  import torch
83
 
84
 
85
+ # Mean Pooling - Take attention mask into account for correct averaging
86
  def mean_pooling(model_output, attention_mask):
87
+ token_embeddings = model_output[0] # First element of model_output contains all token embeddings
88
  input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
89
  return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
90
 
91
 
92
  # Sentences we want sentence embeddings for
93
+ sentences = [
94
+ "PHP является скриптовым языком программирования, широко используемым для веб-разработки.",
95
+ "PHP is a scripting language widely used for web development.",
96
+ "PHP поддерживает множество баз данных, таких как MySQL, PostgreSQL и SQLite.",
97
+ "PHP supports many databases like MySQL, PostgreSQL, and SQLite.",
98
+ "Функция echo в PHP используется для вывода текста на экран.",
99
+ "The echo function in PHP is used to output text to the screen.",
100
+ "Машинное обучение помогает создавать интеллектуальные системы.",
101
+ "Machine learning helps to create intelligent systems.",
102
+ ]
103
 
104
  # Load model from HuggingFace Hub
105
+ tokenizer = AutoTokenizer.from_pretrained('evilfreelancer/enbeddrus-v0.1')
106
+ model = AutoModel.from_pretrained('evilfreelancer/enbeddrus-v0.1')
107
 
108
  # Tokenize sentences
109
  encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
 
119
  print(sentence_embeddings)
120
  ```
121
 
 
 
122
  ## Evaluation Results
123
 
124
+ The model was tested on the `eval` split of the
125
+ dataset [evilfreelancer/opus-php-en-ru-cleaned](https://huggingface.co/datasets/evilfreelancer/opus-php-en-ru-cleaned),
126
+ which contains 100 pairs of sentences in Russian and English on the topic of PHP. The results of the testing are
127
+ presented in the image below.
128
 
129
+ ![Evaluation Results](./eval.png)
130
 
131
+ * **Left**: Embedding similarity in Russian and English before training
132
+ (the points are spread out into two distinct clusters).
133
+ * **Center**: Embedding similarity after training
134
+ (the points representing similar phrases are very close to each other).
135
+ * **Right**: Cosine distance before and after training.
136
 
137
  ## Training
138
+
139
  The model was trained with the parameters:
140
 
141
  **DataLoader**:
142
 
143
+ `torch.utils.data.dataloader.DataLoader` of length 556 with parameters:
144
+
145
+ ```python
146
+ {
147
+ 'batch_size': 64,
148
+ 'sampler': 'torch.utils.data.sampler.RandomSampler',
149
+ 'batch_sampler': 'torch.utils.data.sampler.BatchSampler'
150
+ }
151
  ```
152
 
153
  **Loss**:
154
 
155
+ `sentence_transformers.losses.MSELoss.MSELoss`
156
 
157
  Parameters of the fit()-Method:
158
+
159
  ```
160
  {
161
  "epochs": 20,
 
174
  }
175
  ```
176
 
 
177
  ## Full Model Architecture
178
+
179
  ```
180
  SentenceTransformer(
181
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel