koziev ilya commited on
Commit
7468ff7
1 Parent(s): 6aab6f6

making readme more human-friendly

Browse files
Files changed (1) hide show
  1. README.md +17 -101
README.md CHANGED
@@ -16,122 +16,39 @@ widget:
16
 
17
  # SBERT_PQ
18
 
19
- This is a [sentence-transformers](https://www.SBERT.net) model: It maps texts & questions to a 312 dimensional dense vector space and can be used for tasks like clustering or semantic search.
 
20
 
21
- <!--- Describe your model here -->
 
 
22
 
23
- ## Usage (Sentence-Transformers)
24
 
25
- Using this model becomes easy when you have [sentence-transformers](https://www.SBERT.net) installed:
26
 
27
- ```
28
- pip install -U sentence-transformers
29
- ```
30
 
31
- Then you can use the model like this:
32
-
33
- ```python
34
- from sentence_transformers import SentenceTransformer
35
- sentences = ["Кошка ловит мышку.", "Чем занята кошка?"]
36
 
37
- model = SentenceTransformer('inkoziev/sbert_pq')
38
- embeddings = model.encode(sentences)
39
- print(embeddings)
40
  ```
41
-
42
-
43
-
44
- ## Usage (HuggingFace Transformers)
45
- Without [sentence-transformers](https://www.SBERT.net), you can use the model like this: First, you pass your input through the transformer model, then you have to apply the right pooling-operation on-top of the contextualized word embeddings.
46
-
47
- ```python
48
- from transformers import AutoTokenizer, AutoModel
49
- import torch
50
-
51
-
52
- #Mean Pooling - Take attention mask into account for correct averaging
53
- def mean_pooling(model_output, attention_mask):
54
- token_embeddings = model_output[0] #First element of model_output contains all token embeddings
55
- input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
56
- return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
57
-
58
-
59
- # Sentences we want sentence embeddings for
60
- sentences = ['This is an example sentence', 'Each sentence is converted']
61
-
62
- # Load model from HuggingFace Hub
63
- tokenizer = AutoTokenizer.from_pretrained('inkoziev/sbert_pq')
64
- model = AutoModel.from_pretrained('inkoziev/sbert_pq')
65
-
66
- # Tokenize sentences
67
- encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
68
-
69
- # Compute token embeddings
70
- with torch.no_grad():
71
- model_output = model(**encoded_input)
72
-
73
- # Perform pooling. In this case, max pooling.
74
- sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
75
-
76
- print("Sentence embeddings:")
77
- print(sentence_embeddings)
78
  ```
79
 
 
80
 
81
-
82
- ## Evaluation Results
83
-
84
- <!--- Describe how your model was evaluated -->
85
-
86
- For an automated evaluation of this model, see the *Sentence Embeddings Benchmark*: [https://seb.sbert.net](https://seb.sbert.net?model_name={MODEL_NAME})
87
-
88
-
89
- ## Training
90
- The model was trained with the parameters:
91
-
92
- **DataLoader**:
93
-
94
- `torch.utils.data.dataloader.DataLoader` of length 2320 with parameters:
95
- ```
96
- {'batch_size': 2048, 'sampler': 'torch.utils.data.sampler.RandomSampler', 'batch_sampler': 'torch.utils.data.sampler.BatchSampler'}
97
  ```
 
98
 
99
- **Loss**:
100
-
101
- `sentence_transformers.losses.ContrastiveLoss.ContrastiveLoss` with parameters:
102
- ```
103
- {'distance_metric': 'SiameseDistanceMetric.COSINE_DISTANCE', 'margin': 0.5, 'size_average': True}
104
- ```
105
-
106
- Parameters of the fit()-Method:
107
- ```
108
- {
109
- "callback": null,
110
- "epochs": 10,
111
- "evaluation_steps": 200,
112
- "evaluator": "sentence_transformers.evaluation.BinaryClassificationEvaluator.BinaryClassificationEvaluator",
113
- "max_grad_norm": 1,
114
- "optimizer_class": "<class 'transformers.optimization.AdamW'>",
115
- "optimizer_params": {
116
- "lr": 2e-05
117
- },
118
- "scheduler": "WarmupLinear",
119
- "steps_per_epoch": null,
120
- "warmup_steps": 2320,
121
- "weight_decay": 1e-05
122
- }
123
- ```
124
 
 
 
125
 
126
- ## Full Model Architecture
127
- ```
128
- SentenceTransformer(
129
- (0): Transformer({'max_seq_length': 2048, 'do_lower_case': False}) with Transformer model: BertModel
130
- (1): Pooling({'word_embedding_dimension': 312, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
131
- )
132
  ```
133
 
134
- ## Citing & Authors
135
 
136
  ```
137
  @MISC{rugpt_chitchat,
@@ -141,4 +58,3 @@ SentenceTransformer(
141
  year = 2022
142
  }
143
  ```
144
-
 
16
 
17
  # SBERT_PQ
18
 
19
+ Это [sentence-transformers](https://www.SBERT.net) модель, предназначенная
20
+ для определения релевантности короткого текста (преимущественно 1 предложение до 10-15 слов) и вопроса.
21
 
22
+ Модель вычисляет для текста и вопроса векторы размерностью 312. Косинус угла между этими векторами
23
+ дает оценку того, содержит ли текст ответ на заданный вопрос. В [проекте диалоговой системы](https://github.com/Koziev/chatbot)
24
+ она используется для семантического поиска записей в базе фактов по заданному собеседником вопросу.
25
 
26
+ Модель основана на [cointegrated/rubert-tiny2]. Она имеет очень небольшой размер и быстро выполняет инференс даже на CPU.
27
 
 
28
 
29
+ ## Использование с библиотекой (Sentence-Transformers)
 
 
30
 
31
+ Для удобства установите [sentence-transformers](https://www.SBERT.net):
 
 
 
 
32
 
 
 
 
33
  ```
34
+ pip install -U sentence-transformers
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
35
  ```
36
 
37
+ Чтобы определить релевантность в одной паре "текст-вопрос", можно использовать такой код:
38
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
39
  ```
40
+ import sentence_transformers
41
 
42
+ sentences = ["Кошка ловит мышку.", "Чем занята кошка?"]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
43
 
44
+ model = sentence_transformers.SentenceTransformer('inkoziev/sbert_pq')
45
+ embeddings = model.encode(sentences)
46
 
47
+ s = sentence_transformers.util.cos_sim(a=embeddings[0], b=embeddings[1])
48
+ print('text={} qquestion={} cossim={}'.format(sentences[0], sentences[1], s))
 
 
 
 
49
  ```
50
 
51
+ ## Контакты и цитирование
52
 
53
  ```
54
  @MISC{rugpt_chitchat,
 
58
  year = 2022
59
  }
60
  ```