aiana94 commited on
Commit
0febe2a
1 Parent(s): cc119d1

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +159 -1
README.md CHANGED
@@ -76,6 +76,59 @@ language:
76
  - yo
77
  - zh
78
  - zu
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
79
  pipeline_tag: sentence-similarity
80
  tags:
81
  - bert
@@ -83,4 +136,109 @@ tags:
83
  - sentence-embedding
84
  - sentence-similarity
85
  - multilingual
86
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
76
  - yo
77
  - zh
78
  - zu
79
+ - af
80
+ - as
81
+ - az
82
+ - be
83
+ - bo
84
+ - ceb
85
+ - co
86
+ - cy
87
+ - eo
88
+ - eu
89
+ - fy
90
+ - ga
91
+ - gd
92
+ - gl
93
+ - haw
94
+ - hmn
95
+ - hr
96
+ - ht
97
+ - hy
98
+ - is
99
+ - jv
100
+ - ka
101
+ - kn
102
+ - ku
103
+ - ky
104
+ - la
105
+ - lb
106
+ - lo
107
+ - mi
108
+ - mn
109
+ - ml
110
+ - mr
111
+ - ms
112
+ - mt
113
+ - ny
114
+ - or
115
+ - rw
116
+ - si
117
+ - sk
118
+ - sl
119
+ - sm
120
+ - st
121
+ - su
122
+ - te
123
+ - tg
124
+ - th
125
+ - tk
126
+ - tl
127
+ - tt
128
+ - ug
129
+ - uz
130
+ - vi
131
+ - yi
132
  pipeline_tag: sentence-similarity
133
  tags:
134
  - bert
 
136
  - sentence-embedding
137
  - sentence-similarity
138
  - multilingual
139
+ ---
140
+ # NaSE (News-adapted Sentence Encoder)
141
+
142
+ This model is a news-adapted sentence encoder, domain-specialized starting from the pretrained massively mulitlingual sentence encoder [LaBSE](https://www.kaggle.com/models/google/labse/tensorFlow2/labse/1?tfhub-redirect=true).
143
+
144
+ ## Model Details
145
+
146
+ ### Model Description
147
+
148
+ NaSE is a domain-adapted multilingual sentence encoder, initialized from [LaBSE](https://www.kaggle.com/models/google/labse/tensorFlow2/labse/1?tfhub-redirect=true).
149
+ It was specialized to the news domain using two multilingual corpora, namely [Polynews](https://huggingface.co/datasets/aiana94/polynews) and [PolyNewsParallel](https://huggingface.co/datasets/aiana94/polynews-parallel).
150
+ More specifically, NaSE was pretrained with two objectives: denoising auto-encoding and sequence-to-sequence machine translation.
151
+
152
+ ## Usage (HuggingFace Transformers)
153
+
154
+ Here is how to use this model to get the sentence embeddings of a given text in PyTorch:
155
+
156
+ ```python
157
+ from transformers import BERTModel, BERTTokenizerFast
158
+
159
+ tokenizer = BERTTokenizerFast.from_pretrained('aiana94/NaSE')
160
+ model = BERTModel.from_pretrained('aiana94/NaSE')
161
+
162
+ # pepare input
163
+ sentences = ["This is an example sentence", "Dies ist auch ein Beispielsatz in einer anderen Sprache."]
164
+ encoded_input = tokenizer.encode(sentences, return_tensors='pt')
165
+
166
+ # forward pass
167
+ with torch.no_grad():
168
+ output = model(**encoded_input)
169
+
170
+ # to get the sentence embeddings, use the pooler output
171
+ sentence_embeddings = output.pooler_output
172
+ ```
173
+
174
+ and in Tensorflow:
175
+
176
+ ```python
177
+ from transformers import TFBERTModel, BERTTokenizerFast
178
+
179
+ tokenizer = BERTTokenizerFast.from_pretrained('aiana94/NaSE')
180
+ model = TFBERTModel.from_pretrained('aiana94/NaSE')
181
+
182
+ # pepare input
183
+ sentences = ["This is an example sentence", "Dies ist auch ein Beispielsatz in einer anderen Sprache."]
184
+ encoded_input = tokenizer.encode(sentences, return_tensors='tf')
185
+
186
+ # forward pass
187
+ with torch.no_grad():
188
+ output = model(**encoded_input)
189
+
190
+ # to get the sentence embeddings, use the pooler output
191
+ sentence_embeddings = output.pooler_output
192
+ ```
193
+
194
+ For similarity between sentences, an L2-norm is recommended before calculating the similarity:
195
+
196
+ ```python
197
+ import torch.nn.functional as F
198
+
199
+ def similarity(embeddings_1, embeddings_2):
200
+ pass
201
+
202
+
203
+
204
+ ```
205
+
206
+ ## Bias, Risks, and Limitations
207
+
208
+ <!-- This section is meant to convey both technical and sociotechnical limitations. -->
209
+
210
+ [More Information Needed]
211
+
212
+
213
+ ## Training Details
214
+
215
+ ### Training Data
216
+
217
+ <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
218
+
219
+ [More Information Needed]
220
+
221
+ ### Training Procedure
222
+
223
+ <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
224
+
225
+ #### Preprocessing [optional]
226
+
227
+ [More Information Needed]
228
+
229
+
230
+ #### Training Hyperparameters
231
+
232
+ - **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
233
+
234
+
235
+ ## Technical Specifications
236
+
237
+ The model was pretrained on 1 40GB NVIDIA A100 GPU for a total of 100k steps. See the [training code](https://github.com/andreeaiana/nase) for all hyperparameters.
238
+
239
+
240
+ ## Citation [optional]
241
+
242
+ **BibTeX:**
243
+
244
+ [More Information Needed]