aiana94
/

NaSE

@@ -76,6 +76,59 @@ language:
 - yo
 - zh
 - zu
 pipeline_tag: sentence-similarity
 tags:
 - bert
@@ -83,4 +136,109 @@ tags:
 - sentence-embedding
 - sentence-similarity
 - multilingual
----

 - yo
 - zh
 - zu
+- af
+- as
+- az
+- be
+- bo
+- ceb
+- co
+- cy
+- eo
+- eu
+- fy
+- ga
+- gd
+- gl
+- haw
+- hmn
+- hr
+- ht
+- hy
+- is
+- jv
+- ka
+- kn
+- ku
+- ky
+- la
+- lb
+- lo
+- mi
+- mn
+- ml
+- mr
+- ms
+- mt
+- ny
+- or
+- rw
+- si
+- sk
+- sl
+- sm
+- st
+- su
+- te
+- tg
+- th
+- tk
+- tl
+- tt
+- ug
+- uz
+- vi
+- yi
 pipeline_tag: sentence-similarity
 tags:
 - bert
 - sentence-embedding
 - sentence-similarity
 - multilingual
+---
+# NaSE (News-adapted Sentence Encoder)
+This model is a news-adapted sentence encoder, domain-specialized starting from the pretrained massively mulitlingual sentence encoder [LaBSE](https://www.kaggle.com/models/google/labse/tensorFlow2/labse/1?tfhub-redirect=true).
+## Model Details
+### Model Description
+NaSE is a domain-adapted multilingual sentence encoder, initialized from [LaBSE](https://www.kaggle.com/models/google/labse/tensorFlow2/labse/1?tfhub-redirect=true).
+It was specialized to the news domain using two multilingual corpora, namely [Polynews](https://huggingface.co/datasets/aiana94/polynews) and [PolyNewsParallel](https://huggingface.co/datasets/aiana94/polynews-parallel).
+More specifically, NaSE was pretrained with two objectives: denoising auto-encoding and sequence-to-sequence machine translation.
+## Usage (HuggingFace Transformers)
+Here is how to use this model to get the sentence embeddings of a given text in PyTorch:
+```python
+    from transformers import BERTModel, BERTTokenizerFast
+    tokenizer = BERTTokenizerFast.from_pretrained('aiana94/NaSE')
+    model = BERTModel.from_pretrained('aiana94/NaSE')
+    # pepare input
+    sentences = ["This is an example sentence", "Dies ist auch ein Beispielsatz in einer anderen Sprache."]
+    encoded_input = tokenizer.encode(sentences, return_tensors='pt')
+    # forward pass
+    with torch.no_grad():
+        output = model(**encoded_input)
+    # to get the sentence embeddings, use the pooler output
+    sentence_embeddings = output.pooler_output
+```
+and in Tensorflow:
+```python
+    from transformers import TFBERTModel, BERTTokenizerFast
+    tokenizer = BERTTokenizerFast.from_pretrained('aiana94/NaSE')
+    model = TFBERTModel.from_pretrained('aiana94/NaSE')
+    # pepare input
+    sentences = ["This is an example sentence", "Dies ist auch ein Beispielsatz in einer anderen Sprache."]
+    encoded_input = tokenizer.encode(sentences, return_tensors='tf')
+    # forward pass
+    with torch.no_grad():
+        output = model(**encoded_input)
+    # to get the sentence embeddings, use the pooler output
+    sentence_embeddings = output.pooler_output
+```
+For similarity between sentences, an L2-norm is recommended before calculating the similarity:
+```python
+import torch.nn.functional as F
+def similarity(embeddings_1, embeddings_2):
+  pass
+```
+## Bias, Risks, and Limitations
+<!-- This section is meant to convey both technical and sociotechnical limitations. -->
+[More Information Needed]
+## Training Details
+### Training Data
+<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
+[More Information Needed]
+### Training Procedure
+<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
+#### Preprocessing [optional]
+[More Information Needed]
+#### Training Hyperparameters
+- **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
+## Technical Specifications
+The model was pretrained on 1 40GB NVIDIA A100 GPU for a total of 100k steps. See the [training code](https://github.com/andreeaiana/nase) for all hyperparameters.
+## Citation [optional]
+**BibTeX:**
+[More Information Needed]