NaSE / README.md
aiana94's picture
Update README.md
d17bfd0 verified
---
license: apache-2.0
datasets:
- aiana94/polynews-parallel
- aiana94/polynews
language:
- af
- am
- ar
- as
- az
- be
- bg
- bn
- bo
- bs
- ca
- ceb
- co
- cs
- cy
- da
- de
- el
- en
- eo
- es
- et
- eu
- fa
- fi
- fr
- fy
- ga
- gd
- gl
- gu
- ha
- haw
- he
- hi
- hmn
- hr
- ht
- hu
- hy
- id
- ig
- is
- it
- ja
- jv
- ka
- kk
- km
- kn
- ko
- ku
- ky
- la
- lb
- lo
- lt
- lv
- mg
- mi
- mk
- mn
- mr
- ms
- mt
- my
- ne
- nl
- 'no'
- ny
- or
- pa
- pl
- pt
- ro
- ru
- rw
- si
- sk
- sl
- sm
- sn
- so
- sw
- sq
- sr
- st
- sv
- ta
- te
- tg
- th
- tk
- tl
- tr
- tt
- ug
- uk
- ur
- uz
- vi
- wo
- xh
- yi
- yo
- zh
- zu
- ay
- bm
- bbj
- ee
- fon
- guw
- ln
- lg
- luo
- pcm
- rn
- tet
- ti
- tn
- tw
- fil
- mos
- orm
pipeline_tag: sentence-similarity
tags:
- bert
- feature-extraction
- sentence-embedding
- sentence-similarity
- multilingual
---
# NaSE (News-adapted Sentence Encoder)
This model is a news-adapted sentence encoder, domain-specialized starting from the pretrained massively mulitlingual sentence encoder [LaBSE](https://aclanthology.org/2022.acl-long.62.pdf).
## Model Details
### Model Description
NaSE is a domain-adapted multilingual sentence encoder, initialized from [LaBSE](https://www.kaggle.com/models/google/labse/tensorFlow2/labse/1?tfhub-redirect=true).
It was specialized to the news domain using two multilingual corpora, namely [Polynews](https://huggingface.co/datasets/aiana94/polynews) and [PolyNewsParallel](https://huggingface.co/datasets/aiana94/polynews-parallel).
More specifically, NaSE was pretrained with two objectives: denoising auto-encoding and sequence-to-sequence machine translation.
## Usage (HuggingFace Transformers)
Here is how to use this model to get the sentence embeddings of a given text in PyTorch:
```python
from transformers import BertModel, BertTokenizerFast
tokenizer = BertTokenizerFast.from_pretrained('aiana94/NaSE')
model = BertModel.from_pretrained('aiana94/NaSE')
# pepare input
sentences = ["This is an example sentence", "Dies ist auch ein Beispielsatz in einer anderen Sprache."]
encoded_input = tokenizer(sentences, return_tensors='pt', padding=True)
# forward pass
with torch.no_grad():
output = model(**encoded_input)
# to get the sentence embeddings, use the pooler output
sentence_embeddings = output.pooler_output
```
and in Tensorflow:
```python
from transformers import TFBertModel, BertTokenizerFast
tokenizer = BertTokenizerFast.from_pretrained('aiana94/NaSE')
model = TFBertModell.from_pretrained('aiana94/NaSE')
# pepare input
sentences = ["This is an example sentence", "Dies ist auch ein Beispielsatz in einer anderen Sprache."]
encoded_input = tokenizer(sentences, return_tensors='tf', padding=True)
# forward pass
with torch.no_grad():
output = model(**encoded_input)
# to get the sentence embeddings, use the pooler output
sentence_embeddings = output.pooler_output
```
For similarity between sentences, an L2-norm is recommended before calculating the similarity:
```python
import torch
import torch.nn.functional as F
def cos_sim(a: torch.Tensor, b: torch.Tensor):
a_norm = F.normalize(a, p=2, dim=1)
b_norm = F.normalize(b, p=2, dim=1)
return torch.mm(a_norm, b_norm.transpose(0, 1))
```
### Intended Uses
Our model is intended to be used as a sentence, and in particular, news encoder. Given an input text, it outputs a vector which captures its semantic information.
The sentence vector may be used for sentence similarity, information retrieval or clustering tasks.
## Training Details
### Training Data
NaSE was domain-adapted using two multilingual datasets: [Polynews](https://huggingface.co/datasets/aiana94/polynews)
and the parallel [PolyNewsParallel](https://huggingface.co/datasets/aiana94/polynews-parallel).
We use the following procedure to smoothen the per-language distribution when sampling for model training:
* We sample only languages and language-pairs that contain at least 100 texts in PolyNews and PolyNewsParallel, respectively;
* We sample texts from language _L_ by sampling from the modified distribution _p(L) ~ |L| * alpha_, where _|L|_ is the number of examples and _L_. We use a smooting rate _alpha=0.3_ (i.e., we upsample low-resource languages and downsample high-resource languages).
### Training Procedure
We initialize NaSE with the pretrained weights of the mulitlingual sentenece encoder [LaBSE](https://huggingface.co/sentence-transformers/LaBSE).
Please refer to its [model card](https://www.kaggle.com/models/google/labse/tensorFlow2/labse/1?tfhub-redirect=true) or the corresponding [paper](https://aclanthology.org/2022.acl-long.62.pdf)
for more detaled information about the pre-training procedure.
We adapt the multilingual sentence encoder to the news domain using two objectives:
* Denoising auto-encoding (DAE): reconstructs the original input sentence from its corrupted version obtained by adding discrete noise (see [TSDAE](https://aclanthology.org/2021.findings-emnlp.59.pdf) for details);
* Machine translation (MT): generates the taget-language translation from the source-language input sentence (i.e., the source language constitutes the _corruption_ of the target sentence x in the target language, which is to be _reconstructed_).
NaSE is trained sequentially, first on reconstruction, and then on translation, i.e., we continue training the NaSE encoder obtained with the DAE objective for translation on parallel data.
#### Training Hyperparameters
- **Training regime:** fp16 mixed precision
- **Training steps:** 100k (50K per objective), validating every 5K steps
- **Learning rate:** 3e-5
- **Optimizer:** AdamW
The full training scripts is accessible in the [training code](https://github.com/andreeaiana/nase).
## Technical Specifications
The model was pretrained on 1 40GB NVIDIA A100 GPU for a total of 100k steps.
## Citation
**BibTeX:**
```bibtex
@misc{iana2024news,
title={News Without Borders: Domain Adaptation of Multilingual Sentence Embeddings for Cross-lingual News Recommendation},
author={Andreea Iana and Fabian David Schmidt and Goran Glavaš and Heiko Paulheim},
year={2024},
eprint={2406.12634},
archivePrefix={arXiv},
url={https://arxiv.org/abs/2406.12634}
}
```