--- license: apache-2.0 datasets: - aiana94/polynews-parallel - aiana94/polynews language: - af - am - ar - as - az - be - bg - bn - bo - bs - ca - ceb - co - cs - cy - da - de - el - en - eo - es - et - eu - fa - fi - fr - fy - ga - gd - gl - gu - ha - haw - he - hi - hmn - hr - ht - hu - hy - id - ig - is - it - ja - jv - ka - kk - km - kn - ko - ku - ky - la - lb - lo - lt - lv - mg - mi - mk - mn - mr - ms - mt - my - ne - nl - 'no' - ny - or - pa - pl - pt - ro - ru - rw - si - sk - sl - sm - sn - so - sw - sq - sr - st - sv - ta - te - tg - th - tk - tl - tr - tt - ug - uk - ur - uz - vi - wo - xh - yi - yo - zh - zu - ay - bm - bbj - ee - fon - guw - ln - lg - luo - pcm - rn - tet - ti - tn - tw - fil - mos - orm pipeline_tag: sentence-similarity tags: - bert - feature-extraction - sentence-embedding - sentence-similarity - multilingual --- # NaSE (News-adapted Sentence Encoder) This model is a news-adapted sentence encoder, domain-specialized starting from the pretrained massively mulitlingual sentence encoder [LaBSE](https://aclanthology.org/2022.acl-long.62.pdf). ## Model Details ### Model Description NaSE is a domain-adapted multilingual sentence encoder, initialized from [LaBSE](https://www.kaggle.com/models/google/labse/tensorFlow2/labse/1?tfhub-redirect=true). It was specialized to the news domain using two multilingual corpora, namely [Polynews](https://huggingface.co/datasets/aiana94/polynews) and [PolyNewsParallel](https://huggingface.co/datasets/aiana94/polynews-parallel). More specifically, NaSE was pretrained with two objectives: denoising auto-encoding and sequence-to-sequence machine translation. ## Usage (HuggingFace Transformers) Here is how to use this model to get the sentence embeddings of a given text in PyTorch: ```python from transformers import BertModel, BertTokenizerFast tokenizer = BertTokenizerFast.from_pretrained('aiana94/NaSE') model = BertModel.from_pretrained('aiana94/NaSE') # pepare input sentences = ["This is an example sentence", "Dies ist auch ein Beispielsatz in einer anderen Sprache."] encoded_input = tokenizer(sentences, return_tensors='pt', padding=True) # forward pass with torch.no_grad(): output = model(**encoded_input) # to get the sentence embeddings, use the pooler output sentence_embeddings = output.pooler_output ``` and in Tensorflow: ```python from transformers import TFBertModel, BertTokenizerFast tokenizer = BertTokenizerFast.from_pretrained('aiana94/NaSE') model = TFBertModell.from_pretrained('aiana94/NaSE') # pepare input sentences = ["This is an example sentence", "Dies ist auch ein Beispielsatz in einer anderen Sprache."] encoded_input = tokenizer(sentences, return_tensors='tf', padding=True) # forward pass with torch.no_grad(): output = model(**encoded_input) # to get the sentence embeddings, use the pooler output sentence_embeddings = output.pooler_output ``` For similarity between sentences, an L2-norm is recommended before calculating the similarity: ```python import torch import torch.nn.functional as F def cos_sim(a: torch.Tensor, b: torch.Tensor): a_norm = F.normalize(a, p=2, dim=1) b_norm = F.normalize(b, p=2, dim=1) return torch.mm(a_norm, b_norm.transpose(0, 1)) ``` ### Intended Uses Our model is intended to be used as a sentence, and in particular, news encoder. Given an input text, it outputs a vector which captures its semantic information. The sentence vector may be used for sentence similarity, information retrieval or clustering tasks. ## Training Details ### Training Data NaSE was domain-adapted using two multilingual datasets: [Polynews](https://huggingface.co/datasets/aiana94/polynews) and the parallel [PolyNewsParallel](https://huggingface.co/datasets/aiana94/polynews-parallel). We use the following procedure to smoothen the per-language distribution when sampling for model training: * We sample only languages and language-pairs that contain at least 100 texts in PolyNews and PolyNewsParallel, respectively; * We sample texts from language _L_ by sampling from the modified distribution _p(L) ~ |L| * alpha_, where _|L|_ is the number of examples and _L_. We use a smooting rate _alpha=0.3_ (i.e., we upsample low-resource languages and downsample high-resource languages). ### Training Procedure We initialize NaSE with the pretrained weights of the mulitlingual sentenece encoder [LaBSE](https://huggingface.co/sentence-transformers/LaBSE). Please refer to its [model card](https://www.kaggle.com/models/google/labse/tensorFlow2/labse/1?tfhub-redirect=true) or the corresponding [paper](https://aclanthology.org/2022.acl-long.62.pdf) for more detaled information about the pre-training procedure. We adapt the multilingual sentence encoder to the news domain using two objectives: * Denoising auto-encoding (DAE): reconstructs the original input sentence from its corrupted version obtained by adding discrete noise (see [TSDAE](https://aclanthology.org/2021.findings-emnlp.59.pdf) for details); * Machine translation (MT): generates the taget-language translation from the source-language input sentence (i.e., the source language constitutes the _corruption_ of the target sentence x in the target language, which is to be _reconstructed_). NaSE is trained sequentially, first on reconstruction, and then on translation, i.e., we continue training the NaSE encoder obtained with the DAE objective for translation on parallel data. #### Training Hyperparameters - **Training regime:** fp16 mixed precision - **Training steps:** 100k (50K per objective), validating every 5K steps - **Learning rate:** 3e-5 - **Optimizer:** AdamW The full training scripts is accessible in the [training code](https://github.com/andreeaiana/nase). ## Technical Specifications The model was pretrained on 1 40GB NVIDIA A100 GPU for a total of 100k steps. ## Citation **BibTeX:** ```bibtex @misc{iana2024news, title={News Without Borders: Domain Adaptation of Multilingual Sentence Embeddings for Cross-lingual News Recommendation}, author={Andreea Iana and Fabian David Schmidt and Goran Glavaš and Heiko Paulheim}, year={2024}, eprint={2406.12634}, archivePrefix={arXiv}, url={https://arxiv.org/abs/2406.12634} } ```