---
license: apache-2.0
datasets:
- AigizK/bashkir-russian-parallel-corpora
language:
- ba
pipeline_tag: sentence-similarity
---

This is a shallow (3 layers) BERT-like model, trained on the Bashkir language to compute sentence embedings 
compatible with [LaBSE](https://huggingface.co/sentence-transformers/LaBSE) 
and to do masked language modelling.


The following code can be used to extract sentence embedings:
```Python
import torch
from transformers import AutoModel, AutoTokenizer
model = AutoModel.from_pretrained('slone/LaBSE-shallow-distilled-bak')
tokenizer = AutoTokenizer.from_pretrained('slone/LaBSE-shallow-distilled-bak')

def embed(texts, max_length=512):
    b = tokenizer(texts, return_tensors='pt', padding=True, truncation=True, max_length=max_length)
    with torch.inference_mode():
        return torch.nn.functional.normalize(model(**b.to(model.device)).pooler_output).cpu().numpy()

embeddings = embed(['Сәләм, ғаләм!', 'Хәйерле көн, тыныслыҡ.', 'Бөгөн йома.'])
print(embeddings.shape)
# (3, 768)
print(embeddings.dot(embeddings.T).round(2))
# [[1.   0.56 0.18]
#  [0.56 1.   0.32]
#  [0.18 0.32 1.  ]]
```

For semantically equivalent sentence pairs, the dot products of these embeddings 
(which are also their cosine similarities, because the vectors are L2-normed) are usually above 0.4.