--- license: apache-2.0 datasets: - AigizK/bashkir-russian-parallel-corpora language: - ba pipeline_tag: sentence-similarity --- This is a shallow (3 layers) BERT-like model, trained on the Bashkir language to compute sentence embedings compatible with [LaBSE](https://huggingface.co/sentence-transformers/LaBSE) and to do masked language modelling. The following code can be used to extract sentence embedings: ```Python import torch from transformers import AutoModel, AutoTokenizer model = AutoModel.from_pretrained('slone/LaBSE-shallow-distilled-bak') tokenizer = AutoTokenizer.from_pretrained('slone/LaBSE-shallow-distilled-bak') def embed(texts, max_length=512): b = tokenizer(texts, return_tensors='pt', padding=True, truncation=True, max_length=max_length) with torch.inference_mode(): return torch.nn.functional.normalize(model(**b.to(model.device)).pooler_output).cpu().numpy() embeddings = embed(['Сәләм, ғаләм!', 'Хәйерле көн, тыныслыҡ.', 'Бөгөн йома.']) print(embeddings.shape) # (3, 768) print(embeddings.dot(embeddings.T).round(2)) # [[1. 0.56 0.18] # [0.56 1. 0.32] # [0.18 0.32 1. ]] ``` For semantically equivalent sentence pairs, the dot products of these embeddings (which are also their cosine similarities, because the vectors are L2-normed) are usually above 0.4.