README.md · mlsa-iai-msu-lab/sci-rus-tiny at b5893e7c8cfead45009ca3e32902a699fba1ee67

metadata

license: mit
language:
  - ru
  - en
pipeline_tag: sentence-similarity
tags:
  - russian
  - fill-mask
  - pretraining
  - embeddings
  - masked-lm
  - tiny
  - feature-extraction
  - sentence-similarity
  - sentence-transformers
  - transformers
widget:
  - text: Метод опорных векторов

SciRus-tiny is a model to obtain embeddings of scientific texts in russian and english. Model was trained on eLibrary data with contrastive technics described in habr post. High metrics values were achieved on the ruSciBench benchmark.

How to get embeddings

from transformers import AutoTokenizer, AutoModel
import torch
import torch.nn.functional as F


tokenizer = AutoTokenizer.from_pretrained("mlsa-iai-msu-lab/sci-rus-tiny")
model = AutoModel.from_pretrained("mlsa-iai-msu-lab/sci-rus-tiny")
# model.cuda()  # if you want to use a GPU

def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)


def get_sentence_embedding(title, abstract, model, tokenizer, max_length=None):
    # Tokenize sentences
    sentence = '</s>'.join([title, abstract])
    encoded_input = tokenizer(
        [sentence], padding=True, truncation=True, return_tensors='pt', max_length=max_length).to(model.device)

    # Compute token embeddings
    with torch.no_grad():
        model_output = model(**encoded_input)

    # Perform pooling
    sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
    
    # Normalize embeddings
    sentence_embeddings = F.normalize(sentence_embeddings, p=2, dim=1)
    return sentence_embeddings.cpu().detach().numpy()[0]

print(get_sentence_embedding('some title', 'some abstract', model, tokenizer).shape)
# (312,)

Or you can use the sentence_transformers:

from sentence_transformers import SentenceTransformer


model = SentenceTransformer('mlsa-iai-msu-lab/sci-rus-tiny')
embeddings = model.encode(['привет мир'])
print(embeddings[0].shape)
# (312,)

Authors

Benchmark developed by MLSA Lab of Institute for AI, MSU.

Acknowledgement

We would like to thank eLibrary for provided datasets. The research is part of the project #23-Ш05-21 SES MSU "Development of mathematical methods of machine learning for processing large-volume textual scientific information"

Contacts

Nikolai Gerasimenko (nikgerasimenko@gmail.com), Alexey Vatolin (vatolinalex@gmail.com)