Edit model card

SciRus-tiny is a model to obtain embeddings of scientific texts in russian and english. Model was trained on eLibrary data with contrastive technics described in habr post. High metrics values were achieved on the ruSciBench benchmark.

How to get embeddings

from transformers import AutoTokenizer, AutoModel
import torch.nn.functional as F
import torch

tokenizer = AutoTokenizer.from_pretrained("mlsa-iai-msu-lab/sci-rus-tiny")
model = AutoModel.from_pretrained("mlsa-iai-msu-lab/sci-rus-tiny")
# model.cuda()  # if you want to use a GPU

def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

def get_sentence_embedding(title, abstract, model, tokenizer, max_length=None):
    # Tokenize sentences
    sentence = '</s>'.join([title, abstract])
    encoded_input = tokenizer(
        [sentence], padding=True, truncation=True, return_tensors='pt', max_length=max_length).to(model.device)
    # Compute token embeddings
    with torch.no_grad():
        model_output = model(**encoded_input)
    # Perform pooling
    sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
    # Normalize embeddings
    sentence_embeddings = F.normalize(sentence_embeddings, p=2, dim=1)
    return sentence_embeddings.cpu().detach().numpy()[0]

print(get_sentence_embedding('some title', 'some abstract', model, tokenizer).shape)
# (312,)

Or you can use the sentence_transformers:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer('mlsa-iai-msu-lab/sci-rus-tiny')
embeddings = model.encode(['some title' + '</s>' + 'some abstract'])
# (312,)


Benchmark developed by MLSA Lab of Institute for AI, MSU.


The research is part of the project #23-Ш05-21 SES MSU "Development of mathematical methods of machine learning for processing large-volume textual scientific information". We would like to thank eLibrary for provided datasets.


Nikolai Gerasimenko (nikgerasimenko@gmail.com), Alexey Vatolin (vatolinalex@gmail.com)

Downloads last month