README.md · kazzand/ru-longformer-base-4096 at 02c87f746718523a3ba6521005e88024674112b4

metadata

language:
  - ru

This is a base version of Russian Longformer model created from blinoff/roberta-base-russian-v0 weights with the length of context expanded to 4096 tokens. The model was fine-tuned on russian books dataset but also supports English as its source model. For a more comprehensive overview, please refer to this Habr post, which is available in Russian.

The model can be used as-is to produce text embeddings or it can be further fine-tuned for a specific downstream task.

Text embeddings can be produced as follows:

# pip install transformers sentencepiece
import torch
from transformers import LongformerForMaskedLM, LongformerTokenizerFast

model = LongformerModel.from_pretrained('kazzand/ru-longformer-base-4096')
tokenizer = LongformerTokenizerFast.from_pretrained('kazzand/ru-longformer-base-4096')

def get_cls_embedding(text, model, tokenizer, device='cuda'):
    model.to(device)
    batch = tokenizer(text, return_tensors='pt')

    #set global attention for cls token
    global_attention_mask = [
            [1 if token_id == tokenizer.cls_token_id else 0 for token_id in input_ids]
            for input_ids in batch["input_ids"]
        ]

    #add global attention mask to batch
    batch["global_attention_mask"] = torch.tensor(global_attention_mask)

    with torch.no_grad():
        output = model(**batch.to(device))
    return output.last_hidden_state[:,0,:]