This is a base Longformer model designed for Russian language. It was initialized from blinoff/roberta-base-russian-v0 weights and has been modified to support a context length of up to 4096 tokens. We fine-tuned it on a dataset of Russian books. For a detailed information check out our post on Habr.

Model attributes:

  • 12 attention heads
  • 12 hidden layers
  • 4096 tokens length of context

The model can be used as-is to produce text embeddings or it can be further fine-tuned for a specific downstream task.

Text embeddings can be produced as follows:

# pip install transformers sentencepiece
import torch
from transformers import LongformerForMaskedLM, LongformerTokenizerFast

model = LongformerModel.from_pretrained('kazzand/ru-longformer-base-4096')
tokenizer = LongformerTokenizerFast.from_pretrained('kazzand/ru-longformer-base-4096')

def get_cls_embedding(text, model, tokenizer, device='cuda'):
    model.to(device)
    batch = tokenizer(text, return_tensors='pt')

    #set global attention for cls token
    global_attention_mask = [
            [1 if token_id == tokenizer.cls_token_id else 0 for token_id in input_ids]
            for input_ids in batch["input_ids"]
        ]

    #add global attention mask to batch
    batch["global_attention_mask"] = torch.tensor(global_attention_mask)

    with torch.no_grad():
        output = model(**batch.to(device))
    return output.last_hidden_state[:,0,:]
Downloads last month
85
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.