--- language: - ru --- This is a base Longformer model designed for Russian language. It was initialized from [blinoff/roberta-base-russian-v0](https://huggingface.co/blinoff/roberta-base-russian-v0) weights and has been modified to support a context length of up to 4096 tokens. We fine-tuned it on a dataset of Russian books. For a detailed information check out our post on Habr. Model attributes: * 12 attention heads * 12 hidden layers * 4096 tokens length of context The model can be used as-is to produce text embeddings or it can be further fine-tuned for a specific downstream task. Text embeddings can be produced as follows: ```python # pip install transformers sentencepiece import torch from transformers import LongformerForMaskedLM, LongformerTokenizerFast model = LongformerModel.from_pretrained('kazzand/ru-longformer-base-4096') tokenizer = LongformerTokenizerFast.from_pretrained('kazzand/ru-longformer-base-4096') def get_cls_embedding(text, model, tokenizer, device='cuda'): model.to(device) batch = tokenizer(text, return_tensors='pt') #set global attention for cls token global_attention_mask = [ [1 if token_id == tokenizer.cls_token_id else 0 for token_id in input_ids] for input_ids in batch["input_ids"] ] #add global attention mask to batch batch["global_attention_mask"] = torch.tensor(global_attention_mask) with torch.no_grad(): output = model(**batch.to(device)) return output.last_hidden_state[:,0,:] ```