kazzand commited on
Commit
02c87f7
1 Parent(s): b11a767

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +43 -0
README.md ADDED
@@ -0,0 +1,43 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - ru
4
+ ---
5
+
6
+ This is a base version of Russian Longformer model created from [blinoff/roberta-base-russian-v0](https://huggingface.co/blinoff/roberta-base-russian-v0) weights with the length of context expanded to 4096 tokens.
7
+ The model was fine-tuned on russian books dataset but also supports English as its source model.
8
+ For a more comprehensive overview, please refer to this Habr post, which is available in Russian.
9
+
10
+ The model can be used as-is to produce text embeddings or it can be further fine-tuned for a specific downstream task.
11
+
12
+ Text embeddings can be produced as follows:
13
+
14
+ ```python
15
+ # pip install transformers sentencepiece
16
+ import torch
17
+ from transformers import LongformerForMaskedLM, LongformerTokenizerFast
18
+
19
+ model = LongformerModel.from_pretrained('kazzand/ru-longformer-base-4096')
20
+ tokenizer = LongformerTokenizerFast.from_pretrained('kazzand/ru-longformer-base-4096')
21
+
22
+ def get_cls_embedding(text, model, tokenizer, device='cuda'):
23
+ model.to(device)
24
+ batch = tokenizer(text, return_tensors='pt')
25
+
26
+ #set global attention for cls token
27
+ global_attention_mask = [
28
+ [1 if token_id == tokenizer.cls_token_id else 0 for token_id in input_ids]
29
+ for input_ids in batch["input_ids"]
30
+ ]
31
+
32
+ #add global attention mask to batch
33
+ batch["global_attention_mask"] = torch.tensor(global_attention_mask)
34
+
35
+ with torch.no_grad():
36
+ output = model(**batch.to(device))
37
+ return output.last_hidden_state[:,0,:]
38
+
39
+ ```
40
+
41
+
42
+
43
+