File size: 3,659 Bytes
fea3169
0121fdb
 
fea3169
 
0121fdb
7d14d36
3f337b8
0121fdb
3f337b8
e0f5606
0121fdb
 
 
 
 
 
 
 
 
 
 
 
 
d925dc3
0121fdb
88c1f44
0121fdb
88c1f44
 
 
 
 
 
5b4be94
88c1f44
 
 
 
190b067
8297fa4
88c1f44
 
 
7d14d36
88c1f44
 
190b067
8297fa4
88c1f44
 
 
757bcec
88c1f44
757bcec
190b067
8297fa4
88c1f44
 
 
7d14d36
88c1f44
 
 
 
 
 
 
05d4f35
3f2a611
7dcfe38
3f337b8
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
---
language:
- ru
license: apache-2.0
---

# FRED-T5 large 820M (Full-scale Russian Enhanced Denoisers T5) 
The model architecture design, pretraining, and evaluation are documented in our preprint: [**A Family of Pretrained Transformer Language Models for Russian**](https://arxiv.org/abs/2309.10931).

The model was trained by [SberDevices](https://sberdevices.ru/).  

Architecture based on T5. 

It has 24 layers and 1024 hidden size. More details in config.json.

The model trained on a mixture of 7 denoisers like UL2 with several differences (https://arxiv.org/abs/2205.05131).

It was trained on Russian language corpus (300GB).   The dataset is the same as for ruT5 models. 

Bbpe tokenizer. 50257 + special tokens 107. Prefix tokens: '\<LM\>', '\<SC1>',.. '\<SC6>'

First half of the time model trained on the small part of all dataset (1%,3GB) and without prefixes in each task.

For RSG, we trained as described in the T5 paper. First, we trained multitask for all tasks. Then we took the best checkpoint for the task and trained it further.
RSG submit here https://russiansuperglue.com/login/submit_info/2060

Total training time was around 35 days on 160 V100 GPUs + 5 days on 80 A100.

## Usage (HuggingFace Models Repository)

```python
import torch
from transformers import GPT2Tokenizer, T5ForConditionalGeneration 
tokenizer = GPT2Tokenizer.from_pretrained('ai-forever/FRED-T5-1.7B',eos_token='</s>')
model = T5ForConditionalGeneration.from_pretrained('ai-forever/FRED-T5-1.7B')
device='cuda'
model.to(device)

#Prefix <LM>
lm_text='<LM>Принялся Кутузов рассказывать свою историю как он сюда попал. Началось'
input_ids=torch.tensor([tokenizer.encode(lm_text)]).to(device)
outputs=model.generate(input_ids,eos_token_id=tokenizer.eos_token_id,early_stopping=True)
print(tokenizer.decode(outputs[0][1:]))

# print result: , как водится, с того, что он был в плену.</s>

#Prefix <SC1>
lm_text='<SC1>Принялся Кутузов рассказывать свою историю <extra_id_0>. Началось с того, что он был в армии, служил в артиллерии.'
input_ids=torch.tensor([tokenizer.encode(lm_text)]).to(device)
outputs=model.generate(input_ids,eos_token_id=tokenizer.eos_token_id,early_stopping=True)
print(tokenizer.decode(outputs[0][1:]))

#print result: '<extra_id_0>, как он жил</s>'

# Prefix <SC5>
lm_text='<SC5>Принялся Кутузов рассказывать свою историю <extra_id_0>. Началось с того, что он был в армии, служил в артиллерии.'
input_ids=torch.tensor([tokenizer.encode(lm_text)]).to(device)
outputs=model.generate(input_ids,eos_token_id=tokenizer.eos_token_id,early_stopping=True,max_length=100)
print(tokenizer.decode(outputs[0][1:]))

#print result: '<extra_id_0> </s>'

```
# Authors
+ NLP core team RnD [Telegram channel](https://t.me/nlpcoreteam):
  + Dmitry Zmitrovich 
  + Andrei Kalmykov 
  + Vitaly Kadulin 
  + Mikhail Novikov
  + Alexey Khoroshilov

[Salute AI Community](https://t.me/SaluteTechGroup).  


# Cite us
```
@misc{zmitrovich2023family,
      title={A Family of Pretrained Transformer Language Models for Russian}, 
      author={Dmitry Zmitrovich and Alexander Abramov and Andrey Kalmykov and Maria Tikhonova and Ekaterina Taktasheva and Danil Astafurov and Mark Baushenko and Artem Snegirev and Tatiana Shavrina and Sergey Markov and Vladislav Mikhailov and Alena Fenogenova},
      year={2023},
      eprint={2309.10931},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}
```