File size: 979 Bytes

f467091
5198f5c
 
856eccb
499e3bf
 
 
 
 
 
 
 
c5ba01c
499e3bf
86bef56
499e3bf
6777adb
499e3bf
6777adb
 
 
 
7b8f31a
 
6777adb
 
499e3bf

---
language:
- ru
license: apache-2.0
---

# FRED-T5 1.7B (Full-scale Russian Enhanced Denoisers T5) 

Architecture based on T5. 

It has 24 layers and 1536 hidden size.

Model trained on a mixture of 7 denoisers like UL2 with several differences (https://arxiv.org/abs/2205.05131).

It trained on Russian language corpus (300GB).   Dataset is the same as for ruT5 models. 

Bbpe tokenizer. 

First half of the time model trained on the small part of all datasets (1%,3GB) and without prefixes in each task.

For RSG we trained as described in the T5 paper. First, we trained multitask for all tasks. Then we took the best checkpoint for the task and trained it further.

Total training time was around 45 days on 112 A100 GPUs.

Training loss:
![Screenshot 2023-01-21 at 11.36.52.png](https://s3.amazonaws.com/moonup/production/uploads/1674290304538-5f91b1208a61a359f44e1851.png)

We continue to experiment... 

We'll tell you more  and release  checkpoint  to the public soon.