|
--- |
|
language: |
|
- ru |
|
license: apache-2.0 |
|
--- |
|
|
|
# FRED-T5 1.7B (Full-scale Russian Enhanced Denoisers T5) |
|
|
|
Architecture based on T5. |
|
|
|
It has 24 layers and 1536 hidden size. |
|
|
|
Model trained on a mixture of 7 denoisers like UL2 with several differences (https://arxiv.org/abs/2205.05131). |
|
|
|
It trained on Russian language corpus (300GB). Dataset is the same as for ruT5 models. |
|
|
|
Bbpe tokenizer. |
|
|
|
First half of the time model trained on the small part of all datasets (1%,3GB) and without prefixes in each task. |
|
|
|
For RSG we trained as described in the T5 paper. First, we trained multitask for all tasks. Then we took the best checkpoint for the task and trained it further. |
|
|
|
Total training time was around 45 days on 112 A100 GPUs. |
|
|
|
Training loss: |
|
![Screenshot 2023-01-21 at 11.36.52.png](https://s3.amazonaws.com/moonup/production/uploads/1674290304538-5f91b1208a61a359f44e1851.png) |
|
|
|
We continue to experiment... |
|
|
|
We'll tell you more and release checkpoint to the public soon. |