metadata
language:
- ru
license: apache-2.0
Russian text (number) normalization
Finetuned version of FRED-T5 large 820M.
Code repo.
Trained on ficbook, librusec and pikabu sentences, inverse text normalized using NeMo Text Processing. Haven't trained anything yet but number normalization.
Usage
import torch
from transformers import GPT2Tokenizer, T5ForConditionalGeneration
device='cuda'
tokenizer = GPT2Tokenizer.from_pretrained('saarus72/russian_text_normalizer', eos_token='</s>')
model = T5ForConditionalGeneration.from_pretrained('saarus72/russian_text_normalizer').to(device)
lm_text = '<SC1>Было у отца [3]<extra_id_0> сына, но не было даже [2-3]<extra_id_1> пиджаков с блёстками за [142 990 руб]<extra_id_2>.'
input_ids = torch.tensor([tokenizer.encode(lm_text)]).to(device)
outputs = model.generate(input_ids, eos_token_id=tokenizer.eos_token_id, early_stopping=True)
print(tokenizer.decode(outputs[0][1:]))
# <extra_id_0> три <extra_id_1> двух-трех <extra_id_2> сто сорок две тысячи девятьсот рублей </s>