Model Card for mbart-large-50-verbalization
Model Description
mbart-large-50-verbalization
is a fine-tuned version of the facebook/mbart-large-50 model, specifically designed for the task of verbalizing Ukrainian text to prepare it for Text-to-Speech (TTS) systems. This model aims to transform structured data like numbers and dates into their fully expanded textual representations in Ukrainian.
Architecture
This model is based on the facebook/mbart-large-50 architecture, renowned for its effectiveness in translation and text generation tasks across numerous languages.
Training Data
The model was fine-tuned on a subset of 457,610 sentences from the Ubertext dataset, focusing on news content. The verbalized equivalents were created using Google Gemini Pro, providing a rich basis for learning text transformation tasks. Dataset skypro1111/ubertext-2-news-verbalized
Training Procedure
The model underwent 410,000 training steps (1 epoch).
from transformers import MBartForConditionalGeneration, AutoTokenizer, Trainer, TrainingArguments
from datasets import load_dataset, DatasetDict
import torch
model_name = "facebook/mbart-large-50"
dataset = load_dataset("skypro1111/ubertext-2-news-verbalized")
dataset = dataset.train_test_split(test_size=0.1)
datasets = DatasetDict({
'train': dataset['train'],
'test': dataset['test']
})
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.src_lang = "uk_XX"
tokenizer.tgt_lang = "uk_XX"
def preprocess_data(examples):
model_inputs = tokenizer(examples["inputs"], max_length=1024, truncation=True, padding="max_length")
with tokenizer.as_target_tokenizer():
labels = tokenizer(examples["labels"], max_length=1024, truncation=True, padding="max_length")
model_inputs["labels"] = labels["input_ids"]
return model_inputs
datasets = datasets.map(preprocess_data, batched=True)
model = MBartForConditionalGeneration.from_pretrained(model_name)
training_args = TrainingArguments(
output_dir=f"./results/{model_name}-verbalization",
evaluation_strategy="steps",
eval_steps=5000,
save_strategy="steps",
save_steps=1000,
save_total_limit=40,
learning_rate=2e-5,
per_device_train_batch_size=2,
per_device_eval_batch_size=2,
num_train_epochs=2,
weight_decay=0.01,
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=datasets["train"],
eval_dataset=datasets["test"],
)
trainer.train()
trainer.save_model(f"./saved_models/{model_name}-verbalization")
Usage
from transformers import MBartForConditionalGeneration, AutoTokenizer
import torch
device = "cuda" if torch.cuda.is_available() else "cpu"
model_name = "skypro1111/mbart-large-50-verbalization"
model = T5ForConditionalGeneration.from_pretrained(
model_name,
low_cpu_mem_usage=True,
device_map=device,
)
model.eval()
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.src_lang = "uk_XX"
tokenizer.tgt_lang = "uk_XX"
input_text = "<verbalization>:Цей додаток вийде 15.06.2025."
encoded_input = tokenizer(input_text, return_tensors="pt", padding=True, truncation=True, max_length=1024).to(device)
output_ids = model.generate(**encoded_input, max_length=1024, num_beams=5, early_stopping=True)
output_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)
print(output_text)
Performance
Evaluation metrics were not explicitly used for this model. Its performance is primarily demonstrated through its application in enhancing the naturalness of TTS outputs.
Limitations and Ethical Considerations
Users should be aware of the model's potential limitations in understanding highly nuanced or domain-specific content. Ethical considerations, including fairness and bias, are also crucial when deploying this model in real-world applications.
Citation
Ubertext 2.0
@inproceedings{chaplynskyi-2023-introducing,
title = "Introducing {U}ber{T}ext 2.0: A Corpus of Modern {U}krainian at Scale",
author = "Chaplynskyi, Dmytro",
booktitle = "Proceedings of the Second Ukrainian Natural Language Processing Workshop",
month = may,
year = "2023",
address = "Dubrovnik, Croatia",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2023.unlp-1.1",
pages = "1--10",
}
mBart-large-50
@article{tang2020multilingual,
title={Multilingual Translation with Extensible Multilingual Pretraining and Finetuning},
author={Yuqing Tang and Chau Tran and Xian Li and Peng-Jen Chen and Naman Goyal and Vishrav Chaudhary and Jiatao Gu and Angela Fan},
year={2020},
eprint={2008.00401},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
License
This model is released under the MIT License, in line with the base mbart-large-50 model.
- Downloads last month
- 9