File size: 5,197 Bytes
0231c29 9df5160 8335936 0231c29 9df5160 eb01b22 9df5160 eb01b22 9df5160 eb01b22 9df5160 e870308 9df5160 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 |
---
library_name: transformers
license: mit
datasets:
- skypro1111/ubertext-2-news-verbalized
language:
- uk
---
# Model Card for mbart-large-50-verbalization
## Model Description
`mbart-large-50-verbalization` is a fine-tuned version of the [facebook/mbart-large-50](https://huggingface.co/facebook/mbart-large-50) model, specifically designed for the task of verbalizing Ukrainian text to prepare it for Text-to-Speech (TTS) systems. This model aims to transform structured data like numbers and dates into their fully expanded textual representations in Ukrainian.
## Architecture
This model is based on the [facebook/mbart-large-50](https://huggingface.co/facebook/mbart-large-50) architecture, renowned for its effectiveness in translation and text generation tasks across numerous languages.
## Training Data
The model was fine-tuned on a subset of 96,780 sentences from the Ubertext dataset, focusing on news content. The verbalized equivalents were created using Google Gemini Pro, providing a rich basis for learning text transformation tasks.
Dataset [skypro1111/ubertext-2-news-verbalized](https://huggingface.co/datasets/skypro1111/ubertext-2-news-verbalized)
## Training Procedure
The model underwent 70,000 training steps, which is almost 2 epochs, with further training the results degraded.
```python
from transformers import MBartForConditionalGeneration, AutoTokenizer, Trainer, TrainingArguments
from datasets import load_dataset, DatasetDict
import torch
model_name = "facebook/mbart-large-50"
dataset = load_dataset("skypro1111/ubertext-2-news-verbalized")
dataset = dataset.train_test_split(test_size=0.1)
datasets = DatasetDict({
'train': dataset['train'],
'test': dataset['test']
})
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.src_lang = "uk_XX"
tokenizer.tgt_lang = "uk_XX"
def preprocess_data(examples):
model_inputs = tokenizer(examples["inputs"], max_length=1024, truncation=True, padding="max_length")
with tokenizer.as_target_tokenizer():
labels = tokenizer(examples["labels"], max_length=1024, truncation=True, padding="max_length")
model_inputs["labels"] = labels["input_ids"]
return model_inputs
datasets = datasets.map(preprocess_data, batched=True)
model = MBartForConditionalGeneration.from_pretrained(model_name)
training_args = TrainingArguments(
output_dir=f"./results/{model_name}-verbalization",
evaluation_strategy="steps",
eval_steps=5000,
save_strategy="steps",
save_steps=1000,
save_total_limit=40,
learning_rate=2e-5,
per_device_train_batch_size=2,
per_device_eval_batch_size=2,
num_train_epochs=2,
weight_decay=0.01,
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=datasets["train"],
eval_dataset=datasets["test"],
)
trainer.train()
trainer.save_model(f"./saved_models/{model_name}-verbalization")
```
## Usage
```python
from transformers import MBartForConditionalGeneration, AutoTokenizer
import torch
device = "cuda" if torch.cuda.is_available() else "cpu"
model_name = "skypro1111/mbart-large-50-verbalization"
model = T5ForConditionalGeneration.from_pretrained(
model_name,
low_cpu_mem_usage=True,
device_map=device,
)
model.eval()
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.src_lang = "uk_XX"
tokenizer.tgt_lang = "uk_XX"
input_text = "<verbalization>:Цей додаток вийде 15.06.2025."
encoded_input = tokenizer(input_text, return_tensors="pt", padding=True, truncation=True, max_length=1024).to(device)
output_ids = model.generate(**encoded_input, max_length=1024, num_beams=5, early_stopping=True)
output_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)
print(output_text)
```
## Performance
Evaluation metrics were not explicitly used for this model. Its performance is primarily demonstrated through its application in enhancing the naturalness of TTS outputs.
## Limitations and Ethical Considerations
Users should be aware of the model's potential limitations in understanding highly nuanced or domain-specific content. Ethical considerations, including fairness and bias, are also crucial when deploying this model in real-world applications.
## Citation
Ubertext 2.0
```
@inproceedings{chaplynskyi-2023-introducing,
title = "Introducing {U}ber{T}ext 2.0: A Corpus of Modern {U}krainian at Scale",
author = "Chaplynskyi, Dmytro",
booktitle = "Proceedings of the Second Ukrainian Natural Language Processing Workshop",
month = may,
year = "2023",
address = "Dubrovnik, Croatia",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2023.unlp-1.1",
pages = "1--10",
}
```
mBart-large-50
```
@article{tang2020multilingual,
title={Multilingual Translation with Extensible Multilingual Pretraining and Finetuning},
author={Yuqing Tang and Chau Tran and Xian Li and Peng-Jen Chen and Naman Goyal and Vishrav Chaudhary and Jiatao Gu and Angela Fan},
year={2020},
eprint={2008.00401},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```
## License
This model is released under the MIT License, in line with the base mbart-large-50 model. |