File size: 5,197 Bytes
0231c29
9df5160
8335936
 
 
 
 
0231c29
9df5160
 
 
 
eb01b22
9df5160
 
eb01b22
9df5160
 
 
eb01b22
9df5160
 
e870308
9df5160
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
---
library_name: transformers
license: mit
datasets:
- skypro1111/ubertext-2-news-verbalized
language:
- uk
---

# Model Card for mbart-large-50-verbalization

## Model Description
`mbart-large-50-verbalization` is a fine-tuned version of the [facebook/mbart-large-50](https://huggingface.co/facebook/mbart-large-50) model, specifically designed for the task of verbalizing Ukrainian text to prepare it for Text-to-Speech (TTS) systems. This model aims to transform structured data like numbers and dates into their fully expanded textual representations in Ukrainian.

## Architecture
This model is based on the [facebook/mbart-large-50](https://huggingface.co/facebook/mbart-large-50) architecture, renowned for its effectiveness in translation and text generation tasks across numerous languages.

## Training Data
The model was fine-tuned on a subset of 96,780 sentences from the Ubertext dataset, focusing on news content. The verbalized equivalents were created using Google Gemini Pro, providing a rich basis for learning text transformation tasks.
Dataset [skypro1111/ubertext-2-news-verbalized](https://huggingface.co/datasets/skypro1111/ubertext-2-news-verbalized)

## Training Procedure
The model underwent 70,000 training steps, which is almost 2 epochs, with further training the results degraded.

```python
from transformers import MBartForConditionalGeneration, AutoTokenizer, Trainer, TrainingArguments
from datasets import load_dataset, DatasetDict
import torch

model_name = "facebook/mbart-large-50"

dataset = load_dataset("skypro1111/ubertext-2-news-verbalized")
dataset = dataset.train_test_split(test_size=0.1)
datasets = DatasetDict({
    'train': dataset['train'],
    'test': dataset['test']
})

tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.src_lang = "uk_XX"
tokenizer.tgt_lang = "uk_XX"


def preprocess_data(examples):
    model_inputs = tokenizer(examples["inputs"], max_length=1024, truncation=True, padding="max_length")
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(examples["labels"], max_length=1024, truncation=True, padding="max_length")
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

datasets = datasets.map(preprocess_data, batched=True)

model = MBartForConditionalGeneration.from_pretrained(model_name)

training_args = TrainingArguments(
    output_dir=f"./results/{model_name}-verbalization",
    evaluation_strategy="steps",
    eval_steps=5000,
    save_strategy="steps",
    save_steps=1000,
    save_total_limit=40,
    learning_rate=2e-5,
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    num_train_epochs=2,
    weight_decay=0.01,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=datasets["train"],
    eval_dataset=datasets["test"],
)

trainer.train()
trainer.save_model(f"./saved_models/{model_name}-verbalization")
```

## Usage
```python
from transformers import MBartForConditionalGeneration, AutoTokenizer
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"

model_name = "skypro1111/mbart-large-50-verbalization"

model = T5ForConditionalGeneration.from_pretrained(
        model_name,
        low_cpu_mem_usage=True,
        device_map=device,
)
model.eval()
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.src_lang = "uk_XX"
tokenizer.tgt_lang = "uk_XX"

input_text = "<verbalization>:Цей додаток вийде 15.06.2025."

encoded_input = tokenizer(input_text, return_tensors="pt", padding=True, truncation=True, max_length=1024).to(device)
output_ids = model.generate(**encoded_input, max_length=1024, num_beams=5, early_stopping=True)
output_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)

print(output_text)
```

## Performance
Evaluation metrics were not explicitly used for this model. Its performance is primarily demonstrated through its application in enhancing the naturalness of TTS outputs.

## Limitations and Ethical Considerations
Users should be aware of the model's potential limitations in understanding highly nuanced or domain-specific content. Ethical considerations, including fairness and bias, are also crucial when deploying this model in real-world applications.

## Citation
Ubertext 2.0
```
@inproceedings{chaplynskyi-2023-introducing,
  title = "Introducing {U}ber{T}ext 2.0: A Corpus of Modern {U}krainian at Scale",
  author = "Chaplynskyi, Dmytro",
  booktitle = "Proceedings of the Second Ukrainian Natural Language Processing Workshop",
  month = may,
  year = "2023",
  address = "Dubrovnik, Croatia",
  publisher = "Association for Computational Linguistics",
  url = "https://aclanthology.org/2023.unlp-1.1",
  pages = "1--10",
}
```
mBart-large-50
```
@article{tang2020multilingual,
    title={Multilingual Translation with Extensible Multilingual Pretraining and Finetuning},
    author={Yuqing Tang and Chau Tran and Xian Li and Peng-Jen Chen and Naman Goyal and Vishrav Chaudhary and Jiatao Gu and Angela Fan},
    year={2020},
    eprint={2008.00401},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}
```

## License
This model is released under the MIT License, in line with the base mbart-large-50 model.