Summarization
Transformers
PyTorch
Azerbaijani
mt5
text2text-generation
Inference Endpoints
nijatzeynalov commited on
Commit
da86749
1 Parent(s): 53c3ce9

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +55 -2
README.md CHANGED
@@ -32,7 +32,7 @@ pipeline_tag: summarization
32
 
33
  # mT5-small based Azerbaijani Summarization
34
 
35
- In this project, [Google's Multilingual T5-small](https://github.com/google-research/multilingual-t5) is fine-tuned on [Azerbaijani News Summary Dataset](https://huggingface.co/datasets/nijatzeynalov/azerbaijani-multi-news) for **Summarization** downstream task. The model is trained with 3 epochs, 64 batch size and 10e-4 learning rate. It took almost 12 hours on GPU instance with Ubuntu Server 20.04 LTS image in Microsoft Azure. The max news length is kept as 2048 and max summary length is determined as 128.
36
 
37
 
38
  mT5 is a multilingual variant of __T5__ and only pre-trained on [mC4](https://www.tensorflow.org/datasets/catalog/c4#c4multilingual)
@@ -64,4 +64,57 @@ The T5 model, pre-trained on C4, achieves state-of-the-art results on many NLP b
64
 
65
  mT5 is pre-trained only by unsupervised manner with multiple languages, and it’s not trained for specific downstream tasks. To dare say, this pre-trained model has ability to build correct text in Azerbaijani, but it doesn’t have any ability for specific tasks, such as, summarization, correction, machine translation, etc.
66
 
67
- Therefore I trained (fine-tune) this model for summarization in Azerbaijani using [Azerbaijani News Summary Dataset](https://huggingface.co/datasets/nijatzeynalov/azerbaijani-multi-news).
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
32
 
33
  # mT5-small based Azerbaijani Summarization
34
 
35
+ In this model, [Google's Multilingual T5-small](https://github.com/google-research/multilingual-t5) is fine-tuned on [Azerbaijani News Summary Dataset](https://huggingface.co/datasets/nijatzeynalov/azerbaijani-multi-news) for **Summarization** downstream task. The model is trained with 3 epochs, 64 batch size and 10e-4 learning rate. It took almost 12 hours on GPU instance with Ubuntu Server 20.04 LTS image in Microsoft Azure. The max news length is kept as 2048 and max summary length is determined as 128.
36
 
37
 
38
  mT5 is a multilingual variant of __T5__ and only pre-trained on [mC4](https://www.tensorflow.org/datasets/catalog/c4#c4multilingual)
 
64
 
65
  mT5 is pre-trained only by unsupervised manner with multiple languages, and it’s not trained for specific downstream tasks. To dare say, this pre-trained model has ability to build correct text in Azerbaijani, but it doesn’t have any ability for specific tasks, such as, summarization, correction, machine translation, etc.
66
 
67
+ In HuggingFace, several sizes of mT5 models are available, and here I used small one (google/mt5-small). Therefore I trained (fine-tune) this model for summarization in Azerbaijani using [Azerbaijani News Summary Dataset](https://huggingface.co/datasets/nijatzeynalov/azerbaijani-multi-news).
68
+
69
+
70
+ ## Training hyperparameters
71
+
72
+ __mT5-based-azerbaijani-summarize__ model training took almost 12 hours on GPU instance with Ubuntu Server 20.04 LTS image in Microsoft Azure. The following hyperparameters were used during training:
73
+
74
+ - learning_rate: 0.0005
75
+ - train_batch_size: 2
76
+ - eval_batch_size: 1
77
+ - seed: 42
78
+ - gradient_accumulation_steps: 16
79
+ - total_train_batch_size: 64
80
+ - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
81
+ - lr_scheduler_type: linear
82
+ - lr_scheduler_warmup_steps: 90
83
+ - num_epochs: 10
84
+
85
+ ## Dataset
86
+
87
+ Model was trained on [__az-news-summary__ dataset](https://huggingface.co/datasets/nijatzeynalov/azerbaijani-multi-news), a comprehensive and diverse dataset comprising 143k (143,448) Azerbaijani news articles extracted using a set of carefully designed heuristics.
88
+
89
+ The dataset covers common topics for news reports include war, government, politics, education, health, the environment, economy, business, fashion, entertainment, and sport, as well as quirky or unusual events.
90
+
91
+ This dataset has 3 splits: _train_, _validation_, and _test_. \
92
+ Token counts are white space based.
93
+
94
+ | Dataset Split | Number of Instances | Size (MB) |
95
+ | ------------- | --------------------|:----------------------|
96
+ | Train | 100,413 | 150 |
97
+ | Validation | 14,344 | 21.3 |
98
+ | Test | 28,691 | 42.8 |
99
+
100
+
101
+ ## Training results with comparison
102
+
103
+ It achieves the following results on the test set:
104
+
105
+ Rouge1: 39.4222
106
+ Rouge2: 24.8624
107
+ Rougel: 32.2487
108
+
109
+ For __Azerbaijani text summarization downstream task__, mT5-multilingual-XLSum has also been developed on the 45 languages of [XL-Sum](https://huggingface.co/datasets/csebuetnlp/xlsum) dataset. For finetuning details and scripts,
110
+ see the [paper](https://aclanthology.org/2021.findings-acl.413/) and the [official repository](https://github.com/csebuetnlp/xl-sum). .
111
+
112
+ Scores on the XL-Sum test sets for Azerbaijani are as follows:
113
+
114
+ Rouge1: 21.4227
115
+ Rouge2: 9.5214
116
+ Rougel: 19.3331
117
+
118
+ As seen from the numbers, __mT5-based-azerbaijani-summarize__ model achieves dramatically better performance than __mT5_multilingual_XLSum__.
119
+
120
+ ## Using this model in transformers