mBART-TeSum
This model is a fine-tuned version of facebook/mbart-large-50 on TeSum dataset. More details about the training and analysis mentioned in the paper.
Model description
mBART-50 is a multilingual Sequence-to-Sequence model. It was introduced to show that multilingual translation models can be created through multilingual fine-tuning. Instead of fine-tuning on one direction, a pre-trained model is fine-tuned on many directions simultaneously. mBART-50 is created using the original mBART model and extended to add extra 25 languages to support multilingual machine translation models of 50 languages. The pre-training objective is explained below.
Multilingual Denoising Pretraining: The model incorporates N languages by concatenating data:
D = {D1, ..., DN }
where each Di is a collection of monolingual documents in language i
.
The source documents are noised using two schemes, first randomly shuffling the original sentences' order, and second a novel in-filling scheme,
where spans of text are replaced with a single mask token. The model is then tasked to reconstruct the original text.
35% of each instance's words are masked by random sampling a span length according to a Poisson distribution (λ = 3.5)
.
The decoder input is the original text with one position offset. A language id symbol LID
is used as the initial token to predict
the sentence.
Intended uses & limitations
mbart-large-50 is pre-trained model and primarily aimed at being fine-tuned on translation tasks. It can also be fine-tuned on other multilingual sequence-to-sequence tasks. See the model hub to look for fine-tuned versions.
Training
As the model is multilingual, it expects the sequences in a different format. A special language id token is used as a prefix in both the source and target text.
The text format is [lang_code] X [eos]
with X
being the source or target text respectively and lang_code is source_lang_code
for source text
and tgt_lang_code
for target text. bos
is never used. Once the examples are prepared in this format, it can be trained as any other sequence-to-sequence model.
from transformers import MBartForConditionalGeneration, MBart50TokenizerFast
model = MBartForConditionalGeneration.from_pretrained("ashokurlana/mBART-TeSum")
tokenizer = MBart50TokenizerFast.from_pretrained("ashokurlana/mBART-TeSum", src_lang="te_IN", tgt_lang="te_IN")
src_text = "తెలంగాణలో సచలనం సృష్టించిన టీఎస్పీఎస్సీ పేపర్ లీకేజీ వ్యవహారంపై ప్రభుత్వం తరపున మంత్రి కేటీఆర్ తొలిసారి స్పందించారు. ఇది వ్యవస్థ వైఫల్యం కాదని.., ఇద్దరు వ్యక్తులు చేసిన తప్పు అని కేటీఆర్ వెల్లడించారు. ఈ వ్యవహారం వెనుక ఏ పార్టీకి చెందిన వారున్నా.., ఎంతటి వారైనా కఠినంగా శిక్షిస్తామని చెప్పారు. నిరుద్యోగుల్లో ఆందోళనలు రేకెత్తించేలా ప్రతిపక్షాలు మాట్లాడటం సరికాదని హితవు పలికారు."
tgt_text = "తెలంగాణలో సచలనం సృష్టించిన టీఎస్ పీఎస్సీ పేపర్ లీకేజీ వ్యవహారంపై ప్రభుత్వం తరపున మంత్రి కేటీఆర్ స్పందించారు. ఇది వ్యవస్థ వైఫల్యం కాదని, ఇద్దరు వ్యక్తులు చేసిన తప్పు అని, ఈ వ్యవహారం వెనుక ఏ పార్టీకి చెందిన వారున్నా కఠినంగా శిక్షిస్తామని చెప్పారు."
model_inputs = tokenizer(src_text, return_tensors="pt")
with tokenizer.as_target_tokenizer():
labels = tokenizer(tgt_text, return_tensors="pt").input_ids
model(**model_inputs, labels=labels) # forward pass
Training procedure
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 5e-05
- train_batch_size: 2
- eval_batch_size: 2
- seed: 42
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- num_epochs: 2.0
Evaluation results
It achieves the following results on the evaluation set:
- Loss: 1.4009
- Rouge1: 32.8603
- Rouge2: 12.2822
- Rougel: 31.7473
- Rougelsum: 32.505
- Gen Len: 117.6326
Framework versions
- Transformers 4.19.0.dev0
- Pytorch 1.11.0
- Datasets 2.1.0
- Tokenizers 0.12.1
BibTeX entry and citation info
@inproceedings{urlana-etal-2022-tesum,
title = "{T}e{S}um: Human-Generated Abstractive Summarization Corpus for {T}elugu",
author = "Urlana, Ashok and
Surange, Nirmal and
Baswani, Pavan and
Ravva, Priyanka and
Shrivastava, Manish",
booktitle = "Proceedings of the Thirteenth Language Resources and Evaluation Conference",
month = jun,
year = "2022",
address = "Marseille, France",
publisher = "European Language Resources Association",
url = "https://aclanthology.org/2022.lrec-1.614",
pages = "5712--5722",
abstract = "Expert human annotation for summarization is definitely an expensive task, and can not be done on huge scales. But with this work, we show that even with a crowd sourced summary generation approach, quality can be controlled by aggressive expert informed filtering and sampling-based human evaluation. We propose a pipeline that crowd-sources summarization data and then aggressively filters the content via: automatic and partial expert evaluation. Using this pipeline we create a high-quality Telugu Abstractive Summarization dataset (TeSum) which we validate with sampling-based human evaluation. We also provide baseline numbers for various models commonly used for summarization. A number of recently released datasets for summarization, scraped the web-content relying on the assumption that summary is made available with the article by the publishers. While this assumption holds for multiple resources (or news-sites) in English, it should not be generalised across languages without thorough analysis and verification. Our analysis clearly shows that this assumption does not hold true for most Indian language news resources. We show that our proposed filtration pipeline can even be applied to these large-scale scraped datasets to extract better quality article-summary pairs.",
}
- Downloads last month
- 0