Instructions to use mskayacioglu/mbart-xlsum-2 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use mskayacioglu/mbart-xlsum-2 with Transformers:
# Use a pipeline as a high-level helper # Warning: Pipeline type "summarization" is no longer supported in transformers v5. # You must load the model directly (see below) or downgrade to v4.x with: # 'pip install "transformers<5.0.0' from transformers import pipeline pipe = pipeline("summarization", model="mskayacioglu/mbart-xlsum-2")# Load model directly from transformers import AutoTokenizer, AutoModelForMultimodalLM tokenizer = AutoTokenizer.from_pretrained("mskayacioglu/mbart-xlsum-2") model = AutoModelForMultimodalLM.from_pretrained("mskayacioglu/mbart-xlsum-2") - Notebooks
- Google Colab
- Kaggle
mBART XL-Sum 2
mbart-xlsum-2 is a multilingual abstractive news summarization model fine-tuned from facebook/mbart-large-50-many-to-many-mmt on 15 language subsets of csebuetnlp/xlsum. It was developed as part of a multilingual news aggregation and summarization project.
This is the project's quality-oriented second mBART-50 training variant. It uses a shorter 64-token training target, generation-based validation, ROUGE-L checkpoint selection, and alpha-smoothed language resampling to increase the visibility of lower-resource languages. In the final evaluation it achieved small gains over the first variant in ROUGE, BLEU, METEOR-lite, latency, completeness, and bigram novelty, while showing lower source coverage in the project's automatic diagnostics.
Model Details
| Property | Value |
|---|---|
| Task | Multilingual abstractive news summarization |
| Base model | facebook/mbart-large-50-many-to-many-mmt |
| Architecture | MBartForConditionalGeneration |
| Encoder / decoder layers | 12 / 12 |
| Attention heads | 16 |
| Hidden size | 1,024 |
| Vocabulary size | 250,054 |
| Maximum model positions | 1,024 |
| Fine-tuning source length | 512 tokens |
| Fine-tuning target length | 64 tokens |
| Framework | PyTorch / Transformers |
Languages
The fine-tuning mixture contains the following 15 languages:
| Language | mBART code | Language | mBART code |
|---|---|---|---|
| Arabic | ar_AR |
Chinese (Simplified) | zh_CN |
| English | en_XX |
French | fr_XX |
| Hindi | hi_IN |
Indonesian | id_ID |
| Japanese | ja_XX |
Korean | ko_KR |
| Persian | fa_IR |
Portuguese | pt_XX |
| Russian | ru_RU |
Spanish | es_XX |
| Turkish | tr_TR |
Ukrainian | uk_UA |
| Vietnamese | vi_VN |
Quantitative evaluation was completed for 11 of these languages: Arabic, Chinese (Simplified), English, French, Hindi, Japanese, Korean, Russian, Spanish, Turkish, and Vietnamese. Results for Indonesian, Persian, Portuguese, and Ukrainian are not reported in the final common evaluation, so performance claims should not be extrapolated to them.
Intended Use
The model is intended for:
- summarizing news articles in a supported source language;
- research on multilingual abstractive summarization;
- experiments involving low-resource language balancing;
- integration into news-processing pipelines with human review.
The model is not intended to be a factual authority, to make editorial decisions autonomously, or to summarize high-stakes medical, legal, financial, or emergency information without verification against the source.
Training Data
The checkpoint was trained on csebuetnlp/xlsum, a multilingual collection of BBC news articles paired with short reference summaries. Examples from the 15 selected subsets were annotated with their corresponding mBART language codes.
| Split | Examples used |
|---|---|
| Train | 818,123 after empty-example filtering and resampling |
| Validation | 7,500 (up to 500 per language) |
| Test | 7,500 (up to 500 per language) |
Language Resampling
To reduce domination by high-resource languages, training examples were resampled with:
p(language) proportional to n(language)^alpha, where alpha = 0.5
The total training-set size was kept approximately constant. High-resource languages were downsampled while lower-resource languages were oversampled with replacement. This improves exposure balance but can also increase repetition of examples from the smallest subsets.
Training Procedure
| Hyperparameter | Value |
|---|---|
| Maximum optimization steps | 35,000 |
| Completed epochs | approximately 1.369 |
| Learning rate | 1e-4 |
| Optimizer | Adafactor |
| Scheduler | inverse square root |
| Warmup steps | 5,000 |
| Weight decay | 0.0 |
| Per-device training batch size | 8 |
| Gradient accumulation | 4 |
| Effective batch size | 32 |
| Evaluation / save interval | 5,000 steps |
| Best checkpoint metric | validation ROUGE-L |
| Maximum source length | 512 |
| Maximum target length | 64 |
| Training precision | float16 |
| Seed | 42 |
Gradient checkpointing and generation-based validation were enabled. Training ran on an NVIDIA A100-SXM4-40GB GPU and completed with a reported training loss of 2.1509 in approximately 12 hours 7 minutes.
The selected checkpoint reported the following internal held-out results on 7,500 examples: validation ROUGE-L 0.1860 and test ROUGE-L 0.1851. These results used the training notebook's evaluation path and are not directly interchangeable with the common final evaluation below.
Evaluation
Protocol
The final evaluation used the complete available test subsets of csebuetnlp/xlsum for 11 languages, totaling 52,219 examples per model. No per-language sample cap was applied. Scores below are macro-averaged across languages and reported on a 0-1 scale.
Articles were truncated to 512 tokens. Evaluation used batch size 8 and mixed precision on CUDA. The common mBART decoding configuration was:
max_length=90
min_length=24
num_beams=5
length_penalty=1.1
no_repeat_ngram_size=3
repetition_penalty=1.05
early_stopping=True
Turkish used a language-specific configuration with max_length=84, min_length=26, 6 beams, length penalty 1.3, 4-gram repetition blocking, and repetition penalty 1.2. The target language token was forced for every language.
Overall Results
| Metric | Score |
|---|---|
| ROUGE-1 F1 | 0.3574 |
| ROUGE-2 F1 | 0.1746 |
| ROUGE-L F1 | 0.2792 |
| BLEU | 0.1228 |
| METEOR-lite | 0.3181 |
METEOR-lite is the project's lightweight exact-token implementation and must not be interpreted as the canonical METEOR package score.
Results by Language
| Language | Samples | ROUGE-1 | ROUGE-2 | ROUGE-L | BLEU | METEOR-lite |
|---|---|---|---|---|---|---|
| Arabic | 4,689 | 0.2660 | 0.1101 | 0.2261 | 0.0743 | 0.2260 |
| English | 11,535 | 0.3448 | 0.1285 | 0.2724 | 0.0930 | 0.2990 |
| Spanish | 4,763 | 0.3010 | 0.1021 | 0.2251 | 0.0703 | 0.2500 |
| French | 1,086 | 0.3117 | 0.1331 | 0.2500 | 0.0958 | 0.2681 |
| Hindi | 8,847 | 0.5695 | 0.2945 | 0.4089 | 0.2157 | 0.5293 |
| Japanese | 889 | 0.5078 | 0.3123 | 0.3689 | 0.2234 | 0.4557 |
| Korean | 550 | 0.4351 | 0.2506 | 0.3436 | 0.1886 | 0.4185 |
| Russian | 7,780 | 0.2425 | 0.0943 | 0.2017 | 0.0645 | 0.1975 |
| Turkish | 3,397 | 0.2332 | 0.1014 | 0.2092 | 0.0728 | 0.2026 |
| Vietnamese | 4,013 | 0.3308 | 0.1594 | 0.2526 | 0.0960 | 0.2907 |
| Chinese (Simplified) | 4,670 | 0.3890 | 0.2342 | 0.3128 | 0.1558 | 0.3613 |
Supplementary Diagnostics
These diagnostics are included to describe model behavior, not as established factuality or human-quality measures.
| Diagnostic | Value |
|---|---|
| Mean latency per example | 0.1840 s |
| Mean compression ratio | 23.1007 |
| Mean source coverage | 0.8203 |
| Mean novel bigram ratio | 0.4712 |
| Mean repeated trigram ratio | 0.0036 |
| Mean generated summary length | 31.22 tokens |
| Automatic completeness proxy | 0.4576 |
| Automatic factuality proxy | 0.8249 |
Latency was measured in one CUDA evaluation environment and is not portable across hardware, software versions, or batch configurations. The completeness and factuality values are project-specific automatic proxies; they do not prove semantic completeness or factual correctness.
Usage
Replace YOUR_USERNAME with the Hugging Face account or organization that hosts the model.
import torch
from transformers import MBart50TokenizerFast, MBartForConditionalGeneration
model_id = "YOUR_USERNAME/mbart-xlsum-2"
language_code = "tr_TR"
tokenizer = MBart50TokenizerFast.from_pretrained(model_id)
model = MBartForConditionalGeneration.from_pretrained(model_id)
model.eval()
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)
article = "Özetlenecek Türkçe haber metni buraya yazılır."
tokenizer.src_lang = language_code
inputs = tokenizer(
article,
return_tensors="pt",
max_length=512,
truncation=True,
).to(device)
with torch.inference_mode():
output_ids = model.generate(
**inputs,
forced_bos_token_id=tokenizer.lang_code_to_id[language_code],
max_length=84,
min_length=26,
num_beams=6,
length_penalty=1.3,
no_repeat_ngram_size=4,
repetition_penalty=1.2,
early_stopping=True,
)
summary = tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0]
print(summary)
For languages other than Turkish, the evaluation configuration used max_length=90, min_length=24, 5 beams, length penalty 1.1, 3-gram repetition blocking, and repetition penalty 1.05.
Limitations and Risks
- The model can hallucinate facts, names, dates, quantities, or causal relationships.
- Inputs longer than 512 tokens were truncated during fine-tuning and evaluation; important information near the end of long articles may be omitted.
- XL-Sum is based on BBC news. Performance may decline on other publishers, informal writing, specialist domains, or structurally noisy web text.
- Automatic overlap metrics reward lexical similarity and do not establish factual consistency or summary usefulness.
- Evaluation coverage is uneven across languages and includes only 11 languages in the final common benchmark.
- Alpha-smoothed resampling duplicates examples from smaller language subsets and may increase memorization or language-specific overfitting.
- Higher bigram novelty indicates more reformulation, but may also increase the risk of departing from the source.
- The model may reproduce social, geographic, and editorial biases present in its pretrained parameters and news data.
- Always verify generated summaries against the original article before publication or high-stakes use.
License
XL-Sum is distributed under CC BY-NC-SA 4.0. This model card therefore declares the same license for the fine-tuned checkpoint. Users are responsible for checking the terms of the base model and dataset and for ensuring that their intended use is permitted. In particular, the dataset license restricts commercial use.
Citation
If you use the model, cite the XL-Sum dataset and mBART-50:
- XL-Sum: arXiv:2106.13822
- mBART-50: arXiv:2008.00401
@inproceedings{hasan-etal-2021-xl,
title = {{XL-Sum}: Large-Scale Multilingual Abstractive Summarization for 44 Languages},
author = {Hasan, Tahmid and Bhattacharjee, Abhik and Islam, Md. Saiful and Samin, Kazi and Li, Yuan-Fang and Kang, Yong-Bin and Rahman, M. Sohel and Shahriyar, Rifat},
booktitle = {Findings of ACL-IJCNLP 2021},
year = {2021},
pages = {4693--4703},
eprint = {2106.13822},
archivePrefix = {arXiv}
}
@article{tang2020multilingual,
title = {Multilingual Translation with Extensible Multilingual Pretraining and Finetuning},
author = {Tang, Yuqing and Tran, Chau and Li, Xian and Chen, Peng-Jen and Goyal, Naman and Chaudhary, Vishrav and Gu, Jiatao and Fan, Angela},
journal = {arXiv preprint arXiv:2008.00401},
year = {2020},
eprint = {2008.00401},
archivePrefix = {arXiv}
}
Model Card Author
Mert Samet Kayacıoğlu
- Downloads last month
- -
Model tree for mskayacioglu/mbart-xlsum-2
Base model
facebook/mbart-large-50-many-to-many-mmt