# Arabic News Article Summarization with mT5 This project fine-tunes the `google/mt5-small` model on the BBC Arabic news dataset for the task of summarizing news articles into concise summaries. Utilizing the Transformer-based model's state-of-the-art performance in natural language understanding and generation, this project addresses the unique linguistic nuances of Arabic through advanced NLP techniques. ## Introduction Harnessing the power of the `google/mt5-small` model, this project aims to leverage its multilingual processing capabilities for Arabic text summarization. By fine-tuning the model on the BBC Arabic news dataset, we enhance its ability to generate accurate and concise summaries of Arabic news articles. The project employs the Transformers library for an efficient training loop and uses ROUGE scores as an evaluation metric to ensure high-quality summaries. You can replicate this model following the [Training Repo](https://github.com/yalsaffar/mt5-small-Arabic-Summarization) ## Dataset The dataset comprises news articles from the BBC Arabic news, split into 32,000 training rows, 4,000 testing rows, and 4,000 validation rows. - **Dataset Source:** [BBC Arabic News Data](https://www.kaggle.com/datasets/fadyelkbeer/arabic-summarization-bbc-news) ## Model The `google/mt5-small` model, a part of the T5 family, is extended to mT5 to support multilingual capabilities, covering 101 languages including Arabic. This project fine-tunes mT5 for Arabic news summarization. - **Pretrained Model:** [google/mt5-small](https://huggingface.co/google/mt5-small) ## Usage To use this model for summarizing Arabic news articles, follow the steps below: ```python from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, AutoConfig import torch # Load tokenizer and model tokenizer = AutoTokenizer.from_pretrained("yalsaffar/mt5-small-Arabic-Summarization") config = AutoConfig.from_pretrained( "yalsaffar/mt5-small-Arabic-Summarization", max_length=128, length_penalty=0.6, no_repeat_ngram_size=2, num_beams=15, ) model = AutoModelForSeq2SeqLM.from_pretrained("yalsaffar/mt5-small-Arabic-Summarization", config=config).to("cuda" if torch.cuda.is_available() else "cpu") # Prepare input input_text = "الأخبار ...." input_ids = tokenizer.encode(input_text, return_tensors="pt").to("cuda" if torch.cuda.is_available() else "cpu") # Generate summary with torch.no_grad(): preds = model.generate( input_ids, num_beams=15, num_return_sequences=1, no_repeat_ngram_size=1, remove_invalid_values=True, max_length=128, ) # Convert ids to text summary = tokenizer.batch_decode(preds, skip_special_tokens=True) print("***** Original Text *****") print(input_text) print("***** Generated Summary *****") print(summary[0]) ``` ## License This project is licensed under the MIT License - see the LICENSE file for details.