m-nagoudi commited on
Commit
412f1b3
1 Parent(s): 86a3ee4

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +33 -53
README.md CHANGED
@@ -1,62 +1,39 @@
1
  # AraT5-base-title-generation
2
- <img src="https://raw.githubusercontent.com/UBC-NLP/araT5/main/AraT5_logo.jpg" alt="drawing" width="30%" height="30%" align="right"/>
3
 
4
- We introduce the News Title Generation (NGT) task as a new task for Arabic language generation. Given an article, a title generation model needs to output a short grammatical sequence of words suited to the article content. For NGT, we create a novel dataset from an existing news dataset namely **ARGEN<sub>NTG</sub>**. We extract 120K articles along with their titles from [AraNews (Nagoudi et al., 2020)](https://arxiv.org/abs/2011.03092). We only include titles with at least three words in this dataset. We split ARGEN<sub>NTG</sub> data into 80% (93.3K), 10% (11.7K), and 10% (11.7K) for training, development, and test respectively.
5
 
6
- We fine-tune ARGEN<sub>NTG</sub> on [**AraT5-base**](https://huggingface.co/UBC-NLP/AraT5-base), more details described in our [**AraT5: Text-to-Text Transformers for Arabic Language Understanding and Generation**](https://arxiv.org/abs/2109.12068)
 
 
 
7
 
8
 
9
- # How to use
 
 
10
  ``` bash
11
- !pip install transformers sentencepiece
 
 
 
 
 
 
 
 
 
 
 
12
  ```
 
13
 
14
- ```python
15
- from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
16
- tokenizer = AutoTokenizer.from_pretrained("UBC-NLP/AraT5-base-title-generation")
17
- model = AutoModelForSeq2SeqLM.from_pretrained("UBC-NLP/AraT5-base-title-generation")
18
-
19
- Document = "تحت رعاية صاحب السمو الملكي الأمير سعود بن نايف بن عبدالعزيز أمير المنطقة الشرقية اختتمت غرفة الشرقية مؤخرا، الثاني من مبادرتها لتأهيل وتدريب أبناء وبنات المملكة ضمن مبادرتها المجانية للعام 2019 حيث قدمت 6 برامج تدريبية نوعية. وثمن رئيس مجلس إدارة الغرفة، عبدالحكيم العمار الخالدي، رعاية سمو أمير المنطقة الشرقية للمبادرة، مؤكدا أن دعم سموه لجميع أنشطة ."
20
-
21
- encoding = tokenizer.encode_plus(Document,pad_to_max_length=True, return_tensors="pt")
22
- input_ids, attention_masks = encoding["input_ids"], encoding["attention_mask"]
23
-
24
-
25
- outputs = model.generate(
26
- input_ids=input_ids, attention_mask=attention_masks,
27
- max_length=256,
28
- do_sample=True,
29
- top_k=120,
30
- top_p=0.95,
31
- early_stopping=True,
32
- num_return_sequences=5
33
- )
34
-
35
- for id, output in enumerate(outputs):
36
- title = tokenizer.decode(output, skip_special_tokens=True,clean_up_tokenization_spaces=True)
37
- print("title#"+str(id), title)
38
- ```
39
- **The input news document**
40
 
41
- <div style="white-space : pre-wrap !important;word-break: break-word; direction:rtl; text-align: right">
42
- تحت رعاية صاحب السمو الملكي الأمير سعود بن نايف بن عبدالعزيز أمير المنطقة الشرقية اختتمت غرفة الشرقية مؤخرا، الثاني من مبادرتها لتأهيل وتدريب أبناء وبنات المملكة ضمن مبادرتها المجانية للعام 2019 حيث قدمت 6 برامج تدريبية نوعية. وثمن رئيس مجلس إدارة الغرفة، عبدالحكيم العمار الخالدي، رعاية سمو أمير المنطقة الشرقية للمبادرة، مؤكدا أن دعم سموه لجميع أنشطة .
43
- <br>
44
- </div>
45
-
46
 
47
- **The generated titles**
48
- ```
49
- title#0 غرفة الشرقية تختتم المرحلة الثانية من مبادرتها لتأهيل وتدريب أبناء وبنات المملكة
50
- title#1 غرفة الشرقية تختتم الثاني من مبادرة تأهيل وتأهيل أبناء وبناتنا
51
- title#2 سعود بن نايف يختتم ثانى مبادراتها لتأهيل وتدريب أبناء وبنات المملكة
52
- title#3 أمير الشرقية يرعى اختتام برنامج برنامج تدريب أبناء وبنات المملكة
53
- title#4 سعود بن نايف يرعى اختتام مبادرة تأهيل وتدريب أبناء وبنات المملكة
54
- ```
55
 
56
- # How to use AraT5 models
57
- [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1GFOGolWPIfDvYdSNdGFrOXwu3Gu28k2b?usp=sharing)This is an example for fine-tuning **AraT5-base** for News Title Generation on the Aranews dataset
58
 
59
- For more details, please visit our own [GitHub](https://github.com/UBC-NLP/araT5).
60
 
61
 
62
  # AraT5 Models Checkpoints
@@ -75,15 +52,18 @@ AraT5 Pytorch and TensorFlow checkpoints are available on the Huggingface websit
75
 
76
  If you use our models (Arat5-base, Arat5-msa-base, Arat5-tweet-base, Arat5-msa-small, or Arat5-tweet-small ) for your scientific publication, or if you find the resources in this repository useful, please cite our paper as follows (to be updated):
77
  ```bibtex
78
- @inproceedings{araT5-2021,
79
- title = "{AraT5: Text-to-Text Transformers for Arabic Language Understanding and Generation",
80
  author = "Nagoudi, El Moatez Billah and
81
  Elmadany, AbdelRahim and
82
  Abdul-Mageed, Muhammad",
83
- booktitle = "https://arxiv.org/abs/2109.12068",
84
- month = aug,
85
- year = "2021"}
 
 
 
86
  ```
87
 
88
  ## Acknowledgments
89
- We gratefully acknowledge support from the Natural Sciences and Engineering Research Council of Canada, the Social Sciences and Humanities Research Council of Canada, Canadian Foundation for Innovation, [ComputeCanada](www.computecanada.ca) and [UBC ARC-Sockeye](https://doi.org/10.14288/SOCKEYE). We also thank the [Google TensorFlow Research Cloud (TFRC)](https://www.tensorflow.org/tfrc) program for providing us with free TPU access.
 
1
  # AraT5-base-title-generation
2
+ # AraT5: Text-to-Text Transformers for Arabic Language Generation
3
 
4
+ <img src="https://huggingface.co/UBC-NLP/AraT5-base/resolve/main/AraT5_CR_new.png" alt="AraT5" width="45%" height="35%" align="right"/>
5
 
6
+ This is the repository accompanying our paper [AraT5: Text-to-Text Transformers for Arabic Language Understanding and Generation](https://arxiv.org/abs/2109.12068). In this is the repository we introduce:
7
+ * Introduce **AraT5<sub>MSA</sub>**, **AraT5<sub>Tweet</sub>**, and **AraT5**: three powerful Arabic-specific text-to-text Transformer based models;
8
+ * Introduce **ARGEN**: A new benchmark for Arabic language generation and evaluation for four Arabic NLP tasks, namely, ```machine translation```, ```summarization```, ```news title generation```, ```question generation```, , ```paraphrasing```, ```transliteration```, and ```code-switched translation```.
9
+ * Evaluate ```AraT5``` models on ```ARGEN``` and compare against available language models.
10
 
11
 
12
+ ---
13
+ # How to use AraT5 models
14
+ Below is an example for fine-tuning **AraT5-base** for News Title Generation on the Aranews dataset
15
  ``` bash
16
+ !python run_trainier_seq2seq_huggingface.py \
17
+ --learning_rate 5e-5 \
18
+ --max_target_length 128 --max_source_length 128 \
19
+ --per_device_train_batch_size 8 --per_device_eval_batch_size 8 \
20
+ --model_name_or_path "UBC-NLP/AraT5-base" \
21
+ --output_dir "/content/AraT5_FT_title_generation" --overwrite_output_dir \
22
+ --num_train_epochs 3 \
23
+ --train_file "/content/ARGEn_title_genration_sample_train.tsv" \
24
+ --validation_file "/content/ARGEn_title_genration_sample_valid.tsv" \
25
+ --task "title_generation" --text_column "document" --summary_column "title" \
26
+ --load_best_model_at_end --metric_for_best_model "eval_bleu" --greater_is_better True --evaluation_strategy epoch --logging_strategy epoch --predict_with_generate\
27
+ --do_train --do_eval
28
  ```
29
+ For more details about the fine-tuning example, please read this notebook [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://github.com/UBC-NLP/araT5/blob/main/examples/Fine_tuning_AraT5.ipynb)
30
 
31
+ In addition, we release the fine-tuned checkpoint of the News Title Generation (NGT) which is described in the paper. The model available at Huggingface ([UBC-NLP/AraT5-base-title-generation](https://huggingface.co/UBC-NLP/AraT5-base-title-generation)).
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
32
 
33
+ For more details, please visit our own [GitHub](https://github.com/UBC-NLP/araT5).
 
 
 
 
34
 
 
 
 
 
 
 
 
 
35
 
 
 
36
 
 
37
 
38
 
39
  # AraT5 Models Checkpoints
 
52
 
53
  If you use our models (Arat5-base, Arat5-msa-base, Arat5-tweet-base, Arat5-msa-small, or Arat5-tweet-small ) for your scientific publication, or if you find the resources in this repository useful, please cite our paper as follows (to be updated):
54
  ```bibtex
55
+ @inproceedings{nagoudi-2022-arat5,
56
+ title = "{AraT5: Text-to-Text Transformers for Arabic Language Generation",
57
  author = "Nagoudi, El Moatez Billah and
58
  Elmadany, AbdelRahim and
59
  Abdul-Mageed, Muhammad",
60
+ booktitle = "Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics",
61
+ month = May,
62
+ year = "2022",
63
+ address = "Online",
64
+ publisher = "Association for Computational Linguistics",
65
+ }
66
  ```
67
 
68
  ## Acknowledgments
69
+ We gratefully acknowledge support from the Natural Sciences and Engineering Research Council of Canada, the Social Sciences and Humanities Research Council of Canada, Canadian Foundation for Innovation, [ComputeCanada](www.computecanada.ca) and [UBC ARC-Sockeye](https://doi.org/10.14288/SOCKEYE). We also thank the [Google TensorFlow Research Cloud (TFRC)](https://www.tensorflow.org/tfrc) program for providing us with free TPU access.