File size: 6,452 Bytes
ceed500 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 |
# AraGPT2
You can find more information in our paper [AraGPT2](https://arxiv.org/abs/2012.15520)
The code in this repository was used to train all GPT2 variants. The code support training and fine-tuning GPT2 on GPUs and TPUs via the TPUEstimator API.
GPT2-base and medium uses the code from the `gpt2` folder and can trains models from the [minimaxir/gpt-2-simple](https://github.com/minimaxir/gpt-2-simple) repository.
These models were trained using the `lamb` optimizer and follow the same architecture as `gpt2` and are fully compatible with the `transformers` library.
GPT2-large and GPT2-mega were trained using the [imcaspar/gpt2-ml](https://github.com/imcaspar/gpt2-ml/) library, and follow the `grover` architecture. You can use the pytorch classes found in `grover/modeling_gpt2.py` as a direct replacement for classes in the `transformers` library (it should support version `v4.x` from `transformers`).
Both models are trained using the `adafactor` optimizer, since the `adam` and `lamb` optimizer use too much memory causing the model to not even fit 1 batch on a TPU core.
AraGPT2 is trained on the same large Arabic Dataset as AraBERTv2.
# AraGPT2 Detector
Machine generated detector model from the [AraGPT2: Pre-Trained Transformer for Arabic Language Generation paper](https://arxiv.org/abs/2012.15520)
This model is trained on the long text passages, and achieves a 99.4% F1-Score.
# Usage
## Testing the model using `transformers`:
```python
from transformers import GPT2TokenizerFast, pipeline
#for base and medium
from transformers import GPT2LMHeadModel
#for large and mega
from arabert.aragpt2.grover.modeling_gpt2 import GPT2LMHeadModel
from arabert.preprocess import ArabertPreprocessor
MODEL_NAME='aubmindlab/aragpt2-base'
arabert_prep = ArabertPreprocessor(model_name=MODEL_NAME)
model = GPT2LMHeadModel.from_pretrained(MODEL_NAME)
tokenizer = GPT2TokenizerFast.from_pretrained(MODEL_NAME)
generation_pipeline = pipeline("text-generation",model=model,tokenizer=tokenizer)
text=""
text_clean = arabert_prep.preprocess(text)
#feel free to try different decoding settings
generation_pipeline(
text_clean,
pad_token_id=tokenizer.eos_token_id,
num_beams=10,
max_length=200,
top_p=0.9,
repetition_penalty = 3.0,
no_repeat_ngram_size = 3)[0]['generated_text']
```
## How to use the detector:
```python
from transformers import pipeline
from arabert.preprocess import ArabertPreprocessor
processor = ArabertPreprocessor(model="aubmindlab/araelectra-base-discriminator")
pipe = pipeline("sentiment-analysis", model = "aubmindlab/aragpt2-mega-detector-long")
text = " "
text_prep = processor.preprocess(text)
result = pipe(text_prep)
# [{'label': 'machine-generated', 'score': 0.9977743625640869}]
```
## Finetunning using `transformers`:
Follow the guide linked [here](https://towardsdatascience.com/fine-tuning-gpt2-on-colab-gpu-for-free-340468c92ed)
## Finetuning using our code with TF 1.15.4:
- Create the Training TFRecords:
```bash
python create_pretraining_data.py
--input_file=<RAW TEXT FILE with documents/article sperated by an empty line>
--output_file=<OUTPUT TFRecord>
--tokenizer_dir=<Directory with the GPT2 Tokenizer files>
```
- Finetuning:
```bash
python3 run_pretraining.py \
--input_file="gs://<GS_BUCKET>/pretraining_data/*" \
--output_dir="gs://<GS_BUCKET>/pretraining_model/" \
--config_file="config/small_hparams.json" \
--batch_size=128 \
--eval_batch_size=8 \
--num_train_steps= \
--num_warmup_steps= \
--learning_rate= \
--save_checkpoints_steps= \
--max_seq_length=1024 \
--max_eval_steps= \
--optimizer="lamb" \
--iterations_per_loop=5000 \
--keep_checkpoint_max=10 \
--use_tpu=True \
--tpu_name=<TPU NAME> \
--do_train=True \
--do_eval=False
```
# Model Sizes
Model | Optimizer | Context size | Embedding Size | Num of heads | Num of layers | Model Size / Num of Params |
---|:---:|:---:|:---:|:---:|:---:|:---:
AraGPT2-base | `lamb` | 1024 | 768 | 12 | 12 | 527MB / 135M |
AraGPT2-medium | `lamb` | 1024 | 1024 | 16 | 24 | 1.4GB / 369M |
AraGPT2-large | `adafactor` | 1024 | 1280 | 20 | 36 | 2.98GB/792M |
AraGPT2-mega | `adafactor` | 1024 | 1536 | 24 | 48 | 5.5GB/1.46B |
## Compute
Model | Hardware | num of examples (seq len = 1024) | Batch Size | Num of Steps | Time (in days)
---|:---:|:---:|:---:|:---:|:---:
AraGPT2-base | TPUv3-128 | 9.7M | 1792 | 125K | 1.5
AraGPT2-medium | TPUv3-8 | 9.7M | 80 | 1M | 15
AraGPT2-large | TPUv3-128 | 9.7M | 256 | 220k | 3
AraGPT2-mega | TPUv3-128 | 9.7M | 256 | 780K | 9
# Results
The results show in the table below are the perplexity values on wikipedia articles that are not in the training data.
Model | PPL |
---|:---:
AraGPT2-base | 55.8
AraGPT2-medium | 45.7
AraGPT2-large | 36.6
AraGPT2-mega | 29.8
# Disclaimer
The text generated by AraGPT2 Arabic is automatically generated by a neural network model trained on a large amount of texts, which does not represent the authors' or their institutes' official attitudes and preferences. The text generated by AraGPT2 should only be used for research and scientific purposes. If it infringes on your rights and interests or violates social morality, please do not propagate it.
# If you used this model please cite us as :
```
@inproceedings{antoun-etal-2021-aragpt2,
title = "{A}ra{GPT}2: Pre-Trained Transformer for {A}rabic Language Generation",
author = "Antoun, Wissam and
Baly, Fady and
Hajj, Hazem",
booktitle = "Proceedings of the Sixth Arabic Natural Language Processing Workshop",
month = apr,
year = "2021",
address = "Kyiv, Ukraine (Virtual)",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/2021.wanlp-1.21",
pages = "196--207",
}
```
# Acknowledgments
Thanks to TensorFlow Research Cloud (TFRC) for the free access to Cloud TPUs, couldn't have done it without this program, and to the [AUB MIND Lab](https://sites.aub.edu.lb/mindlab/) Members for the continous support. Also thanks to [Yakshof](https://www.yakshof.com/#/) and Assafir for data and storage access. Another thanks for Habib Rahal (https://www.behance.net/rahalhabib), for putting a face to AraBERT.
# Contacts
**Wissam Antoun**: [Linkedin](https://www.linkedin.com/in/wissam-antoun-622142b4/) | [Twitter](https://twitter.com/wissam_antoun) | [Github](https://github.com/WissamAntoun) | <wfa07@mail.aub.edu> | <wissam.antoun@gmail.com> |