---
language: ar
license: other
license_name: custom
license_link: https://github.com/aub-mind/arabert/blob/master/aragpt2/LICENSE
datasets:
- wikipedia
- Osian
- arabic-billion-words
- oscar
- Assafir-private
inference: false
widget:
- text: "يحكى أن مزارعا مخادعا قام ببيع بئر الماء الموجود في أرضه لجاره مقابل مبلغ كبير من المال"
- text: "القدس مدينة تاريخية، بناها الكنعانيون في"
- text: "كان يا ما كان في قديم الزمان"
---
# Arabic GPT2
You can find more information in our paper [AraGPT2](https://arxiv.org/abs/2012.15520)
The code in this repository was used to train all GPT2 variants. The code support training and fine-tuning GPT2 on GPUs and TPUs via the TPUEstimator API.
GPT2-base and medium uses the code from the `gpt2` folder and can trains models from the [minimaxir/gpt-2-simple](https://github.com/minimaxir/gpt-2-simple) repository.
These models were trained using the `lamb` optimizer and follow the same architecture as `gpt2` and are fully compatible with the `transformers` library.
GPT2-large and GPT2-mega were trained using the [imcaspar/gpt2-ml](https://github.com/imcaspar/gpt2-ml/) library, and follow the `grover` architecture. You can use the pytorch classes found in `grover/modeling_gpt2.py` as a direct replacement for classes in the `transformers` library (it should support version `v4.x` from `transformers`).
Both models are trained using the `adafactor` optimizer, since the `adam` and `lamb` optimizer use too much memory causing the model to not even fit 1 batch on a TPU core.
AraGPT2 is trained on the same large Arabic Dataset as AraBERTv2.
# NOTE: The model expects the input to be preprocessed using the `arabert` library.
if not the model won't be able to generate the correct output.
## Testing the model using `transformers`:
The model code is now hosted on HuggingFace so you need to use the `trust_remote_code` flag, and can be used as follows:
```python
from transformers import AutoModelForCausalLM, pipeline
from arabert.preprocess import ArabertPreprocessor
MODEL_NAME='aubmindlab/aragpt2-mega'
arabert_prep = ArabertPreprocessor(model_name=MODEL_NAME)
text=""
text_clean = arabert_prep.preprocess(text)
model = AutoModelForCausalLM.from_pretrained(MODEL_NAME, trust_remote_code=True)
tokenizer = GPT2TokenizerFast.from_pretrained(MODEL_NAME)
generation_pipeline = pipeline(
"text-generation", model=MODEL_NAME, trust_remote_code=True
)
#feel free to try different decoding settings
generation_pipeline(text,
pad_token_id=pipeline.tokenizer.eos_token_id,
num_beams=10,
max_length=200,
top_p=0.9,
repetition_penalty = 3.0,
no_repeat_ngram_size = 3)[0]['generated_text']
>>>
```
## Finetunning using `transformers`:
Follow the guide linked [here](https://towardsdatascience.com/fine-tuning-gpt2-on-colab-gpu-for-free-340468c92ed)
## Finetuning using our code with TF 1.15.4:
Create the Training TFRecords:
```bash
python create_pretraining_data.py
--input_file=
--output_file=