language:
- ar
tags:
- text-generation
license: apache-2.0
datasets:
- Arabic Poem Comprehensive Dataset (APCD)
widget:
- text: 'عمرو بنِ قُمَيئَة: خَليلَيَّ لا تَستَعجِلا أَن'
GPTPoet: Pre-training GPT2 for Arabic Poetry Language Understanding
GPTPoet is an Arabic pretrained language model based on OpenAi GPT2 architechture. We use the same GPT2-Base config. More details are available in the Google Colab [].
To save computation time the model used pretrained weights from another model. This allowed us to fine-tune our model on our specific dataset, which to our knowledge was never used in NLP task before.
This is a poem generator that creates poems based on the style of the targeted poet. The model was trained on different poets and their respective poems, and the model's input is the poet's name and a suggestion that the model will strive to develop something that imitates the style of that specific poet.
What's New!
All models are available in the HuggingFace
model page under the usama98 name. Checkpoints are available in PyTorch.
Dataset
The dataset consists of content scraped mainly from الموسوعة الشعرية and الديوان. After merging both, the total number of verses is 1,831,770 poetic verses. Each verse is labeled by its meter, the poet who wrote it, and the age which it was written in. There are 22 meters, 3701 poets and 11 ages: Pre-Islamic, Islamic, Umayyad, Mamluk, Abbasid, Ayyubid, Ottoman, Andalusian, era between Umayyad and Abbasid, Fatimid, and finally the modern age. We are only interested in the 16 classic meters which are attributed to Al-Farahidi, and they comprise the majority of the dataset with a total number around 1.7M verses. It is important to note that the verses diacritic states are not consistent. This means that a verse can carry full, semi diacritics, or it can carry nothing.
Preprocessing
It is recommended to apply our preprocessing tokenizer before training/testing on any dataset.
Contacts
Usama Zidan: Linkedin | Github | usama.zidan@bcu.ac.uk | osama.zadan@gmail.com