GPTPoet: Pre-training GPT2 for Arabic Poetry Language Understanding

GPTPoet is an Arabic pretrained language model based on OpenAi GPT2 architechture. We use the same GPT2-Base config. More details are available in the Google Colab [https://colab.research.google.com/drive/1kByhyhvA0JUZRKL-XCG0ZEDyAg45w8AW?usp=sharing].

To save computation time the model used pretrained weights from another model. This allowed us to fine-tune our model on our specific dataset, which to our knowledge was never used in NLP task before.

This is a poem generator that creates poems based on the style of the targeted poet. The model was trained on different poets and their respective poems, and the model's input is the poet's name and a suggestion that the model will strive to develop something that imitates the style of that specific poet.

What's New!

All models are available in the HuggingFace model page under the usama98 name. Checkpoints are available in PyTorch.

Our model adds a newly tried capability of NLP models where we don't just try to generate text but one that imitates a specific style. Our dataset contains poetry gathered from different poets, the data was feed to the model during training in with the aim of teaching the model how to structure arabic poetry. The additional step here was to add a poet name at the beginning of each training example. This training strategy allows the model to not only learn how to write poetry but how to the written poetry relates to that specific poet and their style.

Dataset

The dataset consists of content scraped mainly from الموسوعة الشعرية and الديوان. After merging both, the total number of verses is 1,831,770 poetic verses. Each verse is labeled by its meter, the poet who wrote it, and the age which it was written in. There are 22 meters, 3701 poets and 11 ages: Pre-Islamic, Islamic, Umayyad, Mamluk, Abbasid, Ayyubid, Ottoman, Andalusian, era between Umayyad and Abbasid, Fatimid, and finally the modern age. We are only interested in the 16 classic meters which are attributed to Al-Farahidi, and they comprise the majority of the dataset with a total number around 1.7M verses. It is important to note that the verses diacritic states are not consistent. This means that a verse can carry full, semi diacritics, or it can carry nothing.

APCD

Preprocessing

It is recommended to apply our preprocessing tokenizer before training/testing on any dataset.

Contacts

Usama Zidan: Linkedin | Github | usama.zidan@bcu.ac.uk | osama.zadan@gmail.com