---
language: ar
datasets:
- Arabic Poetry Dataset (6th - 21st century)
metrics:
- perplexity
tags:
- text-generation
- poetry
license: apache-2.0
widget:
- text: "للوهلة الأولى قرأت في عينيه"
model-index:
- name: elgeish Arabic GPT2 Medium
  results:
  - task: 
      name: Text Generation
      type: text-generation
    dataset:
      name: Arabic Poetry Dataset (6th - 21st century)
      type: poetry
      args: ar
    metrics:
       - name: Validation Perplexity
         type: perplexity
         value: 282.09
---

# GPT2-Medium-Arabic-Poetry

Fine-tuned [aubmindlab/aragpt2-medium](https://huggingface.co/aubmindlab/aragpt2-medium) on
the [Arabic Poetry Dataset (6th - 21st century)](https://www.kaggle.com/fahd09/arabic-poetry-dataset-478-2017)
using 41,922 lines of poetry as the train split and 9,007 (by poets not in the train split) for validation.

## Usage

```python
from transformers import AutoModelForCausalLM, AutoTokenizer, set_seed

set_seed(42)
model_name = "elgeish/gpt2-medium-arabic-poetry"
model = AutoModelForCausalLM.from_pretrained(model_name).to("cuda")
tokenizer = AutoTokenizer.from_pretrained(model_name)
prompt = "للوهلة الأولى قرأت في عينيه"
input_ids = tokenizer.encode(prompt, return_tensors="pt")
samples = model.generate(
    input_ids.to("cuda"),
    do_sample=True,
    early_stopping=True,
    max_length=32,
    min_length=16,
    num_return_sequences=3,
    pad_token_id=50256,
    repetition_penalty=1.5,
    top_k=32,
    top_p=0.95,
)

for sample in samples:
    print(tokenizer.decode(sample.tolist()))
    print("--")
```

Here's the output:

```
للوهلة الأولى قرأت في عينيه عن تلك النسم لم تذكر شيءا فلربما نامت علي كتفيها العصافير وتناثرت اوراق التوت عليها وغابت الوردة من
--
للوهلة الأولى قرأت في عينيه اية نشوة من ناره وهي تنظر الي المستقبل بعيون خلاقة ورسمت خطوطه العريضة علي جبينك العاري رسمت الخطوط الحمر فوق شعرك
--
للوهلة الأولى قرأت في عينيه كل ما كان وما سيكون غدا اذا لم تكن امراة ستكبر كثيرا علي الورق الابيض او لا تري مثلا خطوطا رفيعة فوق صفحة الماء
--
```