--- license: mit language: - ar pipeline_tag: text-generation tags: - 'arabic ' - text-generation --- # Model Description * Model Name: ArabianGPT * Architecture: GPT-2 * Layers: 12 * Model Size: 134M * Context Window Size: 768 > [! NOTE] > ArabianGPT is a custom-trained version of the GPT-2 base model, specifically tailored for the Arabic language. It is designed to understand and generate Arabic text, making it suitable for various natural language processing tasks in Arabic. # Training * Dataset: Abu Elkhiar Corpus * Size: 15.5 GB * Number of Words: 237,814,541 * Number of Tokens: 1,752,421,071 * Number of Parameters : 134 M Params * Steps: 337,500 * Loss: 3.97 > [!NOTE] > The model was trained on the Abu Elkhiar dataset, a comprehensive Arabic text corpus encompassing a wide range of topics. The training process focused on adapting the model to understand the nuances and complexities of the Arabic language. # Tokenizer Type: Custom trained SentencePiece tokenizer Vocabulary Size: 64K > We employed AraNizer, a custom trained tokenizer based on the SentencePiece model, with a vocabulary size of 64K. This choice was made to optimize the model's performance for the specific characteristics of the Arabic language. More info about AraNizer can be found here [Link](https://github.com/omarnj-lab/aranizer/tree/main) # Usage ArabianGPT can be used for text generation tasks in Arabic. ### How to use Here is how to use this model to generate ruby function documentation using Transformers SummarizationPipeline: ```python from transformers import pipeline pipe = pipeline("text-generation", model="riotu-lab/ArabianGPT-base" , max_new_tokens = 512) text = '' pipe.predict(text) ``` # Limitations > [!TIP] > As with any language model, ArabianGPT may have limitations in understanding context or generating text in certain scenarios. Users should be aware of these limitations and use the model accordingly. # Ethical Considerations We emphasize responsible usage of ArabianGPT. Users should ensure that the generated text is used ethically and does not propagate misinformation or harmful content. # Citation If you use ArabianGPT in your research or application, please cite it as follows: ``` @misc{ArabianGPT, 2023, title={ArabianGPT: A GPT-2 Based Language Model for Arabic}, author={Najar, Omar and Sibaee, Serry and Ghouti, Lahouari and Koubaa, Anis}, affiliation={Prince Sultan University, Riyadh, Saudi Arabia}, year={2023}, } ``` # Acknowledgments > We thank Prince Sultan University, especially the Robotics and Internet of Things Lab, for their support. # Contact For inquiries regarding ArabianGPT, please contact onajar@psu.edu.sa.