Update README.md
Browse files
README.md
CHANGED
@@ -7,53 +7,68 @@ tags:
|
|
7 |
- 'arabic '
|
8 |
- text-generation
|
9 |
---
|
10 |
-
Model Description
|
11 |
-
|
|
|
12 |
Architecture: GPT-2
|
13 |
Layers: 12
|
14 |
Model Size: 134M
|
15 |
Context Window Size: 768
|
16 |
-
|
17 |
ArabianGPT is a custom-trained version of the GPT-2 base model, specifically tailored for the Arabic language. It is designed to understand and generate Arabic text, making it suitable for various natural language processing tasks in Arabic.
|
18 |
|
19 |
-
Training
|
20 |
Dataset: Abu Elkhiar Corpus
|
21 |
Size: 15.5 GB
|
22 |
Number of Words: 237,814,541
|
23 |
Number of Tokens: 1,752,421,071
|
|
|
24 |
Epochs: 5.87
|
25 |
Loss: 3.97
|
26 |
-
|
27 |
The model was trained on the Abu Elkhiar dataset, a comprehensive Arabic text corpus encompassing a wide range of topics. The training process focused on adapting the model to understand the nuances and complexities of the Arabic language.
|
28 |
|
29 |
-
Tokenizer
|
30 |
-
Type: Custom trained SentencePiece tokenizer
|
31 |
Vocabulary Size: 64K
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
32 |
|
33 |
-
|
34 |
|
35 |
-
|
36 |
-
ArabianGPT can be used for text generation
|
37 |
|
38 |
-
|
39 |
-
|
40 |
|
41 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
42 |
We emphasize responsible usage of ArabianGPT. Users should ensure that the generated text is used ethically and does not propagate misinformation or harmful content.
|
43 |
|
44 |
-
Citation
|
45 |
If you use ArabianGPT in your research or application, please cite it as follows:
|
46 |
|
47 |
@misc{ArabianGPT, 2023,
|
48 |
-
title={ArabianGPT:
|
49 |
-
author={Najar, Omar and Sibaee, Serry and Ghouti, Lahouari and Koubaa, Anis},
|
50 |
-
affiliation={Prince Sultan University, Riyadh, Saudi Arabia},
|
51 |
-
year={2023},
|
52 |
}
|
53 |
|
|
|
|
|
54 |
|
55 |
-
|
56 |
-
|
57 |
-
|
58 |
-
Contact
|
59 |
-
For inquiries regarding ArabicGPT-S, please contact onajar@psu.edu.sa
|
|
|
7 |
- 'arabic '
|
8 |
- text-generation
|
9 |
---
|
10 |
+
# Model Description
|
11 |
+
|
12 |
+
Model Name: ArabianGPT
|
13 |
Architecture: GPT-2
|
14 |
Layers: 12
|
15 |
Model Size: 134M
|
16 |
Context Window Size: 768
|
|
|
17 |
ArabianGPT is a custom-trained version of the GPT-2 base model, specifically tailored for the Arabic language. It is designed to understand and generate Arabic text, making it suitable for various natural language processing tasks in Arabic.
|
18 |
|
19 |
+
# Training
|
20 |
Dataset: Abu Elkhiar Corpus
|
21 |
Size: 15.5 GB
|
22 |
Number of Words: 237,814,541
|
23 |
Number of Tokens: 1,752,421,071
|
24 |
+
Number of Parameters : 134 M Params
|
25 |
Epochs: 5.87
|
26 |
Loss: 3.97
|
|
|
27 |
The model was trained on the Abu Elkhiar dataset, a comprehensive Arabic text corpus encompassing a wide range of topics. The training process focused on adapting the model to understand the nuances and complexities of the Arabic language.
|
28 |
|
29 |
+
# Tokenizer
|
30 |
+
Type: Custom trained SentencePiece tokenizer
|
31 |
Vocabulary Size: 64K
|
32 |
+
We employed AraNizer, a custom trained tokenizer based on the SentencePiece model, with a vocabulary size of 64K. This choice was made to optimize the model's performance for the specific characteristics of the Arabic language.
|
33 |
+
|
34 |
+
More info about AraNizer can be found here [Link](https://github.com/omarnj-lab/aranizer/tree/main)
|
35 |
+
|
36 |
+
|
37 |
+
# Usage
|
38 |
+
ArabianGPT can be used for text generation tasks in Arabic.
|
39 |
|
40 |
+
### How to use
|
41 |
|
42 |
+
Here is how to use this model to generate ruby function documentation using Transformers SummarizationPipeline:
|
|
|
43 |
|
44 |
+
```python
|
45 |
+
from transformers import pipeline
|
46 |
|
47 |
+
pipe = pipeline("text-generation", model="riotu-lab/Ghazal" , max_new_tokens = 512)
|
48 |
+
|
49 |
+
text = ''
|
50 |
+
|
51 |
+
pipe.predict(text)
|
52 |
+
```
|
53 |
+
|
54 |
+
# Limitations
|
55 |
+
As with any language model, ArabianGPT may have limitations in understanding context or generating text in certain scenarios. Users should be aware of these limitations and use the model accordingly.
|
56 |
+
|
57 |
+
# Ethical Considerations
|
58 |
We emphasize responsible usage of ArabianGPT. Users should ensure that the generated text is used ethically and does not propagate misinformation or harmful content.
|
59 |
|
60 |
+
# Citation
|
61 |
If you use ArabianGPT in your research or application, please cite it as follows:
|
62 |
|
63 |
@misc{ArabianGPT, 2023,
|
64 |
+
title={ArabianGPT: A GPT-2 Based Language Model for Arabic},
|
65 |
+
author={Najar, Omar and Sibaee, Serry and Ghouti, Lahouari and Koubaa, Anis},
|
66 |
+
affiliation={Prince Sultan University, Riyadh, Saudi Arabia},
|
67 |
+
year={2023},
|
68 |
}
|
69 |
|
70 |
+
# Acknowledgments
|
71 |
+
We thank Prince Sultan University, especially the Robotics and Internet of Things Lab, for their support.
|
72 |
|
73 |
+
# Contact
|
74 |
+
For inquiries regarding ArabianGPT, please contact onajar@psu.edu.sa.
|
|
|
|
|
|