system HF staff commited on
Commit
140451d
โ€ข
1 Parent(s): 9cc7a13

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +30 -0
README.md ADDED
@@ -0,0 +1,30 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: ar
3
+ ---
4
+
5
+ # Sanaa-Dialect
6
+ ## Finetuned Arabic GPT-2 demo
7
+
8
+ This is a small GPT-2 model, originally trained on Arabic Wikipedia circa September 2020 ,
9
+ finetuned on part of University of British Columbia's Arabic Dialect dataset https://github.com/UBC-NLP/aoc_id
10
+
11
+ You can use special tokens to set four dialects: `[EGYPTIAN]`, `[GULF]`, `[LEVANTINE]`, and `[MSA]`
12
+
13
+ ```
14
+ from simpletransformers.language_generation import LanguageGenerationModel
15
+ model = LanguageGenerationModel("gpt2", "monsoon-nlp/sanaa-dialect")
16
+ model.generate('[GULF]' + "ู…ุฏูŠู†ุชูŠ ู‡ูŠ", { 'max_length': 100 })
17
+ ```
18
+
19
+ There is NO content filtering in the current version; do not use for public-facing
20
+ text generation!
21
+
22
+ ## Training and Finetuning details
23
+
24
+ Original model and training: https://huggingface.co/monsoon-nlp/sanaa
25
+
26
+ I inserted new tokens into the tokenizer, finetuned the model on the dialect samples, and exported the new model.
27
+
28
+ Notebook: https://colab.research.google.com/drive/1Z_935vTuZvbseOsExCjSprrqn1MsQT57
29
+
30
+ ## Example text block