system HF staff commited on
Commit
68100ba
1 Parent(s): a0fba48

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +33 -0
README.md CHANGED
@@ -8,6 +8,8 @@ language: ar
8
  This is a small GPT-2 model retrained on Arabic Wikipedia circa September 2020
9
  (due to memory limits, the first 600,000 lines of the Wiki dump)
10
 
 
 
11
  Training notebook: https://colab.research.google.com/drive/1Z_935vTuZvbseOsExCjSprrqn1MsQT57
12
 
13
  Steps to training:
@@ -24,3 +26,34 @@ Steps to training:
24
  am = AutoModel.from_pretrained('./argpt', from_tf=True)
25
  am.save_pretrained("./")
26
  ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8
  This is a small GPT-2 model retrained on Arabic Wikipedia circa September 2020
9
  (due to memory limits, the first 600,000 lines of the Wiki dump)
10
 
11
+ ## Training
12
+
13
  Training notebook: https://colab.research.google.com/drive/1Z_935vTuZvbseOsExCjSprrqn1MsQT57
14
 
15
  Steps to training:
 
26
  am = AutoModel.from_pretrained('./argpt', from_tf=True)
27
  am.save_pretrained("./")
28
  ```
29
+
30
+ ## Generating text in SimpleTransformers
31
+
32
+ Finetuning notebook: https://colab.research.google.com/drive/1fXFH7g4nfbxBo42icI4ZMy-0TAGAxc2i
33
+
34
+ ```python
35
+ from simpletransformers.language_generation import LanguageGenerationModel
36
+ model = LanguageGenerationModel("gpt2", "monsoon-nlp/sanaa")
37
+ model.generate("مدرستي")
38
+ ```
39
+
40
+ ## Finetuning dialects in SimpleTransformers
41
+
42
+ I finetuned this model on different Arabic dialects to generate a new
43
+ model (monsoon-nlp/sanaa-dialect on HuggingFace) with some additional
44
+ control tokens.
45
+
46
+ Finetuning notebook: https://colab.research.google.com/drive/1fXFH7g4nfbxBo42ic$
47
+
48
+ ```python
49
+ from simpletransformers.language_modeling import LanguageModelingModel
50
+ ft_model = LanguageModelingModel('gpt2', 'monsoon-nlp/sanaa', args=train_args)
51
+ ft_model.tokenizer.add_tokens(["[EGYPTIAN]", "[MSA]", "[LEVANTINE]", "[GULF]"])
52
+ ft_model.model.resize_token_embeddings(len(ft_model.tokenizer))
53
+ ft_model.train_model("./train.txt", eval_file="./test.txt")
54
+
55
+ # exported model
56
+ from simpletransformers.language_generation import LanguageGenerationModel
57
+ model = LanguageGenerationModel("gpt2", "./dialects")
58
+ model.generate('[EGYPTIAN]' + "مدرستي")
59
+ ```