m3hrdadfi commited on
Commit
1c131ee
1 Parent(s): c5a9149

Update readme

Browse files
Files changed (1) hide show
  1. README.md +21 -35
README.md CHANGED
@@ -1,44 +1,30 @@
1
- # GPT2 - Persian
 
 
 
 
 
 
 
 
2
 
 
 
 
3
 
4
- ## Scripts
5
 
6
- ### Normalizer
 
 
7
 
8
- ```python
9
- from src.normalizer import normalize
10
 
11
- input_text = "ὑ蕉Ұ제ṅ尘̲改座◦花芝秀黄天자埃澤ಿ ˈazbab اینجا ایران خانه‌شما است؟!۱۲۳۱۲۳۱۳۱۲ اَلْحُرُوفُ ٱلْعَرَبِیَّة"
12
- print(normalize(input_text))
13
- ```
14
 
15
- Output:
16
- ```text
17
- azbab اینجا ایران خانه‌شما است ؟ ! 1231231312 الحروف لعربیه
18
- ```
19
 
20
- ### Training tokenizer
21
 
22
- ```bash
23
- python train_tokenizer.py --dataset_name oscar --dataset_config_name unshuffled_deduplicated_als --vocab_size 42000
24
- ```
25
 
26
- ### Configuration
27
-
28
- ```bash
29
- python create_config.py --name_or_path gpt2-medium --params '{"vocab_size": 42000}'
30
- ```
31
-
32
- ### Normalization steps
33
-
34
- Steps:
35
-
36
- - [x] Remove stretched words such as ســــــــــلام
37
-
38
- - [x] Remove links, user-mentioning (such as @jane_doe)
39
-
40
- - [ ] Remove Telegram, Instagram advertisements, or posts (a whole record)
41
-
42
- - [ ] Remove advertisement records
43
-
44
- - [ ] Remove separated words (or the whole record) which are showing up as an individual record, while they are just the tags at the end of the post (such as بلاب ... بلاب ... ورزشی، خبری، سیاسی، اجتماعی، خانوده)
 
1
+ ---
2
+ language: fa
3
+ tags:
4
+ - text-generation
5
+ widget:
6
+ - text: "در یک اتفاق شگفت انگیز، پژوهشگران"
7
+ - text: "گرفتگی بینی در کودکان و به‌خصوص نوزادان باعث می‌شود"
8
+ - text: "امیدواریم نوروز امسال سالی"
9
+ ---
10
 
11
+ # GPT2 Medium 4 Persian
12
+ > This is part of the
13
+ [Flax/Jax Community Week](https://discuss.huggingface.co/t/pretrain-gpt2-from-scratch-in-persian/7560), organized by [HuggingFace](https://huggingface.co/) and TPU usage sponsored by Google.
14
 
 
15
 
16
+ ## Team Members
17
+ - FirstName LastName ([hf_user](https://huggingface.co/hf_user))
18
+ ... SOON
19
 
20
+ ## Dataset
 
21
 
22
+ ... SOON
 
 
23
 
24
+ ## How To Use
 
 
 
25
 
26
+ ... SOON
27
 
28
+ ## Evaluation
 
 
29
 
30
+ ... SOON