File size: 1,233 Bytes
21d29cb 6040dc3 21d29cb 74e88fc 21d29cb 74e88fc 21d29cb 1809a17 a90e731 1809a17 a90e731 74e88fc a90e731 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 |
# GPT2 - Persian
## Scripts
### Normalizer
```python
from src.normalizer import normalize
input_text = "ὑ蕉Ұ제ṅ尘̲改座◦花芝秀黄天자埃澤ಿ ˈazbab اینجا ایران خانهشما است؟!۱۲۳۱۲۳۱۳۱۲ اَلْحُرُوفُ ٱلْعَرَبِیَّة"
print(normalize(input_text))
```
Output:
```text
azbab اینجا ایران خانهشما است ؟ ! 1231231312 الحروف لعربیه
```
### Training tokenizer
```bash
python train_tokenizer.py --dataset_name oscar --dataset_config_name unshuffled_deduplicated_als --vocab_size 42000
```
### Configuration
```bash
python create_config.py --name_or_path gpt2-medium --params '{"vocab_size": 42000}'
```
### Normalization steps
Steps:
- [x] Remove stretched words such as ســــــــــلام
- [x] Remove links, user-mentioning (such as @jane_doe)
- [ ] Remove Telegram, Instagram advertisements, or posts (a whole record)
- [ ] Remove advertisement records
- [ ] Remove separated words (or the whole record) which are showing up as an individual record, while they are just the tags at the end of the post (such as بلاب ... بلاب ... ورزشی، خبری، سیاسی، اجتماعی، خانوده) |