khashei commited on
Commit
fd10943
1 Parent(s): 8bf8807

added details on special tokens

Browse files
Files changed (1) hide show
  1. README.md +23 -4
README.md CHANGED
@@ -6,7 +6,7 @@ tags:
6
  - persian
7
  ---
8
  # GPT2-Persian
9
- Bolbol-zaban/gpt2-persian is gpt2 language model that is trained with hyper parameters similar to standard gpt2-medium with two differences:
10
  1. The context size is reduced from 1024 to 256 sub words in order to make the training affordable
11
  2. Instead of BPE, google sentence piece tokenizor is used for tokenization.
12
  3. The training dataset only include Persian text. All non-persian characters are replaced with especial tokens (e.g [LAT], [URL], [NUM])
@@ -25,12 +25,31 @@ generator = pipeline('text-generation', model, tokenizer=tokenizer, config={'max
25
  sample = generator('در یک اتفاق شگفت انگیز، پژوهشگران')
26
  ```
27
  If you are using Tensorflow import TFGPT2LMHeadModel instead of GPT2LMHeadModel.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
28
  ## Acknowledgment
29
  This project is supported by Cloud TPUs from Google’s TensorFlow Research Cloud (TFRC).
30
  ## Citation and Reference
31
  Please reference "bolbolzaban.com" website if you are using gpt2-persian in your research or commertial application.
32
  ## Contacts
33
- [Bolbolzaban.com](http://bolbolzaban.com/about), [Twitter](https://twitter.com/bolbol_zaban), [Telegram](https://t.me/bolbol_zaban), [Instagram](https://www.instagram.com/bolbolzaban/), [Linkedin](https://www.linkedin.com/in/khashei/)
34
-
35
-
36
 
 
 
6
  - persian
7
  ---
8
  # GPT2-Persian
9
+ bolbolzaban/gpt2-persian is gpt2 language model that is trained with hyper parameters similar to standard gpt2-medium with two differences:
10
  1. The context size is reduced from 1024 to 256 sub words in order to make the training affordable
11
  2. Instead of BPE, google sentence piece tokenizor is used for tokenization.
12
  3. The training dataset only include Persian text. All non-persian characters are replaced with especial tokens (e.g [LAT], [URL], [NUM])
 
25
  sample = generator('در یک اتفاق شگفت انگیز، پژوهشگران')
26
  ```
27
  If you are using Tensorflow import TFGPT2LMHeadModel instead of GPT2LMHeadModel.
28
+ ## Special Tokens
29
+ gpt-persian is trained for the purpose of research on Persian poetry. Because of that all english words and numbers are replaced with special tokens and only standard Persian alphabet is used as part of input text. Here is one example:
30
+
31
+ original text: اگر آیفون یا آیپد شما دارای سیستم عامل iOS 14.3 یا iPadOS 14.3 یا نسخه‌های جدیدتر باشد
32
+ text used in training: اگر آیفون یا آیپد شما دارای سیستم عامل [LAT] [NUM] یا [LAT] [NUM] یا نسخه‌های جدیدتر باشد
33
+
34
+ Please consider normalizing your input text using [Hazm](https://github.com/sobhe/hazm) or similar libraries and ensure only Persian characters are provided as input.
35
+
36
+ If you want to use classical Persian poetry as input use [BOM] (begining of mesra) at the beginning of each verse (مصرع) followed by [EOS] (end of statement) at the end of each couplet (بیت).
37
+
38
+ See following links for example:
39
+ [[BOM] توانا بود](https://huggingface.co/bolbolzaban/gpt2-persian?text=%5BBOM%5D+%D8%AA%D9%88%D8%A7%D9%86%D8%A7+%D8%A8%D9%88%D8%AF)
40
+
41
+ [[BOM] توانا بود هر که دانا بود [BOM]](https://huggingface.co/bolbolzaban/gpt2-persian?text=%5BBOM%5D+%D8%AA%D9%88%D8%A7%D9%86%D8%A7+%D8%A8%D9%88%D8%AF+%D9%87%D8%B1+%DA%A9%D9%87+%D8%AF%D8%A7%D9%86%D8%A7+%D8%A8%D9%88%D8%AF+%5BBOM%5D)
42
+
43
+ [[BOM] توانا بود هر که دانا بود [BOM] ز دانش دل پیر](https://huggingface.co/bolbolzaban/gpt2-persian?text=%5BBOM%5D+%D8%AA%D9%88%D8%A7%D9%86%D8%A7+%D8%A8%D9%88%D8%AF+%D9%87%D8%B1+%DA%A9%D9%87+%D8%AF%D8%A7%D9%86%D8%A7+%D8%A8%D9%88%D8%AF+%5BBOM%5D+%D8%B2+%D8%AF%D8%A7%D9%86%D8%B4+%D8%AF%D9%84+%D9%BE%DB%8C%D8%B1)
44
+
45
+ [[BOM] توانا بود هر که دانا بود [BOM] ز دانش دل پیربرنا بود [EOS]](https://huggingface.co/bolbolzaban/gpt2-persian?text=%5BBOM%5D+%D8%AA%D9%88%D8%A7%D9%86%D8%A7+%D8%A8%D9%88%D8%AF+%D9%87%D8%B1+%DA%A9%D9%87+%D8%AF%D8%A7%D9%86%D8%A7+%D8%A8%D9%88%D8%AF+%5BBOM%5D+%D8%B2+%D8%AF%D8%A7%D9%86%D8%B4+%D8%AF%D9%84+%D9%BE%DB%8C%D8%B1%D8%A8%D8%B1%D9%86%D8%A7+%D8%A8%D9%88%D8%AF++%5BEOS%5D)
46
+
47
+ If you like to know about structure of classical Persian poetry refer to these [blog posts](https://medium.com/@khashei).
48
  ## Acknowledgment
49
  This project is supported by Cloud TPUs from Google’s TensorFlow Research Cloud (TFRC).
50
  ## Citation and Reference
51
  Please reference "bolbolzaban.com" website if you are using gpt2-persian in your research or commertial application.
52
  ## Contacts
53
+ Please reachout on [Linkedin](https://www.linkedin.com/in/khashei/) or [Telegram](https://t.me/khasheia) if you have any question or need any help to use the model.
 
 
54
 
55
+ Follow [Bolbolzaban](http://bolbolzaban.com/about) on [Twitter](https://twitter.com/bolbol_zaban), [Telegram](https://t.me/bolbol_zaban) or [Instagram](https://www.instagram.com/bolbolzaban/)