wissamantoun commited on
Commit
9f6d74c
1 Parent(s): 87071b6

added citation

Browse files
Files changed (1) hide show
  1. README.md +15 -10
README.md CHANGED
@@ -63,7 +63,7 @@ Follow the guide linked [here](https://towardsdatascience.com/fine-tuning-gpt2-o
63
 
64
  ## Finetuning using our code with TF 1.15.4:
65
 
66
- - Create the Training TFRecords:
67
  ```bash
68
  python create_pretraining_data.py
69
  --input_file=<RAW TEXT FILE with documents/article sperated by an empty line>
@@ -71,7 +71,7 @@ python create_pretraining_data.py
71
  --tokenizer_dir=<Directory with the GPT2 Tokenizer files>
72
  ```
73
 
74
- - Finetuning:
75
  ```bash
76
  python3 run_pretraining.py \
77
  --input_file="gs://<GS_BUCKET>/pretraining_data/*" \
@@ -119,7 +119,7 @@ The pretraining data used for the new AraGPT2 model is also used for **AraBERTv2
119
 
120
  The dataset consists of 77GB or 200,095,961 lines or 8,655,948,860 words or 82,232,988,358 chars (before applying Farasa Segmentation)
121
 
122
- For the new dataset we added the unshuffled OSCAR corpus, after we thoroughly filter it, to the dataset used in AraBERTv1 but with out the websites that we previously crawled:
123
  - OSCAR unshuffled and filtered.
124
  - [Arabic Wikipedia dump](https://archive.org/details/arwiki-20190201) from 2020/09/01
125
  - [The 1.5B words Arabic Corpus](https://www.semanticscholar.org/paper/1.5-billion-words-Arabic-Corpus-El-Khair/f3eeef4afb81223df96575adadf808fe7fe440b4)
@@ -133,13 +133,18 @@ The text generated by AraGPT2 is automatically generated by a neural network mod
133
  # If you used this model please cite us as :
134
 
135
  ```
136
- @misc{antoun2020aragpt2,
137
- title={AraGPT2: Pre-Trained Transformer for Arabic Language Generation},
138
- author={Wissam Antoun and Fady Baly and Hazem Hajj},
139
- year={2020},
140
- eprint={2012.15520},
141
- archivePrefix={arXiv},
142
- primaryClass={cs.CL}
 
 
 
 
 
143
  }
144
  ```
145
 
 
63
 
64
  ## Finetuning using our code with TF 1.15.4:
65
 
66
+ Create the Training TFRecords:
67
  ```bash
68
  python create_pretraining_data.py
69
  --input_file=<RAW TEXT FILE with documents/article sperated by an empty line>
 
71
  --tokenizer_dir=<Directory with the GPT2 Tokenizer files>
72
  ```
73
 
74
+ Finetuning:
75
  ```bash
76
  python3 run_pretraining.py \
77
  --input_file="gs://<GS_BUCKET>/pretraining_data/*" \
 
119
 
120
  The dataset consists of 77GB or 200,095,961 lines or 8,655,948,860 words or 82,232,988,358 chars (before applying Farasa Segmentation)
121
 
122
+ For the new dataset we added the unshuffled OSCAR corpus after we thoroughly filter it, to the dataset used in AraBERTv1 but without the websites that we previously crawled:
123
  - OSCAR unshuffled and filtered.
124
  - [Arabic Wikipedia dump](https://archive.org/details/arwiki-20190201) from 2020/09/01
125
  - [The 1.5B words Arabic Corpus](https://www.semanticscholar.org/paper/1.5-billion-words-Arabic-Corpus-El-Khair/f3eeef4afb81223df96575adadf808fe7fe440b4)
 
133
  # If you used this model please cite us as :
134
 
135
  ```
136
+ @inproceedings{antoun-etal-2021-aragpt2,
137
+ title = "{A}ra{GPT}2: Pre-Trained Transformer for {A}rabic Language Generation",
138
+ author = "Antoun, Wissam and
139
+ Baly, Fady and
140
+ Hajj, Hazem",
141
+ booktitle = "Proceedings of the Sixth Arabic Natural Language Processing Workshop",
142
+ month = apr,
143
+ year = "2021",
144
+ address = "Kyiv, Ukraine (Virtual)",
145
+ publisher = "Association for Computational Linguistics",
146
+ url = "https://www.aclweb.org/anthology/2021.wanlp-1.21",
147
+ pages = "196--207",
148
  }
149
  ```
150