uer commited on
Commit
081eba7
1 Parent(s): 4e11c2f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +3 -3
README.md CHANGED
@@ -46,18 +46,18 @@ Training data contains 800,000 Chinese ancient poems which are collected by [chi
46
 
47
  ## Training procedure
48
 
49
- The model is pre-trained by [UER-py](https://github.com/dbiir/UER-py/) on [Tencent Cloud TI-ONE](https://cloud.tencent.com/product/tione/). We pre-train 200,000 steps with a sequence length of 128.
50
 
51
  ```
52
  python3 preprocess.py --corpus_path corpora/poem.txt \
53
- --vocab_path models/google_zh_vocab.txt \
54
  --dataset_path poem_dataset.pt --processes_num 16 \
55
  --seq_length 128 --target lm
56
  ```
57
 
58
  ```
59
  python3 pretrain.py --dataset_path poem_dataset.pt \
60
- --vocab_path models/google_zh_vocab.txt \
61
  --output_model_path models/poem_gpt2_model.bin \
62
  --config_path models/gpt2/config.json \
63
  --world_size 8 --gpu_ranks 0 1 2 3 4 5 6 7 \
46
 
47
  ## Training procedure
48
 
49
+ The model is pre-trained by [UER-py](https://github.com/dbiir/UER-py/) on [Tencent Cloud TI-ONE](https://cloud.tencent.com/product/tione/). We pre-train 200,000 steps with a sequence length of 128. We use extended vocabulary to handle out-of-vocabulary words. The Chinese character that occurs greater than or equal to 100 in poem corpus is added to the vocabulary.
50
 
51
  ```
52
  python3 preprocess.py --corpus_path corpora/poem.txt \
53
+ --vocab_path models/poem_zh_vocab.txt \
54
  --dataset_path poem_dataset.pt --processes_num 16 \
55
  --seq_length 128 --target lm
56
  ```
57
 
58
  ```
59
  python3 pretrain.py --dataset_path poem_dataset.pt \
60
+ --vocab_path models/poem_zh_vocab.txt \
61
  --output_model_path models/poem_gpt2_model.bin \
62
  --config_path models/gpt2/config.json \
63
  --world_size 8 --gpu_ranks 0 1 2 3 4 5 6 7 \