uer commited on
Commit
a472bdd
1 Parent(s): b9b9ffc

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +15 -15
README.md CHANGED
@@ -36,30 +36,30 @@ Training data contains 3,000,000 ancient Chinese which are collected by [daizhig
36
  The model is pre-trained by [UER-py](https://github.com/dbiir/UER-py/) on [Tencent Cloud](https://cloud.tencent.com/). We pre-train 500,000 steps with a sequence length of 320. We use extended vocabulary to handle out-of-vocabulary words. The Chinese character that occurs greater than or equal to 100 in ancient Chinese corpus is added to the vocabulary.
37
 
38
  ```
39
- python3 preprocess.py --corpus_path corpora/ancient_chinese.txt \\
40
- --vocab_path models/google_zh_vocab.txt \\
41
- --dataset_path ancient_chinese_dataset.pt --processes_num 16 \\
42
  --seq_length 320 --target lm
43
  ```
44
 
45
  ```
46
- python3 pretrain.py --dataset_path ancient_chinese_dataset.pt \\
47
- --vocab_path models/google_zh_vocab.txt \\
48
- --config_path models/bert_base_config.json \\
49
- --output_model_path models/ancient_chinese_base_model.bin \\
50
- --world_size 8 --gpu_ranks 0 1 2 3 4 5 6 7 \\
51
- --total_steps 500000 --save_checkpoint_steps 100000 --report_steps 10000 \\
52
- --learning_rate 5e-4 --batch_size 32 \\
53
- --embedding word_pos --remove_embedding_layernorm \\
54
- --encoder transformer --mask causal --layernorm_positioning pre \\
55
- --target lm --tie_weight
56
  ```
57
 
58
  Finally, we convert the pre-trained model into Huggingface's format:
59
 
60
  ```
61
- python3 scripts/convert_gpt2_from_uer_to_huggingface.py --input_model_path ancient_chinese_base_model.bin-500000 \\
62
- --output_model_path pytorch_model.bin \\
63
  --layers_num 12
64
  ```
65
 
 
36
  The model is pre-trained by [UER-py](https://github.com/dbiir/UER-py/) on [Tencent Cloud](https://cloud.tencent.com/). We pre-train 500,000 steps with a sequence length of 320. We use extended vocabulary to handle out-of-vocabulary words. The Chinese character that occurs greater than or equal to 100 in ancient Chinese corpus is added to the vocabulary.
37
 
38
  ```
39
+ python3 preprocess.py --corpus_path corpora/ancient_chinese.txt \
40
+ --vocab_path models/google_zh_vocab.txt \
41
+ --dataset_path ancient_chinese_dataset.pt --processes_num 16 \
42
  --seq_length 320 --target lm
43
  ```
44
 
45
  ```
46
+ python3 pretrain.py --dataset_path ancient_chinese_dataset.pt \
47
+ --vocab_path models/google_zh_vocab.txt \
48
+ --config_path models/bert_base_config.json \
49
+ --output_model_path models/ancient_chinese_base_model.bin \
50
+ --world_size 8 --gpu_ranks 0 1 2 3 4 5 6 7 \
51
+ --total_steps 500000 --save_checkpoint_steps 100000 --report_steps 10000 \
52
+ --learning_rate 5e-4 --batch_size 32 \
53
+ --embedding word_pos --remove_embedding_layernorm \
54
+ --encoder transformer --mask causal --layernorm_positioning pre \
55
+ --target lm --tie_weights
56
  ```
57
 
58
  Finally, we convert the pre-trained model into Huggingface's format:
59
 
60
  ```
61
+ python3 scripts/convert_gpt2_from_uer_to_huggingface.py --input_model_path ancient_chinese_base_model.bin-500000 \
62
+ --output_model_path pytorch_model.bin \
63
  --layers_num 12
64
  ```
65