uer commited on
Commit
4124b40
1 Parent(s): 8cf5bf9

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +5 -2
README.md CHANGED
@@ -28,11 +28,12 @@ You can use the model directly with a pipeline for text generation:
28
 
29
  ## Training data
30
 
31
- Training data contains 3,000,000 ancient Chinese which are collected by [daizhigev20](https://github.com/garychowcmu/daizhigev20).
 
32
 
33
  ## Training procedure
34
 
35
- The model is pre-trained by [UER-py](https://github.com/dbiir/UER-py/) on [Tencent Cloud TI-ONE](https://cloud.tencent.com/product/tione/). We pre-train 500,000 steps with a sequence length of 320. We use extended vocabulary to handle out-of-vocabulary words. The Chinese character that occurs greater than or equal to 100 in ancient chinese corpus is added to the vocabulary.
36
 
37
  ```
38
  python3 preprocess.py --corpus_path corpora/ancient_chinese.txt \
@@ -72,4 +73,6 @@ python3 scripts/convert_gpt2_from_uer_to_huggingface.py --input_model_path ancie
72
  pages={241},
73
  year={2019}
74
  }
 
 
75
  ```
 
28
 
29
  ## Training data
30
 
31
+ Training data contains 3,000,000 ancient Chinese which are collected by [daizhigev20](https://github.com/garychowcmu/daizhigev20). Since part of ancient corpus has no punctuation, we used the [ancient Chinese punctuation system](https://seg.shenshen.wiki) developed by [BNU ICIP lab](http://icip.bnu.edu.cn/). 
32
+
33
 
34
  ## Training procedure
35
 
36
+ The model is pre-trained by [UER-py](https://github.com/dbiir/UER-py/) on [Tencent Cloud TI-ONE](https://cloud.tencent.com/product/tione/). We pre-train 500,000 steps with a sequence length of 320. We use extended vocabulary to handle out-of-vocabulary words. The Chinese character that occurs greater than or equal to 100 in ancient Chinese corpus is added to the vocabulary.
37
 
38
  ```
39
  python3 preprocess.py --corpus_path corpora/ancient_chinese.txt \
 
73
  pages={241},
74
  year={2019}
75
  }
76
+
77
+ 胡韧奋,李绅,诸雨辰.基于深层语言模型的古汉语知识表示及自动断句研究[C].第十八届中国计算语言学大会(CCL 2019).
78
  ```