Update README.md
Browse files
README.md
CHANGED
@@ -28,11 +28,12 @@ You can use the model directly with a pipeline for text generation:
|
|
28 |
|
29 |
## Training data
|
30 |
|
31 |
-
Training data contains 3,000,000 ancient Chinese which are collected by [daizhigev20](https://github.com/garychowcmu/daizhigev20).
|
|
|
32 |
|
33 |
## Training procedure
|
34 |
|
35 |
-
The model is pre-trained by [UER-py](https://github.com/dbiir/UER-py/) on [Tencent Cloud TI-ONE](https://cloud.tencent.com/product/tione/). We pre-train 500,000 steps with a sequence length of 320. We use extended vocabulary to handle out-of-vocabulary words. The Chinese character that occurs greater than or equal to 100 in ancient
|
36 |
|
37 |
```
|
38 |
python3 preprocess.py --corpus_path corpora/ancient_chinese.txt \
|
@@ -72,4 +73,6 @@ python3 scripts/convert_gpt2_from_uer_to_huggingface.py --input_model_path ancie
|
|
72 |
pages={241},
|
73 |
year={2019}
|
74 |
}
|
|
|
|
|
75 |
```
|
|
|
28 |
|
29 |
## Training data
|
30 |
|
31 |
+
Training data contains 3,000,000 ancient Chinese which are collected by [daizhigev20](https://github.com/garychowcmu/daizhigev20). Since part of ancient corpus has no punctuation, we used the [ancient Chinese punctuation system](https://seg.shenshen.wiki) developed by [BNU ICIP lab](http://icip.bnu.edu.cn/).
|
32 |
+
|
33 |
|
34 |
## Training procedure
|
35 |
|
36 |
+
The model is pre-trained by [UER-py](https://github.com/dbiir/UER-py/) on [Tencent Cloud TI-ONE](https://cloud.tencent.com/product/tione/). We pre-train 500,000 steps with a sequence length of 320. We use extended vocabulary to handle out-of-vocabulary words. The Chinese character that occurs greater than or equal to 100 in ancient Chinese corpus is added to the vocabulary.
|
37 |
|
38 |
```
|
39 |
python3 preprocess.py --corpus_path corpora/ancient_chinese.txt \
|
|
|
73 |
pages={241},
|
74 |
year={2019}
|
75 |
}
|
76 |
+
|
77 |
+
胡韧奋,李绅,诸雨辰.基于深层语言模型的古汉语知识表示及自动断句研究[C].第十八届中国计算语言学大会(CCL 2019).
|
78 |
```
|