uer commited on
Commit
6da3598
1 Parent(s): 9b8a142

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +19 -30
README.md CHANGED
@@ -3,30 +3,23 @@ language: Chinese
3
  widget:
4
  - text: "北京上个月召开了两会"
5
 
6
-
7
-
8
-
9
  ---
10
 
11
  # Chinese RoBERTa-Base Models for Text Classification
12
 
13
-
14
-
15
  ## Model description
16
 
17
- This is the set of 5 Chinese RoBERTa base models fine-tuned by [UER-py](https://arxiv.org/abs/1909.05658).
18
 
19
  You can download the 5 Chinese RoBERTa base models either from the links below:
20
 
21
  | corpus | Link |
22
  | :-----------: | :-------------------------------------------------------: |
23
- | **JD full** | [**roberta-base-finetuned-jd-full-chinese**][JD_full] |
24
- | **JD binary** | [**roberta-base-finetuned-jd-binary-chinese**][JD_binary] |
25
- | **Dianping** | [**roberta-base-finetuned-dianping-chinese**][Dianping] |
26
- | **Ifeng** | [**roberta-base-finetuned-ifeng-chinese**][Ifeng] |
27
- | **Chinanews** | [**roberta-base-finetuned-chinanews-chinese**][Chinanews] |
28
-
29
-
30
 
31
  ## How to use
32
 
@@ -41,34 +34,30 @@ You can use this model directly with a pipeline for text classification (take th
41
  [{'label': 'mainland China politics', 'score': 0.7211663722991943}]
42
  ```
43
 
44
-
45
-
46
  ## Training data
47
 
48
  We use 5 Chinese text classification datasets which are collected by [Glyph](https://github.com/zhangxiangxiao/glyph) project.
49
 
50
-
51
-
52
  ## Training procedure
53
 
54
- Models are fine-tuned by [UER-py](https://github.com/dbiir/UER-py/) on [Tencent Cloud](https://cloud.tencent.com/). We fine-tune three epochs with a sequence length of 512 on the basis of the pre-trained model [chinese_roberta_L-12_H-768](https://huggingface.co/uer/chinese_roberta_L-12_H-768). At the end of each epoch, the model is saved when the best performance on development set is achieved.
55
 
56
  Taking the case of roberta-base-finetuned-chinanews-chinese
57
 
58
  ```
59
- python3 run_classifier.py --pretrained_model_path models/cluecorpussmall_roberta_base_seq512_model.bin-250000 \
60
  --vocab_path models/google_zh_vocab.txt \
61
- --train_path Glyph/Chinanews_train.txt \
62
- --dev_path Glypg/Chinanews_test.txt \
63
- --output_model_path models/Chinanews_model.bin \
64
- --learning_rate 3e-5 --batch_size 32 --epochs_num 3 \
65
- --seq_length 512 --embedding word_pos_seg --encoder transformer --mask fully_visible
66
  ```
67
 
68
  Finally, we convert the pre-trained model into Huggingface's format:
69
 
70
  ```
71
- python3 scripts/convert_bert_text_classification_from_uer_to_huggingface.py --input_model_path models/Chinanews_model.bin \
72
  --output_model_path pytorch_model.bin \
73
  --layers_num 12
74
  ```
@@ -85,8 +74,8 @@ python3 scripts/convert_bert_text_classification_from_uer_to_huggingface.py --in
85
  }
86
  ```
87
 
88
- [JD_full]:https://huggingface.co/uer/roberta-base-finetuned-jd-full-chinese
89
- [JD_binary]:https://huggingface.co/uer/roberta-base-finetuned-jd-binary-chinese
90
- [Dianping]:https://huggingface.co/uer/roberta-base-finetuned-dianping-chinese
91
- [Ifeng]:https://huggingface.co/uer/roberta-base-finetuned-ifeng-chinese
92
- [Chinanews]:https://huggingface.co/uer/roberta-base-finetuned-chinanews-chinese
 
3
  widget:
4
  - text: "北京上个月召开了两会"
5
 
 
 
 
6
  ---
7
 
8
  # Chinese RoBERTa-Base Models for Text Classification
9
 
 
 
10
  ## Model description
11
 
12
+ This is the set of 5 Chinese RoBERTa-Base classification models fine-tuned by [UER-py](https://arxiv.org/abs/1909.05658). You can download the 5 Chinese RoBERTa-Base classification models either from the [UER-py Modelzoo page](https://github.com/dbiir/UER-py/wiki/Modelzoo) (in UER-py format), or via HuggingFace from the links below:
13
 
14
  You can download the 5 Chinese RoBERTa base models either from the links below:
15
 
16
  | corpus | Link |
17
  | :-----------: | :-------------------------------------------------------: |
18
+ | **JD full** | [**roberta-base-finetuned-jd-full-chinese**][jd_full] |
19
+ | **JD binary** | [**roberta-base-finetuned-jd-binary-chinese**][jd_binary] |
20
+ | **Dianping** | [**roberta-base-finetuned-dianping-chinese**][dianping] |
21
+ | **Ifeng** | [**roberta-base-finetuned-ifeng-chinese**][ifeng] |
22
+ | **Chinanews** | [**roberta-base-finetuned-chinanews-chinese**][chinanews] |
 
 
23
 
24
  ## How to use
25
 
 
34
  [{'label': 'mainland China politics', 'score': 0.7211663722991943}]
35
  ```
36
 
 
 
37
  ## Training data
38
 
39
  We use 5 Chinese text classification datasets which are collected by [Glyph](https://github.com/zhangxiangxiao/glyph) project.
40
 
 
 
41
  ## Training procedure
42
 
43
+ Models are fine-tuned by [UER-py](https://github.com/dbiir/UER-py/) on [Tencent Cloud](https://cloud.tencent.com/). We fine-tune three epochs with a sequence length of 512 on the basis of the pre-trained model [chinese_roberta_L-12_H-768](https://huggingface.co/uer/chinese_roberta_L-12_H-768). At the end of each epoch, the model is saved when the best performance on development set is achieved. We use the same hyper-parameters on different models.
44
 
45
  Taking the case of roberta-base-finetuned-chinanews-chinese
46
 
47
  ```
48
+ python3 run_classifier.py --pretrained_model_path models/cluecorpussmall_roberta_base_seq512_model.bin-250000 \
49
  --vocab_path models/google_zh_vocab.txt \
50
+ --train_path datasets/glyph/chinanews/train.tsv \
51
+ --dev_path datasets/glyph/chinanews/dev.tsv \
52
+ --output_model_path models/chinanews_classifier_model.bin \
53
+ --learning_rate 3e-5 --batch_size 32 --epochs_num 3 --seq_length 512 \
54
+ --embedding word_pos_seg --encoder transformer --mask fully_visible
55
  ```
56
 
57
  Finally, we convert the pre-trained model into Huggingface's format:
58
 
59
  ```
60
+ python3 scripts/convert_bert_text_classification_from_uer_to_huggingface.py --input_model_path models/chinanews_classifier_model.bin \
61
  --output_model_path pytorch_model.bin \
62
  --layers_num 12
63
  ```
 
74
  }
75
  ```
76
 
77
+ [jd_full]:https://huggingface.co/uer/roberta-base-finetuned-jd-full-chinese
78
+ [jd_binary]:https://huggingface.co/uer/roberta-base-finetuned-jd-binary-chinese
79
+ [dianping]:https://huggingface.co/uer/roberta-base-finetuned-dianping-chinese
80
+ [ifeng]:https://huggingface.co/uer/roberta-base-finetuned-ifeng-chinese
81
+ [chinanews]:https://huggingface.co/uer/roberta-base-finetuned-chinanews-chinese