zijun commited on
Commit
bb22b88
1 Parent(s): 5565860

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +32 -0
README.md ADDED
@@ -0,0 +1,32 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # ChineseBERT-base
2
+
3
+ This repository contains code, model, dataset for **ChineseBERT** at ACL2021.
4
+
5
+ paper:
6
+ **[ChineseBERT: Chinese Pretraining Enhanced by Glyph and Pinyin Information](https://arxiv.org/abs/2106.16038)**
7
+ *Zijun Sun, Xiaoya Li, Xiaofei Sun, Yuxian Meng, Xiang Ao, Qing He, Fei Wu and Jiwei Li*
8
+
9
+ code:
10
+ [ChineseBERT github link](https://github.com/ShannonAI/ChineseBert)
11
+
12
+ ## Model description
13
+ We propose ChineseBERT, which incorporates both the glyph and pinyin information of Chinese
14
+ characters into language model pretraining.
15
+
16
+ First, for each Chinese character, we get three kind of embedding.
17
+ - **Char Embedding:** the same as origin BERT token embedding.
18
+ - **Glyph Embedding:** capture visual features based on different fonts of a Chinese character.
19
+ - **Pinyin Embedding:** capture phonetic feature from the pinyin sequence ot a Chinese Character.
20
+
21
+ Then, char embedding, glyph embedding and pinyin embedding
22
+ are first concatenated, and mapped to a D-dimensional embedding through a fully
23
+ connected layer to form the fusion embedding.
24
+ Finally, the fusion embedding is added with the position embedding, which is fed as input to the BERT model.
25
+ The following image shows an overview architecture of ChineseBERT model.
26
+
27
+ ![MODEL](https://raw.githubusercontent.com/ShannonAI/ChineseBert/main/images/ChineseBERT.png)
28
+
29
+ ChineseBERT leverages the glyph and pinyin information of Chinese
30
+ characters to enhance the model's ability of capturing
31
+ context semantics from surface character forms and
32
+ disambiguating polyphonic characters in Chinese.