Canwen Xu
commited on
Commit
•
cfeba42
1
Parent(s):
3b32398
Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,57 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
language:
|
3 |
+
- zh
|
4 |
+
- ja
|
5 |
+
tags:
|
6 |
+
- crosslingual
|
7 |
+
license: Apache-2.0
|
8 |
+
datasets:
|
9 |
+
- Wikipedia
|
10 |
+
---
|
11 |
+
|
12 |
+
# Unihan LM: Coarse-to-Fine Chinese-Japanese Language Model Pretraining with the Unihan Database
|
13 |
+
|
14 |
+
## Model description
|
15 |
+
|
16 |
+
Chinese and Japanese share many characters with similar surface morphology. To better utilize the shared knowledge across the languages, we propose UnihanLM, a self-supervised Chinese-Japanese pretrained masked language model (MLM) with a novel two-stage coarse-to-fine training approach. We exploit Unihan, a ready-made database constructed by linguistic experts to first merge morphologically similar characters into clusters. The resulting clusters are used to replace the original characters in sentences for the coarse-grained pretraining of the MLM. Then, we restore the clusters back to the original characters in sentences for the fine-grained pretraining to learn the representation of the specific characters. We conduct extensive experiments on a variety of Chinese and Japanese NLP benchmarks, showing that our proposed UnihanLM is effective on both mono- and cross-lingual Chinese and Japanese tasks, shedding light on a new path to exploit the homology of languages. [Paper](https://www.aclweb.org/anthology/2020.aacl-main.24/)
|
17 |
+
|
18 |
+
## Intended uses & limitations
|
19 |
+
|
20 |
+
#### How to use
|
21 |
+
|
22 |
+
Use it like how you use XLM :)
|
23 |
+
|
24 |
+
#### Limitations and bias
|
25 |
+
|
26 |
+
The training corpus is solely from Wikipedia so the model may perform worse on informal text data. Be careful with English words! The tokenizer would cut it to characters.
|
27 |
+
|
28 |
+
## Training data
|
29 |
+
|
30 |
+
We use Chinese and Japanese Wikipedia to train the model.
|
31 |
+
|
32 |
+
## Training procedure
|
33 |
+
|
34 |
+
Please refer to our paper: https://www.aclweb.org/anthology/2020.aacl-main.24/
|
35 |
+
|
36 |
+
## Eval results
|
37 |
+
|
38 |
+
Please refer to our paper: https://www.aclweb.org/anthology/2020.aacl-main.24/
|
39 |
+
|
40 |
+
### BibTeX entry and citation info
|
41 |
+
|
42 |
+
```bibtex
|
43 |
+
@inproceedings{xu-etal-2020-unihanlm,
|
44 |
+
title = "{U}nihan{LM}: Coarse-to-Fine {C}hinese-{J}apanese Language Model Pretraining with the Unihan Database",
|
45 |
+
author = "Xu, Canwen and
|
46 |
+
Ge, Tao and
|
47 |
+
Li, Chenliang and
|
48 |
+
Wei, Furu",
|
49 |
+
booktitle = "Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing",
|
50 |
+
month = dec,
|
51 |
+
year = "2020",
|
52 |
+
address = "Suzhou, China",
|
53 |
+
publisher = "Association for Computational Linguistics",
|
54 |
+
url = "https://www.aclweb.org/anthology/2020.aacl-main.24",
|
55 |
+
pages = "201--211"
|
56 |
+
}
|
57 |
+
```
|