|
--- |
|
language: |
|
- zh |
|
thumbnail: https://ckip.iis.sinica.edu.tw/files/ckip_logo.png |
|
tags: |
|
- pytorch |
|
- lm-head |
|
- bert |
|
- zh |
|
license: gpl-3.0 |
|
--- |
|
|
|
# CKIP BERT Base Han Chinese |
|
|
|
Pretrained model on Ancient Chinese language using a masked language modeling (MLM) objective. |
|
|
|
## Homepage |
|
* [ckiplab/han-transformers](https://github.com/ckiplab/han-transformers) |
|
|
|
## Training Datasets |
|
The copyright of the datasets belongs to the Institute of Linguistics, Academia Sinica. |
|
* [中央研究院上古漢語標記語料庫](http://lingcorpus.iis.sinica.edu.tw/cgi-bin/kiwi/akiwi/kiwi.sh) |
|
* [中央研究院中古漢語語料庫](http://lingcorpus.iis.sinica.edu.tw/cgi-bin/kiwi/dkiwi/kiwi.sh) |
|
* [中央研究院近代漢語語料庫](http://lingcorpus.iis.sinica.edu.tw/cgi-bin/kiwi/pkiwi/kiwi.sh) |
|
* [中央研究院現代漢語語料庫](http://asbc.iis.sinica.edu.tw) |
|
|
|
## Contributors |
|
* Chin-Tung Lin at [CKIP](https://ckip.iis.sinica.edu.tw) |
|
|
|
## Usage |
|
|
|
* Using our model in your script |
|
```python |
|
from transformers import ( |
|
AutoTokenizer, |
|
AutoModel, |
|
) |
|
|
|
tokenizer = AutoTokenizer.from_pretrained("ckiplab/bert-base-han-chinese") |
|
model = AutoModel.from_pretrained("ckiplab/bert-base-han-chinese") |
|
``` |
|
|
|
* Using our model for inference |
|
```python |
|
>>> from transformers import pipeline |
|
>>> unmasker = pipeline('fill-mask', model='ckiplab/bert-base-han-chinese') |
|
>>> unmasker("黎[MASK]於變時雍。") |
|
|
|
[{'sequence': '黎 民 於 變 時 雍 。', |
|
'score': 0.14885780215263367, |
|
'token': 3696, |
|
'token_str': '民'}, |
|
{'sequence': '黎 庶 於 變 時 雍 。', |
|
'score': 0.0859643816947937, |
|
'token': 2433, |
|
'token_str': '庶'}, |
|
{'sequence': '黎 氏 於 變 時 雍 。', |
|
'score': 0.027848130092024803, |
|
'token': 3694, |
|
'token_str': '氏'}, |
|
{'sequence': '黎 人 於 變 時 雍 。', |
|
'score': 0.023678112775087357, |
|
'token': 782, |
|
'token_str': '人'}, |
|
{'sequence': '黎 生 於 變 時 雍 。', |
|
'score': 0.018718384206295013, |
|
'token': 4495, |
|
'token_str': '生'}] |
|
``` |