CKIP BERT Base Han Chinese WS

This model provides word segmentation for the ancient Chinese language. Our training dataset covers four eras of the Chinese language.

Homepage

ckiplab/han-transformers

Training Datasets

The copyright of the datasets belongs to the Institute of Linguistics, Academia Sinica.

Contributors

Chin-Tung Lin at CKIP

Usage

Using our model in your script

from transformers import (
  AutoTokenizer,
  AutoModel,
)

tokenizer = AutoTokenizer.from_pretrained("ckiplab/bert-base-han-chinese-ws")
model = AutoModel.from_pretrained("ckiplab/bert-base-han-chinese-ws")

Using our model for inference

>>> from transformers import pipeline
>>> classifier = pipeline("token-classification", model="ckiplab/bert-base-han-chinese-ws")
>>> classifier("帝堯曰放勳")

# output
[{'entity': 'B',
'score': 0.9999793,
'index': 1,
'word': '帝',
'start': 0,
'end': 1},
{'entity': 'I',
'score': 0.9915047,
'index': 2,
'word': '堯',
'start': 1,
'end': 2},
{'entity': 'B',
'score': 0.99992275,
'index': 3,
'word': '曰',
'start': 2,
'end': 3},
{'entity': 'B',
'score': 0.99905187,
'index': 4,
'word': '放',
'start': 3,
'end': 4},
{'entity': 'I',
'score': 0.96299917,
'index': 5,
'word': '勳',
'start': 4,
'end': 5}]