roberta-classical-chinese-base-sentence-segmentation

Model Description

This is a RoBERTa model pre-trained on Classical Chinese texts for sentence segmentation, derived from roberta-classical-chinese-base-char. Every segmented sentence begins with token-class "B" and ends with token-class "E" (except for single-character sentence with token-class "S").

How to Use

import torch
from transformers import AutoTokenizer,AutoModelForTokenClassification
tokenizer=AutoTokenizer.from_pretrained("KoichiYasuoka/roberta-classical-chinese-base-sentence-segmentation")
model=AutoModelForTokenClassification.from_pretrained("KoichiYasuoka/roberta-classical-chinese-base-sentence-segmentation")
s="子曰學而時習之不亦説乎有朋自遠方來不亦樂乎人不知而不慍不亦君子乎"
p=[model.config.id2label[q] for q in torch.argmax(model(tokenizer.encode(s,return_tensors="pt"))["logits"],dim=2)[0].tolist()[1:-1]]
print("".join(c+"。" if q=="E" or q=="S" else c for c,q in zip(s,p)))

Reference

Koichi Yasuoka: Sentence Segmentation of Classical Chinese Texts Using Transformers and BERT/RoBERTa Models, IPSJ Symposium Series, Vol.2021, No.1 (December 2021), pp.104-109.

Downloads last month
270
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for KoichiYasuoka/roberta-classical-chinese-base-sentence-segmentation