conan1024hao
/

cjkbert-small

Inference Endpoints

Model card Files Files and versions Community

Model description

This model was trained on ZH, JA, KO's Wikipedia (5 epochs).

How to use

from transformers import AutoTokenizer, AutoModelForMaskedLM
tokenizer = AutoTokenizer.from_pretrained("conan1024hao/cjkbert-small")
model = AutoModelForMaskedLM.from_pretrained("conan1024hao/cjkbert-small")

Before you fine-tune downstream tasks, you don't need any text segmentation.
(Though you may obtain better results if you applied morphological analysis to the data before fine-tuning)

Morphological analysis tools

ZH: For Chinese, we use LTP.
JA: For Japanese, we use Juman++.
KO: For Korean, we use KoNLPy(Kkma class).

Tokenization

We use character-based tokenization with whole-word-masking strategy.

Model size

vocab_size: 15015
num_hidden_layers: 4
hidden_size: 512
num_attention_heads: 8
param_num: 25M

Downloads last month: 6

Inference Providers NEW

This model is not currently available via any of the supported third-party Inference Providers, and the model is not deployed on the HF Inference API.

Dataset used to train conan1024hao/cjkbert-small

Space using conan1024hao/cjkbert-small 1