Edit model card

Model description

  • This model was trained on ZH, JA, KO's Wikipedia (5 epochs).

How to use

from transformers import AutoTokenizer, AutoModelForMaskedLM
tokenizer = AutoTokenizer.from_pretrained("conan1024hao/cjkbert-small")
model = AutoModelForMaskedLM.from_pretrained("conan1024hao/cjkbert-small")
  • Before you fine-tune downstream tasks, you don't need any text segmentation.
  • (Though you may obtain better results if you applied morphological analysis to the data before fine-tuning)

Morphological analysis tools

  • ZH: For Chinese, we use LTP.
  • JA: For Japanese, we use Juman++.
  • KO: For Korean, we use KoNLPy(Kkma class).

Tokenization

  • We use character-based tokenization with whole-word-masking strategy.

Model size

  • vocab_size: 15015
  • num_hidden_layers: 4
  • hidden_size: 512
  • num_attention_heads: 8
  • param_num: 25M
Downloads last month
13
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Dataset used to train conan1024hao/cjkbert-small

Space using conan1024hao/cjkbert-small 1