Edit model card

Model description

  • This model was trained on ZH, JA, KO's Wikipedia (5 epochs).

How to use

from transformers import AutoTokenizer, AutoModelForMaskedLM
tokenizer = AutoTokenizer.from_pretrained("conan1024hao/cjkbert-small")
model = AutoModelForMaskedLM.from_pretrained("conan1024hao/cjkbert-small")
  • Before you fine-tune downstream tasks, you don't need any text segmentation.
  • (Though you may obtain better results if you applied morphological analysis to the data before fine-tuning)

Morphological analysis tools

  • ZH: For Chinese, we use LTP.
  • JA: For Japanese, we use Juman++.
  • KO: For Korean, we use KoNLPy(Kkma class).

Tokenization

  • We use character-based tokenization with whole-word-masking strategy.

Model size

  • vocab_size: 15015
  • num_hidden_layers: 4
  • hidden_size: 512
  • num_attention_heads: 8
  • param_num: 25M
Downloads last month
5

Dataset used to train conan1024hao/cjkbert-small