atsuki-yamaguchi's picture
Update README.md
e2eb659
metadata
license: cc-by-nc-sa-4.0
datasets:
  - wikipedia
  - cc100
language:
  - ja
library_name: transformers
pipeline_tag: fill-mask

Japanese BERT-base (Nothing + Unigram)

How to load the tokenizer

Please download the dictionary file for Nothing + Unigram from our GitHub repository. Then you can load the tokenizer by specifying the path of the dictionary file to dict_path.

from typing import Optional

from tokenizers import Tokenizer, NormalizedString, PreTokenizedString
from tokenizers.processors import BertProcessing
from tokenizers.pre_tokenizers import PreTokenizer
from transformers import PreTrainedTokenizerFast

# load a tokenizer
dict_path = /path/to/nothing_unigram.json
tokenizer = Tokenizer.from_file(dict_path)
tokenizer.post_processor = BertProcessing(
    cls=("[CLS]", tokenizer.token_to_id('[CLS]')),
    sep=("[SEP]", tokenizer.token_to_id('[SEP]'))
)

# convert to PreTrainedTokenizerFast
tokenizer = PreTrainedTokenizerFast(
    tokenizer_object=tokenizer,
    unk_token='[UNK]',
    cls_token='[CLS]',
    sep_token='[SEP]',
    pad_token='[PAD]',
    mask_token='[MASK]'
)
# Test
test_str = "γ“γ‚“γ«γ‘γ―γ€‚η§γ―ε½’ζ…‹η΄ θ§£ζžε™¨γ«γ€γ„γ¦η ”η©Άγ‚’γ—γ¦γ„γΎγ™γ€‚"
tokenizer.convert_ids_to_tokens(tokenizer(test_str).input_ids)
# -> ['[CLS]','こん','に','け','は','。','私','は','ε½’ζ…‹','η΄ ','解析','器','に぀いて','η ”η©Ά','をして','います','。','[SEP]']

How to load the model

from transformers import AutoModelForMaskedLM
model = AutoModelForMaskedLM.from_pretrained("hitachi-nlp/bert-base_nothing-unigram")

See our repository for more details!