--- license: cc-by-nc-sa-4.0 datasets: - wikipedia - cc100 language: - ja library_name: transformers pipeline_tag: fill-mask --- Japanese BERT-base (Nothing + WordPiece) === ## How to load the tokenizer Please download the dictionary file for Nothing + WordPiece from [our GitHub repository](https://github.com/hitachi-nlp/compare-ja-tokenizer/blob/public/data/dict/nothing_wordpiece.json). Then you can load the tokenizer by specifying the path of the dictionary file to `dict_path`. ```python from typing import Optional from tokenizers import Tokenizer, NormalizedString, PreTokenizedString from tokenizers.processors import BertProcessing from tokenizers.pre_tokenizers import PreTokenizer from transformers import PreTrainedTokenizerFast # load a tokenizer dict_path = /path/to/nothing_wordpiece.json tokenizer = Tokenizer.from_file(dict_path) tokenizer.post_processor = BertProcessing( cls=("[CLS]", tokenizer.token_to_id('[CLS]')), sep=("[SEP]", tokenizer.token_to_id('[SEP]')) ) # convert to PreTrainedTokenizerFast tokenizer = PreTrainedTokenizerFast( tokenizer_object=tokenizer, unk_token='[UNK]', cls_token='[CLS]', sep_token='[SEP]', pad_token='[PAD]', mask_token='[MASK]' ) ``` ```python # Test test_str = "こんにちは。私は形態素解析器について研究をしています。" tokenizer.convert_ids_to_tokens(tokenizer(test_str).input_ids) # -> ['[CLS]','こ','##ん','##に','##ち','##は','##。','##私','##は','##形','##態','##素','##解','##析','##器','##に','##つ','##い','##て','##研','##究','##を','##し','##て','##い','##ま','##す','##。','[SEP]'] ``` ## How to load the model ```python from transformers import AutoModelForMaskedLM model = AutoModelForMaskedLM.from_pretrained("hitachi-nlp/bert-base_nothing-wordpiece") ``` **See [our repository](https://github.com/hitachi-nlp/compare-ja-tokenizer) for more details!**