--- license: cc-by-nc-sa-4.0 datasets: - wikipedia - cc100 language: - ja library_name: transformers pipeline_tag: fill-mask --- Japanese BERT-base (Juman++ + BPE) === ## How to load the tokenizer Please download the dictionary file for Juman++ + BPE from [our GitHub repository](https://github.com/hitachi-nlp/compare-ja-tokenizer/blob/public/data/dict/jumanpp_bpe.json). Then you can load the tokenizer by specifying the path of the dictionary file to `dict_path`. ```python from typing import Optional from tokenizers import Tokenizer, NormalizedString, PreTokenizedString from tokenizers.processors import BertProcessing from tokenizers.pre_tokenizers import PreTokenizer from transformers import PreTrainedTokenizerFast from pyknp import Juman import mojimoji import textspan class JumanPreTokenizer: def __init__(self): self.juman = Juman("jumanpp", multithreading=True) def tokenize(self, sequence: str) -> list[str]: text = mojimoji.han_to_zen(sequence).rstrip() try: result = self.juman.analysis(text) except: traceback.print_exc() text = "" result = self.juman.analysis(text) return [mrph.midasi for mrph in result.mrph_list()] def custom_split(self, i: int, normalized_string: NormalizedString) -> list[NormalizedString]: text = str(normalized_string) tokens = self.tokenize(text) tokens_spans = textspan.get_original_spans(tokens, text) return [normalized_string[st:ed] for cahr_spans in tokens_spans for st,ed in cahr_spans] def pre_tokenize(self, pretok: PreTokenizedString): pretok.split(self.custom_split) # load a pre-tokenizer pre_tokenizer = JumanPreTokenizer() # load a tokenizer dict_path = /path/to/jumanpp_bpe.json tokenizer = Tokenizer.from_file(dict_path) tokenizer.post_processor = BertProcessing( cls=("[CLS]", tokenizer.token_to_id('[CLS]')), sep=("[SEP]", tokenizer.token_to_id('[SEP]')) ) # convert to PreTrainedTokenizerFast tokenizer = PreTrainedTokenizerFast( tokenizer_object=tokenizer, unk_token='[UNK]', cls_token='[CLS]', sep_token='[SEP]', pad_token='[PAD]', mask_token='[MASK]' ) # set a pre-tokenizer tokenizer._tokenizer.pre_tokenizer = PreTokenizer.custom(pre_tokenizer) ``` ```python # Test test_str = "こんにちは。私は形態素解析器について研究をしています。" tokenizer.convert_ids_to_tokens(tokenizer(test_str).input_ids) # -> ['[CLS]','こ','んに','ち','は','。','私','は','形態','素','解析','器','に','ついて','研究','を','して','い','ます','。','[SEP]'] ``` ## How to load the model ```python from transformers import AutoModelForMaskedLM model = AutoModelForMaskedLM.from_pretrained("hitachi-nlp/bert-base_jumanpp-bpe") ``` **See [our repository](https://github.com/hitachi-nlp/compare-ja-tokenizer) for more details!**