--- license: cc-by-nc-sa-4.0 datasets: - wikipedia - cc100 language: - ja library_name: transformers pipeline_tag: fill-mask --- Japanese BERT-base (Sudachi + WordPiece) === ## How to load the tokenizer Please download the dictionary file for Sudachi + WordPiece from [our GitHub repository](https://github.com/hitachi-nlp/compare-ja-tokenizer/blob/public/data/dict/sudachi_wordpiece.json). Then you can load the tokenizer by specifying the path of the dictionary file to `dict_path`. ```python from typing import Optional from tokenizers import Tokenizer, NormalizedString, PreTokenizedString from tokenizers.processors import BertProcessing from tokenizers.pre_tokenizers import PreTokenizer from transformers import PreTrainedTokenizerFast from sudachipy import tokenizer from sudachipy import dictionary import textspan class SudachiPreTokenizer: def __init__(self, mecab_dict_path: Optional[str] = None): self.sudachi = dictionary.Dictionary().create() def tokenize(self, sequence: str) -> list[str]: return [token.surface() for token in self.sudachi.tokenize(sequence)] def custom_split(self, i: int, normalized_string: NormalizedString) -> list[NormalizedString]: text = str(normalized_string) tokens = self.tokenize(text) tokens_spans = textspan.get_original_spans(tokens, text) return [normalized_string[st:ed] for cahr_spans in tokens_spans for st,ed in cahr_spans] def pre_tokenize(self, pretok: PreTokenizedString): pretok.split(self.custom_split) # load a pre-tokenizer pre_tokenizer = SudachiPreTokenizer() # load a tokenizer dict_path = /path/to/sudachi_wordpiece.json tokenizer = Tokenizer.from_file(dict_path) tokenizer.post_processor = BertProcessing( cls=("[CLS]", tokenizer.token_to_id('[CLS]')), sep=("[SEP]", tokenizer.token_to_id('[SEP]')) ) # convert to PreTrainedTokenizerFast tokenizer = PreTrainedTokenizerFast( tokenizer_object=tokenizer, unk_token='[UNK]', cls_token='[CLS]', sep_token='[SEP]', pad_token='[PAD]', mask_token='[MASK]' ) # set a pre-tokenizer tokenizer._tokenizer.pre_tokenizer = PreTokenizer.custom(pre_tokenizer) ``` ```python # Test test_str = "こんにちは。私は形態素解析器について研究をしています。" tokenizer.convert_ids_to_tokens(tokenizer(test_str).input_ids) # -> ['[CLS]','こ','##ん','##に','##ち','##は','。','私','は','形態','##素','解析','器','に','つい','て','研究','を','し','て','い','ます','。','[SEP]'] ``` ## How to load the model ```python from transformers import AutoModelForMaskedLM model = AutoModelForMaskedLM.from_pretrained("hitachi-nlp/bert-base_sudachi-wordpiece") ``` **See [our repository](https://github.com/hitachi-nlp/compare-ja-tokenizer) for more details!**