File size: 1,920 Bytes

7f4fe42
 
b2db8c0
 
 
 
 
 
 
7f4fe42
b2db8c0
c2a884d
b2db8c0

---
license: cc-by-nc-sa-4.0
datasets:
- wikipedia
- cc100
language:
- ja
library_name: transformers
pipeline_tag: fill-mask
---

Japanese BERT-base (Nothing + WordPiece)
===

## How to load the tokenizer
Please download the dictionary file for Nothing + WordPiece from [our GitHub repository](https://github.com/hitachi-nlp/compare-ja-tokenizer/blob/public/data/dict/nothing_wordpiece.json).
Then you can load the tokenizer by specifying the path of the dictionary file to `dict_path`.

```python
from typing import Optional

from tokenizers import Tokenizer, NormalizedString, PreTokenizedString
from tokenizers.processors import BertProcessing
from tokenizers.pre_tokenizers import PreTokenizer
from transformers import PreTrainedTokenizerFast

# load a tokenizer
dict_path = /path/to/nothing_wordpiece.json
tokenizer = Tokenizer.from_file(dict_path)
tokenizer.post_processor = BertProcessing(
    cls=("[CLS]", tokenizer.token_to_id('[CLS]')),
    sep=("[SEP]", tokenizer.token_to_id('[SEP]'))
)

# convert to PreTrainedTokenizerFast
tokenizer = PreTrainedTokenizerFast(
    tokenizer_object=tokenizer,
    unk_token='[UNK]',
    cls_token='[CLS]',
    sep_token='[SEP]',
    pad_token='[PAD]',
    mask_token='[MASK]'
)
```

```python
# Test
test_str = "こんにちは。私は形態素解析器について研究をしています。"
tokenizer.convert_ids_to_tokens(tokenizer(test_str).input_ids)
# -> ['[CLS]','こ','##ん','##に','##ち','##は','##。','##私','##は','##形','##態','##素','##解','##析','##器','##に','##つ','##い','##て','##研','##究','##を','##し','##て','##い','##ま','##す','##。','[SEP]']
```

## How to load the model
```python
from transformers import AutoModelForMaskedLM
model = AutoModelForMaskedLM.from_pretrained("hitachi-nlp/bert-base_nothing-wordpiece")
```


**See [our repository](https://github.com/hitachi-nlp/compare-ja-tokenizer) for more details!**