|
--- |
|
license: cc-by-nc-sa-4.0 |
|
datasets: |
|
- wikipedia |
|
- cc100 |
|
language: |
|
- ja |
|
library_name: transformers |
|
pipeline_tag: fill-mask |
|
--- |
|
|
|
Japanese BERT-base (Nothing + Unigram) |
|
=== |
|
|
|
## How to load the tokenizer |
|
Please download the dictionary file for Nothing + Unigram from [our GitHub repository](https://github.com/hitachi-nlp/compare-ja-tokenizer/blob/public/data/dict/nothing_unigram.json). |
|
Then you can load the tokenizer by specifying the path of the dictionary file to `dict_path`. |
|
|
|
```python |
|
from typing import Optional |
|
|
|
from tokenizers import Tokenizer, NormalizedString, PreTokenizedString |
|
from tokenizers.processors import BertProcessing |
|
from tokenizers.pre_tokenizers import PreTokenizer |
|
from transformers import PreTrainedTokenizerFast |
|
|
|
# load a tokenizer |
|
dict_path = /path/to/nothing_unigram.json |
|
tokenizer = Tokenizer.from_file(dict_path) |
|
tokenizer.post_processor = BertProcessing( |
|
cls=("[CLS]", tokenizer.token_to_id('[CLS]')), |
|
sep=("[SEP]", tokenizer.token_to_id('[SEP]')) |
|
) |
|
|
|
# convert to PreTrainedTokenizerFast |
|
tokenizer = PreTrainedTokenizerFast( |
|
tokenizer_object=tokenizer, |
|
unk_token='[UNK]', |
|
cls_token='[CLS]', |
|
sep_token='[SEP]', |
|
pad_token='[PAD]', |
|
mask_token='[MASK]' |
|
) |
|
``` |
|
|
|
```python |
|
# Test |
|
test_str = "γγγ«γ‘γ―γη§γ―ε½’ζ
η΄ θ§£ζε¨γ«γ€γγ¦η η©Άγγγ¦γγΎγγ" |
|
tokenizer.convert_ids_to_tokens(tokenizer(test_str).input_ids) |
|
# -> ['[CLS]','γγ','γ«','γ‘','γ―','γ','η§','γ―','ε½’ζ
','η΄ ','θ§£ζ','ε¨','γ«γ€γγ¦','η η©Ά','γγγ¦','γγΎγ','γ','[SEP]'] |
|
``` |
|
|
|
## How to load the model |
|
```python |
|
from transformers import AutoModelForMaskedLM |
|
model = AutoModelForMaskedLM.from_pretrained("hitachi-nlp/bert-base_nothing-unigram") |
|
``` |
|
|
|
|
|
**See [our repository](https://github.com/hitachi-nlp/compare-ja-tokenizer) for more details!** |
|
|