atsuki-yamaguchi commited on
Commit
b2db8c0
β€’
1 Parent(s): 99ef139

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +57 -0
README.md CHANGED
@@ -1,3 +1,60 @@
1
  ---
2
  license: cc-by-nc-sa-4.0
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: cc-by-nc-sa-4.0
3
+ datasets:
4
+ - wikipedia
5
+ - cc100
6
+ language:
7
+ - ja
8
+ library_name: transformers
9
+ pipeline_tag: fill-mask
10
  ---
11
+
12
+ BERT-base (Nothing + WordPiece)
13
+ ===
14
+
15
+ ## How to load the tokenizer
16
+ Please download the dictionary file for Nothing + WordPiece from [our GitHub repository](https://github.com/hitachi-nlp/compare-ja-tokenizer/blob/public/data/dict/nothing_wordpiece.json).
17
+ Then you can load the tokenizer by specifying the path of the dictionary file to `dict_path`.
18
+
19
+ ```python
20
+ from typing import Optional
21
+
22
+ from tokenizers import Tokenizer, NormalizedString, PreTokenizedString
23
+ from tokenizers.processors import BertProcessing
24
+ from tokenizers.pre_tokenizers import PreTokenizer
25
+ from transformers import PreTrainedTokenizerFast
26
+
27
+ # load a tokenizer
28
+ dict_path = /path/to/nothing_wordpiece.json
29
+ tokenizer = Tokenizer.from_file(dict_path)
30
+ tokenizer.post_processor = BertProcessing(
31
+ cls=("[CLS]", tokenizer.token_to_id('[CLS]')),
32
+ sep=("[SEP]", tokenizer.token_to_id('[SEP]'))
33
+ )
34
+
35
+ # convert to PreTrainedTokenizerFast
36
+ tokenizer = PreTrainedTokenizerFast(
37
+ tokenizer_object=tokenizer,
38
+ unk_token='[UNK]',
39
+ cls_token='[CLS]',
40
+ sep_token='[SEP]',
41
+ pad_token='[PAD]',
42
+ mask_token='[MASK]'
43
+ )
44
+ ```
45
+
46
+ ```python
47
+ # Test
48
+ test_str = "γ“γ‚“γ«γ‘γ―γ€‚η§γ―ε½’ζ…‹η΄ θ§£ζžε™¨γ«γ€γ„γ¦η ”η©Άγ‚’γ—γ¦γ„γΎγ™γ€‚"
49
+ tokenizer.convert_ids_to_tokens(tokenizer(test_str).input_ids)
50
+ # -> ['[CLS]','こ','##γ‚“','##に','##け','##は','##。','##私','##は','##ε½’','##ζ…‹','##η΄ ','##解','##析','##器','##に','##぀','##い','##て','##η ”','##η©Ά','##γ‚’','##し','##て','##い','##ま','##す','##。','[SEP]']
51
+ ```
52
+
53
+ ## How to load the model
54
+ ```python
55
+ from transformers import AutoModelForMaskedLM
56
+ model = AutoModelForMaskedLM.from_pretrained("hitachi-nlp/bert-base_nothing-wordpiece")
57
+ ```
58
+
59
+
60
+ **See [our repository](https://github.com/hitachi-nlp/compare-ja-tokenizer) for more details!**