hitachi-nlp
/

bert-base-japanese_jumanpp-bpe

Inference Endpoints

Model card Files Files and versions Community

atsuki-yamaguchi commited on Jun 15, 2023

Commit

fafe82f

•

1 Parent(s): d1e349c

Update README.md

Files changed (1) hide show

README.md +90 -0

README.md CHANGED Viewed

@@ -1,3 +1,93 @@
 ---
 license: cc-by-nc-sa-4.0
 ---

 ---
 license: cc-by-nc-sa-4.0
+datasets:
+- wikipedia
+- cc100
+language:
+- ja
+library_name: transformers
+pipeline_tag: fill-mask
 ---
+BERT-base (Juman++ + BPE)
+===
+## How to load the tokenizer
+Please download the dictionary file for Juman++ + BPE from [our GitHub repository](https://github.com/hitachi-nlp/compare-ja-tokenizer/blob/public/data/dict/jumanpp_bpe.json).
+Then you can load the tokenizer by specifying the path of the dictionary file to `dict_path`.
+```python
+from typing import Optional
+from tokenizers import Tokenizer, NormalizedString, PreTokenizedString
+from tokenizers.processors import BertProcessing
+from tokenizers.pre_tokenizers import PreTokenizer
+from transformers import PreTrainedTokenizerFast
+from pyknp import Juman
+import mojimoji
+import textspan
+class JumanPreTokenizer:
+    def __init__(self):
+        self.juman = Juman("jumanpp", multithreading=True)
+    def tokenize(self, sequence: str) -> list[str]:
+        text = mojimoji.han_to_zen(sequence).rstrip()
+        try:
+            result = self.juman.analysis(text)
+        except:
+            traceback.print_exc()
+            text = ""
+            result = self.juman.analysis(text)
+        return [mrph.midasi for mrph in result.mrph_list()]
+    def custom_split(self, i: int, normalized_string: NormalizedString) -> list[NormalizedString]:
+        text = str(normalized_string)
+        tokens = self.tokenize(text)
+        tokens_spans = textspan.get_original_spans(tokens, text)
+        return [normalized_string[st:ed] for cahr_spans in tokens_spans for st,ed in cahr_spans]
+    def pre_tokenize(self, pretok: PreTokenizedString):
+        pretok.split(self.custom_split)
+# load a pre-tokenizer
+pre_tokenizer = JumanPreTokenizer()
+# load a tokenizer
+dict_path = /path/to/jumanpp_bpe.json
+tokenizer = Tokenizer.from_file(dict_path)
+tokenizer.post_processor = BertProcessing(
+    cls=("[CLS]", tokenizer.token_to_id('[CLS]')),
+    sep=("[SEP]", tokenizer.token_to_id('[SEP]'))
+)
+# convert to PreTrainedTokenizerFast
+tokenizer = PreTrainedTokenizerFast(
+    tokenizer_object=tokenizer,
+    unk_token='[UNK]',
+    cls_token='[CLS]',
+    sep_token='[SEP]',
+    pad_token='[PAD]',
+    mask_token='[MASK]'
+)
+# set a pre-tokenizer
+tokenizer._tokenizer.pre_tokenizer = PreTokenizer.custom(pre_tokenizer)
+```
+```python
+# Test
+test_str = "こんにちは。私は形態素解析器について研究をしています。"
+tokenizer.convert_ids_to_tokens(tokenizer(test_str).input_ids)
+# -> ['[CLS]','こ','んに','ち','は','。','私','は','形態','素','解析','器','に','ついて','研究','を','して','い','ます','。','[SEP]']
+```
+## How to load the model
+```python
+from transformers import AutoModelForMaskedLM
+model = AutoModelForMaskedLM.from_pretrained("hitachi-nlp/bert-base_jumanpp-bpe")
+```
+**See [our repository](https://github.com/hitachi-nlp/compare-ja-tokenizer) for more details!**