add-custom-tokenizer
I have implemented DebertaV2JumanppTokenizer(Fast)
. This tokenizer carries out text pre-segmentation using Juman++ within the Transformers Tokenizer. This provides the following advantages:
- Text can be tokenized more easily.
- Functions of the Fast Tokenizer, such as offset_mapping, can be used more easily.
- The Transformers pipeline can be used more easily. This is particularly effective for the
TokenClassificationPipeline
.
Additionally, unrelated to this PR, you may get different tokenization results between the Slow Tokenizer and the Fast Tokenizer. It might be better to deprecate the Slow Tokenizer.
The automatic tokenization feature by Juman++ is very useful, but on the other hand, it does not preserve backward compatibility in ku-nlp/deberta-v2-base-japanese
.
It is better to provide separate models, just like nlp-waseda/roberta-large-japanese-seq512
and nlp-waseda/roberta-large-japanese-seq512-with-auto-jumanpp
.
I created a model ku-nlp/deberta-v2-base-japanese-with-auto-jumanpp reflecting this PR , so I'm closing this PR.
Please use the new model for pre-segmentation using Juman++.