I have implemented DebertaV2JumanppTokenizer(Fast). This tokenizer carries out text pre-segmentation using Juman++ within the Transformers Tokenizer. This provides the following advantages:

  • Text can be tokenized more easily.
  • Functions of the Fast Tokenizer, such as offset_mapping, can be used more easily.
  • The Transformers pipeline can be used more easily. This is particularly effective for the TokenClassificationPipeline.

Additionally, unrelated to this PR, you may get different tokenization results between the Slow Tokenizer and the Fast Tokenizer. It might be better to deprecate the Slow Tokenizer.

tealgreen0503 changed pull request status to open
Language Media Processing Lab at Kyoto University org

The automatic tokenization feature by Juman++ is very useful, but on the other hand, it does not preserve backward compatibility in ku-nlp/deberta-v2-base-japanese.
It is better to provide separate models, just like nlp-waseda/roberta-large-japanese-seq512 and nlp-waseda/roberta-large-japanese-seq512-with-auto-jumanpp.

Language Media Processing Lab at Kyoto University org

I created a model ku-nlp/deberta-v2-base-japanese-with-auto-jumanpp reflecting this PR , so I'm closing this PR.
Please use the new model for pre-segmentation using Juman++.

nobu-g changed pull request status to closed

Sign up or log in to comment