add-custom-tokenizer

by tealgreen0503 - opened May 15, 2023

base: refs/heads/main

←

from: refs/pr/3

Discussion Files changed

+108

-6

tealgreen0503

May 15, 2023

•

edited May 21, 2023

I have implemented DebertaV2JumanppTokenizer(Fast). This tokenizer carries out text pre-segmentation using Juman++ within the Transformers Tokenizer. This provides the following advantages:

Text can be tokenized more easily.
Functions of the Fast Tokenizer, such as offset_mapping, can be used more easily.
The Transformers pipeline can be used more easily. This is particularly effective for the TokenClassificationPipeline.

feat: add custom fast tokenizerdc384f8a

feat: add custom normal tokenizerac868f82

update: README.mddbaae2b7

update: tokenizer_config.json74370b4c

tealgreen0503

May 21, 2023

•

edited May 21, 2023

Additionally, unrelated to this PR, you may get different tokenization results between the Slow Tokenizer and the Fast Tokenizer. It might be better to deprecate the Slow Tokenizer.

tealgreen0503 changed pull request status to open May 21, 2023

nobu-g

Language Media Processing Lab at Kyoto University org Sep 7, 2023

The automatic tokenization feature by Juman++ is very useful, but on the other hand, it does not preserve backward compatibility in ku-nlp/deberta-v2-base-japanese.
It is better to provide separate models, just like nlp-waseda/roberta-large-japanese-seq512 and nlp-waseda/roberta-large-japanese-seq512-with-auto-jumanpp.

nobu-g

Language Media Processing Lab at Kyoto University org Sep 7, 2023

I created a model ku-nlp/deberta-v2-base-japanese-with-auto-jumanpp reflecting this PR , so I'm closing this PR.
Please use the new model for pre-segmentation using Juman++.

nobu-g changed pull request status to closed Sep 7, 2023

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment