vocab.txt may be broken

#1
by KoichiYasuoka - opened

Duplicated lines are observed in vocab.txt, and no "##" included. I suspect vocab.txt broken, and it can be fixed with:

from transformers import AlbertTokenizer
from transformers.utils import cached_file
tkz=AlbertTokenizer(cached_file("nlp-waseda/roberta-base-japanese-with-auto-jumanpp","spiece.model"))
with open("vocab.txt","w",encoding="utf-8") as w:
  print("\n".join(("##"+x).replace("##\u2581","").replace("##[","[").replace("##<","<") for x in tkz.convert_ids_to_tokens(list(range(len(tkz))))),file=w)
Kawahara Lab at Waseda University org

@KoichiYasuoka Thank you Prof.Yasuoka, it's very helpful to us. We also found that with WordPiece, this model with auto jumanpp can not perform well as the old model in JGLUE tasks, so we will add sentencepiece for BertJapaneseTokenizer as soon as possible.

Thank you @conan1024hao for fixing vocab.txt. I've just confirmed that the script below work well on Google Colaboratory:

!test -d jumanpp-2.0.0-rc3 || curl -L https://github.com/ku-nlp/jumanpp/releases/download/v2.0.0-rc3/jumanpp-2.0.0-rc3.tar.xz | tar xJf -
!test -x /usr/local/bin/jumanpp || ( mkdir jumanpp-2.0.0-rc3/build && cd jumanpp-2.0.0-rc3/build && cmake .. -DCMAKE_BUILD_TYPE=Release && make install )
!pip install transformers pyknp
from transformers import pipeline
fmp=pipeline("fill-mask","nlp-waseda/roberta-base-japanese-with-auto-jumanpp")
print(fmp("国境の[MASK]トンネルを抜けると雪国であった。"))

Thank you again and I'm looking forward to subword_tokenizer with sentencepiece.

KoichiYasuoka changed discussion status to closed

@dkawahara san, I've also confirmed that the tentative script below work well:

!test -d jumanpp-2.0.0-rc3 || curl -L https://github.com/ku-nlp/jumanpp/releases/download/v2.0.0-rc3/jumanpp-2.0.0-rc3.tar.xz | tar xJf -
!test -x /usr/local/bin/jumanpp || ( mkdir jumanpp-2.0.0-rc3/build && cd jumanpp-2.0.0-rc3/build && cmake .. -DCMAKE_BUILD_TYPE=Release && make install )
!pip install transformers pyknp sentencepiece
from transformers import pipeline,AlbertTokenizer
from transformers.utils import cached_file
spm=AlbertTokenizer(cached_file("nlp-waseda/roberta-base-japanese-with-auto-jumanpp","spiece.model"),keep_accents=True,do_lower_case=False)
fmp=pipeline("fill-mask","nlp-waseda/roberta-base-japanese-with-auto-jumanpp")
fmp.tokenizer.subword_tokenizer.tokenize=lambda x:[("##"+t).replace("##[","[").replace("##<","<").replace("##\u2581","") for t in spm.tokenize(x)]
print(fmp("国境の[MASK]トンネルを抜けると雪国であった。"))

to use sentencepiece as subword_tokenizer. But it seems too tentative and needs to be brushed up...

Kawahara Lab at Waseda University org

@KoichiYasuoka Hi Prof. Yasuoka, we have merged this PR: https://github.com/huggingface/transformers/pull/19769. Now sentencepiece can be used in BertJapaneseTokenizer, please have a try.

Sign up or log in to comment