vocab.txt may be broken

by KoichiYasuoka - opened Oct 17, 2022

Oct 17, 2022

Duplicated lines are observed in vocab.txt, and no "##" included. I suspect vocab.txt broken, and it can be fixed with:

from transformers import AlbertTokenizer
from transformers.utils import cached_file
tkz=AlbertTokenizer(cached_file("nlp-waseda/roberta-base-japanese-with-auto-jumanpp","spiece.model"))
with open("vocab.txt","w",encoding="utf-8") as w:
  print("\n".join(("##"+x).replace("##\u2581","").replace("##[","[").replace("##<","<") for x in tkz.convert_ids_to_tokens(list(range(len(tkz))))),file=w)

conan1024hao

Kawahara Lab at Waseda University org Oct 18, 2022

@KoichiYasuoka Thank you Prof.Yasuoka, it's very helpful to us. We also found that with WordPiece, this model with auto jumanpp can not perform well as the old model in JGLUE tasks, so we will add sentencepiece for BertJapaneseTokenizer as soon as possible.

KoichiYasuoka

Oct 18, 2022

•

edited Oct 18, 2022

Thank you @conan1024hao for fixing vocab.txt. I've just confirmed that the script below work well on Google Colaboratory:

!test -d jumanpp-2.0.0-rc3 || curl -L https://github.com/ku-nlp/jumanpp/releases/download/v2.0.0-rc3/jumanpp-2.0.0-rc3.tar.xz | tar xJf -
!test -x /usr/local/bin/jumanpp || ( mkdir jumanpp-2.0.0-rc3/build && cd jumanpp-2.0.0-rc3/build && cmake .. -DCMAKE_BUILD_TYPE=Release && make install )
!pip install transformers pyknp
from transformers import pipeline
fmp=pipeline("fill-mask","nlp-waseda/roberta-base-japanese-with-auto-jumanpp")
print(fmp("国境の[MASK]トンネルを抜けると雪国であった。"))

Thank you again and I'm looking forward to subword_tokenizer with sentencepiece.

KoichiYasuoka changed discussion status to closed Oct 18, 2022

KoichiYasuoka

Oct 18, 2022

•

edited Oct 18, 2022

@dkawahara san, I've also confirmed that the tentative script below work well:

!test -d jumanpp-2.0.0-rc3 || curl -L https://github.com/ku-nlp/jumanpp/releases/download/v2.0.0-rc3/jumanpp-2.0.0-rc3.tar.xz | tar xJf -
!test -x /usr/local/bin/jumanpp || ( mkdir jumanpp-2.0.0-rc3/build && cd jumanpp-2.0.0-rc3/build && cmake .. -DCMAKE_BUILD_TYPE=Release && make install )
!pip install transformers pyknp sentencepiece
from transformers import pipeline,AlbertTokenizer
from transformers.utils import cached_file
spm=AlbertTokenizer(cached_file("nlp-waseda/roberta-base-japanese-with-auto-jumanpp","spiece.model"),keep_accents=True,do_lower_case=False)
fmp=pipeline("fill-mask","nlp-waseda/roberta-base-japanese-with-auto-jumanpp")
fmp.tokenizer.subword_tokenizer.tokenize=lambda x:[("##"+t).replace("##[","[").replace("##<","<").replace("##\u2581","") for t in spm.tokenize(x)]
print(fmp("国境の[MASK]トンネルを抜けると雪国であった。"))

to use sentencepiece as subword_tokenizer. But it seems too tentative and needs to be brushed up...

conan1024hao

Kawahara Lab at Waseda University org Oct 21, 2022

@KoichiYasuoka Hi Prof. Yasuoka, we have merged this PR: https://github.com/huggingface/transformers/pull/19769. Now sentencepiece can be used in BertJapaneseTokenizer, please have a try.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment