roberta-large-japanese-with-auto-jumanpp's tokenizer.batch_encode_plus() is hung in tokenization.
#2
by
ogis-uno
- opened
Hi, I have a issue in tokenization with "nlp-waseda/roberta-large-japanese-seq512-with-auto-jumanpp".
batch_encode_plus() doesn't return when put in relatively long texts.
I'm wondering whether is here correct place to discuss this issue.
and if so, please let me know where should I go next.
how to replicate the issue in Google Colab.
Install transformers, sentencepiece, rhoknp
!pip install "transformers==4.30.*"
!pip install "sentencepiece==0.1.*"
!pip install "rhoknp==1.3.2"
And jumanpp.
!wget https://github.com/ku-nlp/jumanpp/releases/download/v2.0.0-rc3/jumanpp-2.0.0-rc3.tar.xz
!tar xf jumanpp-2.0.0-rc3.tar.xz
!cd jumanpp-2.0.0-rc3 && mkdir bld && cd bld && cmake .. -DCMAKE_BUILD_TYPE=Release && make install -j2
jumannapp was installed to /usr/local/bin.
!which jumanpp
# /usr/local/bin/jumanpp
Download Livedoor News Corpus.
!wget "https://www.rondhuit.com/download/ldcc-20140209.tar.gz"
!tar zxf ldcc-20140209.tar.gz
Preprocessing and sort by length.
import glob
articles = []
for filename in glob.glob("./text/*/*.txt"):
with open(filename, "r") as f:
article = f.read()
articles.append("".join(article.split("\n")[2:]))
articles = sorted(articles, key=len)
Load tokenizer.
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("nlp-waseda/roberta-large-japanese-seq512-with-auto-jumanpp")
Encode top-100 long text samples and returned in 373ms.
%%time
features = tokenizer.batch_encode_plus(articles[-100:], padding="max_length", truncation=True, max_length=512)
# CPU times: user 340 ms, sys: 2.97 ms, total: 343 ms
# Wall time: 373 ms
One more try with same data. Waiting over 10 minutes, but never return.
%%time
features = tokenizer.batch_encode_plus(articles[-100:], padding="max_length", truncation=True, max_length=512)
Somehow we meet a similar problem when the tokenizer processes a relatively long batch of texts.