nlp-waseda/roberta-large-japanese-with-auto-jumanpp · roberta-large-japanese-with-auto-jumanpp's tokenizer.batch_encode

Hi, I have a issue in tokenization with "nlp-waseda/roberta-large-japanese-seq512-with-auto-jumanpp".
batch_encode_plus() doesn't return when put in relatively long texts.
I'm wondering whether is here correct place to discuss this issue.
and if so, please let me know where should I go next.

how to replicate the issue in Google Colab.

Install transformers, sentencepiece, rhoknp

!pip install "transformers==4.30.*"
!pip install "sentencepiece==0.1.*"
!pip install "rhoknp==1.3.2"

And jumanpp.

!wget https://github.com/ku-nlp/jumanpp/releases/download/v2.0.0-rc3/jumanpp-2.0.0-rc3.tar.xz
!tar xf jumanpp-2.0.0-rc3.tar.xz
!cd jumanpp-2.0.0-rc3 && mkdir bld  && cd bld && cmake .. -DCMAKE_BUILD_TYPE=Release  && make install -j2

jumannapp was installed to /usr/local/bin.

!which jumanpp
# /usr/local/bin/jumanpp

Download Livedoor News Corpus.

!wget "https://www.rondhuit.com/download/ldcc-20140209.tar.gz"
!tar zxf ldcc-20140209.tar.gz

Preprocessing and sort by length.

import glob
articles = []
for filename in glob.glob("./text/*/*.txt"):
  with open(filename, "r") as f:
    article = f.read()
    articles.append("".join(article.split("\n")[2:]))

articles = sorted(articles, key=len)

Load tokenizer.

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("nlp-waseda/roberta-large-japanese-seq512-with-auto-jumanpp")

Encode top-100 long text samples and returned in 373ms.

%%time
features = tokenizer.batch_encode_plus(articles[-100:], padding="max_length", truncation=True, max_length=512)
# CPU times: user 340 ms, sys: 2.97 ms, total: 343 ms
# Wall time: 373 ms

One more try with same data. Waiting over 10 minutes, but never return.

%%time
features = tokenizer.batch_encode_plus(articles[-100:], padding="max_length", truncation=True, max_length=512)

nlp-waseda
/

roberta-large-japanese-with-auto-jumanpp

roberta-large-japanese-with-auto-jumanpp's tokenizer.batch_encode_plus() is hung in tokenization.

how to replicate the issue in Google Colab.