ShortText/JLBert · Hugging Face

Model description

This is a Japanese light weight BERT model pre-trained from scratch on Ecommerce japanese data.

For pre-training, User short reviews and review titles are taken from Rakuten & Amazon websites.

This base model is primarily designed to fine-tune on short text classification use-cases.

How to use

Here is how to use this model to get the features of a given text in PyTorch:

from transformers import RobertaTokenizer, AutoModelForMaskedLM

tokenizer = RobertaTokenizer.from_pretrained('ShortText/JLBert')
model = AutoModelForMaskedLM.from_pretrained('ShortText/JLBert')

# prepare input
text = "トイザらス・ベビーザらス郡山店"
encoded_input = tokenizer(text, return_tensors='pt')

# forward pass
output = model(**encoded_input)...

You can use this model directly with a pipeline for masked language modeling:

from transformers import RobertaTokenizer, AutoModelForMaskedLM, pipeline

tokenizer = RobertaTokenizer.from_pretrained('ShortText/JLBert')
model = AutoModelForMaskedLM.from_pretrained('ShortText/JLBert')

unmasker = pipeline('fill-mask', model=model, tokenizer=tokenizer)
unmasker("こんにちは、<mask>モデルです。")

You can fine-tune this model on Short Text Classification downstream tasks.

You can use the raw model for masked language modeling, but it's mostly intended to be fine-tuned on a downstream task.

Tokenization

The texts are tokenized using a byte version of Byte-Pair Encoding (BPE) with a vocabulary size of 30,522.

Training procedure

JLBert has 6 hidden layers, 6 attention heads, and 768 hidden sizes, making it to be lighter than the BERT.