Model description
This is a Japanese light weight BERT model pre-trained from scratch on Ecommerce japanese data.
For pre-training, User short reviews and review titles are taken from Rakuten & Amazon websites.
This base model is primarily designed to fine-tune on short text classification use-cases.
How to use
Here is how to use this model to get the features of a given text in PyTorch:
from transformers import RobertaTokenizer, AutoModelForMaskedLM
tokenizer = RobertaTokenizer.from_pretrained('ShortText/JLBert')
model = AutoModelForMaskedLM.from_pretrained('ShortText/JLBert')
# prepare input
text = "トイザらス・ベビーザらス郡山店"
encoded_input = tokenizer(text, return_tensors='pt')
# forward pass
output = model(**encoded_input)...
You can use this model directly with a pipeline for masked language modeling:
from transformers import RobertaTokenizer, AutoModelForMaskedLM, pipeline
tokenizer = RobertaTokenizer.from_pretrained('ShortText/JLBert')
model = AutoModelForMaskedLM.from_pretrained('ShortText/JLBert')
unmasker = pipeline('fill-mask', model=model, tokenizer=tokenizer)
unmasker("こんにちは、<mask>モデルです。")
You can fine-tune this model on Short Text Classification downstream tasks.
You can use the raw model for masked language modeling, but it's mostly intended to be fine-tuned on a downstream task.
Tokenization
The texts are tokenized using a byte version of Byte-Pair Encoding (BPE) with a vocabulary size of 30,522.
Training procedure
JLBert has 6 hidden layers, 6 attention heads, and 768 hidden sizes, making it to be lighter than the BERT.
- Downloads last month
- 6