---
license: apache-2.0
language:
- ja
pipeline_tag: fill-mask
mask_token: "<mask>"
widget:
- text: "早稲田大学で自然言語処理を<mask>する。"
---

## Model description
This is a Japanese light weight BERT model pre-trained from scratch on Ecommerce japanese data. 

For pre-training, User short reviews and review titles are taken from Rakuten & Amazon websites.

This base model is primarily designed to fine-tune on short text classification use-cases.


## How to use

Here is how to use this model to get the features of a given text in PyTorch:
```python
from transformers import RobertaTokenizer, AutoModelForMaskedLM

tokenizer = RobertaTokenizer.from_pretrained('ShortText/JLBert')
model = AutoModelForMaskedLM.from_pretrained('ShortText/JLBert')

# prepare input
text = "トイザらス・ベビーザらス郡山店"
encoded_input = tokenizer(text, return_tensors='pt')

# forward pass
output = model(**encoded_input)...
```

You can use this model directly with a pipeline for masked language modeling:
```python
from transformers import RobertaTokenizer, AutoModelForMaskedLM, pipeline

tokenizer = RobertaTokenizer.from_pretrained('ShortText/JLBert')
model = AutoModelForMaskedLM.from_pretrained('ShortText/JLBert')

unmasker = pipeline('fill-mask', model=model, tokenizer=tokenizer)
unmasker("こんにちは、<mask>モデルです。")
```
You can fine-tune this model on Short Text Classification downstream tasks.

You can use the raw model for masked language modeling, but it's mostly intended to be fine-tuned on a downstream task.

## Tokenization

The texts are tokenized using a byte version of Byte-Pair Encoding (BPE) with a vocabulary size of 30,522.

## Training procedure

JLBert has 6 hidden layers, 6 attention heads, and 768 hidden sizes, making it to be lighter than the BERT.