--- license: apache-2.0 language: - ja pipeline_tag: fill-mask mask_token: "" widget: - text: "早稲田大学で自然言語処理をする。" --- ## Model description This is a Japanese light weight BERT model pre-trained from scratch on Ecommerce japanese data. For pre-training, User short reviews and review titles are taken from Rakuten & Amazon websites. This base model is primarily designed to fine-tune on short text classification use-cases. ## How to use Here is how to use this model to get the features of a given text in PyTorch: ```python from transformers import RobertaTokenizer, AutoModelForMaskedLM tokenizer = RobertaTokenizer.from_pretrained('ShortText/JLBert') model = AutoModelForMaskedLM.from_pretrained('ShortText/JLBert') # prepare input text = "トイザらス・ベビーザらス郡山店" encoded_input = tokenizer(text, return_tensors='pt') # forward pass output = model(**encoded_input)... ``` You can use this model directly with a pipeline for masked language modeling: ```python from transformers import RobertaTokenizer, AutoModelForMaskedLM, pipeline tokenizer = RobertaTokenizer.from_pretrained('ShortText/JLBert') model = AutoModelForMaskedLM.from_pretrained('ShortText/JLBert') unmasker = pipeline('fill-mask', model=model, tokenizer=tokenizer) unmasker("こんにちは、モデルです。") ``` You can fine-tune this model on Short Text Classification downstream tasks. You can use the raw model for masked language modeling, but it's mostly intended to be fine-tuned on a downstream task. ## Tokenization The texts are tokenized using a byte version of Byte-Pair Encoding (BPE) with a vocabulary size of 30,522. ## Training procedure JLBert has 6 hidden layers, 6 attention heads, and 768 hidden sizes, making it to be lighter than the BERT.