inference: false
language: ja
license: apache-2.0
mask_token: '[MASK]'
widget:
- text: LINE株式会社で[MASK]の研究・開発をしている。
LINE DistilBERT Japanese
This is a DistilBERT model pre-trained on 131 GB of Japanese web text. The teacher model is BERT-base that built in-house at LINE. The model was trained by LINE Corporation.
For Japanese
https://github.com/line/LINE-DistilBERT-Japanese/blob/main/README_ja.md is written in Japanese.
How to use
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("line-corporation/line-distilbert-base-japanese", trust_remote_code=True)
model = AutoModel.from_pretrained("line-corporation/line-distilbert-base-japanese")
sentence = "LINE株式会社で[MASK]の研究・開発をしている。"
print(model(**tokenizer(sentence, return_tensors="pt")))
Requirements
fugashi
sentencepiece
unidic-lite
Model architecture
The model architecture is the DitilBERT base model; 6 layers, 768 dimensions of hidden states, 12 attention heads, 66M parameters.
Evaluation
The evaluation by JGLUE is as follows:
model name | #Params | Marc_ja | JNLI | JSTS | JSQuAD | JCommonSenseQA |
---|---|---|---|---|---|---|
acc | acc | Pearson/Spearman | EM/F1 | acc | ||
LINE-DistilBERT | 68M | 95.6 | 88.9 | 89.2/85.1 | 87.3/93.3 | 76.1 |
Laboro-DistilBERT | 68M | 94.7 | 82.0 | 87.4/82.7 | 70.2/87.3 | 73.2 |
BandaiNamco-DistilBERT | 68M | 94.6 | 81.6 | 86.8/82.1 | 80.0/88.0 | 66.5 |
Tokenization
The texts are first tokenized by MeCab with the Unidic dictionary and then split into subwords by the SentencePiece algorithm. The vocabulary size is 32768.
Licenses
The pretrained models are distributed under the terms of the Apache License, Version 2.0.
To cite this work
We haven't published any paper on this work. Please cite this GitHub repository:
@article{LINE DistilBERT Japanese,
title = {LINE DistilBERT Japanese},
author = {"Koga, Kobayashi and Li, Shengzhe and Nakamachi, Akifumi and Sato, Toshinori"},
year = {2023},
howpublished = {\url{http://github.com/line/LINE-DistilBERT-Japanese}}
}