Edit model card

LINE DistilBERT Japanese (forked by liwii)

This is a forked version of DistilBERT model pre-trained on 131 GB of Japanese web text. The teacher model is BERT-base that built in-house at LINE. The model was trained by LINE Corporation.

The difference from the original repository is the tokenizer code. In this repository, we updated it to work with transformers>=4.34 after a tokenizer refactoring.

For Japanese

https://github.com/line/LINE-DistilBERT-Japanese/blob/main/README_ja.md is written in Japanese.

How to use

from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("liwii/line-distilbert-base-japanese-fork", trust_remote_code=True)
# The model is the same as the original repository
model = AutoModel.from_pretrained("line-corporation/line-distilbert-base-japanese")

sentence = "LINE株式会社で[MASK]の研究・開発をしている。"
print(model(**tokenizer(sentence, return_tensors="pt")))

Requirements

fugashi 
sentencepiece
unidic-lite

Model architecture

The model architecture is the DitilBERT base model; 6 layers, 768 dimensions of hidden states, 12 attention heads, 66M parameters.

Evaluation

The evaluation by JGLUE is as follows:

model name #Params Marc_ja JNLI JSTS JSQuAD JCommonSenseQA
acc acc Pearson/Spearman EM/F1 acc
LINE-DistilBERT 68M 95.6 88.9 89.2/85.1 87.3/93.3 76.1
Laboro-DistilBERT 68M 94.7 82.0 87.4/82.7 70.2/87.3 73.2
BandaiNamco-DistilBERT 68M 94.6 81.6 86.8/82.1 80.0/88.0 66.5

Tokenization

The texts are first tokenized by MeCab with the Unidic dictionary and then split into subwords by the SentencePiece algorithm. The vocabulary size is 32768.

Licenses

The pretrained models are distributed under the terms of the Apache License, Version 2.0.

To cite this work

We haven't published any paper on this work. Please cite this GitHub repository:

@article{LINE DistilBERT Japanese,
  title = {LINE DistilBERT Japanese},
  author = {"Koga, Kobayashi and Li, Shengzhe and Nakamachi, Akifumi and Sato, Toshinori"},
  year = {2023},
  howpublished = {\url{http://github.com/line/LINE-DistilBERT-Japanese}}
}
Downloads last month
14
Inference Examples
Mask token: [MASK]
Inference API (serverless) has been turned off for this model.