liwii's picture
Typo
154c3cd
---
inference: false
language: ja
license: apache-2.0
mask_token: "[MASK]"
widget:
- text: "LINE株式会社で[MASK]の研究・開発をしている。"
---
# LINE DistilBERT Japanese (forked by liwii)
This is a forked version of DistilBERT model pre-trained on 131 GB of Japanese web text.
The teacher model is BERT-base that built in-house at LINE.
The model was trained by [LINE Corporation](https://linecorp.com/).
The difference from the [original repository](https://huggingface.co/line-corporation/line-distilbert-base-japanese) is the tokenizer code. In this repository, we updated it to work with `transformers>=4.34` after a [tokenizer refactoring](https://github.com/huggingface/transformers/pull/23909).
## For Japanese
https://github.com/line/LINE-DistilBERT-Japanese/blob/main/README_ja.md is written in Japanese.
## How to use
```python
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("liwii/line-distilbert-base-japanese-fork", trust_remote_code=True)
# The model is the same as the original repository
model = AutoModel.from_pretrained("line-corporation/line-distilbert-base-japanese")
sentence = "LINE株式会社で[MASK]の研究・開発をしている。"
print(model(**tokenizer(sentence, return_tensors="pt")))
```
### Requirements
```txt
fugashi
sentencepiece
unidic-lite
```
## Model architecture
The model architecture is the DitilBERT base model; 6 layers, 768 dimensions of hidden states, 12 attention heads, 66M parameters.
## Evaluation
The evaluation by [JGLUE](https://github.com/yahoojapan/JGLUE) is as follows:
| model name | #Params | Marc_ja | JNLI | JSTS | JSQuAD | JCommonSenseQA |
|------------------------|:-------:|:-------:|:----:|:----------------:|:---------:|:--------------:|
| | | acc | acc | Pearson/Spearman | EM/F1 | acc |
| LINE-DistilBERT | 68M | 95.6 | 88.9 | 89.2/85.1 | 87.3/93.3 | 76.1 |
| Laboro-DistilBERT | 68M | 94.7 | 82.0 | 87.4/82.7 | 70.2/87.3 | 73.2 |
| BandaiNamco-DistilBERT | 68M | 94.6 | 81.6 | 86.8/82.1 | 80.0/88.0 | 66.5 |
## Tokenization
The texts are first tokenized by MeCab with the Unidic dictionary and then split into subwords by the SentencePiece algorithm. The vocabulary size is 32768.
## Licenses
The pretrained models are distributed under the terms of the [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0).
## To cite this work
We haven't published any paper on this work. Please cite [this GitHub repository](http://github.com/line/LINE-DistilBERT-Japanese):
```
@article{LINE DistilBERT Japanese,
title = {LINE DistilBERT Japanese},
author = {"Koga, Kobayashi and Li, Shengzhe and Nakamachi, Akifumi and Sato, Toshinori"},
year = {2023},
howpublished = {\url{http://github.com/line/LINE-DistilBERT-Japanese}}
}
```