metadata

license: apache-2.0
language:
  - ja

Model Card for japanese-spoken-language-bert

These BERT models are pre-trained on written Japanese (Wikipedia) and fine-tuned on Spoken Japanese. We used CSJ and the Japanese diet record. CSJ (Corpus of Spontaneous Japanese) is provided by NINJAL (https://www.ninjal.ac.jp/). We only provide model parameters. You have to download other config files to use these models.

We provide three models down below:

1-6 layer-wise (Folder Name: models/1-6_layer-wise) Fine-Tuned only 1st-6th layers in Encoder on CSJ.
TAPT512 60k (Folder Name: models/tapt512_60k) Fine-Tuned on CSJ.
DAPT128-TAPT512 (Folder Name: models/dapt128-tap512) Fine-Tuned on the diet record and CSJ.

Model Card for japanese-spoken-language-bert
Table of Contents
Table of Contents
Model Details
- Model Description
Training Details
- Training Data
- Training Procedure
Evaluation
- Testing Data, Factors & Metrics
- Results
Citation
More Information
Model Card Authors
Model Card Contact
How to Get Started with the Model

Model Details

Model Description

We provide three models down below:

1-6 layer-wise (Folder Name: models/1-6_layer-wise) Fine-Tuned only 1st-6th layers in Encoder on CSJ.
TAPT512 60k (Folder Name: models/tapt512_60k) Fine-Tuned on CSJ.
DAPT128-TAPT512 (Folder Name: models/dapt128-tap512) Fine-Tuned on the diet record and CSJ.
Model type: Language model
Language(s) (NLP): ja
License: Copyright (c) 2021 National Institute for Japanese Language and Linguistics and Retrieva, Inc. Licensed under the Apache License, Version 2.0 (the “License”)

Training Details

Training Data

1-6 layer-wise: CSJ
TAPT512 60K: CSJ
DAPT128-TAPT512: The Japanese diet record and CSJ

Training Procedure

We continuously train the pre-trained Japanese BERT model (cl-tohoku/bert-base-japanese-whole-word-masking; written BERT).

In detail, See Japanese blog or Japanese paper.

Evaluation

Testing Data, Factors & Metrics

Testing Data

We use CSJ for the evaluation.

Factors

We evaluate the following tasks on CSJ:

Dependency Parsing
Sentence Boundary
Extract Important Sentence

Metrics

Dependency Parsing: Undirected Unlabeled Attachment Score (UUAS)
Sentence Boundary: F1 Score
Extract Important Sentence: F1 Score

Results

	Dependency Parsing	Sentence Boundary	Extract Important Sentence
written BERT	39.4	61.6	36.8
1-6 layer wise	44.6	64.8	35.4
TAPT 512 60K	-	-	40.2
DAPT128-TAPT512	42.9	64.0	39.7

Citation

BibTeX:

@inproceedings{csjbert2021,
    title = {CSJを用いた日本語話し言葉BERTの作成},
    author = {勝又智 and 坂田大直},
    booktitle = {言語処理学会第27回年次大会},
    year = {2021},
}

More Information

https://tech.retrieva.jp/entry/2021/04/01/114943 (In Japanese)

Model Card Authors

Satoru Katsumata

Model Card Contact

More information needed

How to Get Started with the Model

Use the code below to get started with the model.

Click to expand

Run download_wikipedia_bert.py to download BERT model which is trained on Wikipedia.

python download_wikipedia_bert.py

This script downloads config files and a vocab file provided by Inui Laboratory of Tohoku University from Hugging Face Model Hub. https://github.com/cl-tohoku/bert-japanese

Run sample_mlm.py to confirm you can use our models.

python sample_mlm.py