Edit model card

Model Card for japanese-spoken-language-bert

日本語READMEはこちら

These BERT models are pre-trained on written Japanese (Wikipedia) and fine-tuned on Spoken Japanese. We used CSJ and the Japanese diet record. CSJ (Corpus of Spontaneous Japanese) is provided by NINJAL (https://www.ninjal.ac.jp/). We only provide model parameters. You have to download other config files to use these models.

We provide three models down below:

  • 1-6 layer-wise (Folder Name: models/1-6_layer-wise)
    Fine-Tuned only 1st-6th layers in Encoder on CSJ.

  • TAPT512 60k (Folder Name: models/tapt512_60k)
    Fine-Tuned on CSJ.

  • DAPT128-TAPT512 (Folder Name: models/dapt128-tap512)
    Fine-Tuned on the diet record and CSJ.

Table of Contents

Model Details

Model Description

These BERT models are pre-trained on written Japanese (Wikipedia) and fine-tuned on Spoken Japanese. We used CSJ and the Japanese diet record. CSJ (Corpus of Spontaneous Japanese) is provided by NINJAL (https://www.ninjal.ac.jp/). We only provide model parameters. You have to download other config files to use these models.

We provide three models down below:

  • 1-6 layer-wise (Folder Name: models/1-6_layer-wise)
    Fine-Tuned only 1st-6th layers in Encoder on CSJ.

  • TAPT512 60k (Folder Name: models/tapt512_60k)
    Fine-Tuned on CSJ.

  • DAPT128-TAPT512 (Folder Name: models/dapt128-tap512)
    Fine-Tuned on the diet record and CSJ.

Model Information

  • Model type: Language model
  • Language(s) (NLP): ja
  • License: Copyright (c) 2021 National Institute for Japanese Language and Linguistics and Retrieva, Inc. Licensed under the Apache License, Version 2.0 (the “License”)

Training Details

Training Data

  • 1-6 layer-wise: CSJ
  • TAPT512 60K: CSJ
  • DAPT128-TAPT512: The Japanese diet record and CSJ

Training Procedure

We continuously train the pre-trained Japanese BERT model (cl-tohoku/bert-base-japanese-whole-word-masking; written BERT).

In detail, see Japanese blog or Japanese paper.

Evaluation

Testing Data, Factors & Metrics

Testing Data

We use CSJ for the evaluation.

Factors

We evaluate the following tasks on CSJ:

  • Dependency Parsing
  • Sentence Boundary
  • Important Sentence Extraction

Metrics

  • Dependency Parsing: Undirected Unlabeled Attachment Score (UUAS)
  • Sentence Boundary: F1 Score
  • Important Sentence Extraction: F1 Score

Results

Dependency Parsing Sentence Boundary Important Sentence Extraction
written BERT 39.4 61.6 36.8
1-6 layer wise 44.6 64.8 35.4
TAPT 512 60K - - 40.2
DAPT128-TAPT512 42.9 64.0 39.7

Citation

BibTeX:

@inproceedings{csjbert2021,
    title = {CSJを用いた日本語話し言葉BERTの作成},
    author = {勝又智 and 坂田大直},
    booktitle = {言語処理学会第27回年次大会},
    year = {2021},
}

More Information

https://tech.retrieva.jp/entry/2021/04/01/114943 (In Japanese)

Model Card Authors

Satoru Katsumata

Model Card Contact

pr@retrieva.jp

How to Get Started with the Model

Use the code below to get started with the model.

Click to expand
  1. Run download_wikipedia_bert.py to download BERT model which is trained on Wikipedia.
python download_wikipedia_bert.py

This script downloads config files and a vocab file provided by Inui Laboratory of Tohoku University from Hugging Face Model Hub. https://github.com/cl-tohoku/bert-japanese

  1. Run sample_mlm.py to confirm you can use our models.
python sample_mlm.py
Downloads last month
0
Unable to determine this model's library. Check the docs .