Katsumata420's picture
Add meta data
531a802
|
raw
history blame
No virus
5.79 kB
metadata
license: apache-2.0
language:
  - ja

Model Card for japanese-spoken-language-bert

These BERT models are pre-trained on written Japanese (Wikipedia) and fine-tuned on Spoken Japanese. We used CSJ and the Japanese diet record. CSJ (Corpus of Spontaneous Japanese) is provided by NINJAL (https://www.ninjal.ac.jp/). We only provide model parameters. You have to download other config files to use these models.

We provide three models down below:

  • 1-6 layer-wise (Folder Name: models/1-6_layer-wise) Fine-Tuned only 1st-6th layers in Encoder on CSJ.

  • TAPT512 60k (Folder Name: models/tapt512_60k) Fine-Tuned on CSJ.

  • DAPT128-TAPT512 (Folder Name: models/dapt128-tap512) Fine-Tuned on the diet record and CSJ.

Table of Contents

Model Details

Model Description

These BERT models are pre-trained on written Japanese (Wikipedia) and fine-tuned on Spoken Japanese. We used CSJ and the Japanese diet record. CSJ (Corpus of Spontaneous Japanese) is provided by NINJAL (https://www.ninjal.ac.jp/). We only provide model parameters. You have to download other config files to use these models.

We provide three models down below:

  • 1-6 layer-wise (Folder Name: models/1-6_layer-wise) Fine-Tuned only 1st-6th layers in Encoder on CSJ.

  • TAPT512 60k (Folder Name: models/tapt512_60k) Fine-Tuned on CSJ.

  • DAPT128-TAPT512 (Folder Name: models/dapt128-tap512) Fine-Tuned on the diet record and CSJ.

  • Model type: Language model

  • Language(s) (NLP): ja

  • License: Copyright (c) 2021 National Institute for Japanese Language and Linguistics and Retrieva, Inc. Licensed under the Apache License, Version 2.0 (the “License”)

Training Details

Training Data

  • 1-6 layer-wise: CSJ
  • TAPT512 60K: CSJ
  • DAPT128-TAPT512: The Japanese diet record and CSJ

Training Procedure

We continuously train the pre-trained Japanese BERT model (cl-tohoku/bert-base-japanese-whole-word-masking; written BERT).

In detail, See Japanese blog or Japanese paper.

Evaluation

Testing Data, Factors & Metrics

Testing Data

We use CSJ for the evaluation.

Factors

We evaluate the following tasks on CSJ:

  • Dependency Parsing
  • Sentence Boundary
  • Extract Important Sentence

Metrics

  • Dependency Parsing: Undirected Unlabeled Attachment Score (UUAS)
  • Sentence Boundary: F1 Score
  • Extract Important Sentence: F1 Score

Results

Dependency Parsing Sentence Boundary Extract Important Sentence
written BERT 39.4 61.6 36.8
1-6 layer wise 44.6 64.8 35.4
TAPT 512 60K - - 40.2
DAPT128-TAPT512 42.9 64.0 39.7

Citation

BibTeX:

@inproceedings{csjbert2021,
    title = {CSJを用いた日本語話し言葉BERTの作成},
    author = {勝又智 and 坂田大直},
    booktitle = {言語処理学会第27回年次大会},
    year = {2021},
}

More Information

https://tech.retrieva.jp/entry/2021/04/01/114943 (In Japanese)

Model Card Authors

Satoru Katsumata

Model Card Contact

More information needed

How to Get Started with the Model

Use the code below to get started with the model.

Click to expand
  1. Run download_wikipedia_bert.py to download BERT model which is trained on Wikipedia.
python download_wikipedia_bert.py

This script downloads config files and a vocab file provided by Inui Laboratory of Tohoku University from Hugging Face Model Hub. https://github.com/cl-tohoku/bert-japanese

  1. Run sample_mlm.py to confirm you can use our models.
python sample_mlm.py