--- license: apache-2.0 language: - ja --- # Model Card for japanese-spoken-language-bert 日本語READMEは[こちら](./README_JA.md) These BERT models are pre-trained on written Japanese (Wikipedia) and fine-tuned on Spoken Japanese. We used CSJ and the Japanese diet record. CSJ (Corpus of Spontaneous Japanese) is provided by NINJAL (https://www.ninjal.ac.jp/). We only provide model parameters. You have to download other config files to use these models. We provide three models down below: - **1-6 layer-wise** (Folder Name: models/1-6_layer-wise) Fine-Tuned only 1st-6th layers in Encoder on CSJ. - **TAPT512 60k** (Folder Name: models/tapt512_60k) Fine-Tuned on CSJ. - **DAPT128-TAPT512** (Folder Name: models/dapt128-tap512) Fine-Tuned on the diet record and CSJ. # Table of Contents - [Model Card for japanese-spoken-language-bert](#model-card-for-japanese-spoken-language-bert) - [Table of Contents](#table-of-contents) - [Model Details](#model-details) - [Model Description](#model-description) - [Training Details](#training-details) - [Training Data](#training-data) - [Training Procedure](#training-procedure) - [Evaluation](#evaluation) - [Testing Data, Factors & Metrics](#testing-data-factors--metrics) - [Testing Data](#testing-data) - [Factors](#factors) - [Metrics](#metrics) - [Results](#results) - [Citation](#citation) - [More Information](#more-information-optional) - [Model Card Authors](#model-card-authors-optional) - [Model Card Contact](#model-card-contact) - [How to Get Started with the Model](#how-to-get-started-with-the-model) # Model Details ## Model Description These BERT models are pre-trained on written Japanese (Wikipedia) and fine-tuned on Spoken Japanese. We used CSJ and the Japanese diet record. CSJ (Corpus of Spontaneous Japanese) is provided by NINJAL (https://www.ninjal.ac.jp/). We only provide model parameters. You have to download other config files to use these models. We provide three models down below: - 1-6 layer-wise (Folder Name: models/1-6_layer-wise) Fine-Tuned only 1st-6th layers in Encoder on CSJ. - TAPT512 60k (Folder Name: models/tapt512_60k) Fine-Tuned on CSJ. - DAPT128-TAPT512 (Folder Name: models/dapt128-tap512) Fine-Tuned on the diet record and CSJ. **Model Information** - **Model type:** Language model - **Language(s) (NLP):** ja - **License:** Copyright (c) 2021 National Institute for Japanese Language and Linguistics and Retrieva, Inc. Licensed under the Apache License, Version 2.0 (the “License”) # Training Details ## Training Data - 1-6 layer-wise: CSJ - TAPT512 60K: CSJ - DAPT128-TAPT512: The Japanese diet record and CSJ ## Training Procedure We continuously train the pre-trained Japanese BERT model ([cl-tohoku/bert-base-japanese-whole-word-masking](https://huggingface.co/cl-tohoku/bert-base-japanese-whole-word-masking); written BERT). In detail, see [Japanese blog](https://tech.retrieva.jp/entry/2021/04/01/114943) or [Japanese paper](https://www.anlp.jp/proceedings/annual_meeting/2021/pdf_dir/P4-17.pdf). # Evaluation ## Testing Data, Factors & Metrics ### Testing Data We use CSJ for the evaluation. ### Factors We evaluate the following tasks on CSJ: - Dependency Parsing - Sentence Boundary - Important Sentence Extraction ### Metrics - Dependency Parsing: Undirected Unlabeled Attachment Score (UUAS) - Sentence Boundary: F1 Score - Important Sentence Extraction: F1 Score ## Results | | Dependency Parsing | Sentence Boundary | Important Sentence Extraction | | :--- | ---: | ---: | ---: | | written BERT | 39.4 | 61.6 | 36.8 | | 1-6 layer wise | 44.6 | 64.8 | 35.4 | | TAPT 512 60K | - | - | 40.2 | | DAPT128-TAPT512 | 42.9 | 64.0 | 39.7 | # Citation **BibTeX:** ```bibtex @inproceedings{csjbert2021, title = {CSJを用いた日本語話し言葉BERTの作成}, author = {勝又智 and 坂田大直}, booktitle = {言語処理学会第27回年次大会}, year = {2021}, } ``` # More Information https://tech.retrieva.jp/entry/2021/04/01/114943 (In Japanese) # Model Card Authors Satoru Katsumata # Model Card Contact pr@retrieva.jp # How to Get Started with the Model Use the code below to get started with the model.
Click to expand 1. Run download_wikipedia_bert.py to download BERT model which is trained on Wikipedia. ```bash python download_wikipedia_bert.py ``` This script downloads config files and a vocab file provided by Inui Laboratory of Tohoku University from Hugging Face Model Hub. https://github.com/cl-tohoku/bert-japanese 2. Run sample_mlm.py to confirm you can use our models. ```bash python sample_mlm.py ```