retrieva-jp
/

japanese-spoken-language-bert

Japanese

Model card Files Files and versions Community

Katsumata420 commited on Oct 18, 2023

Commit

c805715

•

1 Parent(s): 6c96c5e

Update README for model card

Browse files

Files changed (1) hide show

README.md +184 -1

README.md CHANGED Viewed

@@ -1,3 +1,186 @@
 ---
-license: apache-2.0
 ---

 ---
 ---
+# Model Card for japanese-spoken-language-bert
+<!-- Provide a quick summary of what the model is/does. [Optional] -->
+These BERT models are pre-trained on written Japanese (Wikipedia) and fine-tuned on Spoken Japanese.
+We used CSJ and the Japanese diet record.
+CSJ (Corpus of Spontaneous Japanese) is provided by NINJAL (https://www.ninjal.ac.jp/).
+We only provide model parameters. You have to download other config files to use these models.
+We provide three models down below:
+- **1-6 layer-wise** (Folder Name: models/1-6_layer-wise)
+    Fine-Tuned only 1st-6th layers in Encoder on CSJ.
+- **TAPT512 60k** (Folder Name: models/tapt512_60k)
+    Fine-Tuned on CSJ.
+- **DAPT128-TAPT512** (Folder Name: models/dapt128-tap512)
+    Fine-Tuned on the diet record and CSJ.
+#  Table of Contents
+- [Model Card for japanese-spoken-language-bert](#model-card-for--model_id-)
+- [Table of Contents](#table-of-contents)
+- [Table of Contents](#table-of-contents-1)
+- [Model Details](#model-details)
+  - [Model Description](#model-description)
+- [Training Details](#training-details)
+  - [Training Data](#training-data)
+  - [Training Procedure](#training-procedure)
+- [Evaluation](#evaluation)
+  - [Testing Data, Factors & Metrics](#testing-data-factors--metrics)
+    - [Testing Data](#testing-data)
+    - [Factors](#factors)
+    - [Metrics](#metrics)
+  - [Results](#results)
+- [Citation](#citation)
+- [More Information](#more-information-optional)
+- [Model Card Authors](#model-card-authors-optional)
+- [Model Card Contact](#model-card-contact)
+- [How to Get Started with the Model](#how-to-get-started-with-the-model)
+# Model Details
+## Model Description
+<!-- Provide a longer summary of what this model is/does. -->
+These BERT models are pre-trained on written Japanese (Wikipedia) and fine-tuned on Spoken Japanese.
+We used CSJ and the Japanese diet record.
+CSJ (Corpus of Spontaneous Japanese) is provided by NINJAL (https://www.ninjal.ac.jp/).
+We only provide model parameters. You have to download other config files to use these models.
+We provide three models down below:
+- 1-6 layer-wise (Folder Name: models/1-6_layer-wise)
+    Fine-Tuned only 1st-6th layers in Encoder on CSJ.
+- TAPT512 60k (Folder Name: models/tapt512_60k)
+    Fine-Tuned on CSJ.
+- DAPT128-TAPT512 (Folder Name: models/dapt128-tap512)
+    Fine-Tuned on the diet record and CSJ.
+- **Model type:** Language model
+- **Language(s) (NLP):** ja
+- **License:** Copyright (c) 2021 National Institute for Japanese Language and Linguistics and Retrieva, Inc. Licensed under the Apache License, Version 2.0 (the “License”)
+# Training Details
+## Training Data
+<!-- This should link to a Data Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
+- 1-6 layer-wise: CSJ
+- TAPT512 60K: CSJ
+- DAPT128-TAPT512: The Japanese diet record and CSJ
+## Training Procedure
+<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
+We continuously train the pre-trained Japanese BERT model ([cl-tohoku/bert-base-japanese-whole-word-masking](https://huggingface.co/cl-tohoku/bert-base-japanese-whole-word-masking); written BERT).
+In detail, See [Japanese blog](https://tech.retrieva.jp/entry/2021/04/01/114943) or [Japanese paper](https://www.anlp.jp/proceedings/annual_meeting/2021/pdf_dir/P4-17.pdf).
+# Evaluation
+<!-- This section describes the evaluation protocols and provides the results. -->
+## Testing Data, Factors & Metrics
+### Testing Data
+<!-- This should link to a Data Card if possible. -->
+We use CSJ for the evaluation.
+### Factors
+<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
+We evaluate the following tasks on CSJ:
+- Dependency Parsing
+- Sentence Boundary
+- Extract Important Sentence
+### Metrics
+<!-- These are the evaluation metrics being used, ideally with a description of why. -->
+- Dependency Parsing: Undirected Unlabeled Attachment Score (UUAS)
+- Sentence Boundary: F1 Score
+- Extract Important Sentence: F1 Score
+## Results
+| | Dependency Parsing | Sentence Boundary | Extract Important Sentence |
+| :--- | ---: | ---: | ---: |
+| written BERT | 39.4 | 61.6 | 36.8 |
+| 1-6 layer wise | 44.6 | 64.8 | 35.4 |
+| TAPT 512 60K | - | - | 40.2 |
+| DAPT128-TAPT512 | 42.9 | 64.0 | 39.7 |
+# Citation
+<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
+**BibTeX:**
+```bib
+@inproceedings{csjbert2021,
+    title = {CSJを用いた日本語話し言葉BERTの作成},
+    author = {勝又智 and 坂田大直},
+    booktitle = {言語処理学会第27回年次大会},
+    year = {2021},
+}
+```
+# More Information
+https://tech.retrieva.jp/entry/2021/04/01/114943 (In Japanese)
+# Model Card Authors
+<!-- This section provides another layer of transparency and accountability. Whose views is this model card representing? How many voices were included in its construction? Etc. -->
+Satoru Katsumata
+# Model Card Contact
+More information needed
+# How to Get Started with the Model
+Use the code below to get started with the model.
+<details>
+<summary> Click to expand </summary>
+1. Run download_wikipedia_bert.py to download BERT model which is trained on Wikipedia.
+```bash
+python download_wikipedia_bert.py
+```
+This script downloads config files and a vocab file provided by Inui Laboratory of Tohoku University from Hugging Face Model Hub.
+https://github.com/cl-tohoku/bert-japanese
+2. Run sample_mlm.py to confirm you can use our models.
+```bash
+python sample_mlm.py
+```
+</details>