Katsumata420 commited on
Commit
c805715
1 Parent(s): 6c96c5e

Update README for model card

Browse files
Files changed (1) hide show
  1. README.md +184 -1
README.md CHANGED
@@ -1,3 +1,186 @@
1
  ---
2
- license: apache-2.0
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
 
2
  ---
3
+
4
+
5
+
6
+
7
+
8
+
9
+ # Model Card for japanese-spoken-language-bert
10
+
11
+ <!-- Provide a quick summary of what the model is/does. [Optional] -->
12
+ These BERT models are pre-trained on written Japanese (Wikipedia) and fine-tuned on Spoken Japanese.
13
+ We used CSJ and the Japanese diet record.
14
+ CSJ (Corpus of Spontaneous Japanese) is provided by NINJAL (https://www.ninjal.ac.jp/).
15
+ We only provide model parameters. You have to download other config files to use these models.
16
+
17
+ We provide three models down below:
18
+ - **1-6 layer-wise** (Folder Name: models/1-6_layer-wise)
19
+ Fine-Tuned only 1st-6th layers in Encoder on CSJ.
20
+
21
+ - **TAPT512 60k** (Folder Name: models/tapt512_60k)
22
+ Fine-Tuned on CSJ.
23
+
24
+ - **DAPT128-TAPT512** (Folder Name: models/dapt128-tap512)
25
+ Fine-Tuned on the diet record and CSJ.
26
+
27
+ # Table of Contents
28
+
29
+ - [Model Card for japanese-spoken-language-bert](#model-card-for--model_id-)
30
+ - [Table of Contents](#table-of-contents)
31
+ - [Table of Contents](#table-of-contents-1)
32
+ - [Model Details](#model-details)
33
+ - [Model Description](#model-description)
34
+ - [Training Details](#training-details)
35
+ - [Training Data](#training-data)
36
+ - [Training Procedure](#training-procedure)
37
+ - [Evaluation](#evaluation)
38
+ - [Testing Data, Factors & Metrics](#testing-data-factors--metrics)
39
+ - [Testing Data](#testing-data)
40
+ - [Factors](#factors)
41
+ - [Metrics](#metrics)
42
+ - [Results](#results)
43
+ - [Citation](#citation)
44
+ - [More Information](#more-information-optional)
45
+ - [Model Card Authors](#model-card-authors-optional)
46
+ - [Model Card Contact](#model-card-contact)
47
+ - [How to Get Started with the Model](#how-to-get-started-with-the-model)
48
+
49
+
50
+ # Model Details
51
+
52
+ ## Model Description
53
+
54
+ <!-- Provide a longer summary of what this model is/does. -->
55
+ These BERT models are pre-trained on written Japanese (Wikipedia) and fine-tuned on Spoken Japanese.
56
+ We used CSJ and the Japanese diet record.
57
+ CSJ (Corpus of Spontaneous Japanese) is provided by NINJAL (https://www.ninjal.ac.jp/).
58
+ We only provide model parameters. You have to download other config files to use these models.
59
+
60
+ We provide three models down below:
61
+ - 1-6 layer-wise (Folder Name: models/1-6_layer-wise)
62
+ Fine-Tuned only 1st-6th layers in Encoder on CSJ.
63
+
64
+ - TAPT512 60k (Folder Name: models/tapt512_60k)
65
+ Fine-Tuned on CSJ.
66
+
67
+ - DAPT128-TAPT512 (Folder Name: models/dapt128-tap512)
68
+ Fine-Tuned on the diet record and CSJ.
69
+
70
+ - **Model type:** Language model
71
+ - **Language(s) (NLP):** ja
72
+ - **License:** Copyright (c) 2021 National Institute for Japanese Language and Linguistics and Retrieva, Inc. Licensed under the Apache License, Version 2.0 (the “License”)
73
+
74
+
75
+ # Training Details
76
+
77
+ ## Training Data
78
+
79
+ <!-- This should link to a Data Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
80
+
81
+ - 1-6 layer-wise: CSJ
82
+ - TAPT512 60K: CSJ
83
+ - DAPT128-TAPT512: The Japanese diet record and CSJ
84
+
85
+
86
+ ## Training Procedure
87
+
88
+ <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
89
+
90
+ We continuously train the pre-trained Japanese BERT model ([cl-tohoku/bert-base-japanese-whole-word-masking](https://huggingface.co/cl-tohoku/bert-base-japanese-whole-word-masking); written BERT).
91
+
92
+ In detail, See [Japanese blog](https://tech.retrieva.jp/entry/2021/04/01/114943) or [Japanese paper](https://www.anlp.jp/proceedings/annual_meeting/2021/pdf_dir/P4-17.pdf).
93
+
94
+ # Evaluation
95
+
96
+ <!-- This section describes the evaluation protocols and provides the results. -->
97
+
98
+ ## Testing Data, Factors & Metrics
99
+
100
+ ### Testing Data
101
+
102
+ <!-- This should link to a Data Card if possible. -->
103
+
104
+ We use CSJ for the evaluation.
105
+
106
+
107
+ ### Factors
108
+
109
+ <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
110
+
111
+ We evaluate the following tasks on CSJ:
112
+ - Dependency Parsing
113
+ - Sentence Boundary
114
+ - Extract Important Sentence
115
+
116
+ ### Metrics
117
+
118
+ <!-- These are the evaluation metrics being used, ideally with a description of why. -->
119
+
120
+ - Dependency Parsing: Undirected Unlabeled Attachment Score (UUAS)
121
+ - Sentence Boundary: F1 Score
122
+ - Extract Important Sentence: F1 Score
123
+
124
+ ## Results
125
+
126
+ | | Dependency Parsing | Sentence Boundary | Extract Important Sentence |
127
+ | :--- | ---: | ---: | ---: |
128
+ | written BERT | 39.4 | 61.6 | 36.8 |
129
+ | 1-6 layer wise | 44.6 | 64.8 | 35.4 |
130
+ | TAPT 512 60K | - | - | 40.2 |
131
+ | DAPT128-TAPT512 | 42.9 | 64.0 | 39.7 |
132
+
133
+
134
+ # Citation
135
+
136
+ <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
137
+
138
+ **BibTeX:**
139
+
140
+ ```bib
141
+ @inproceedings{csjbert2021,
142
+ title = {CSJを用いた日本語話し言葉BERTの作成},
143
+ author = {勝又智 and 坂田大直},
144
+ booktitle = {言語処理学会第27回年次大会},
145
+ year = {2021},
146
+ }
147
+ ```
148
+
149
+
150
+ # More Information
151
+
152
+ https://tech.retrieva.jp/entry/2021/04/01/114943 (In Japanese)
153
+
154
+ # Model Card Authors
155
+
156
+ <!-- This section provides another layer of transparency and accountability. Whose views is this model card representing? How many voices were included in its construction? Etc. -->
157
+
158
+ Satoru Katsumata
159
+
160
+ # Model Card Contact
161
+
162
+ More information needed
163
+
164
+ # How to Get Started with the Model
165
+
166
+ Use the code below to get started with the model.
167
+
168
+ <details>
169
+ <summary> Click to expand </summary>
170
+
171
+ 1. Run download_wikipedia_bert.py to download BERT model which is trained on Wikipedia.
172
+
173
+ ```bash
174
+ python download_wikipedia_bert.py
175
+ ```
176
+
177
+ This script downloads config files and a vocab file provided by Inui Laboratory of Tohoku University from Hugging Face Model Hub.
178
+ https://github.com/cl-tohoku/bert-japanese
179
+
180
+ 2. Run sample_mlm.py to confirm you can use our models.
181
+
182
+ ```bash
183
+ python sample_mlm.py
184
+ ```
185
+
186
+ </details>