izumilab commited on
Commit
71ffc1d
1 Parent(s): 3875b31

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +82 -0
README.md ADDED
@@ -0,0 +1,82 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+
3
+ language: ja
4
+
5
+ license: cc-by-sa-4.0
6
+
7
+ tags:
8
+
9
+ - finance
10
+
11
+ datasets:
12
+
13
+ - wikipedia
14
+ - securities reports
15
+ - summaries of financial results
16
+
17
+ widget:
18
+
19
+ - text: 流動[MASK]は1億円となりました。
20
+
21
+ ---
22
+
23
+ # ELECTRA small Japanese finance generator
24
+
25
+ This is a [ELECTRA](https://github.com/google-research/electra) model pretrained on texts in the Japanese language.
26
+
27
+ The codes for the pretraining are available at [retarfi/language-pretraining](https://github.com/retarfi/language-pretraining/tree/v1.0).
28
+
29
+ ## Model architecture
30
+
31
+ The model architecture is the same as ELECTRA small in the [original ELECTRA implementation](https://github.com/google-research/electra); 12 layers, 256 dimensions of hidden states, and 4 attention heads.
32
+
33
+ ## Training Data
34
+
35
+ The models are trained on the Japanese version of Wikipedia.
36
+
37
+ The training corpus is generated from the Japanese version of Wikipedia, using Wikipedia dump file as of June 1, 2021.
38
+
39
+ The Wikipedia corpus file is 2.9GB, consisting of approximately 20M sentences.
40
+
41
+ The financial corpus consists of 2 corpora:
42
+
43
+ - Summaries of financial results from October 9, 2012, to December 31, 2020
44
+
45
+ - Securities reports from February 8, 2018, to December 31, 2020
46
+
47
+ The financial corpus file is 5.2GB, consisting of approximately 27M sentences.
48
+
49
+ ## Tokenization
50
+
51
+ The texts are first tokenized by MeCab with IPA dictionary and then split into subwords by the WordPiece algorithm.
52
+
53
+ The vocabulary size is 32768.
54
+
55
+ ## Training
56
+
57
+ The models are trained with the same configuration as ELECTRA small in the [original ELECTRA paper](https://arxiv.org/abs/2003.10555) except size; 128 tokens per instance, 128 instances per batch, and 1M training steps.
58
+
59
+ The size of the generator is the same of the discriminator.
60
+
61
+ ## Citation
62
+
63
+ **There will be another paper for this pretrained model. Be sure to check here again when you cite.**
64
+
65
+ ```
66
+ @inproceedings{bert_electra_japanese,
67
+ title = {Construction and Validation of a Pre-Trained Language Model
68
+ Using Financial Documents}
69
+ author = {Masahiro Suzuki and Hiroki Sakaji and Masanori Hirano and Kiyoshi Izumi},
70
+ month = {oct},
71
+ year = {2021},
72
+ booktitle = {"Proceedings of JSAI Special Interest Group on Financial Infomatics (SIG-FIN) 27"}
73
+ }
74
+ ```
75
+
76
+ ## Licenses
77
+
78
+ The pretrained models are distributed under the terms of the [Creative Commons Attribution-ShareAlike 4.0](https://creativecommons.org/licenses/by-sa/4.0/).
79
+
80
+ ## Acknowledgments
81
+
82
+ This work was supported by JSPS KAKENHI Grant Number JP21K12010.