ptaszynski commited on
Commit
a1f33fe
1 Parent(s): 8150bfa

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +84 -0
README.md ADDED
@@ -0,0 +1,84 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+
3
+ language: ja
4
+
5
+ license: cc-by-sa-4.0
6
+
7
+ datasets:
8
+
9
+ - YACIS corpus (pretraining)
10
+ - Harmful BBS Japanese comments dataset
11
+ - Twitter Japanese cyberbullying dataset
12
+
13
+ ---
14
+
15
+ # yacis-electra-small-cyberbullying
16
+
17
+ This is an [ELECTRA](https://github.com/google-research/electra) Small model for Japanese language, originally pretrained on 354 million sentences / 5.6 billion words of [YACIS](https://github.com/ptaszynski/yacis-corpus) blog corpus, and finetuned on a balanced set containing of two datasets, namely "Harmful BBS Japanese comments dataset" and "Twitter Japanese cyberbullying dataset".
18
+
19
+ ## Model architecture
20
+
21
+ The original (foundation) model was pretrained using ELECTRA Small model settings, with 12 layers, 128 dimensions of hidden states, and 12 attention heads. Original vocabulary size was set to 32,000 tokens.
22
+
23
+ The original
24
+ https://huggingface.co/ptaszynski/yacis-electra-small-japanese
25
+
26
+ ## Training data and libraries
27
+
28
+ YACIS-ELECTRA is trained on the whole of [YACIS](https://github.com/ptaszynski/yacis-corpus) blog corpus, which is a Japanese blog corpus containing 5.6 billion words in 354 million sentences.
29
+
30
+ The corpus was originally split into sentences using custom rules, and each sentence was tokenized using [MeCab](https://taku910.github.io/mecab/). Subword tokenization for pretraining was done with WordPiece.
31
+
32
+ We used original [ELECTRA](https://github.com/google-research/electra) repository for pretraining. The pretrainig process took 7 days and 6 hours under the following environment: CPU: Intel Core i9-7920X, RAM: 132 GB, GPU: GeForce GTX 1080 Ti x1.
33
+
34
+
35
+ ## Licenses
36
+
37
+ The pretrained model with all attached files is distributed under the terms of the [CC BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/deed.en) license.
38
+
39
+ <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-sa/4.0/88x31.png" /></a><br />This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">Creative Commons Attribution-ShareAlike 4.0 International License</a>.
40
+
41
+ ## Citations
42
+
43
+ Please, cite the model using the following citation.
44
+
45
+ ```
46
+ @inproceedings{shibata2022yacis-electra,
47
+ title={日本語大規模ブログコーパスYACISに基づいたELECTRA事前学習済み言語モデルの作成及び性能評価},
48
+ % title={Development and performance evaluation of ELECTRA pretrained language model based on YACIS large-scale Japanese blog corpus [in Japanese]}, %% for English citations
49
+ author={柴田 祥伍 and プタシンスキ ミハウ and エロネン ユーソ and ノヴァコフスキ カロル and 桝井 文人},
50
+ % author={Shibata, Shogo and Ptaszynski, Michal and Eronen, Juuso and Nowakowski, Karol and Masui, Fumito}, %% for English citations
51
+ booktitle={言語処理学会第28回年次大会(NLP2022)},
52
+ % booktitle={Proceedings of The 28th Annual Meeting of The Association for Natural Language Processing (NLP2022)}, %% for English citations
53
+ pages={1--4},
54
+ year={2022}
55
+ }
56
+ ```
57
+
58
+
59
+ The model was build using sentences from YACIS corpus, which should be cited using at least one of the following refrences.
60
+
61
+ ```
62
+ @inproceedings{ptaszynski2012yacis,
63
+ title={YACIS: A five-billion-word corpus of Japanese blogs fully annotated with syntactic and affective information},
64
+ author={Ptaszynski, Michal and Dybala, Pawel and Rzepka, Rafal and Araki, Kenji and Momouchi, Yoshio},
65
+ booktitle={Proceedings of the AISB/IACAP world congress},
66
+ pages={40--49},
67
+ year={2012},
68
+ howpublished = "\url{https://github.com/ptaszynski/yacis-corpus}"
69
+ }
70
+ ```
71
+
72
+ ```
73
+ @article{ptaszynski2014automatically,
74
+ title={Automatically annotating a five-billion-word corpus of Japanese blogs for sentiment and affect analysis},
75
+ author={Ptaszynski, Michal and Rzepka, Rafal and Araki, Kenji and Momouchi, Yoshio},
76
+ journal={Computer Speech \& Language},
77
+ volume={28},
78
+ number={1},
79
+ pages={38--55},
80
+ year={2014},
81
+ publisher={Elsevier},
82
+ howpublished = "\url{https://github.com/ptaszynski/yacis-corpus}"
83
+ }
84
+ ```