nobu-g commited on
Commit
9553bb4
1 Parent(s): 8d528f5

first commit

Browse files
README.md ADDED
@@ -0,0 +1,109 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-sa-4.0
3
+ language:
4
+ - en
5
+ - ja
6
+ programming_language:
7
+ - C
8
+ - C++
9
+ - C#
10
+ - Go
11
+ - Java
12
+ - JavaScript
13
+ - Lua
14
+ - PHP
15
+ - Python
16
+ - Ruby
17
+ - Rust
18
+ - Scala
19
+ - TypeScript
20
+ library_name: transformers
21
+ tags:
22
+ - deberta
23
+ - deberta-v3
24
+ - fill-mask
25
+ datasets:
26
+ - wikipedia
27
+ - EleutherAI/pile
28
+ - bigcode/the-stack
29
+ - mc4
30
+ metrics:
31
+ - accuracy
32
+ mask_token: "[MASK]"
33
+ widget:
34
+ - text: "京都大学で自然言語処理を[MASK]する。"
35
+ ---
36
+
37
+ # Model Card for Japanese DeBERTa V2 base
38
+
39
+ ## Model description
40
+
41
+ This is a Japanese DeBERTa V3 base model pre-trained on LLM-jp corpus v1.0.
42
+
43
+ ## How to use
44
+
45
+ You can use this model for masked language modeling as follows:
46
+
47
+ ```python
48
+ from transformers import AutoTokenizer, AutoModelForMaskedLM
49
+ tokenizer = AutoTokenizer.from_pretrained('ku-nlp/deberta-v2-base-japanese')
50
+ model = AutoModelForMaskedLM.from_pretrained('ku-nlp/deberta-v2-base-japanese')
51
+
52
+ sentences = [
53
+ "京都大学で自然言語処理を[MASK]する。",
54
+ "I [MASK] NLP at Kyoto University.",
55
+ 'int main() { printf("Hello, [MASK]!"); return 0; }',
56
+ ]
57
+ encodings = tokenizer(sentences, return_tensors='pt')
58
+ ...
59
+ ```
60
+
61
+ You can also fine-tune this model on downstream tasks.
62
+
63
+ ## Tokenization
64
+
65
+ The tokenizer of this model is based on [huggingface/tokenizers](https://github.com/huggingface/tokenizers) Unigram byte-fallback model.
66
+ The vocabulary entries were converted from [`llm-jp-tokenizer v2.2 (100k)`](https://github.com/llm-jp/llm-jp-tokenizer/releases/tag/v2.2).
67
+ Please refer to [README.md](https://github.com/llm-jp/llm-jp-tokenizer) of `llm-jp/llm-ja-tokenizer` for details on the vocabulary construction procedure.
68
+
69
+ Note that unlike [ku-nlp/deberta-v2-base-japanese](https://huggingface.co/ku-nlp/deberta-v2-base-japanese), pre-segmentation by a morphological analyzer (e.g., Juman++) is no longer required for this model.
70
+
71
+ ## Training data
72
+
73
+ We used the [LLM-jp corpus](https://github.com/llm-jp/llm-jp-corpus) v1.0.1 for pre-training.
74
+ The corpus consists of the following corpora:
75
+
76
+ - Japanese
77
+ - Wikipedia (1B tokens)
78
+ - mC4 (129B tokens)
79
+ - English
80
+ - Wikipedia (4B tokens)
81
+ - The Pile (126B tokens)
82
+ - Code
83
+ - The Stack (10B tokens)
84
+
85
+ We shuffled the corpora, which has 270B tokens in total, and trained the model for 2 epochs.
86
+ Thus, the total number of tokens fed to the model was 540B.
87
+
88
+ ## Training procedure
89
+
90
+ We slightly modified [the official implementation of DeBERTa V3](https://github.com/microsoft/DeBERTa) and followed the official training procedure.
91
+ The modified code is available at [nobu-g/DeBERTa](https://github.com/nobu-g/DeBERTa).
92
+
93
+ The following hyperparameters were used during pre-training:
94
+
95
+ - learning_rate: 1e-4
96
+ - per_device_train_batch_size: 800
97
+ - num_devices: 8
98
+ - gradient_accumulation_steps: 3
99
+ - total_train_batch_size: 2400
100
+ - max_seq_length: 512
101
+ - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-06
102
+ - lr_scheduler_type: linear schedule with warmup
103
+ - training_steps: 475,000
104
+ - warmup_steps: 10,000
105
+
106
+ ## Acknowledgments
107
+
108
+ This work was supported by Joint Usage/Research Center for Interdisciplinary Large-scale Information Infrastructures (JHPCN) through General Collaboration Project no. jh221004, "Developing a Platform for Constructing and Sharing of Large-Scale Japanese Language Models".
109
+ For training models, we used the mdx: a platform for the data-driven future.
added_tokens.json ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {
2
+ "[CLS]": 96871,
3
+ "[MASK]": 96867,
4
+ "[PAD]": 96869,
5
+ "[SEP]": 96868,
6
+ "[UNK]": 96870
7
+ }
config.json ADDED
@@ -0,0 +1,30 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "attention_probs_dropout_prob": 0.1,
3
+ "hidden_act": "gelu",
4
+ "hidden_dropout_prob": 0.1,
5
+ "hidden_size": 768,
6
+ "initializer_range": 0.02,
7
+ "intermediate_size": 3072,
8
+ "layer_norm_eps": 1e-07,
9
+ "max_position_embeddings": 512,
10
+ "max_relative_positions": -1,
11
+ "norm_rel_ebd": "layer_norm",
12
+ "model_type": "deberta-v2",
13
+ "num_attention_heads": 12,
14
+ "num_hidden_layers": 12,
15
+ "pad_token_id": 0,
16
+ "pooler_dropout": 0,
17
+ "pooler_hidden_act": "gelu",
18
+ "pooler_hidden_size": 768,
19
+ "pos_att_type": [
20
+ "p2c",
21
+ "c2p"
22
+ ],
23
+ "position_biased_input": false,
24
+ "position_buckets": 256,
25
+ "relative_attention": true,
26
+ "share_att_key": true,
27
+ "transformers_version": "4.37.2",
28
+ "type_vocab_size": 0,
29
+ "vocab_size": 96900
30
+ }
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:dfe7489b46b879cee9f6d35ec6d6a13f15e49f7e1cb41ee5cfa8e45501259e44
3
+ size 471366482
special_tokens_map.json ADDED
@@ -0,0 +1,15 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": "[CLS]",
3
+ "cls_token": "[CLS]",
4
+ "eos_token": "[SEP]",
5
+ "mask_token": "[MASK]",
6
+ "pad_token": "[PAD]",
7
+ "sep_token": "[SEP]",
8
+ "unk_token": {
9
+ "content": "[UNK]",
10
+ "lstrip": false,
11
+ "normalized": true,
12
+ "rstrip": false,
13
+ "single_word": false
14
+ }
15
+ }
spm.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:7fefde905766244f5e613a490d6e35236043d6483c4aae0eaac4b4a8fc365a88
3
+ size 1658609
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,58 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "96867": {
4
+ "content": "[MASK]",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "96868": {
12
+ "content": "[SEP]",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "96869": {
20
+ "content": "[PAD]",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "96870": {
28
+ "content": "[UNK]",
29
+ "lstrip": false,
30
+ "normalized": true,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "96871": {
36
+ "content": "[CLS]",
37
+ "lstrip": false,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ }
43
+ },
44
+ "bos_token": "[CLS]",
45
+ "clean_up_tokenization_spaces": false,
46
+ "cls_token": "[CLS]",
47
+ "do_lower_case": false,
48
+ "eos_token": "[SEP]",
49
+ "keep_accents": true,
50
+ "mask_token": "[MASK]",
51
+ "model_max_length": 1000000000000000019884624838656,
52
+ "pad_token": "[PAD]",
53
+ "sep_token": "[SEP]",
54
+ "sp_model_kwargs": {},
55
+ "split_by_punct": false,
56
+ "tokenizer_class": "DebertaV2Tokenizer",
57
+ "unk_token": "[UNK]"
58
+ }