Joeran Bosma commited on
Commit
c74da97
1 Parent(s): d8d9ced

Initial release

Browse files
README.md CHANGED
@@ -1,3 +1,104 @@
1
- ---
2
- license: cc-by-nc-sa-4.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-nc-sa-4.0
3
+ ---
4
+
5
+ # DRAGON Longformer large mixed-domain
6
+
7
+ Pretrained model on Dutch clinical reports using a masked language modeling (MLM) objective. It was introduced in [this](#pending) paper. The model was first pretrained using general domain data, as specified [here](https://huggingface.co/allenai/longformer-large-4096). The pretrained model was taken from HuggingFace: [`allenai/longformer-large-4096`](https://huggingface.co/allenai/longformer-large-4096). Subsequently, the model was pretrained using domain-specific data (i.e., clinical reports). The tokenizer of [`allenai/longformer-large-4096`](https://huggingface.co/allenai/longformer-large-4096) was used.
8
+
9
+ ## Model description
10
+ Longformer is a transformers model that was pretrained on a large corpus of Dutch clinical reports in a self-supervised fashion. This means it was pretrained on the raw texts only, with no humans labeling them in any way with an automatic process to generate inputs and labels from those texts.
11
+
12
+ This way, the model learns an inner representation of the Dutch medical language that can then be used to extract features useful for downstream tasks: if you have a dataset of labeled reports, for instance, you can train a standard classifier using the features produced by the BERT model as inputs.
13
+
14
+ ## Model variations
15
+ Multiple architectures were pretrained for the DRAGON challenge.
16
+
17
+ | Model | #params | Language |
18
+ |------------------------|--------------------------------|-------|
19
+ | [`joeranbosma/dragon-bert-base-mixed-domain`](https://huggingface.co/joeranbosma/dragon-bert-base-mixed-domain) | 109M | Dutch → Dutch |
20
+ | [`joeranbosma/dragon-roberta-base-mixed-domain`](https://huggingface.co/joeranbosma/dragon-roberta-base-mixed-domain) | 278M | Multiple → Dutch |
21
+ | [`joeranbosma/dragon-roberta-large-mixed-domain`](https://huggingface.co/joeranbosma/dragon-roberta-large-mixed-domain) | 560M | Multiple → Dutch |
22
+ | [`joeranbosma/dragon-longformer-base-mixed-domain`](https://huggingface.co/joeranbosma/dragon-longformer-base-mixed-domain) | 149M | English → Dutch |
23
+ | [`joeranbosma/dragon-longformer-large-mixed-domain`](https://huggingface.co/joeranbosma/dragon-longformer-large-mixed-domain) | 435M | English → Dutch |
24
+ | [`joeranbosma/dragon-bert-base-domain-specific`](https://huggingface.co/joeranbosma/dragon-bert-base-domain-specific) | 109M | Dutch |
25
+ | [`joeranbosma/dragon-roberta-base-domain-specific`](https://huggingface.co/joeranbosma/dragon-roberta-base-domain-specific) | 278M | Dutch |
26
+ | [`joeranbosma/dragon-roberta-large-domain-specific`](https://huggingface.co/joeranbosma/dragon-roberta-large-domain-specific) | 560M | Dutch |
27
+ | [`joeranbosma/dragon-longformer-base-domain-specific`](https://huggingface.co/joeranbosma/dragon-longformer-base-domain-specific) | 149M | Dutch |
28
+ | [`joeranbosma/dragon-longformer-large-domain-specific`](https://huggingface.co/joeranbosma/dragon-longformer-large-domain-specific) | 435M | Dutch |
29
+
30
+
31
+ ## Intended uses & limitations
32
+ You can use the raw model for masked language modeling, but it's mostly intended to be fine-tuned on a downstream task.
33
+
34
+ Note that this model is primarily aimed at being fine-tuned on tasks that use the whole text (e.g., a clinical report) to make decisions, such as sequence classification, token classification or question answering. For tasks such as text generation you should look at model like GPT2.
35
+
36
+ ## How to use
37
+ You can use this model directly with a pipeline for masked language modeling:
38
+
39
+ ```python
40
+ from transformers import pipeline
41
+ unmasker = pipeline("fill-mask", model="joeranbosma/dragon-longformer-large-mixed-domain")
42
+ unmasker("Dit onderzoek geen aanwijzingen voor significant carcinoom. PIRADS <mask>.")
43
+ ```
44
+
45
+ Here is how to use this model to get the features of a given text in PyTorch:
46
+
47
+ ```python
48
+ from transformers import AutoTokenizer, AutoModel
49
+ tokenizer = AutoTokenizer.from_pretrained("joeranbosma/dragon-longformer-large-mixed-domain")
50
+ model = AutoModel.from_pretrained("joeranbosma/dragon-longformer-large-mixed-domain")
51
+ text = "Replace me by any text you'd like."
52
+ encoded_input = tokenizer(text, return_tensors="pt")
53
+ output = model(**encoded_input)
54
+ ```
55
+
56
+ ## Limitations and bias
57
+ Even if the training data used for this model could be characterized as fairly neutral, this model can have biased predictions. This bias will also affect all fine-tuned versions of this model.
58
+
59
+ ## Training data
60
+ For pretraining, 4,333,201 clinical reports (466,351 consecutive patients) were selected from Ziekenhuisgroep Twente from patients with a diagnostic or interventional visit between 13 July 2000 and 25 April 2023. 180,439 duplicate clinical reports (179,808 patients) were excluded, resulting in 4,152,762 included reports (463,692 patients). These reports were split into training (80%, 3,322,209 reports), validation (10%, 415,276 reports), and testing (10%, 415,277 reports). The testing reports were set aside for future analysis and are not used for pretraining.
61
+
62
+ ## Training procedure
63
+
64
+ ### Pretraining
65
+ The model was pretrained using masked language modeling (MLM): taking a sentence, the model randomly masks 15% of the words in the input then runs the entire masked sentence through the model and has to predict the masked words. This is different from traditional recurrent neural networks (RNNs) that usually see the words one after the other, or from autoregressive models like GPT which internally masks the future tokens. It allows the model to learn a bidirectional representation of the sentence.
66
+
67
+ The details of the masking procedure for each sentence are the following:
68
+ - 15% of the tokens are masked.
69
+ - In 80% of the cases, the masked tokens are replaced by `[MASK]`.
70
+ - In 10% of the cases, the masked tokens are replaced by a random token (different) from the one they replace.
71
+ - In the 10% remaining cases, the masked tokens are left as is.
72
+
73
+ The HuggingFace implementation was used for pretraining: [`run_mlm.py`](https://github.com/huggingface/transformers/blob/7c6ec195adbfcd22cb6baeee64dd3c24a4b80c74/examples/pytorch/language-modeling/run_mlm.py).
74
+
75
+ ### Pretraining hyperparameters
76
+
77
+ The following hyperparameters were used during pretraining:
78
+ - `learning_rate`: 5e-05
79
+ - `train_batch_size`: 4
80
+ - `eval_batch_size`: 4
81
+ - `seed`: 42
82
+ - `gradient_accumulation_steps`: 4
83
+ - `total_train_batch_size`: 16
84
+ - `optimizer`: Adam with betas=(0.9,0.999) and epsilon=1e-08
85
+ - `lr_scheduler_type`: linear
86
+ - `num_epochs`: 3.0
87
+ - `max_seq_length`: 4096
88
+
89
+ ### Framework versions
90
+
91
+ - Transformers 4.29.0.dev0
92
+ - Pytorch 2.0.0+cu117
93
+ - Datasets 2.11.0
94
+ - Tokenizers 0.13.3
95
+
96
+ ## Evaluation results
97
+
98
+ Pending evaluation on the DRAGON benchmark.
99
+
100
+ ### BibTeX entry and citation info
101
+
102
+ ```bibtex
103
+ @article{PENDING}
104
+ ```
all_results.json ADDED
@@ -0,0 +1,15 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "epoch": 3.0,
3
+ "eval_accuracy": 0.9609063974943755,
4
+ "eval_loss": 0.15329138934612274,
5
+ "eval_runtime": 11959.3592,
6
+ "eval_samples": 25413,
7
+ "eval_samples_per_second": 2.125,
8
+ "eval_steps_per_second": 0.531,
9
+ "perplexity": 1.1656645917037398,
10
+ "train_loss": 0.22367710902254853,
11
+ "train_runtime": 1559138.7742,
12
+ "train_samples": 203053,
13
+ "train_samples_per_second": 0.391,
14
+ "train_steps_per_second": 0.024
15
+ }
config.json ADDED
@@ -0,0 +1,56 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "allenai/longformer-large-4096",
3
+ "architectures": [
4
+ "LongformerForMaskedLM"
5
+ ],
6
+ "attention_mode": "longformer",
7
+ "attention_probs_dropout_prob": 0.1,
8
+ "attention_window": [
9
+ 512,
10
+ 512,
11
+ 512,
12
+ 512,
13
+ 512,
14
+ 512,
15
+ 512,
16
+ 512,
17
+ 512,
18
+ 512,
19
+ 512,
20
+ 512,
21
+ 512,
22
+ 512,
23
+ 512,
24
+ 512,
25
+ 512,
26
+ 512,
27
+ 512,
28
+ 512,
29
+ 512,
30
+ 512,
31
+ 512,
32
+ 512
33
+ ],
34
+ "bos_token_id": 0,
35
+ "eos_token_id": 2,
36
+ "gradient_checkpointing": false,
37
+ "hidden_act": "gelu",
38
+ "hidden_dropout_prob": 0.1,
39
+ "hidden_size": 1024,
40
+ "ignore_attention_mask": false,
41
+ "initializer_range": 0.02,
42
+ "intermediate_size": 4096,
43
+ "layer_norm_eps": 1e-05,
44
+ "max_position_embeddings": 4098,
45
+ "model_type": "longformer",
46
+ "num_attention_heads": 16,
47
+ "num_hidden_layers": 24,
48
+ "onnx_export": false,
49
+ "pad_token_id": 1,
50
+ "position_embedding_type": "absolute",
51
+ "sep_token_id": 2,
52
+ "torch_dtype": "float32",
53
+ "transformers_version": "4.29.0.dev0",
54
+ "type_vocab_size": 1,
55
+ "vocab_size": 50265
56
+ }
eval_results.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "epoch": 3.0,
3
+ "eval_accuracy": 0.9609063974943755,
4
+ "eval_loss": 0.15329138934612274,
5
+ "eval_runtime": 11959.3592,
6
+ "eval_samples": 25413,
7
+ "eval_samples_per_second": 2.125,
8
+ "eval_steps_per_second": 0.531,
9
+ "perplexity": 1.1656645917037398
10
+ }
merges.txt ADDED
The diff for this file is too large to render. See raw diff
 
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:7dbef382ef93b0e80052c84e281504842a83ef2c3e65302db46096781c54bdf7
3
+ size 1738801909
special_tokens_map.json ADDED
@@ -0,0 +1,15 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": "<s>",
3
+ "cls_token": "<s>",
4
+ "eos_token": "</s>",
5
+ "mask_token": {
6
+ "content": "<mask>",
7
+ "lstrip": true,
8
+ "normalized": false,
9
+ "rstrip": false,
10
+ "single_word": false
11
+ },
12
+ "pad_token": "<pad>",
13
+ "sep_token": "</s>",
14
+ "unk_token": "<unk>"
15
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,15 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_prefix_space": false,
3
+ "bos_token": "<s>",
4
+ "clean_up_tokenization_spaces": true,
5
+ "cls_token": "<s>",
6
+ "eos_token": "</s>",
7
+ "errors": "replace",
8
+ "mask_token": "<mask>",
9
+ "model_max_length": 4096,
10
+ "pad_token": "<pad>",
11
+ "sep_token": "</s>",
12
+ "tokenizer_class": "LongformerTokenizer",
13
+ "trim_offsets": true,
14
+ "unk_token": "<unk>"
15
+ }
train_results.json ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "epoch": 3.0,
3
+ "train_loss": 0.22367710902254853,
4
+ "train_runtime": 1559138.7742,
5
+ "train_samples": 203053,
6
+ "train_samples_per_second": 0.391,
7
+ "train_steps_per_second": 0.024
8
+ }
trainer_state.json ADDED
@@ -0,0 +1,544 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "best_metric": 0.15368719398975372,
3
+ "best_model_checkpoint": "/output/longformer-large-4096-finetuned-full/checkpoint-35000",
4
+ "epoch": 3.0,
5
+ "global_step": 38073,
6
+ "is_hyper_param_search": false,
7
+ "is_local_process_zero": true,
8
+ "is_world_process_zero": true,
9
+ "log_history": [
10
+ {
11
+ "epoch": 0.04,
12
+ "learning_rate": 4.934336669030547e-05,
13
+ "loss": 0.8478,
14
+ "step": 500
15
+ },
16
+ {
17
+ "epoch": 0.08,
18
+ "learning_rate": 4.8686733380610934e-05,
19
+ "loss": 0.5067,
20
+ "step": 1000
21
+ },
22
+ {
23
+ "epoch": 0.12,
24
+ "learning_rate": 4.80301000709164e-05,
25
+ "loss": 0.4272,
26
+ "step": 1500
27
+ },
28
+ {
29
+ "epoch": 0.16,
30
+ "learning_rate": 4.7373466761221866e-05,
31
+ "loss": 0.3848,
32
+ "step": 2000
33
+ },
34
+ {
35
+ "epoch": 0.2,
36
+ "learning_rate": 4.671683345152733e-05,
37
+ "loss": 0.354,
38
+ "step": 2500
39
+ },
40
+ {
41
+ "epoch": 0.24,
42
+ "learning_rate": 4.60602001418328e-05,
43
+ "loss": 0.3333,
44
+ "step": 3000
45
+ },
46
+ {
47
+ "epoch": 0.28,
48
+ "learning_rate": 4.540356683213826e-05,
49
+ "loss": 0.319,
50
+ "step": 3500
51
+ },
52
+ {
53
+ "epoch": 0.32,
54
+ "learning_rate": 4.474693352244373e-05,
55
+ "loss": 0.3051,
56
+ "step": 4000
57
+ },
58
+ {
59
+ "epoch": 0.35,
60
+ "learning_rate": 4.40903002127492e-05,
61
+ "loss": 0.294,
62
+ "step": 4500
63
+ },
64
+ {
65
+ "epoch": 0.39,
66
+ "learning_rate": 4.343366690305466e-05,
67
+ "loss": 0.2837,
68
+ "step": 5000
69
+ },
70
+ {
71
+ "epoch": 0.39,
72
+ "eval_accuracy": 0.9413275913410438,
73
+ "eval_loss": 0.25030046701431274,
74
+ "eval_runtime": 11955.9114,
75
+ "eval_samples_per_second": 2.126,
76
+ "eval_steps_per_second": 0.531,
77
+ "step": 5000
78
+ },
79
+ {
80
+ "epoch": 0.43,
81
+ "learning_rate": 4.277703359336013e-05,
82
+ "loss": 0.2756,
83
+ "step": 5500
84
+ },
85
+ {
86
+ "epoch": 0.47,
87
+ "learning_rate": 4.212040028366559e-05,
88
+ "loss": 0.2684,
89
+ "step": 6000
90
+ },
91
+ {
92
+ "epoch": 0.51,
93
+ "learning_rate": 4.1463766973971054e-05,
94
+ "loss": 0.2627,
95
+ "step": 6500
96
+ },
97
+ {
98
+ "epoch": 0.55,
99
+ "learning_rate": 4.0807133664276524e-05,
100
+ "loss": 0.2564,
101
+ "step": 7000
102
+ },
103
+ {
104
+ "epoch": 0.59,
105
+ "learning_rate": 4.0150500354581986e-05,
106
+ "loss": 0.2531,
107
+ "step": 7500
108
+ },
109
+ {
110
+ "epoch": 0.63,
111
+ "learning_rate": 3.9493867044887455e-05,
112
+ "loss": 0.2457,
113
+ "step": 8000
114
+ },
115
+ {
116
+ "epoch": 0.67,
117
+ "learning_rate": 3.883723373519292e-05,
118
+ "loss": 0.2424,
119
+ "step": 8500
120
+ },
121
+ {
122
+ "epoch": 0.71,
123
+ "learning_rate": 3.818060042549839e-05,
124
+ "loss": 0.2384,
125
+ "step": 9000
126
+ },
127
+ {
128
+ "epoch": 0.75,
129
+ "learning_rate": 3.7523967115803856e-05,
130
+ "loss": 0.2352,
131
+ "step": 9500
132
+ },
133
+ {
134
+ "epoch": 0.79,
135
+ "learning_rate": 3.686733380610932e-05,
136
+ "loss": 0.2305,
137
+ "step": 10000
138
+ },
139
+ {
140
+ "epoch": 0.79,
141
+ "eval_accuracy": 0.9494652483570813,
142
+ "eval_loss": 0.20809219777584076,
143
+ "eval_runtime": 11956.3035,
144
+ "eval_samples_per_second": 2.125,
145
+ "eval_steps_per_second": 0.531,
146
+ "step": 10000
147
+ },
148
+ {
149
+ "epoch": 0.83,
150
+ "learning_rate": 3.621070049641479e-05,
151
+ "loss": 0.2286,
152
+ "step": 10500
153
+ },
154
+ {
155
+ "epoch": 0.87,
156
+ "learning_rate": 3.555406718672025e-05,
157
+ "loss": 0.2251,
158
+ "step": 11000
159
+ },
160
+ {
161
+ "epoch": 0.91,
162
+ "learning_rate": 3.489743387702571e-05,
163
+ "loss": 0.2233,
164
+ "step": 11500
165
+ },
166
+ {
167
+ "epoch": 0.95,
168
+ "learning_rate": 3.424080056733118e-05,
169
+ "loss": 0.2206,
170
+ "step": 12000
171
+ },
172
+ {
173
+ "epoch": 0.98,
174
+ "learning_rate": 3.3584167257636644e-05,
175
+ "loss": 0.2165,
176
+ "step": 12500
177
+ },
178
+ {
179
+ "epoch": 1.02,
180
+ "learning_rate": 3.292753394794211e-05,
181
+ "loss": 0.2135,
182
+ "step": 13000
183
+ },
184
+ {
185
+ "epoch": 1.06,
186
+ "learning_rate": 3.227090063824758e-05,
187
+ "loss": 0.2128,
188
+ "step": 13500
189
+ },
190
+ {
191
+ "epoch": 1.1,
192
+ "learning_rate": 3.1614267328553045e-05,
193
+ "loss": 0.2105,
194
+ "step": 14000
195
+ },
196
+ {
197
+ "epoch": 1.14,
198
+ "learning_rate": 3.0957634018858514e-05,
199
+ "loss": 0.2096,
200
+ "step": 14500
201
+ },
202
+ {
203
+ "epoch": 1.18,
204
+ "learning_rate": 3.0301000709163973e-05,
205
+ "loss": 0.2069,
206
+ "step": 15000
207
+ },
208
+ {
209
+ "epoch": 1.18,
210
+ "eval_accuracy": 0.9537146569573595,
211
+ "eval_loss": 0.18708214163780212,
212
+ "eval_runtime": 11958.4868,
213
+ "eval_samples_per_second": 2.125,
214
+ "eval_steps_per_second": 0.531,
215
+ "step": 15000
216
+ },
217
+ {
218
+ "epoch": 1.22,
219
+ "learning_rate": 2.9644367399469442e-05,
220
+ "loss": 0.2052,
221
+ "step": 15500
222
+ },
223
+ {
224
+ "epoch": 1.26,
225
+ "learning_rate": 2.8987734089774908e-05,
226
+ "loss": 0.2026,
227
+ "step": 16000
228
+ },
229
+ {
230
+ "epoch": 1.3,
231
+ "learning_rate": 2.833110078008037e-05,
232
+ "loss": 0.2013,
233
+ "step": 16500
234
+ },
235
+ {
236
+ "epoch": 1.34,
237
+ "learning_rate": 2.767446747038584e-05,
238
+ "loss": 0.2003,
239
+ "step": 17000
240
+ },
241
+ {
242
+ "epoch": 1.38,
243
+ "learning_rate": 2.7017834160691302e-05,
244
+ "loss": 0.1979,
245
+ "step": 17500
246
+ },
247
+ {
248
+ "epoch": 1.42,
249
+ "learning_rate": 2.636120085099677e-05,
250
+ "loss": 0.197,
251
+ "step": 18000
252
+ },
253
+ {
254
+ "epoch": 1.46,
255
+ "learning_rate": 2.5704567541302237e-05,
256
+ "loss": 0.1959,
257
+ "step": 18500
258
+ },
259
+ {
260
+ "epoch": 1.5,
261
+ "learning_rate": 2.50479342316077e-05,
262
+ "loss": 0.1943,
263
+ "step": 19000
264
+ },
265
+ {
266
+ "epoch": 1.54,
267
+ "learning_rate": 2.439130092191317e-05,
268
+ "loss": 0.1945,
269
+ "step": 19500
270
+ },
271
+ {
272
+ "epoch": 1.58,
273
+ "learning_rate": 2.3734667612218634e-05,
274
+ "loss": 0.1909,
275
+ "step": 20000
276
+ },
277
+ {
278
+ "epoch": 1.58,
279
+ "eval_accuracy": 0.9563928580029224,
280
+ "eval_loss": 0.17498859763145447,
281
+ "eval_runtime": 11959.4736,
282
+ "eval_samples_per_second": 2.125,
283
+ "eval_steps_per_second": 0.531,
284
+ "step": 20000
285
+ },
286
+ {
287
+ "epoch": 1.62,
288
+ "learning_rate": 2.30780343025241e-05,
289
+ "loss": 0.1904,
290
+ "step": 20500
291
+ },
292
+ {
293
+ "epoch": 1.65,
294
+ "learning_rate": 2.2421400992829566e-05,
295
+ "loss": 0.1895,
296
+ "step": 21000
297
+ },
298
+ {
299
+ "epoch": 1.69,
300
+ "learning_rate": 2.176476768313503e-05,
301
+ "loss": 0.1883,
302
+ "step": 21500
303
+ },
304
+ {
305
+ "epoch": 1.73,
306
+ "learning_rate": 2.1108134373440497e-05,
307
+ "loss": 0.1869,
308
+ "step": 22000
309
+ },
310
+ {
311
+ "epoch": 1.77,
312
+ "learning_rate": 2.0451501063745963e-05,
313
+ "loss": 0.1857,
314
+ "step": 22500
315
+ },
316
+ {
317
+ "epoch": 1.81,
318
+ "learning_rate": 1.9794867754051426e-05,
319
+ "loss": 0.1841,
320
+ "step": 23000
321
+ },
322
+ {
323
+ "epoch": 1.85,
324
+ "learning_rate": 1.9138234444356895e-05,
325
+ "loss": 0.1841,
326
+ "step": 23500
327
+ },
328
+ {
329
+ "epoch": 1.89,
330
+ "learning_rate": 1.848160113466236e-05,
331
+ "loss": 0.183,
332
+ "step": 24000
333
+ },
334
+ {
335
+ "epoch": 1.93,
336
+ "learning_rate": 1.7824967824967826e-05,
337
+ "loss": 0.1809,
338
+ "step": 24500
339
+ },
340
+ {
341
+ "epoch": 1.97,
342
+ "learning_rate": 1.7168334515273292e-05,
343
+ "loss": 0.181,
344
+ "step": 25000
345
+ },
346
+ {
347
+ "epoch": 1.97,
348
+ "eval_accuracy": 0.9582902012352269,
349
+ "eval_loss": 0.16556869447231293,
350
+ "eval_runtime": 11952.7546,
351
+ "eval_samples_per_second": 2.126,
352
+ "eval_steps_per_second": 0.532,
353
+ "step": 25000
354
+ },
355
+ {
356
+ "epoch": 2.01,
357
+ "learning_rate": 1.6511701205578758e-05,
358
+ "loss": 0.1791,
359
+ "step": 25500
360
+ },
361
+ {
362
+ "epoch": 2.05,
363
+ "learning_rate": 1.5855067895884224e-05,
364
+ "loss": 0.1782,
365
+ "step": 26000
366
+ },
367
+ {
368
+ "epoch": 2.09,
369
+ "learning_rate": 1.519843458618969e-05,
370
+ "loss": 0.1786,
371
+ "step": 26500
372
+ },
373
+ {
374
+ "epoch": 2.13,
375
+ "learning_rate": 1.4541801276495154e-05,
376
+ "loss": 0.1771,
377
+ "step": 27000
378
+ },
379
+ {
380
+ "epoch": 2.17,
381
+ "learning_rate": 1.3885167966800621e-05,
382
+ "loss": 0.1761,
383
+ "step": 27500
384
+ },
385
+ {
386
+ "epoch": 2.21,
387
+ "learning_rate": 1.3228534657106087e-05,
388
+ "loss": 0.1748,
389
+ "step": 28000
390
+ },
391
+ {
392
+ "epoch": 2.25,
393
+ "learning_rate": 1.2571901347411553e-05,
394
+ "loss": 0.1738,
395
+ "step": 28500
396
+ },
397
+ {
398
+ "epoch": 2.29,
399
+ "learning_rate": 1.1915268037717017e-05,
400
+ "loss": 0.1736,
401
+ "step": 29000
402
+ },
403
+ {
404
+ "epoch": 2.32,
405
+ "learning_rate": 1.1258634728022484e-05,
406
+ "loss": 0.1724,
407
+ "step": 29500
408
+ },
409
+ {
410
+ "epoch": 2.36,
411
+ "learning_rate": 1.0602001418327948e-05,
412
+ "loss": 0.172,
413
+ "step": 30000
414
+ },
415
+ {
416
+ "epoch": 2.36,
417
+ "eval_accuracy": 0.9597659108954908,
418
+ "eval_loss": 0.15860329568386078,
419
+ "eval_runtime": 11965.9783,
420
+ "eval_samples_per_second": 2.124,
421
+ "eval_steps_per_second": 0.531,
422
+ "step": 30000
423
+ },
424
+ {
425
+ "epoch": 2.4,
426
+ "learning_rate": 9.945368108633416e-06,
427
+ "loss": 0.1717,
428
+ "step": 30500
429
+ },
430
+ {
431
+ "epoch": 2.44,
432
+ "learning_rate": 9.28873479893888e-06,
433
+ "loss": 0.1711,
434
+ "step": 31000
435
+ },
436
+ {
437
+ "epoch": 2.48,
438
+ "learning_rate": 8.632101489244348e-06,
439
+ "loss": 0.1705,
440
+ "step": 31500
441
+ },
442
+ {
443
+ "epoch": 2.52,
444
+ "learning_rate": 7.975468179549812e-06,
445
+ "loss": 0.1703,
446
+ "step": 32000
447
+ },
448
+ {
449
+ "epoch": 2.56,
450
+ "learning_rate": 7.318834869855279e-06,
451
+ "loss": 0.1704,
452
+ "step": 32500
453
+ },
454
+ {
455
+ "epoch": 2.6,
456
+ "learning_rate": 6.662201560160744e-06,
457
+ "loss": 0.1695,
458
+ "step": 33000
459
+ },
460
+ {
461
+ "epoch": 2.64,
462
+ "learning_rate": 6.00556825046621e-06,
463
+ "loss": 0.168,
464
+ "step": 33500
465
+ },
466
+ {
467
+ "epoch": 2.68,
468
+ "learning_rate": 5.348934940771676e-06,
469
+ "loss": 0.168,
470
+ "step": 34000
471
+ },
472
+ {
473
+ "epoch": 2.72,
474
+ "learning_rate": 4.6923016310771415e-06,
475
+ "loss": 0.1676,
476
+ "step": 34500
477
+ },
478
+ {
479
+ "epoch": 2.76,
480
+ "learning_rate": 4.035668321382607e-06,
481
+ "loss": 0.1669,
482
+ "step": 35000
483
+ },
484
+ {
485
+ "epoch": 2.76,
486
+ "eval_accuracy": 0.9607826213936019,
487
+ "eval_loss": 0.15368719398975372,
488
+ "eval_runtime": 11962.7904,
489
+ "eval_samples_per_second": 2.124,
490
+ "eval_steps_per_second": 0.531,
491
+ "step": 35000
492
+ },
493
+ {
494
+ "epoch": 2.8,
495
+ "learning_rate": 3.3790350116880735e-06,
496
+ "loss": 0.1669,
497
+ "step": 35500
498
+ },
499
+ {
500
+ "epoch": 2.84,
501
+ "learning_rate": 2.722401701993539e-06,
502
+ "loss": 0.1653,
503
+ "step": 36000
504
+ },
505
+ {
506
+ "epoch": 2.88,
507
+ "learning_rate": 2.0657683922990046e-06,
508
+ "loss": 0.1655,
509
+ "step": 36500
510
+ },
511
+ {
512
+ "epoch": 2.92,
513
+ "learning_rate": 1.4091350826044704e-06,
514
+ "loss": 0.1649,
515
+ "step": 37000
516
+ },
517
+ {
518
+ "epoch": 2.95,
519
+ "learning_rate": 7.525017729099362e-07,
520
+ "loss": 0.1649,
521
+ "step": 37500
522
+ },
523
+ {
524
+ "epoch": 2.99,
525
+ "learning_rate": 9.5868463215402e-08,
526
+ "loss": 0.166,
527
+ "step": 38000
528
+ },
529
+ {
530
+ "epoch": 3.0,
531
+ "step": 38073,
532
+ "total_flos": 5.673660618532037e+18,
533
+ "train_loss": 0.22367710902254853,
534
+ "train_runtime": 1559138.7742,
535
+ "train_samples_per_second": 0.391,
536
+ "train_steps_per_second": 0.024
537
+ }
538
+ ],
539
+ "max_steps": 38073,
540
+ "num_train_epochs": 3,
541
+ "total_flos": 5.673660618532037e+18,
542
+ "trial_name": null,
543
+ "trial_params": null
544
+ }
training_args.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:3f8649b9c747b7f3cd5ed18a3e258d16d7fcda69704b45424b81bf97945d249b
3
+ size 3899
vocab.json ADDED
The diff for this file is too large to render. See raw diff