Text Generation
Transformers
Safetensors
Czech
mpt
custom_code
text-generation-inference
Inference Endpoints
mfajcik commited on
Commit
6b9f7d9
1 Parent(s): 25de714

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +22 -8
README.md CHANGED
@@ -5,22 +5,36 @@ license: apache-2.0
5
 
6
 
7
  # Eval
8
- Dev eval at CS-HellaSwag (automatically translated HellaSwag benchmark)
9
- | Model | Model Accuracy |
10
  |---------------|----------------|
11
  | mistral7b | 0.4992 |
12
- | csmpt-130k | __0.5004__ |
13
- | csmpt-100k | 0.4959 |
14
- | csmpt-75k | 0.4895 |
15
- | csmpt-50k steps | 0.4755 |
16
- | csmpt-26.5k steps | 0.4524 |
17
 
18
 
19
  However, we ran validation over the course of training on CS-Hellaswag, and after 100k steps, the improvements were very noisy if any.
20
  The improvement over mistral7b is not significant.
21
 
 
 
22
  ## Loss
23
- tbd.
 
 
 
 
 
 
 
 
 
 
 
 
24
 
25
 
26
  ## Training Method
 
5
 
6
 
7
  # Eval
8
+ Dev eval at CS-HellaSwag (automatically translated HellaSwag benchmark).
9
+ | Model | CS-HellaSwag Accuracy |
10
  |---------------|----------------|
11
  | mistral7b | 0.4992 |
12
+ | csmpt@130k steps [released] | __0.5004__ |
13
+ | csmpt@100k steps | 0.4959 |
14
+ | csmpt@75k steps | 0.4895 |
15
+ | csmpt@50k steps | 0.4755 |
16
+ | csmpt@26,5k steps | 0.4524 |
17
 
18
 
19
  However, we ran validation over the course of training on CS-Hellaswag, and after 100k steps, the improvements were very noisy if any.
20
  The improvement over mistral7b is not significant.
21
 
22
+ <TBD> More evaluation details teaser.
23
+
24
  ## Loss
25
+ We encountered loss spikes during training. As the model always recovered, and our budget for training 7b model was very constrained, we kept on training. We observed such loss spikes before in our ablations. In these ablations (with GPT-2 small), we found these to be
26
+ - (a) influenced by learning rate, the lower the learning rate, less they appear, as it gets higher, they start to appear, and with too high learning rate, the training might diverge on such loss spike.
27
+ - (b) in preliminary ablations, they only appear for continuously pretrained models. While we do not know why do they appear, we hypothesize this might be linked to theory on [Adam instability in time-domain correlation of update vectors](https://arxiv.org/pdf/2304.09871.pdf). However
28
+ such instabilities were previously observed only for much larger models (larger than 65b).
29
+
30
+ The model was trained on 3 corpora. Corpus #1 was the same we used for GPT-2 training (~16b tokens). <TBD MF>
31
+
32
+ <img src="figures/tloss_full.png" width="900"/>
33
+ Figure 1: Training loss.
34
+ <img src="figures/tloss_closeup.png" width="900"/>
35
+ Figure 2: Training loss closeup. We mark two hotswap places, where the training corpus #1 was switched for internal-corpus #2 and internal-corpus #2.1 respectively. <TBD MF>
36
+ <img src="figures/vloss_closeup.png" width="900"/>
37
+ Figure 3: Test loss closeup, testing performed on internal-corpus #1. <TBD MF>
38
 
39
 
40
  ## Training Method