jrahn commited on
Commit
37d7cbd
1 Parent(s): b9249c4

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +5 -5
README.md CHANGED
@@ -45,7 +45,7 @@ p("In a shocking finding, scientist discovered a herd of unicorns living in a re
45
 
46
  <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
47
 
48
- Datasets used: Fineweb-Edu 10B + OpenHermes 2.5 (chatml)
49
 
50
  Dataset proportions:
51
  - Part 1: FWE 4,836,050 + OH 100,000 (2.03%) = 4,936,050
@@ -69,7 +69,7 @@ Total documents: 10,669,024
69
  - **Training regime:**
70
  - bf16
71
  - context length 1024
72
- - per device batch size 16, global batch size 524288, gradient accumulation 16
73
  - zero stage 1
74
  - lr 3e-4, cosine schedule, 700 warmup steps
75
  - more details see [run script](run_gpt2_350M_edu_hermes.sh)
@@ -78,9 +78,9 @@ Total documents: 10,669,024
78
 
79
  <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
80
 
81
- Params: 355M -> Checkpoint: 710MB
82
- Tokens: ~10B
83
- Total training time: 30hrs
84
  Hardware: 2x RTX4090
85
  MFU: 71% (110,000 tok/s)
86
 
 
45
 
46
  <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
47
 
48
+ Datasets used: Fineweb-Edu 10B + OpenHermes 2.5
49
 
50
  Dataset proportions:
51
  - Part 1: FWE 4,836,050 + OH 100,000 (2.03%) = 4,936,050
 
69
  - **Training regime:**
70
  - bf16
71
  - context length 1024
72
+ - per device batch size 16, global batch size 524,288 -> gradient accumulation 16
73
  - zero stage 1
74
  - lr 3e-4, cosine schedule, 700 warmup steps
75
  - more details see [run script](run_gpt2_350M_edu_hermes.sh)
 
78
 
79
  <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
80
 
81
+ Params: 355M -> 710MB / checkpoint
82
+ Tokens: ~10B (10,287,579,136)
83
+ Total training time: ~30hrs
84
  Hardware: 2x RTX4090
85
  MFU: 71% (110,000 tok/s)
86