Text Generation
Transformers
PyTorch
English
olmo
conversational
custom_code
hamishivi commited on
Commit
103ff2e
1 Parent(s): 802acfa

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +2 -4
README.md CHANGED
@@ -144,15 +144,13 @@ For training data details, please see the [Dolma](https://huggingface.co/dataset
144
 
145
  ### Hyperparameters
146
 
147
- The hyperparameters for the two phases of training are below.
148
- Certainly! Here's the table with SFT and DPO as rows:
149
 
150
  | | Learning Rate | Beta | Epochs | Warmup | Weight Decay | Gradient Clipping | Maximum Sequence Length |
151
  |-------------------------|---------------|------|--------|------------------------------------------------------------------------|--------------|-------------------|-------------------------|
152
  | **SFT** | 2 × 10^-6 | N/A | 3 | Linear warmup for the first 3% of total training time, then cooldown to 0 | 0 | 0 | 2048 |
153
- | **DPO** | 5 × 10^-7 | 0.1 | 3 | Linear warmup for the first 10% of total training time, then cooldown to 0| 0 | 0 | 2048 |
154
 
155
- Compared to Tulu 2, DPO hyperparameters are the same. SFT is lower LR and 3 epochs instead of 2 (and 2k length instead of 8k).
156
 
157
  ## Bias, Risks, and Limitations
158
 
 
144
 
145
  ### Hyperparameters
146
 
147
+ The hyperparameters for SFT training are below:
 
148
 
149
  | | Learning Rate | Beta | Epochs | Warmup | Weight Decay | Gradient Clipping | Maximum Sequence Length |
150
  |-------------------------|---------------|------|--------|------------------------------------------------------------------------|--------------|-------------------|-------------------------|
151
  | **SFT** | 2 × 10^-6 | N/A | 3 | Linear warmup for the first 3% of total training time, then cooldown to 0 | 0 | 0 | 2048 |
 
152
 
153
+ Compared to Tulu 2, SFT uses a lower LR, 3 epochs instead of 2, and 2048 length instead of 8192.
154
 
155
  ## Bias, Risks, and Limitations
156