allenai
/

OLMo-7B-SFT

Text Generation

Model card Files Files and versions Community

hamishivi commited on Mar 7

Commit

103ff2e

•

1 Parent(s): 802acfa

Update README.md

Files changed (1) hide show

README.md +2 -4

README.md CHANGED Viewed

@@ -144,15 +144,13 @@ For training data details, please see the [Dolma](https://huggingface.co/dataset
 ### Hyperparameters
-The hyperparameters for the two phases of training are below.
-Certainly! Here's the table with SFT and DPO as rows:
 |                         | Learning Rate | Beta | Epochs | Warmup                                                                 | Weight Decay | Gradient Clipping | Maximum Sequence Length |
 |-------------------------|---------------|------|--------|------------------------------------------------------------------------|--------------|-------------------|-------------------------|
 | **SFT**                 | 2 × 10^-6     | N/A  | 3      | Linear warmup for the first 3% of total training time, then cooldown to 0 | 0            | 0                 | 2048                    |
-| **DPO**                 | 5 × 10^-7     | 0.1  | 3      | Linear warmup for the first 10% of total training time, then cooldown to 0| 0            | 0                 | 2048                    |
-Compared to Tulu 2, DPO hyperparameters are the same. SFT is lower LR and 3 epochs instead of 2 (and 2k length instead of 8k).
 ## Bias, Risks, and Limitations

 ### Hyperparameters
+The hyperparameters for SFT training are below:
 |                         | Learning Rate | Beta | Epochs | Warmup                                                                 | Weight Decay | Gradient Clipping | Maximum Sequence Length |
 |-------------------------|---------------|------|--------|------------------------------------------------------------------------|--------------|-------------------|-------------------------|
 | **SFT**                 | 2 × 10^-6     | N/A  | 3      | Linear warmup for the first 3% of total training time, then cooldown to 0 | 0            | 0                 | 2048                    |
+Compared to Tulu 2, SFT uses a lower LR, 3 epochs instead of 2, and 2048 length instead of 8192.
 ## Bias, Risks, and Limitations