amd
/

Safetensors
llama
alignment-handbook
Generated from Trainer
Mingyuyang-1 commited on
Commit
254f9db
·
verified ·
1 Parent(s): e066f80

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +11 -0
README.md CHANGED
@@ -42,6 +42,17 @@ The Zebra-Llama models are not trained from scratch. Instead, they are composed
42
  | 5. SFT | End-to-End Knowledge Distillation | The composed hybrid model is fine-tuned via knowledge distillation, using an 8B model as a teacher to transfer rich, pre-trained knowledge. |
43
  | 6. Alignment | Direct Preference Optimization (DPO) | In the final stage, DPO is used to align the model's preferences, with the distilled student model itself serving as the reference model for stability. |
44
 
 
 
 
 
 
 
 
 
 
 
 
45
  ## Getting Started
46
 
47
  ### Installation
 
42
  | 5. SFT | End-to-End Knowledge Distillation | The composed hybrid model is fine-tuned via knowledge distillation, using an 8B model as a teacher to transfer rich, pre-trained knowledge. |
43
  | 6. Alignment | Direct Preference Optimization (DPO) | In the final stage, DPO is used to align the model's preferences, with the distilled student model itself serving as the reference model for stability. |
44
 
45
+ ## Training Data
46
+
47
+ |Stage | Dataset | License |
48
+ |-----------|---------------------------------------------------------------------------|------------------------|
49
+ | ILD/SFT | https://huggingface.co/datasets/teknium/OpenHermes-2.5 | Refer source materials |
50
+ | ILD/SFT | https://huggingface.co/datasets/tomg-group-umd/GenQA | CC BY-NC 4.0 |
51
+ | ILD/SFT | https://huggingface.co/datasets/BAAI/Infinity-Instruct | CC BY-SA 4.0 |
52
+ | DPO | https://huggingface.co/datasets/HuggingFaceH4/ultrafeedback_binarized | MIT |
53
+ | DPO | https://huggingface.co/datasets/HuggingFaceH4/orca_dpo_pairs | MIT |
54
+ | DPO | https://huggingface.co/datasets/JunxiongWang/llama3-ultrafeedback-armorm | MIT |
55
+
56
  ## Getting Started
57
 
58
  ### Installation