QiyuWu
/

DPO-Internlm2-1_8B

Feature Extraction

Model card Files Files and versions Community

QiyuWu commited on May 17

Commit

9d12aed

•

1 Parent(s): 1282853

Update README.md

Files changed (1) hide show

README.md +0 -5

README.md CHANGED Viewed

@@ -59,16 +59,11 @@ Experiments were conducted with BitsandBytes to load the quantization model. Two
 In SFT, the run lasts 2h20min, we can see from the figures that the eval_loss snowly decreases when the step grows.
-![pythia2.8B](./output/Pythia_SFT.png)
-![internlm2-chat-1_8b-sft](./output/Intern_SFT.png)
 For more details, check [here](https://wandb.ai/qiyuwu/pythia2_8B_DPO_Quant/runs/co6guc8k?nw=nwuserwqy123202108) and [here](https://wandb.ai/qiyuwu/internlm_1_8B_DPO_Quant/runs/w03t6lsm?nw=nwuserwqy123202108)
 In DPO, the run lasts 7h30min, we can see from the figures that the accuracies and margins snowly increases when the step grows.
-![pythia2.8B](./output/Pythia_DPO.png)
-![internlm2-chat-1_8b-sft](./output/Intern_DPO.png)
 For more details, check [here](https://wandb.ai/qiyuwu/pythia2_8B_DPO_Quant/runs/0tejjuhj?nw=nwuserwqy123202108) and [here](https://wandb.ai/qiyuwu/internlm_1_8B_DPO_Quant/runs/4gnv19ir?nw=nwuserwqy123202108)
 Compared to the example in [eric-mitchell/direct-preference-optimization: Reference implementation for DPO (Direct Preference Optimization) (github.com)](https://github.com/eric-mitchell/direct-preference-optimization), our experiment is more unstable in training, but achieved pretty good results in accuracy. Besides, due to time constraints, our experiments were only trained on top of about 25K conversations, which is why our experiments did not achieve significantly good results on top of some other metrics.

 In SFT, the run lasts 2h20min, we can see from the figures that the eval_loss snowly decreases when the step grows.
 For more details, check [here](https://wandb.ai/qiyuwu/pythia2_8B_DPO_Quant/runs/co6guc8k?nw=nwuserwqy123202108) and [here](https://wandb.ai/qiyuwu/internlm_1_8B_DPO_Quant/runs/w03t6lsm?nw=nwuserwqy123202108)
 In DPO, the run lasts 7h30min, we can see from the figures that the accuracies and margins snowly increases when the step grows.
 For more details, check [here](https://wandb.ai/qiyuwu/pythia2_8B_DPO_Quant/runs/0tejjuhj?nw=nwuserwqy123202108) and [here](https://wandb.ai/qiyuwu/internlm_1_8B_DPO_Quant/runs/4gnv19ir?nw=nwuserwqy123202108)
 Compared to the example in [eric-mitchell/direct-preference-optimization: Reference implementation for DPO (Direct Preference Optimization) (github.com)](https://github.com/eric-mitchell/direct-preference-optimization), our experiment is more unstable in training, but achieved pretty good results in accuracy. Besides, due to time constraints, our experiments were only trained on top of about 25K conversations, which is why our experiments did not achieve significantly good results on top of some other metrics.