Update README.md
Browse files
README.md
CHANGED
@@ -59,16 +59,11 @@ Experiments were conducted with BitsandBytes to load the quantization model. Two
|
|
59 |
|
60 |
In SFT, the run lasts 2h20min, we can see from the figures that the eval_loss snowly decreases when the step grows.
|
61 |
|
62 |
-
![pythia2.8B](./output/Pythia_SFT.png)
|
63 |
-
![internlm2-chat-1_8b-sft](./output/Intern_SFT.png)
|
64 |
|
65 |
For more details, check [here](https://wandb.ai/qiyuwu/pythia2_8B_DPO_Quant/runs/co6guc8k?nw=nwuserwqy123202108) and [here](https://wandb.ai/qiyuwu/internlm_1_8B_DPO_Quant/runs/w03t6lsm?nw=nwuserwqy123202108)
|
66 |
|
67 |
In DPO, the run lasts 7h30min, we can see from the figures that the accuracies and margins snowly increases when the step grows.
|
68 |
|
69 |
-
![pythia2.8B](./output/Pythia_DPO.png)
|
70 |
-
![internlm2-chat-1_8b-sft](./output/Intern_DPO.png)
|
71 |
-
|
72 |
For more details, check [here](https://wandb.ai/qiyuwu/pythia2_8B_DPO_Quant/runs/0tejjuhj?nw=nwuserwqy123202108) and [here](https://wandb.ai/qiyuwu/internlm_1_8B_DPO_Quant/runs/4gnv19ir?nw=nwuserwqy123202108)
|
73 |
|
74 |
Compared to the example in [eric-mitchell/direct-preference-optimization: Reference implementation for DPO (Direct Preference Optimization) (github.com)](https://github.com/eric-mitchell/direct-preference-optimization), our experiment is more unstable in training, but achieved pretty good results in accuracy. Besides, due to time constraints, our experiments were only trained on top of about 25K conversations, which is why our experiments did not achieve significantly good results on top of some other metrics.
|
|
|
59 |
|
60 |
In SFT, the run lasts 2h20min, we can see from the figures that the eval_loss snowly decreases when the step grows.
|
61 |
|
|
|
|
|
62 |
|
63 |
For more details, check [here](https://wandb.ai/qiyuwu/pythia2_8B_DPO_Quant/runs/co6guc8k?nw=nwuserwqy123202108) and [here](https://wandb.ai/qiyuwu/internlm_1_8B_DPO_Quant/runs/w03t6lsm?nw=nwuserwqy123202108)
|
64 |
|
65 |
In DPO, the run lasts 7h30min, we can see from the figures that the accuracies and margins snowly increases when the step grows.
|
66 |
|
|
|
|
|
|
|
67 |
For more details, check [here](https://wandb.ai/qiyuwu/pythia2_8B_DPO_Quant/runs/0tejjuhj?nw=nwuserwqy123202108) and [here](https://wandb.ai/qiyuwu/internlm_1_8B_DPO_Quant/runs/4gnv19ir?nw=nwuserwqy123202108)
|
68 |
|
69 |
Compared to the example in [eric-mitchell/direct-preference-optimization: Reference implementation for DPO (Direct Preference Optimization) (github.com)](https://github.com/eric-mitchell/direct-preference-optimization), our experiment is more unstable in training, but achieved pretty good results in accuracy. Besides, due to time constraints, our experiments were only trained on top of about 25K conversations, which is why our experiments did not achieve significantly good results on top of some other metrics.
|