killawhale2 commited on
Commit
6625d6d
1 Parent(s): 00ae028

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +2 -0
README.md CHANGED
@@ -21,6 +21,8 @@ Solar 10.7B is an ideal choice for fine-tuning. SOLAR-10.7B offers robustness an
21
  We utilize state-of-the-art instruction fine-tuning methods including supervised fine-tuning (SFT) and direct preference optimization (DPO) [1].
22
  Using open source datasets with Alpaca- and OpenOrca-style and generated synthetic datasets, we apply iterative DPO training, a proprietary alignment strategy, to maximize the performance of our resulting model.
23
 
 
 
24
  [1] Rafailov, R., Sharma, A., Mitchell, E., Ermon, S., Manning, C.D. and Finn, C., 2023. Direct preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290.
25
 
26
  # **Evaluation Results**
 
21
  We utilize state-of-the-art instruction fine-tuning methods including supervised fine-tuning (SFT) and direct preference optimization (DPO) [1].
22
  Using open source datasets with Alpaca- and OpenOrca-style and generated synthetic datasets, we apply iterative DPO training, a proprietary alignment strategy, to maximize the performance of our resulting model.
23
 
24
+ *Note:* We were careful of data contamination during SFT and DPO, e.g., removing data created using TruthfulQA's prompts.
25
+
26
  [1] Rafailov, R., Sharma, A., Mitchell, E., Ermon, S., Manning, C.D. and Finn, C., 2023. Direct preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290.
27
 
28
  # **Evaluation Results**