dvilasuero HF staff commited on
Commit
62c23f5
1 Parent(s): 4b43161

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +3 -1
README.md CHANGED
@@ -174,7 +174,9 @@ model-index:
174
 
175
  Notus is a collection of fine-tuned models using Direct Preference Optimization (DPO) and related RLHF techniques. This model is the first version, fine-tuned with DPO (Direct Preference Optimization) over `zephyr-7b-sft-full`, which is the SFT model produced to create `zephyr-7b-beta`.
176
 
177
- Following a **data-first** approach, the only difference between Notus-7B-v1 and Zephyr-7B-beta is the preference dataset used for dDPO. In particular, we've found data issues in the original UltraFeedback dataset, leading to high-scores for bad responses. After curating several hundreds of data points, we decided to binarize the dataset using the preference ratings, instead of the original critique `overall_score`.
 
 
178
 
179
  Using preference ratings, instead of critiques scores, led to a new dataset where the chosen response is different in ~50% of the cases. Using this new dataset with DPO we fine-tuned Notus, a 7B model, that **surpasses Zephyr-7B-beta, Claude 2, and Cohere Command on AlpacaEval**.
180
 
 
174
 
175
  Notus is a collection of fine-tuned models using Direct Preference Optimization (DPO) and related RLHF techniques. This model is the first version, fine-tuned with DPO (Direct Preference Optimization) over `zephyr-7b-sft-full`, which is the SFT model produced to create `zephyr-7b-beta`.
176
 
177
+ Following a **data-first** approach, the only difference between Notus-7B-v1 and Zephyr-7B-beta is the preference dataset used for dDPO.
178
+
179
+ In particular, when we started building [distilabel](https://github.com/argilla-io/distilabel), we took some time to deep-dive into the UltraFeedback dataset. Using [Argilla](https://argilla.io/), we've found data issues in the original UltraFeedback dataset, leading to high-scores for bad responses (more details in the training data section). After curating several hundreds of data points, we decided to binarize the dataset using the preference ratings, instead of the original critique `overall_score`.
180
 
181
  Using preference ratings, instead of critiques scores, led to a new dataset where the chosen response is different in ~50% of the cases. Using this new dataset with DPO we fine-tuned Notus, a 7B model, that **surpasses Zephyr-7B-beta, Claude 2, and Cohere Command on AlpacaEval**.
182