dvilasuero HF staff commited on
Commit
4b43161
1 Parent(s): fbe2d77

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +3 -3
README.md CHANGED
@@ -172,13 +172,13 @@ model-index:
172
 
173
  # Model Card for Notus 7B v1
174
 
175
- Notus is a collection of fine-tuned models using Direct Preference Optimization (DPO) and related RLHF techniques. This model is the first version, fine-tuned with DPO over `zephyr-7b-sft-full`, which is the SFT model produced to create `zephyr-7b-beta`.
176
 
177
  Following a **data-first** approach, the only difference between Notus-7B-v1 and Zephyr-7B-beta is the preference dataset used for dDPO. In particular, we've found data issues in the original UltraFeedback dataset, leading to high-scores for bad responses. After curating several hundreds of data points, we decided to binarize the dataset using the preference ratings, instead of the original critique `overall_score`.
178
 
179
- Using preference ratings, instead of critiques scores, led to a new dataset where the chosen response is different in ~50% of the cases.
180
 
181
- This model wouldn't have been possible without the amazing [Alignment Handbook](https://github.com/huggingface/alignment-handbook) and it's based on fruitful discussions with the HuggingFace H4 team. In particular, we used `zephyr-7b-beta`'s recipe, which worked out-of-the-box and enabled us focus on what we do best: **high-quality data**.
182
 
183
  Notus models are intended to be used as assistants via chat-like applications, and are evaluated with Chat (MT-Bench, AlpacaEval) and Academic (Open LLM Leaderboard) benchmarks for a direct comparison with the original Zephyr dDPO model and other 7B models.
184
 
 
172
 
173
  # Model Card for Notus 7B v1
174
 
175
+ Notus is a collection of fine-tuned models using Direct Preference Optimization (DPO) and related RLHF techniques. This model is the first version, fine-tuned with DPO (Direct Preference Optimization) over `zephyr-7b-sft-full`, which is the SFT model produced to create `zephyr-7b-beta`.
176
 
177
  Following a **data-first** approach, the only difference between Notus-7B-v1 and Zephyr-7B-beta is the preference dataset used for dDPO. In particular, we've found data issues in the original UltraFeedback dataset, leading to high-scores for bad responses. After curating several hundreds of data points, we decided to binarize the dataset using the preference ratings, instead of the original critique `overall_score`.
178
 
179
+ Using preference ratings, instead of critiques scores, led to a new dataset where the chosen response is different in ~50% of the cases. Using this new dataset with DPO we fine-tuned Notus, a 7B model, that **surpasses Zephyr-7B-beta, Claude 2, and Cohere Command on AlpacaEval**.
180
 
181
+ This model **wouldn't have been possible without the amazing [Alignment Handbook](https://github.com/huggingface/alignment-handbook)** and it's based on fruitful discussions with the HuggingFace H4 team. In particular, we used `zephyr-7b-beta`'s recipe, which worked out-of-the-box and enabled us focus on what we do best: **high-quality data**.
182
 
183
  Notus models are intended to be used as assistants via chat-like applications, and are evaluated with Chat (MT-Bench, AlpacaEval) and Academic (Open LLM Leaderboard) benchmarks for a direct comparison with the original Zephyr dDPO model and other 7B models.
184