dvilasuero HF staff commited on
Commit
1806313
1 Parent(s): dfabe92

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +11 -4
README.md CHANGED
@@ -180,6 +180,8 @@ In particular, when we started building [distilabel](https://github.com/argilla-
180
 
181
  Using preference ratings, instead of critiques scores, led to a new dataset where the chosen response is different in ~50% of the cases. Using this new dataset with DPO we fine-tuned Notus, a 7B model, that **surpasses Zephyr-7B-beta and Claude 2 on AlpacaEval**.
182
 
 
 
183
  This model **wouldn't have been possible without the amazing [Alignment Handbook](https://github.com/huggingface/alignment-handbook), [OpenBMB](https://www.openbmb.cn/home) for releasing the Ultrafeedback dataset**, and it's based on fruitful discussions with the HuggingFace H4 team. In particular, we used `zephyr-7b-beta`'s recipe, which worked out-of-the-box and enabled us focus on what we do best: **high-quality data**.
184
 
185
  Notus models are intended to be used as assistants via chat-like applications, and are evaluated with Chat (MT-Bench, AlpacaEval) and Academic (Open LLM Leaderboard) benchmarks for a direct comparison with the original Zephyr dDPO model and other 7B models.
@@ -326,23 +328,28 @@ We used a VM with 8 x A100 40GB hosted in Lambda Labs, but while experimenting w
326
 
327
  ### Training Data
328
 
329
- We used a a new curated version [`argilla/ultrafeedback-binarized-preferences`](https://huggingface.co/datasets/argilla/ultrafeedback-binarized-preferences).
330
- of [`openbmb/UltraFeedback`](https://huggingface.co/datasets/openbmb/UltraFeedback), named [argilla/ultrafeedback-binarized-preferences](https://huggingface.co/datasets/argilla/ultrafeedback-binarized-preferences).
331
 
332
  TL;DR
333
 
334
  After visually browsing around some examples using the sort and filter feature of Argilla (sort by highest rating for chosen responses), we noticed a strong mismatch between the `overall_score` in the original UF dataset (and the Zephyr train_prefs dataset) and the quality of the chosen response.
335
 
336
- By adding the critique rationale to our Argilla Dataset, we confirmed the critique rationale was highly negative, whereas the rating was very high (the highest in fact: `10`).
337
 
338
  See screenshot below for one example of this issue.
339
 
340
- After some quick investigation, we identified hundreds of examples having the same issue, reported a bug on the UltraFeedback repo, and informed the H4 team.
 
 
 
 
341
 
342
  While we're working on fixing the original dataset (already narrowed down ~2K problematic examples). We decided to leverage the multi-preference ratings, leading to Notus!
343
 
344
  ![image/png](https://cdn-uploads.huggingface.co/production/uploads/60420dccc15e823a685f2b03/M9qCKyAB_G1MbVBAPeitd.png)
345
 
 
 
346
  You can find more details about the dataset analysis and curation on the [ultrafeedback-binarized-preferences dataset card](https://huggingface.co/datasets/argilla/ultrafeedback-binarized-preferences).
347
 
348
  ## Prompt template
 
180
 
181
  Using preference ratings, instead of critiques scores, led to a new dataset where the chosen response is different in ~50% of the cases. Using this new dataset with DPO we fine-tuned Notus, a 7B model, that **surpasses Zephyr-7B-beta and Claude 2 on AlpacaEval**.
182
 
183
+ > **Important note**: While we opted for the average of ratings while we fix the original dataset, a very interesting open question remains: once critique data is fixed, what works better? using the critique scores or the preference ratings? We're very excited to do this comparison in the coming weeks, stay tuned!
184
+
185
  This model **wouldn't have been possible without the amazing [Alignment Handbook](https://github.com/huggingface/alignment-handbook), [OpenBMB](https://www.openbmb.cn/home) for releasing the Ultrafeedback dataset**, and it's based on fruitful discussions with the HuggingFace H4 team. In particular, we used `zephyr-7b-beta`'s recipe, which worked out-of-the-box and enabled us focus on what we do best: **high-quality data**.
186
 
187
  Notus models are intended to be used as assistants via chat-like applications, and are evaluated with Chat (MT-Bench, AlpacaEval) and Academic (Open LLM Leaderboard) benchmarks for a direct comparison with the original Zephyr dDPO model and other 7B models.
 
328
 
329
  ### Training Data
330
 
331
+ We used a a new curated version of [`openbmb/UltraFeedback`](https://huggingface.co/datasets/openbmb/UltraFeedback), named [Ultrafeedback binarized preferences](https://huggingface.co/datasets/argilla/ultrafeedback-binarized-preferences).
 
332
 
333
  TL;DR
334
 
335
  After visually browsing around some examples using the sort and filter feature of Argilla (sort by highest rating for chosen responses), we noticed a strong mismatch between the `overall_score` in the original UF dataset (and the Zephyr train_prefs dataset) and the quality of the chosen response.
336
 
337
+ By adding the critique rationale to our Argilla Dataset, **we confirmed the critique rationale was highly negative, whereas the rating was very high** (for most cases it was the highest: `10`).
338
 
339
  See screenshot below for one example of this issue.
340
 
341
+ After some quick investigation, we:
342
+
343
+ * identified hundreds of examples having the same issue,
344
+ * reported a bug on the [UltraFeedback repo](https://github.com/OpenBMB/UltraFeedback/issues/8),
345
+ * and informed the H4 team which was incredibly responsive and ran an additional experiment to validate the new rating binarization approach.
346
 
347
  While we're working on fixing the original dataset (already narrowed down ~2K problematic examples). We decided to leverage the multi-preference ratings, leading to Notus!
348
 
349
  ![image/png](https://cdn-uploads.huggingface.co/production/uploads/60420dccc15e823a685f2b03/M9qCKyAB_G1MbVBAPeitd.png)
350
 
351
+ > **Important note**: While we opted for the average of ratings while we fix the dataset, there's still a very interesting open question: once data is fixed, what works better? using the critique scores or the preference ratings? We're very excited to do this comparison in the coming weeks, stay tuned!
352
+
353
  You can find more details about the dataset analysis and curation on the [ultrafeedback-binarized-preferences dataset card](https://huggingface.co/datasets/argilla/ultrafeedback-binarized-preferences).
354
 
355
  ## Prompt template