argilla
/

notus-7b-v1

@@ -180,6 +180,8 @@ In particular, when we started building [distilabel](https://github.com/argilla-
 Using preference ratings, instead of critiques scores, led to a new dataset where the chosen response is different in ~50% of the cases. Using this new dataset with DPO we fine-tuned Notus, a 7B model, that **surpasses Zephyr-7B-beta and Claude 2 on AlpacaEval**.
 This model **wouldn't have been possible without the amazing [Alignment Handbook](https://github.com/huggingface/alignment-handbook), [OpenBMB](https://www.openbmb.cn/home) for releasing the Ultrafeedback dataset**, and it's based on fruitful discussions with the HuggingFace H4 team. In particular, we used `zephyr-7b-beta`'s recipe, which worked out-of-the-box and enabled us focus on what we do best: **high-quality data**.
 Notus models are intended to be used as assistants via chat-like applications, and are evaluated with Chat (MT-Bench, AlpacaEval) and Academic (Open LLM Leaderboard) benchmarks for a direct comparison with the original Zephyr dDPO model and other 7B models.
@@ -326,23 +328,28 @@ We used a VM with 8 x A100 40GB hosted in Lambda Labs, but while experimenting w
 ### Training Data
-We used a a new curated version [`argilla/ultrafeedback-binarized-preferences`](https://huggingface.co/datasets/argilla/ultrafeedback-binarized-preferences).
-of [`openbmb/UltraFeedback`](https://huggingface.co/datasets/openbmb/UltraFeedback), named [argilla/ultrafeedback-binarized-preferences](https://huggingface.co/datasets/argilla/ultrafeedback-binarized-preferences).
 TL;DR
 After visually browsing around some examples using the sort and filter feature of Argilla (sort by highest rating for chosen responses), we noticed a strong mismatch between the `overall_score` in the original UF dataset (and the Zephyr train_prefs dataset) and the quality of the chosen response.
-By adding the critique rationale to our Argilla Dataset, we confirmed the critique rationale was highly negative, whereas the rating was very high (the highest in fact: `10`).
 See screenshot below for one example of this issue.
-After some quick investigation, we identified hundreds of examples having the same issue, reported a bug on the UltraFeedback repo, and informed the H4 team.
 While we're working on fixing the original dataset (already narrowed down ~2K problematic examples). We decided to leverage the multi-preference ratings, leading to Notus!
 ![image/png](https://cdn-uploads.huggingface.co/production/uploads/60420dccc15e823a685f2b03/M9qCKyAB_G1MbVBAPeitd.png)
 You can find more details about the dataset analysis and curation on the [ultrafeedback-binarized-preferences dataset card](https://huggingface.co/datasets/argilla/ultrafeedback-binarized-preferences).
 ## Prompt template

 Using preference ratings, instead of critiques scores, led to a new dataset where the chosen response is different in ~50% of the cases. Using this new dataset with DPO we fine-tuned Notus, a 7B model, that **surpasses Zephyr-7B-beta and Claude 2 on AlpacaEval**.
+> **Important note**: While we opted for the average of ratings while we fix the original dataset, a very interesting open question remains: once critique data is fixed, what works better? using the critique scores or the preference ratings? We're very excited to do this comparison in the coming weeks, stay tuned!
 This model **wouldn't have been possible without the amazing [Alignment Handbook](https://github.com/huggingface/alignment-handbook), [OpenBMB](https://www.openbmb.cn/home) for releasing the Ultrafeedback dataset**, and it's based on fruitful discussions with the HuggingFace H4 team. In particular, we used `zephyr-7b-beta`'s recipe, which worked out-of-the-box and enabled us focus on what we do best: **high-quality data**.
 Notus models are intended to be used as assistants via chat-like applications, and are evaluated with Chat (MT-Bench, AlpacaEval) and Academic (Open LLM Leaderboard) benchmarks for a direct comparison with the original Zephyr dDPO model and other 7B models.
 ### Training Data
+We used a a new curated version of [`openbmb/UltraFeedback`](https://huggingface.co/datasets/openbmb/UltraFeedback), named [Ultrafeedback binarized preferences](https://huggingface.co/datasets/argilla/ultrafeedback-binarized-preferences).
 TL;DR
 After visually browsing around some examples using the sort and filter feature of Argilla (sort by highest rating for chosen responses), we noticed a strong mismatch between the `overall_score` in the original UF dataset (and the Zephyr train_prefs dataset) and the quality of the chosen response.
+By adding the critique rationale to our Argilla Dataset, **we confirmed the critique rationale was highly negative, whereas the rating was very high** (for most cases it was the highest: `10`).
 See screenshot below for one example of this issue.
+After some quick investigation, we:
+* identified hundreds of examples having the same issue,
+* reported a bug on the [UltraFeedback repo](https://github.com/OpenBMB/UltraFeedback/issues/8),
+* and informed the H4 team which was incredibly responsive and ran an additional experiment to validate the new rating binarization approach.
 While we're working on fixing the original dataset (already narrowed down ~2K problematic examples). We decided to leverage the multi-preference ratings, leading to Notus!
 ![image/png](https://cdn-uploads.huggingface.co/production/uploads/60420dccc15e823a685f2b03/M9qCKyAB_G1MbVBAPeitd.png)
+> **Important note**: While we opted for the average of ratings while we fix the dataset, there's still a very interesting open question: once data is fixed, what works better? using the critique scores or the preference ratings? We're very excited to do this comparison in the coming weeks, stay tuned!
 You can find more details about the dataset analysis and curation on the [ultrafeedback-binarized-preferences dataset card](https://huggingface.co/datasets/argilla/ultrafeedback-binarized-preferences).
 ## Prompt template