alvarobartt HF staff commited on
Commit
75dfa6c
1 Parent(s): 517e565

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +4 -3
README.md CHANGED
@@ -175,19 +175,20 @@ We used a VM with 8 x A100 40GB hosted in Lambda Labs, but while experimenting w
175
 
176
  ### Training Data
177
 
178
- We used a a new curated version of [`openbmb/UltraFeedback`](https://huggingface.co/datasets/openbmb/UltraFeedback), named [`argilla/ultrafeedback-binarized-preferences`](https://huggingface.co/argilla/ultrafeedback-binarized-preferences).
179
 
180
- TLDR:
181
 
182
  After visually browsing around some examples using the sort and filter feature of Argilla (sort by highest rating for chosen responses), we noticed a strong mismatch between the `overall_score` in the original UF dataset (and the Zephyr train_prefs dataset) and the quality of the chosen response.
183
 
184
  By adding the critique rationale to our Argilla Dataset, we confirmed the critique rationale was highly negative, whereas the rating was very high (the highest in fact: `10`).
 
185
  See screenshot below for one example of this issue.
 
186
  After some quick investigation, we identified hundreds of examples having the same issue, reported a bug on the UltraFeedback repo, and informed the H4 team.
187
 
188
  While we're working on fixing the original dataset (already narrowed down ~2K problematic examples). We decided to leverage the multi-preference ratings, leading to Notus!
189
 
190
-
191
  ![image/png](https://cdn-uploads.huggingface.co/production/uploads/60420dccc15e823a685f2b03/M9qCKyAB_G1MbVBAPeitd.png)
192
 
193
  ## Prompt template
 
175
 
176
  ### Training Data
177
 
178
+ We used a a new curated version of [`openbmb/UltraFeedback`](https://huggingface.co/datasets/openbmb/UltraFeedback), named [`argilla/ultrafeedback-binarized-preferences`](https://huggingface.co/datasets/argilla/ultrafeedback-binarized-preferences).
179
 
180
+ TL;DR
181
 
182
  After visually browsing around some examples using the sort and filter feature of Argilla (sort by highest rating for chosen responses), we noticed a strong mismatch between the `overall_score` in the original UF dataset (and the Zephyr train_prefs dataset) and the quality of the chosen response.
183
 
184
  By adding the critique rationale to our Argilla Dataset, we confirmed the critique rationale was highly negative, whereas the rating was very high (the highest in fact: `10`).
185
+
186
  See screenshot below for one example of this issue.
187
+
188
  After some quick investigation, we identified hundreds of examples having the same issue, reported a bug on the UltraFeedback repo, and informed the H4 team.
189
 
190
  While we're working on fixing the original dataset (already narrowed down ~2K problematic examples). We decided to leverage the multi-preference ratings, leading to Notus!
191
 
 
192
  ![image/png](https://cdn-uploads.huggingface.co/production/uploads/60420dccc15e823a685f2b03/M9qCKyAB_G1MbVBAPeitd.png)
193
 
194
  ## Prompt template