alvarobartt HF staff commited on
Commit
8194e3d
1 Parent(s): 34197ae

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +15 -1
README.md CHANGED
@@ -39,7 +39,7 @@ with Zephyr fine-tuned models also using DPO.
39
  - **License:** Apache 2.0 (same as Zephyr 7B SFT and Mistral 7B v0.1)
40
  - **Finetuned from model:** [`alignment-handbook/zephyr-7b-sft-full`](https://huggingface.co/alignment-handbook/zephyr-7b-sft-full)
41
 
42
- ### Model Sources [optional]
43
 
44
  - **Repository:** https://github.com/argilla-io/notus-7b
45
  - **Paper:** N/A
@@ -51,6 +51,10 @@ Notus 7B v1 was trained along November, 2023. And the data as generated by GPT-4
51
 
52
  ## Evaluation
53
 
 
 
 
 
54
  We ran the evaluation using [`EleutherAI/lm-eval-harness`](https://github.com/EleutherAI/lm-evaluation-harness/tree/big-refactor) from the `big-refactor` branch, aiming to mimic the [Open LLM Leaderboard by HuggingFace H4](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard), but running everything on our VMs instead, as we're still experimenting.
55
 
56
  From a first evaluation on the benchmark, we could see that Notus 7B DPO **slightly improved** compared to Zephyr 7B Beta/Alpha and Mistral 7B as we see from the average metric of 7 tasks from the leaderboard.
@@ -62,8 +66,18 @@ From a first evaluation on the benchmark, we could see that Notus 7B DPO **sligh
62
  |[HuggingFaceH4/zephyr-7b-beta](https://huggingface.co/HuggingFaceH4/zephyr-7b-beta) | 52.15 | 62.03 | 84.36 | 61.07 | 57.45 | 77.74 | 12.74 | 9.66 |
63
  | **Ours** | **54.09** | 64.25 | 84.90 | 61.69 | 52.77 | 74.51 | 39.5 | 0.98 |
64
 
 
 
 
 
 
 
65
  ## Training Details
66
 
 
 
 
 
67
  ### Training Data
68
 
69
  We used a slightly curated version of [`openbmb/UltraFeedback`](https://huggingface.co/datasets/openbmb/UltraFeedback), named [`argilla/ultrafeedback-binarized-avg-rating-for-dpo`](https://huggingface.co/argilla/ultrafeedback-binarized-avg-rating-for-dpo).
 
39
  - **License:** Apache 2.0 (same as Zephyr 7B SFT and Mistral 7B v0.1)
40
  - **Finetuned from model:** [`alignment-handbook/zephyr-7b-sft-full`](https://huggingface.co/alignment-handbook/zephyr-7b-sft-full)
41
 
42
+ ### Model Sources
43
 
44
  - **Repository:** https://github.com/argilla-io/notus-7b
45
  - **Paper:** N/A
 
51
 
52
  ## Evaluation
53
 
54
+ Even though LM Eval Harness is a nice benchmark, we have seen that both Alpaca Eval and MT Bench results are usually more meaningful towards explaining how the models will perform in real scenarios and when interacting with humans via chat applications, so the results shown below are just for reporting some metrics and for comparing with existing and similar LLMs.
55
+
56
+ ### LM Eval Harness
57
+
58
  We ran the evaluation using [`EleutherAI/lm-eval-harness`](https://github.com/EleutherAI/lm-evaluation-harness/tree/big-refactor) from the `big-refactor` branch, aiming to mimic the [Open LLM Leaderboard by HuggingFace H4](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard), but running everything on our VMs instead, as we're still experimenting.
59
 
60
  From a first evaluation on the benchmark, we could see that Notus 7B DPO **slightly improved** compared to Zephyr 7B Beta/Alpha and Mistral 7B as we see from the average metric of 7 tasks from the leaderboard.
 
66
  |[HuggingFaceH4/zephyr-7b-beta](https://huggingface.co/HuggingFaceH4/zephyr-7b-beta) | 52.15 | 62.03 | 84.36 | 61.07 | 57.45 | 77.74 | 12.74 | 9.66 |
67
  | **Ours** | **54.09** | 64.25 | 84.90 | 61.69 | 52.77 | 74.51 | 39.5 | 0.98 |
68
 
69
+ Anyway, we will also add our model to the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard) queue to be evaluated on Hugging Face's end to ensure that the produced results match the same ones, as we found some inconsistencies for DROP using the `big-refactor` branch on `lm-eval-harness`.
70
+
71
+ ### MT Bench (Coming soon!)
72
+
73
+ ### Alpaca Eval (Coming soon!)
74
+
75
  ## Training Details
76
 
77
+ ### Training Hardware
78
+
79
+ We used a VM with 8 x A100 40GB hosted in Lambda Labs.
80
+
81
  ### Training Data
82
 
83
  We used a slightly curated version of [`openbmb/UltraFeedback`](https://huggingface.co/datasets/openbmb/UltraFeedback), named [`argilla/ultrafeedback-binarized-avg-rating-for-dpo`](https://huggingface.co/argilla/ultrafeedback-binarized-avg-rating-for-dpo).