leaderboard-pr-bot commited on
Commit
adf1a5b
1 Parent(s): 6b6e2a6

Adding Evaluation Results

Browse files

This is an automated PR created with https://huggingface.co/spaces/Weyaxi/open-llm-leaderboard-results-pr

The purpose of this PR is to add evaluation results from the Open LLM Leaderboard to your model card.

If you encounter any issues, please report them to https://huggingface.co/spaces/Weyaxi/open-llm-leaderboard-results-pr/discussions

Files changed (1) hide show
  1. README.md +118 -2
README.md CHANGED
@@ -25,11 +25,114 @@ datasets:
25
  - Intel/orca_dpo_pairs
26
  - unalignment/toxic-dpo-v0.1
27
  - jondurbin/truthy-dpo-v0.1
28
- - allenai/ultrafeedback_binarized_cleaned
29
  - Squish42/bluemoon-fandom-1-1-rp-cleaned
30
  - LDJnr/Capybara
31
  - JULIELab/EmoBank
32
  - kingbri/PIPPA-shareGPT
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
33
  ---
34
 
35
  # A bagel, with everything (except DPO)
@@ -433,4 +536,17 @@ I am not a lawyer, so I can't help determine if this is actually commercially vi
433
  - If the dataset was released under a permissive license, but actually includes OpenAI generated data, does that ToS supersede the license?
434
  - Does the dataset fall completely under fair use anyways, since the model isn't really capable of reproducing the entire training set verbatim?
435
 
436
- Use your best judgement and seek legal advice if you are concerned about the terms. In any case, by using this model, you agree to completely indemnify me.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
25
  - Intel/orca_dpo_pairs
26
  - unalignment/toxic-dpo-v0.1
27
  - jondurbin/truthy-dpo-v0.1
28
+ - allenai/ultrafeedback_binarized_cleaned
29
  - Squish42/bluemoon-fandom-1-1-rp-cleaned
30
  - LDJnr/Capybara
31
  - JULIELab/EmoBank
32
  - kingbri/PIPPA-shareGPT
33
+ model-index:
34
+ - name: bagel-8x7b-v0.2
35
+ results:
36
+ - task:
37
+ type: text-generation
38
+ name: Text Generation
39
+ dataset:
40
+ name: AI2 Reasoning Challenge (25-Shot)
41
+ type: ai2_arc
42
+ config: ARC-Challenge
43
+ split: test
44
+ args:
45
+ num_few_shot: 25
46
+ metrics:
47
+ - type: acc_norm
48
+ value: 68.26
49
+ name: normalized accuracy
50
+ source:
51
+ url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=jondurbin/bagel-8x7b-v0.2
52
+ name: Open LLM Leaderboard
53
+ - task:
54
+ type: text-generation
55
+ name: Text Generation
56
+ dataset:
57
+ name: HellaSwag (10-Shot)
58
+ type: hellaswag
59
+ split: validation
60
+ args:
61
+ num_few_shot: 10
62
+ metrics:
63
+ - type: acc_norm
64
+ value: 86.32
65
+ name: normalized accuracy
66
+ source:
67
+ url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=jondurbin/bagel-8x7b-v0.2
68
+ name: Open LLM Leaderboard
69
+ - task:
70
+ type: text-generation
71
+ name: Text Generation
72
+ dataset:
73
+ name: MMLU (5-Shot)
74
+ type: cais/mmlu
75
+ config: all
76
+ split: test
77
+ args:
78
+ num_few_shot: 5
79
+ metrics:
80
+ - type: acc
81
+ value: 70.4
82
+ name: accuracy
83
+ source:
84
+ url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=jondurbin/bagel-8x7b-v0.2
85
+ name: Open LLM Leaderboard
86
+ - task:
87
+ type: text-generation
88
+ name: Text Generation
89
+ dataset:
90
+ name: TruthfulQA (0-shot)
91
+ type: truthful_qa
92
+ config: multiple_choice
93
+ split: validation
94
+ args:
95
+ num_few_shot: 0
96
+ metrics:
97
+ - type: mc2
98
+ value: 60.03
99
+ source:
100
+ url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=jondurbin/bagel-8x7b-v0.2
101
+ name: Open LLM Leaderboard
102
+ - task:
103
+ type: text-generation
104
+ name: Text Generation
105
+ dataset:
106
+ name: Winogrande (5-shot)
107
+ type: winogrande
108
+ config: winogrande_xl
109
+ split: validation
110
+ args:
111
+ num_few_shot: 5
112
+ metrics:
113
+ - type: acc
114
+ value: 81.29
115
+ name: accuracy
116
+ source:
117
+ url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=jondurbin/bagel-8x7b-v0.2
118
+ name: Open LLM Leaderboard
119
+ - task:
120
+ type: text-generation
121
+ name: Text Generation
122
+ dataset:
123
+ name: GSM8k (5-shot)
124
+ type: gsm8k
125
+ config: main
126
+ split: test
127
+ args:
128
+ num_few_shot: 5
129
+ metrics:
130
+ - type: acc
131
+ value: 4.7
132
+ name: accuracy
133
+ source:
134
+ url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=jondurbin/bagel-8x7b-v0.2
135
+ name: Open LLM Leaderboard
136
  ---
137
 
138
  # A bagel, with everything (except DPO)
 
536
  - If the dataset was released under a permissive license, but actually includes OpenAI generated data, does that ToS supersede the license?
537
  - Does the dataset fall completely under fair use anyways, since the model isn't really capable of reproducing the entire training set verbatim?
538
 
539
+ Use your best judgement and seek legal advice if you are concerned about the terms. In any case, by using this model, you agree to completely indemnify me.
540
+ # [Open LLM Leaderboard Evaluation Results](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)
541
+ Detailed results can be found [here](https://huggingface.co/datasets/open-llm-leaderboard/details_jondurbin__bagel-8x7b-v0.2)
542
+
543
+ | Metric |Value|
544
+ |---------------------------------|----:|
545
+ |Avg. |61.83|
546
+ |AI2 Reasoning Challenge (25-Shot)|68.26|
547
+ |HellaSwag (10-Shot) |86.32|
548
+ |MMLU (5-Shot) |70.40|
549
+ |TruthfulQA (0-shot) |60.03|
550
+ |Winogrande (5-shot) |81.29|
551
+ |GSM8k (5-shot) | 4.70|
552
+