leaderboard-pr-bot commited on
Commit
0cbd480
1 Parent(s): 9c8f78a

Adding Evaluation Results

Browse files

This is an automated PR created with https://huggingface.co/spaces/Weyaxi/open-llm-leaderboard-results-pr

The purpose of this PR is to add evaluation results from the Open LLM Leaderboard to your model card.

If you encounter any issues, please report them to https://huggingface.co/spaces/Weyaxi/open-llm-leaderboard-results-pr/discussions

Files changed (1) hide show
  1. README.md +117 -1
README.md CHANGED
@@ -1,5 +1,108 @@
1
  ---
2
  license: mit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  ---
4
  Ok so this guy offers [this challenge](https://www.reddit.com/r/ArtificialInteligence/comments/1akestf/day_3_prove_i_am_full_of_bs_and_my_dataset_doesnt/) and I don't actually have a lot going on in my life right now. So I'm like fine. Your idea looks interesting. I have no idea why you're spamming it. It does not appear you make any money from this. Why would you offer to pay for our fine-tuning if we don't like the results after fine-tuning on your data? Does this thing trojan horse in some crazy thing that lets you control all robots later even though it improves performance now? I dunno. I don't even know if I'm doing this right. It says fine-tune your model on it. But I don't know if that means make my model first and then fine-tune using his thing or if I can just sprinkle it into mine and cross my fingers? I'm just going to sprinkle in his data and just cross my fingers.
5
 
@@ -34,4 +137,17 @@ I talked to it. It's ok I guess. I'm a little suspicious of its ability to liter
34
 
35
  I trained from my workstation. I have 2x 3090's and an AMD 5900x. Chicago power is 15¢/kWh. Each 3090 draw about 350 watts and the rest of the system probably draws maybe 200 watts or so. But then my room gets hot and I have to turn on the overhead fan and kick on the HVAC vent fan with the windows open or else my place gets really hot even in the middle of winter. We'll call it a kilowatt even since we're not billing wear and tear on the cards. I think you have to depreciate those by time anyway and not usage. At least for tax purposes. Anyway, dataset prep and training took about 3 hours in-total. Looking at raw data sizes, the pfaf data was about 500kb and my data around 2.1mb. So if we calculate that out, we get 3 * 0.15 * (500/(2100+500)) = 0.0865 to get the portion of the fine-tuning attributable to PFAF (someone check my math. I'm stoned.). I think that I feel like this guy owes me 9 cents, but I'm not gonna be petty about it. You can't give fractions of a penny. We'll call it 8 cents. If the scores don't improve.
36
 
37
- (We'll see probably tomorrow or so if the leaderboard updates if this dataset does anything worth exploring just by dumping it in as suggested by the guy. Compare it to TacoBeLLM and Palworld-SME-13b on the leaderboard for bots I made similar ways.)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: mit
3
+ model-index:
4
+ - name: ASTS-PFAF
5
+ results:
6
+ - task:
7
+ type: text-generation
8
+ name: Text Generation
9
+ dataset:
10
+ name: AI2 Reasoning Challenge (25-Shot)
11
+ type: ai2_arc
12
+ config: ARC-Challenge
13
+ split: test
14
+ args:
15
+ num_few_shot: 25
16
+ metrics:
17
+ - type: acc_norm
18
+ value: 61.26
19
+ name: normalized accuracy
20
+ source:
21
+ url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=ericpolewski/ASTS-PFAF
22
+ name: Open LLM Leaderboard
23
+ - task:
24
+ type: text-generation
25
+ name: Text Generation
26
+ dataset:
27
+ name: HellaSwag (10-Shot)
28
+ type: hellaswag
29
+ split: validation
30
+ args:
31
+ num_few_shot: 10
32
+ metrics:
33
+ - type: acc_norm
34
+ value: 82.94
35
+ name: normalized accuracy
36
+ source:
37
+ url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=ericpolewski/ASTS-PFAF
38
+ name: Open LLM Leaderboard
39
+ - task:
40
+ type: text-generation
41
+ name: Text Generation
42
+ dataset:
43
+ name: MMLU (5-Shot)
44
+ type: cais/mmlu
45
+ config: all
46
+ split: test
47
+ args:
48
+ num_few_shot: 5
49
+ metrics:
50
+ - type: acc
51
+ value: 58.96
52
+ name: accuracy
53
+ source:
54
+ url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=ericpolewski/ASTS-PFAF
55
+ name: Open LLM Leaderboard
56
+ - task:
57
+ type: text-generation
58
+ name: Text Generation
59
+ dataset:
60
+ name: TruthfulQA (0-shot)
61
+ type: truthful_qa
62
+ config: multiple_choice
63
+ split: validation
64
+ args:
65
+ num_few_shot: 0
66
+ metrics:
67
+ - type: mc2
68
+ value: 43.74
69
+ source:
70
+ url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=ericpolewski/ASTS-PFAF
71
+ name: Open LLM Leaderboard
72
+ - task:
73
+ type: text-generation
74
+ name: Text Generation
75
+ dataset:
76
+ name: Winogrande (5-shot)
77
+ type: winogrande
78
+ config: winogrande_xl
79
+ split: validation
80
+ args:
81
+ num_few_shot: 5
82
+ metrics:
83
+ - type: acc
84
+ value: 76.87
85
+ name: accuracy
86
+ source:
87
+ url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=ericpolewski/ASTS-PFAF
88
+ name: Open LLM Leaderboard
89
+ - task:
90
+ type: text-generation
91
+ name: Text Generation
92
+ dataset:
93
+ name: GSM8k (5-shot)
94
+ type: gsm8k
95
+ config: main
96
+ split: test
97
+ args:
98
+ num_few_shot: 5
99
+ metrics:
100
+ - type: acc
101
+ value: 23.81
102
+ name: accuracy
103
+ source:
104
+ url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=ericpolewski/ASTS-PFAF
105
+ name: Open LLM Leaderboard
106
  ---
107
  Ok so this guy offers [this challenge](https://www.reddit.com/r/ArtificialInteligence/comments/1akestf/day_3_prove_i_am_full_of_bs_and_my_dataset_doesnt/) and I don't actually have a lot going on in my life right now. So I'm like fine. Your idea looks interesting. I have no idea why you're spamming it. It does not appear you make any money from this. Why would you offer to pay for our fine-tuning if we don't like the results after fine-tuning on your data? Does this thing trojan horse in some crazy thing that lets you control all robots later even though it improves performance now? I dunno. I don't even know if I'm doing this right. It says fine-tune your model on it. But I don't know if that means make my model first and then fine-tune using his thing or if I can just sprinkle it into mine and cross my fingers? I'm just going to sprinkle in his data and just cross my fingers.
108
 
 
137
 
138
  I trained from my workstation. I have 2x 3090's and an AMD 5900x. Chicago power is 15¢/kWh. Each 3090 draw about 350 watts and the rest of the system probably draws maybe 200 watts or so. But then my room gets hot and I have to turn on the overhead fan and kick on the HVAC vent fan with the windows open or else my place gets really hot even in the middle of winter. We'll call it a kilowatt even since we're not billing wear and tear on the cards. I think you have to depreciate those by time anyway and not usage. At least for tax purposes. Anyway, dataset prep and training took about 3 hours in-total. Looking at raw data sizes, the pfaf data was about 500kb and my data around 2.1mb. So if we calculate that out, we get 3 * 0.15 * (500/(2100+500)) = 0.0865 to get the portion of the fine-tuning attributable to PFAF (someone check my math. I'm stoned.). I think that I feel like this guy owes me 9 cents, but I'm not gonna be petty about it. You can't give fractions of a penny. We'll call it 8 cents. If the scores don't improve.
139
 
140
+ (We'll see probably tomorrow or so if the leaderboard updates if this dataset does anything worth exploring just by dumping it in as suggested by the guy. Compare it to TacoBeLLM and Palworld-SME-13b on the leaderboard for bots I made similar ways.)
141
+ # [Open LLM Leaderboard Evaluation Results](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)
142
+ Detailed results can be found [here](https://huggingface.co/datasets/open-llm-leaderboard/details_ericpolewski__ASTS-PFAF)
143
+
144
+ | Metric |Value|
145
+ |---------------------------------|----:|
146
+ |Avg. |57.93|
147
+ |AI2 Reasoning Challenge (25-Shot)|61.26|
148
+ |HellaSwag (10-Shot) |82.94|
149
+ |MMLU (5-Shot) |58.96|
150
+ |TruthfulQA (0-shot) |43.74|
151
+ |Winogrande (5-shot) |76.87|
152
+ |GSM8k (5-shot) |23.81|
153
+