hjc-puro commited on
Commit
7952630
1 Parent(s): 0e0cfef

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +64 -62
README.md CHANGED
@@ -59,76 +59,80 @@ Hermes 3 is competitive, if not superior, to Llama-3.1 Instruct models at genera
59
 
60
  ## GPT4All:
61
  ```
62
- |    Task     |Version| Metric |Value |   |Stderr|
63
- |-------------|------:|--------|-----:|---|-----:|
64
- |arc_challenge|      0|acc     |0.5529  |0.0145|
65
- |             |       |acc_norm|0.5870  |0.0144|
66
- |arc_easy     |      0|acc     |0.8371  |0.0076|
67
- |             |       |acc_norm|0.8144  |0.0080|
68
- |boolq        |      1|acc     |0.8599  |0.0061|
69
- |hellaswag    |      0|acc     |0.6133  |0.0049|
70
- |             |       |acc_norm|0.7989  |0.0040|
71
- |openbookqa   |      0|acc     |0.3940  |0.0219|
72
- |             |       |acc_norm|0.4680  |0.0223|
73
- |piqa         |      0|acc     |0.8063  |0.0092|
74
- |             |       |acc_norm|0.8156  |0.0090|
75
- |winogrande   |      0|acc     |0.7372  |0.0124|
76
  ```
77
 
78
- Average: 72.59
79
 
80
  ## AGIEval:
81
  ```
82
- |             Task             |Version| Metric |Value |   |Stderr|
83
- |------------------------------|------:|--------|-----:|---|-----:|
84
- |agieval_aqua_rat              |      0|acc     |0.2441  |0.0270|
85
- |                              |       |acc_norm|0.2441|±  |0.0270|
86
- |agieval_logiqa_en             |      0|acc     |0.3687  |0.0189|
87
- |                              |       |acc_norm|0.3840  |0.0191|
88
- |agieval_lsat_ar               |      0|acc     |0.2304|±  |0.0278|
89
- |                              |       |acc_norm|0.2174  |0.0273|
90
- |agieval_lsat_lr               |      0|acc     |0.5471  |0.0221|
91
- |                              |       |acc_norm|0.5373  |0.0221|
92
- |agieval_lsat_rc               |      0|acc     |0.6617  |0.0289|
93
- |                              |       |acc_norm|0.6357  |0.0294|
94
- |agieval_sat_en                |      0|acc     |0.7670  |0.0295|
95
- |                              |       |acc_norm|0.7379  |0.0307|
96
- |agieval_sat_en_without_passage|      0|acc     |0.4417  |0.0347|
97
- |                              |       |acc_norm|0.4223  |0.0345|
98
- |agieval_sat_math              |      0|acc     |0.4000  |0.0331|
99
- |                              |       |acc_norm|0.3455  |0.0321|
100
  ```
101
 
102
- Average: 44.05
103
 
104
  ## BigBench:
105
 
106
  ```
107
-
108
- |                      Task                      |Version|       Metric        |Value |   |Stderr|
109
- |------------------------------------------------|------:|---------------------|-----:|---|-----:|
110
- |bigbench_causal_judgement                       |      0|multiple_choice_grade|0.6000  |0.0356|
111
- |bigbench_date_understanding                     |      0|multiple_choice_grade|0.6585  |0.0247|
112
- |bigbench_disambiguation_qa                      |      0|multiple_choice_grade|0.3178  |0.0290|
113
- |bigbench_geometric_shapes                       |      0|multiple_choice_grade|0.2340  |0.0224|
114
- |                                                |       |exact_str_match      |0.0000  |0.0000|
115
- |bigbench_logical_deduction_five_objects         |      0|multiple_choice_grade|0.2980  |0.0205|
116
- |bigbench_logical_deduction_seven_objects        |      0|multiple_choice_grade|0.2057  |0.0153|
117
- |bigbench_logical_deduction_three_objects        |      0|multiple_choice_grade|0.5367  |0.0288|
118
- |bigbench_movie_recommendation                   |      0|multiple_choice_grade|0.4040  |0.0220|
119
- |bigbench_navigate                               |      0|multiple_choice_grade|0.4970  |0.0158|
120
- |bigbench_reasoning_about_colored_objects        |      0|multiple_choice_grade|0.7075  |0.0102|
121
- |bigbench_ruin_names                             |      0|multiple_choice_grade|0.4821  |0.0236|
122
- |bigbench_salient_translation_error_detection    |      0|multiple_choice_grade|0.2295  |0.0133|
123
- |bigbench_snarks                                 |      0|multiple_choice_grade|0.6906  |0.0345|
124
- |bigbench_sports_understanding                   |      0|multiple_choice_grade|0.5375  |0.0159|
125
- |bigbench_temporal_sequences                     |      0|multiple_choice_grade|0.6270  |0.0153|
126
- |bigbench_tracking_shuffled_objects_five_objects |      0|multiple_choice_grade|0.2216  |0.0118|
127
- |bigbench_tracking_shuffled_objects_seven_objects|      0|multiple_choice_grade|0.1594  |0.0088|
128
- |bigbench_tracking_shuffled_objects_three_objects|      0|multiple_choice_grade|0.5367  |0.0288|
 
 
 
 
129
  ```
130
 
131
- Average: 44.13
132
 
133
 
134
  # Prompt Format
@@ -171,7 +175,7 @@ To utilize the prompt format without a system prompt, simply leave the line out.
171
 
172
  ## Prompt Format for Function Calling
173
 
174
- # Note: This version uses USER as both the user prompt and the tool response role. This is due to a bug we experienced when training. It will require modification to the function calling code!
175
 
176
  Our model was trained on specific system prompts and structures for Function Calling.
177
 
@@ -200,7 +204,7 @@ The model will then generate a tool call, which your inference code must parse,
200
 
201
  Once you parse the tool call, call the api and get the returned values for the call, and pass it back in as a new role, `tool` like so:
202
  ```
203
- <|im_start|>user
204
  <tool_response>
205
  {"name": "get_stock_fundamentals", "content": {'symbol': 'TSLA', 'company_name': 'Tesla, Inc.', 'sector': 'Consumer Cyclical', 'industry': 'Auto Manufacturers', 'market_cap': 611384164352, 'pe_ratio': 49.604652, 'pb_ratio': 9.762013, 'dividend_yield': None, 'eps': 4.3, 'beta': 2.427, '52_week_high': 299.29, '52_week_low': 152.37}}
206
  </tool_response>
@@ -305,6 +309,4 @@ GGUF Quants: https://huggingface.co/NousResearch/Hermes-3-Llama-3.2-3B-GGUF
305
  primaryClass={cs.CL},
306
  url={https://arxiv.org/abs/2408.11857},
307
  }
308
- ```
309
-
310
-
 
59
 
60
  ## GPT4All:
61
  ```
62
+ | Tasks |Version|Filter|n-shot| Metric | |Value | |Stderr|
63
+ |-------------|------:|------|-----:|--------|---|-----:|---|-----:|
64
+ |arc_challenge| 1|none | 0|acc |↑ |0.4411 |0.0145|
65
+ | | |none | 0|acc_norm|↑ |0.4377 |0.0145|
66
+ |arc_easy | 1|none | 0|acc |↑ |0.7399 |0.0090|
67
+ | | |none | 0|acc_norm|↑ |0.6566 |0.0097|
68
+ |boolq | 2|none | 0|acc |↑ |0.8327 |0.0065|
69
+ |hellaswag | 1|none | 0|acc |↑ |0.5453 |0.0050|
70
+ | | |none | 0|acc_norm|↑ |0.7047 |0.0046|
71
+ |openbookqa | 1|none | 0|acc |↑ |0.3480 |0.0213|
72
+ | | |none | 0|acc_norm|↑ |0.4280 |0.0221|
73
+ |piqa | 1|none | 0|acc |↑ |0.7639 |0.0099|
74
+ | | |none | 0|acc_norm|↑ |0.7584 |0.0100|
75
+ |winogrande | 1|none | 0|acc |↑ |0.6590 |0.0133|
76
  ```
77
 
78
+ Average: 64.00
79
 
80
  ## AGIEval:
81
  ```
82
+ | Tasks |Version|Filter|n-shot| Metric | |Value | |Stderr|
83
+ |------------------------------|------:|------|-----:|--------|---|-----:|---|-----:|
84
+ |agieval_aqua_rat | 1|none | 0|acc |↑ |0.2283 |0.0264|
85
+ | | |none | 0|acc_norm|↑ |0.2441|± |0.0270|
86
+ |agieval_logiqa_en | 1|none | 0|acc |↑ |0.3057 |0.0181|
87
+ | | |none | 0|acc_norm|↑ |0.3272 |0.0184|
88
+ |agieval_lsat_ar | 1|none | 0|acc |↑ |0.2304|± |0.0278|
89
+ | | |none | 0|acc_norm|↑ |0.1957 |0.0262|
90
+ |agieval_lsat_lr | 1|none | 0|acc |↑ |0.3784 |0.0215|
91
+ | | |none | 0|acc_norm|↑ |0.3588 |0.0213|
92
+ |agieval_lsat_rc | 1|none | 0|acc |↑ |0.4610 |0.0304|
93
+ | | |none | 0|acc_norm|↑ |0.4275 |0.0302|
94
+ |agieval_sat_en | 1|none | 0|acc |↑ |0.6019 |0.0342|
95
+ | | |none | 0|acc_norm|↑ |0.5340 |0.0348|
96
+ |agieval_sat_en_without_passage| 1|none | 0|acc |↑ |0.3981 |0.0342|
97
+ | | |none | 0|acc_norm|↑ |0.3981 |0.0342|
98
+ |agieval_sat_math | 1|none | 0|acc |↑ |0.2500 |0.0293|
99
+ | | |none | 0|acc_norm|↑ |0.2636 |0.0298|
100
  ```
101
 
102
+ Average: 34.36
103
 
104
  ## BigBench:
105
 
106
  ```
107
+ | Tasks |Version|Filter|n-shot| Metric | |Value | |Stderr|
108
+ |-------------------------------------------------------|------:|------|-----:|--------|---|-----:|---|-----:|
109
+ |leaderboard_bbh_boolean_expressions | 1|none | 3|acc_norm|↑ |0.7560|± |0.0272|
110
+ |leaderboard_bbh_causal_judgement | 1|none | 3|acc_norm|↑ |0.6043 |0.0359|
111
+ |leaderboard_bbh_date_understanding | 1|none | 3|acc_norm|↑ |0.3280 |0.0298|
112
+ |leaderboard_bbh_disambiguation_qa | 1|none | 3|acc_norm|↑ |0.5880 |0.0312|
113
+ |leaderboard_bbh_formal_fallacies | 1|none | 3|acc_norm|↑ |0.5280 |0.0316|
114
+ |leaderboard_bbh_geometric_shapes | 1|none | 3|acc_norm|↑ |0.3560 |0.0303|
115
+ |leaderboard_bbh_hyperbaton | 1|none | 3|acc_norm|↑ |0.6280 |0.0306|
116
+ |leaderboard_bbh_logical_deduction_five_objects | 1|none | 3|acc_norm|↑ |0.3400 |0.0300|
117
+ |leaderboard_bbh_logical_deduction_seven_objects | 1|none | 3|acc_norm|↑ |0.2880 |0.0287|
118
+ |leaderboard_bbh_logical_deduction_three_objects | 1|none | 3|acc_norm|↑ |0.4160 |0.0312|
119
+ |leaderboard_bbh_movie_recommendation | 1|none | 3|acc_norm|↑ |0.6760 |0.0297|
120
+ |leaderboard_bbh_navigate | 1|none | 3|acc_norm|↑ |0.5800 |0.0313|
121
+ |leaderboard_bbh_object_counting | 1|none | 3|acc_norm|↑ |0.3640 |0.0305|
122
+ |leaderboard_bbh_penguins_in_a_table | 1|none | 3|acc_norm|↑ |0.3836 |0.0404|
123
+ |leaderboard_bbh_reasoning_about_colored_objects | 1|none | 3|acc_norm|↑ |0.3560 |0.0303|
124
+ |leaderboard_bbh_ruin_names | 1|none | 3|acc_norm|↑ |0.4160 |0.0312|
125
+ |leaderboard_bbh_salient_translation_error_detection | 1|none | 3|acc_norm|↑ |0.3080 |0.0293|
126
+ |leaderboard_bbh_snarks | 1|none | 3|acc_norm|↑ |0.5618 |0.0373|
127
+ |leaderboard_bbh_sports_understanding | 1|none | 3|acc_norm|↑ |0.6600 |0.0300|
128
+ |leaderboard_bbh_temporal_sequences | 1|none | 3|acc_norm|↑ |0.2320 |0.0268|
129
+ |leaderboard_bbh_tracking_shuffled_objects_five_objects | 1|none | 3|acc_norm|↑ |0.1640|± |0.0235|
130
+ |leaderboard_bbh_tracking_shuffled_objects_seven_objects| 1|none | 3|acc_norm|↑ |0.1480|± |0.0225|
131
+ |leaderboard_bbh_tracking_shuffled_objects_three_objects| 1|none | 3|acc_norm|↑ |0.3120|± |0.0294|
132
+ |leaderboard_bbh_web_of_lies | 1|none | 3|acc_norm|↑ |0.5080|± |0.0317|
133
  ```
134
 
135
+ Average: 43.76
136
 
137
 
138
  # Prompt Format
 
175
 
176
  ## Prompt Format for Function Calling
177
 
178
+ # Note: A previous version used USER as both the user prompt and the tool response role, but this has now been fixed. Please use USER for the user prompt role and TOOL for the tool response role.
179
 
180
  Our model was trained on specific system prompts and structures for Function Calling.
181
 
 
204
 
205
  Once you parse the tool call, call the api and get the returned values for the call, and pass it back in as a new role, `tool` like so:
206
  ```
207
+ <|im_start|>tool
208
  <tool_response>
209
  {"name": "get_stock_fundamentals", "content": {'symbol': 'TSLA', 'company_name': 'Tesla, Inc.', 'sector': 'Consumer Cyclical', 'industry': 'Auto Manufacturers', 'market_cap': 611384164352, 'pe_ratio': 49.604652, 'pb_ratio': 9.762013, 'dividend_yield': None, 'eps': 4.3, 'beta': 2.427, '52_week_high': 299.29, '52_week_low': 152.37}}
210
  </tool_response>
 
309
  primaryClass={cs.CL},
310
  url={https://arxiv.org/abs/2408.11857},
311
  }
312
+ ```