leaderboard-pr-bot commited on
Commit
a46d46a
·
verified ·
1 Parent(s): 8bdf7c4

Adding Evaluation Results

Browse files

This is an automated PR created with https://huggingface.co/spaces/Weyaxi/open-llm-leaderboard-results-pr

The purpose of this PR is to add evaluation results from the Open LLM Leaderboard to your model card.

If you encounter any issues, please report them to https://huggingface.co/spaces/Weyaxi/open-llm-leaderboard-results-pr/discussions

Files changed (1) hide show
  1. README.md +118 -2
README.md CHANGED
@@ -1,10 +1,113 @@
1
  ---
 
2
  tags:
3
  - llm
4
  - llama
5
  - spellcheck
6
  - grammar
7
- license: llama2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8
  ---
9
 
10
  <!-- header start -->
@@ -136,4 +239,17 @@ After probably 10 different versions with subsequent changes, I can now say that
136
 
137
  The goal was to create a model that wouldn't change the style of the text. Often, LLM models, when asked to edit text, will attempt to rewrite the text even if the text is already fine. This proved to be quite challenging for such a small model where the main task was to determine the right balance between fixing the text (and not changing its style) and copying it verbatim.
138
 
139
- The strict model assumes that you're already a good writer that doesn't need hand-holding and that every word you've written you've meant.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ license: llama2
3
  tags:
4
  - llm
5
  - llama
6
  - spellcheck
7
  - grammar
8
+ model-index:
9
+ - name: Karen_TheEditor_V2_STRICT_Mistral_7B
10
+ results:
11
+ - task:
12
+ type: text-generation
13
+ name: Text Generation
14
+ dataset:
15
+ name: AI2 Reasoning Challenge (25-Shot)
16
+ type: ai2_arc
17
+ config: ARC-Challenge
18
+ split: test
19
+ args:
20
+ num_few_shot: 25
21
+ metrics:
22
+ - type: acc_norm
23
+ value: 59.56
24
+ name: normalized accuracy
25
+ source:
26
+ url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=FPHam/Karen_TheEditor_V2_STRICT_Mistral_7B
27
+ name: Open LLM Leaderboard
28
+ - task:
29
+ type: text-generation
30
+ name: Text Generation
31
+ dataset:
32
+ name: HellaSwag (10-Shot)
33
+ type: hellaswag
34
+ split: validation
35
+ args:
36
+ num_few_shot: 10
37
+ metrics:
38
+ - type: acc_norm
39
+ value: 81.79
40
+ name: normalized accuracy
41
+ source:
42
+ url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=FPHam/Karen_TheEditor_V2_STRICT_Mistral_7B
43
+ name: Open LLM Leaderboard
44
+ - task:
45
+ type: text-generation
46
+ name: Text Generation
47
+ dataset:
48
+ name: MMLU (5-Shot)
49
+ type: cais/mmlu
50
+ config: all
51
+ split: test
52
+ args:
53
+ num_few_shot: 5
54
+ metrics:
55
+ - type: acc
56
+ value: 59.56
57
+ name: accuracy
58
+ source:
59
+ url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=FPHam/Karen_TheEditor_V2_STRICT_Mistral_7B
60
+ name: Open LLM Leaderboard
61
+ - task:
62
+ type: text-generation
63
+ name: Text Generation
64
+ dataset:
65
+ name: TruthfulQA (0-shot)
66
+ type: truthful_qa
67
+ config: multiple_choice
68
+ split: validation
69
+ args:
70
+ num_few_shot: 0
71
+ metrics:
72
+ - type: mc2
73
+ value: 49.36
74
+ source:
75
+ url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=FPHam/Karen_TheEditor_V2_STRICT_Mistral_7B
76
+ name: Open LLM Leaderboard
77
+ - task:
78
+ type: text-generation
79
+ name: Text Generation
80
+ dataset:
81
+ name: Winogrande (5-shot)
82
+ type: winogrande
83
+ config: winogrande_xl
84
+ split: validation
85
+ args:
86
+ num_few_shot: 5
87
+ metrics:
88
+ - type: acc
89
+ value: 74.35
90
+ name: accuracy
91
+ source:
92
+ url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=FPHam/Karen_TheEditor_V2_STRICT_Mistral_7B
93
+ name: Open LLM Leaderboard
94
+ - task:
95
+ type: text-generation
96
+ name: Text Generation
97
+ dataset:
98
+ name: GSM8k (5-shot)
99
+ type: gsm8k
100
+ config: main
101
+ split: test
102
+ args:
103
+ num_few_shot: 5
104
+ metrics:
105
+ - type: acc
106
+ value: 30.17
107
+ name: accuracy
108
+ source:
109
+ url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=FPHam/Karen_TheEditor_V2_STRICT_Mistral_7B
110
+ name: Open LLM Leaderboard
111
  ---
112
 
113
  <!-- header start -->
 
239
 
240
  The goal was to create a model that wouldn't change the style of the text. Often, LLM models, when asked to edit text, will attempt to rewrite the text even if the text is already fine. This proved to be quite challenging for such a small model where the main task was to determine the right balance between fixing the text (and not changing its style) and copying it verbatim.
241
 
242
+ The strict model assumes that you're already a good writer that doesn't need hand-holding and that every word you've written you've meant.
243
+ # [Open LLM Leaderboard Evaluation Results](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)
244
+ Detailed results can be found [here](https://huggingface.co/datasets/open-llm-leaderboard/details_FPHam__Karen_TheEditor_V2_STRICT_Mistral_7B)
245
+
246
+ | Metric |Value|
247
+ |---------------------------------|----:|
248
+ |Avg. |59.13|
249
+ |AI2 Reasoning Challenge (25-Shot)|59.56|
250
+ |HellaSwag (10-Shot) |81.79|
251
+ |MMLU (5-Shot) |59.56|
252
+ |TruthfulQA (0-shot) |49.36|
253
+ |Winogrande (5-shot) |74.35|
254
+ |GSM8k (5-shot) |30.17|
255
+