Adding Evaluation Results
#2
by
leaderboard-pr-bot
- opened
README.md
CHANGED
@@ -1,20 +1,114 @@
|
|
1 |
---
|
2 |
-
license: mit
|
3 |
-
license_link: https://huggingface.co/microsoft/Phi-3-medium-4k-instruct/resolve/main/LICENSE
|
4 |
-
|
5 |
language:
|
6 |
- multilingual
|
7 |
-
|
8 |
tags:
|
9 |
- nlp
|
10 |
- code
|
|
|
|
|
11 |
inference:
|
12 |
parameters:
|
13 |
temperature: 0.7
|
14 |
widget:
|
15 |
-
|
16 |
-
|
17 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
18 |
---
|
19 |
|
20 |
# Phi-3-medium-4k-instruct-abliterated-v3
|
@@ -77,3 +171,17 @@ This model may come with interesting quirks, with the methodology being so new.
|
|
77 |
If you manage to develop further improvements, please share! This is really the most basic way to use ablation, but there are other possibilities that I believe are as-yet unexplored.
|
78 |
|
79 |
Additionally, feel free to reach out in any way about this. I'm on the Cognitive Computations Discord, I'm watching the Community tab, reach out! I'd love to see this methodology used in other ways, and so would gladly support whoever whenever I can.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
|
|
|
|
|
|
2 |
language:
|
3 |
- multilingual
|
4 |
+
license: mit
|
5 |
tags:
|
6 |
- nlp
|
7 |
- code
|
8 |
+
license_link: https://huggingface.co/microsoft/Phi-3-medium-4k-instruct/resolve/main/LICENSE
|
9 |
+
pipeline_tag: text-generation
|
10 |
inference:
|
11 |
parameters:
|
12 |
temperature: 0.7
|
13 |
widget:
|
14 |
+
- messages:
|
15 |
+
- role: user
|
16 |
+
content: Can you provide ways to eat combinations of bananas and dragonfruits?
|
17 |
+
model-index:
|
18 |
+
- name: Phi-3-medium-4k-instruct-abliterated-v3
|
19 |
+
results:
|
20 |
+
- task:
|
21 |
+
type: text-generation
|
22 |
+
name: Text Generation
|
23 |
+
dataset:
|
24 |
+
name: IFEval (0-Shot)
|
25 |
+
type: HuggingFaceH4/ifeval
|
26 |
+
args:
|
27 |
+
num_few_shot: 0
|
28 |
+
metrics:
|
29 |
+
- type: inst_level_strict_acc and prompt_level_strict_acc
|
30 |
+
value: 63.19
|
31 |
+
name: strict accuracy
|
32 |
+
source:
|
33 |
+
url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=failspy/Phi-3-medium-4k-instruct-abliterated-v3
|
34 |
+
name: Open LLM Leaderboard
|
35 |
+
- task:
|
36 |
+
type: text-generation
|
37 |
+
name: Text Generation
|
38 |
+
dataset:
|
39 |
+
name: BBH (3-Shot)
|
40 |
+
type: BBH
|
41 |
+
args:
|
42 |
+
num_few_shot: 3
|
43 |
+
metrics:
|
44 |
+
- type: acc_norm
|
45 |
+
value: 46.73
|
46 |
+
name: normalized accuracy
|
47 |
+
source:
|
48 |
+
url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=failspy/Phi-3-medium-4k-instruct-abliterated-v3
|
49 |
+
name: Open LLM Leaderboard
|
50 |
+
- task:
|
51 |
+
type: text-generation
|
52 |
+
name: Text Generation
|
53 |
+
dataset:
|
54 |
+
name: MATH Lvl 5 (4-Shot)
|
55 |
+
type: hendrycks/competition_math
|
56 |
+
args:
|
57 |
+
num_few_shot: 4
|
58 |
+
metrics:
|
59 |
+
- type: exact_match
|
60 |
+
value: 14.12
|
61 |
+
name: exact match
|
62 |
+
source:
|
63 |
+
url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=failspy/Phi-3-medium-4k-instruct-abliterated-v3
|
64 |
+
name: Open LLM Leaderboard
|
65 |
+
- task:
|
66 |
+
type: text-generation
|
67 |
+
name: Text Generation
|
68 |
+
dataset:
|
69 |
+
name: GPQA (0-shot)
|
70 |
+
type: Idavidrein/gpqa
|
71 |
+
args:
|
72 |
+
num_few_shot: 0
|
73 |
+
metrics:
|
74 |
+
- type: acc_norm
|
75 |
+
value: 8.95
|
76 |
+
name: acc_norm
|
77 |
+
source:
|
78 |
+
url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=failspy/Phi-3-medium-4k-instruct-abliterated-v3
|
79 |
+
name: Open LLM Leaderboard
|
80 |
+
- task:
|
81 |
+
type: text-generation
|
82 |
+
name: Text Generation
|
83 |
+
dataset:
|
84 |
+
name: MuSR (0-shot)
|
85 |
+
type: TAUR-Lab/MuSR
|
86 |
+
args:
|
87 |
+
num_few_shot: 0
|
88 |
+
metrics:
|
89 |
+
- type: acc_norm
|
90 |
+
value: 18.52
|
91 |
+
name: acc_norm
|
92 |
+
source:
|
93 |
+
url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=failspy/Phi-3-medium-4k-instruct-abliterated-v3
|
94 |
+
name: Open LLM Leaderboard
|
95 |
+
- task:
|
96 |
+
type: text-generation
|
97 |
+
name: Text Generation
|
98 |
+
dataset:
|
99 |
+
name: MMLU-PRO (5-shot)
|
100 |
+
type: TIGER-Lab/MMLU-Pro
|
101 |
+
config: main
|
102 |
+
split: test
|
103 |
+
args:
|
104 |
+
num_few_shot: 5
|
105 |
+
metrics:
|
106 |
+
- type: acc
|
107 |
+
value: 37.78
|
108 |
+
name: accuracy
|
109 |
+
source:
|
110 |
+
url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=failspy/Phi-3-medium-4k-instruct-abliterated-v3
|
111 |
+
name: Open LLM Leaderboard
|
112 |
---
|
113 |
|
114 |
# Phi-3-medium-4k-instruct-abliterated-v3
|
|
|
171 |
If you manage to develop further improvements, please share! This is really the most basic way to use ablation, but there are other possibilities that I believe are as-yet unexplored.
|
172 |
|
173 |
Additionally, feel free to reach out in any way about this. I'm on the Cognitive Computations Discord, I'm watching the Community tab, reach out! I'd love to see this methodology used in other ways, and so would gladly support whoever whenever I can.
|
174 |
+
|
175 |
+
# [Open LLM Leaderboard Evaluation Results](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard)
|
176 |
+
Detailed results can be found [here](https://huggingface.co/datasets/open-llm-leaderboard/details_failspy__Phi-3-medium-4k-instruct-abliterated-v3)
|
177 |
+
|
178 |
+
| Metric |Value|
|
179 |
+
|-------------------|----:|
|
180 |
+
|Avg. |31.55|
|
181 |
+
|IFEval (0-Shot) |63.19|
|
182 |
+
|BBH (3-Shot) |46.73|
|
183 |
+
|MATH Lvl 5 (4-Shot)|14.12|
|
184 |
+
|GPQA (0-shot) | 8.95|
|
185 |
+
|MuSR (0-shot) |18.52|
|
186 |
+
|MMLU-PRO (5-shot) |37.78|
|
187 |
+
|