Commit
e85021f
1 Parent(s): 12ff2e8

Adding Evaluation Results (#5)

Browse files

- Adding Evaluation Results (687c28b6176490e303b126ce589b3f51c07ee056)


Co-authored-by: Open LLM Leaderboard PR Bot <leaderboard-pr-bot@users.noreply.huggingface.co>

Files changed (1) hide show
  1. README.md +109 -0
README.md CHANGED
@@ -18,6 +18,101 @@ pipeline_tag: text-generation
18
  inference: false
19
  model_creator: MaziyarPanahi
20
  quantized_by: MaziyarPanahi
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
21
  ---
22
 
23
  <img src="./calme-2.webp" alt="Qwen2 fine-tune" width="800" style="margin-left:'auto' margin-right:'auto' display:'block'"/>
@@ -107,3 +202,17 @@ model = AutoModelForCausalLM.from_pretrained("MaziyarPanahi/calme-2.3-qwen2-72b"
107
 
108
  As with any large language model, users should be aware of potential biases and limitations. We recommend implementing appropriate safeguards and human oversight when deploying this model in production environments.
109
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
18
  inference: false
19
  model_creator: MaziyarPanahi
20
  quantized_by: MaziyarPanahi
21
+ model-index:
22
+ - name: calme-2.3-qwen2-72b
23
+ results:
24
+ - task:
25
+ type: text-generation
26
+ name: Text Generation
27
+ dataset:
28
+ name: IFEval (0-Shot)
29
+ type: HuggingFaceH4/ifeval
30
+ args:
31
+ num_few_shot: 0
32
+ metrics:
33
+ - type: inst_level_strict_acc and prompt_level_strict_acc
34
+ value: 38.5
35
+ name: strict accuracy
36
+ source:
37
+ url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=MaziyarPanahi/calme-2.3-qwen2-72b
38
+ name: Open LLM Leaderboard
39
+ - task:
40
+ type: text-generation
41
+ name: Text Generation
42
+ dataset:
43
+ name: BBH (3-Shot)
44
+ type: BBH
45
+ args:
46
+ num_few_shot: 3
47
+ metrics:
48
+ - type: acc_norm
49
+ value: 51.23
50
+ name: normalized accuracy
51
+ source:
52
+ url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=MaziyarPanahi/calme-2.3-qwen2-72b
53
+ name: Open LLM Leaderboard
54
+ - task:
55
+ type: text-generation
56
+ name: Text Generation
57
+ dataset:
58
+ name: MATH Lvl 5 (4-Shot)
59
+ type: hendrycks/competition_math
60
+ args:
61
+ num_few_shot: 4
62
+ metrics:
63
+ - type: exact_match
64
+ value: 14.73
65
+ name: exact match
66
+ source:
67
+ url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=MaziyarPanahi/calme-2.3-qwen2-72b
68
+ name: Open LLM Leaderboard
69
+ - task:
70
+ type: text-generation
71
+ name: Text Generation
72
+ dataset:
73
+ name: GPQA (0-shot)
74
+ type: Idavidrein/gpqa
75
+ args:
76
+ num_few_shot: 0
77
+ metrics:
78
+ - type: acc_norm
79
+ value: 16.22
80
+ name: acc_norm
81
+ source:
82
+ url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=MaziyarPanahi/calme-2.3-qwen2-72b
83
+ name: Open LLM Leaderboard
84
+ - task:
85
+ type: text-generation
86
+ name: Text Generation
87
+ dataset:
88
+ name: MuSR (0-shot)
89
+ type: TAUR-Lab/MuSR
90
+ args:
91
+ num_few_shot: 0
92
+ metrics:
93
+ - type: acc_norm
94
+ value: 11.24
95
+ name: acc_norm
96
+ source:
97
+ url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=MaziyarPanahi/calme-2.3-qwen2-72b
98
+ name: Open LLM Leaderboard
99
+ - task:
100
+ type: text-generation
101
+ name: Text Generation
102
+ dataset:
103
+ name: MMLU-PRO (5-shot)
104
+ type: TIGER-Lab/MMLU-Pro
105
+ config: main
106
+ split: test
107
+ args:
108
+ num_few_shot: 5
109
+ metrics:
110
+ - type: acc
111
+ value: 49.1
112
+ name: accuracy
113
+ source:
114
+ url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=MaziyarPanahi/calme-2.3-qwen2-72b
115
+ name: Open LLM Leaderboard
116
  ---
117
 
118
  <img src="./calme-2.webp" alt="Qwen2 fine-tune" width="800" style="margin-left:'auto' margin-right:'auto' display:'block'"/>
 
202
 
203
  As with any large language model, users should be aware of potential biases and limitations. We recommend implementing appropriate safeguards and human oversight when deploying this model in production environments.
204
 
205
+
206
+ # [Open LLM Leaderboard Evaluation Results](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard)
207
+ Detailed results can be found [here](https://huggingface.co/datasets/open-llm-leaderboard/details_MaziyarPanahi__calme-2.3-qwen2-72b)
208
+
209
+ | Metric |Value|
210
+ |-------------------|----:|
211
+ |Avg. |30.17|
212
+ |IFEval (0-Shot) |38.50|
213
+ |BBH (3-Shot) |51.23|
214
+ |MATH Lvl 5 (4-Shot)|14.73|
215
+ |GPQA (0-shot) |16.22|
216
+ |MuSR (0-shot) |11.24|
217
+ |MMLU-PRO (5-shot) |49.10|
218
+