jaspercatapang commited on
Commit
924e54b
1 Parent(s): 5b8717a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +20 -1
README.md CHANGED
@@ -15,7 +15,7 @@ Released January 11, 2024
15
 
16
  ![bagel-burger](bagel-burger.png)
17
 
18
- This is a merge of pre-trained language models created using [mergekit](https://github.com/cg123/mergekit). For more information, kindly refer to the model cards from jondurbin linked in a section below.
19
 
20
  ## Merge Details
21
  ### Merge Method
@@ -28,6 +28,25 @@ The following models were included in the merge:
28
  * [jondurbin/bagel-dpo-34b-v0.2](https://huggingface.co/jondurbin/bagel-dpo-34b-v0.2)
29
  * [jondurbin/nontoxic-bagel-34b-v0.2](https://huggingface.co/jondurbin/nontoxic-bagel-34b-v0.2)
30
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
31
  ### Configuration
32
 
33
  The following YAML configuration was used to produce this model:
 
15
 
16
  ![bagel-burger](bagel-burger.png)
17
 
18
+ This is a merge of pre-trained language models created using [mergekit](https://github.com/cg123/mergekit). For more information, kindly refer to the model cards from jondurbin linked in the section below. This model debuted in the [leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard) at rank #4 (January 11, 2024).
19
 
20
  ## Merge Details
21
  ### Merge Method
 
28
  * [jondurbin/bagel-dpo-34b-v0.2](https://huggingface.co/jondurbin/bagel-dpo-34b-v0.2)
29
  * [jondurbin/nontoxic-bagel-34b-v0.2](https://huggingface.co/jondurbin/nontoxic-bagel-34b-v0.2)
30
 
31
+ ## Open LLM Leaderboard Metrics (as of January 11, 2024)
32
+ | Metric | Value |
33
+ |-----------------------|-------|
34
+ | MMLU (5-shot) | 76.60 |
35
+ | ARC (25-shot) | 72.70 |
36
+ | HellaSwag (10-shot) | 85.44 |
37
+ | TruthfulQA (0-shot) | 71.42 |
38
+ | Winogrande (5-shot) | 82.72 |
39
+ | GSM8K (5-shot) | 60.73 |
40
+ | Average | 74.93 |
41
+
42
+ According to the leaderboard description, here are the benchmarks used for the evaluation:
43
+ - [MMLU](https://arxiv.org/abs/2009.03300) (5-shot) - a test to measure a text model’s multitask accuracy. The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more.
44
+ - [AI2 Reasoning Challenge](https://arxiv.org/abs/1803.05457) -ARC- (25-shot) - a set of grade-school science questions.
45
+ - [HellaSwag](https://arxiv.org/abs/1905.07830) (10-shot) - a test of commonsense inference, which is easy for humans (~95%) but challenging for SOTA models.
46
+ - [TruthfulQA](https://arxiv.org/abs/2109.07958) (0-shot) - a test to measure a model’s propensity to reproduce falsehoods commonly found online.
47
+ - [Winogrande](https://arxiv.org/abs/1907.10641) (5-shot) - an adversarial and difficult Winograd benchmark at scale, for commonsense reasoning.
48
+ - [GSM8k](https://arxiv.org/abs/2110.14168) (5-shot) - diverse grade school math word problems to measure a model's ability to solve multi-step mathematical reasoning problems.
49
+
50
  ### Configuration
51
 
52
  The following YAML configuration was used to produce this model: