jaspercatapang commited on
Commit
6696617
1 Parent(s): c091ae8

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +23 -10
README.md CHANGED
@@ -14,33 +14,46 @@ datasets:
14
  Released August 11, 2023
15
 
16
  ## Model Description
17
- GodziLLa 2 70B is an experimental combination of various proprietary LoRAs from Maya Philippines and [Guanaco LLaMA 2 1K dataset](https://huggingface.co/datasets/mlabonne/guanaco-llama2-1k), with LLaMA 2 70B. This model's primary purpose is to stress test the limitations of composite, instruction-following LLMs and observe its performance with respect to other LLMs available on the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard). This model debuted in the leaderboard at rank #4 (August 17, 2023) and operates under the Llama 2 license.
18
  ![Godzilla Happy GIF](https://i.pinimg.com/originals/81/3a/e0/813ae09a30f0bc44130cd2c834fe2eba.gif)
19
 
20
- ## Open LLM Leaderboard Metrics
21
  | Metric | Value |
22
  |-----------------------|-------|
23
  | MMLU (5-shot) | 69.88 |
24
  | ARC (25-shot) | 71.42 |
25
  | HellaSwag (10-shot) | 87.53 |
26
  | TruthfulQA (0-shot) | 61.54 |
27
- | Average | 72.59 |
 
 
 
28
 
29
  According to the leaderboard description, here are the benchmarks used for the evaluation:
30
  - [MMLU](https://arxiv.org/abs/2009.03300) (5-shot) - a test to measure a text model’s multitask accuracy. The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more.
31
  - [AI2 Reasoning Challenge](https://arxiv.org/abs/1803.05457) -ARC- (25-shot) - a set of grade-school science questions.
32
  - [HellaSwag](https://arxiv.org/abs/1905.07830) (10-shot) - a test of commonsense inference, which is easy for humans (~95%) but challenging for SOTA models.
33
  - [TruthfulQA](https://arxiv.org/abs/2109.07958) (0-shot) - a test to measure a model’s propensity to reproduce falsehoods commonly found online.
 
 
 
34
 
35
  A detailed breakdown of the evaluation can be found [here](https://huggingface.co/datasets/open-llm-leaderboard/details_MayaPH__GodziLLa2-70B). Huge thanks to [@thomwolf](https://huggingface.co/thomwolf).
36
 
37
- ## Leaderboard Highlights (as of August 17, 2023)
38
- - Godzilla 2 70B debuts at 4th place worldwide in the Open LLM Leaderboard.
39
- - Godzilla 2 70B ranks #3 in the ARC challenge.
40
- - Godzilla 2 70B ranks #5 in the TruthfulQA benchmark.
41
- - *Godzilla 2 70B beats GPT-3.5 (ChatGPT) in terms of average performance and the HellaSwag benchmark (87.53 > 85.5).
42
- - *Godzilla 2 70B outperforms GPT-3.5 (ChatGPT) and GPT-4 on the TruthfulQA benchmark (61.54 for G2-70B, 47 for GPT-3.5, 59 for GPT-4).
43
- - *Godzilla 2 70B is on par with GPT-3.5 (ChatGPT) on the MMLU benchmark (<0.12%).
 
 
 
 
 
 
 
44
 
45
  *Based on a [leaderboard clone](https://huggingface.co/spaces/gsaivinay/open_llm_leaderboard) with GPT-3.5 and GPT-4 included.
46
 
 
14
  Released August 11, 2023
15
 
16
  ## Model Description
17
+ GodziLLa 2 70B is an experimental combination of various proprietary LoRAs from Maya Philippines and [Guanaco LLaMA 2 1K dataset](https://huggingface.co/datasets/mlabonne/guanaco-llama2-1k), with LLaMA 2 70B. This model's primary purpose is to stress test the limitations of composite, instruction-following LLMs and observe its performance with respect to other LLMs available on the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard). This model debuted in the leaderboard at rank #4 (August 17, 2023), debuted in the Fall 2023 update at rank #2 (November, 10, 2023), and operates under the Llama 2 license.
18
  ![Godzilla Happy GIF](https://i.pinimg.com/originals/81/3a/e0/813ae09a30f0bc44130cd2c834fe2eba.gif)
19
 
20
+ ## Open LLM Leaderboard Metrics (Fall 2023 update)
21
  | Metric | Value |
22
  |-----------------------|-------|
23
  | MMLU (5-shot) | 69.88 |
24
  | ARC (25-shot) | 71.42 |
25
  | HellaSwag (10-shot) | 87.53 |
26
  | TruthfulQA (0-shot) | 61.54 |
27
+ | Winogrande (5-shot) | 83.19 |
28
+ | GSM8K (5-shot) | 43.21 |
29
+ | DROP (3-shot) | 52.31 |
30
+ | Average | 67.01 |
31
 
32
  According to the leaderboard description, here are the benchmarks used for the evaluation:
33
  - [MMLU](https://arxiv.org/abs/2009.03300) (5-shot) - a test to measure a text model’s multitask accuracy. The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more.
34
  - [AI2 Reasoning Challenge](https://arxiv.org/abs/1803.05457) -ARC- (25-shot) - a set of grade-school science questions.
35
  - [HellaSwag](https://arxiv.org/abs/1905.07830) (10-shot) - a test of commonsense inference, which is easy for humans (~95%) but challenging for SOTA models.
36
  - [TruthfulQA](https://arxiv.org/abs/2109.07958) (0-shot) - a test to measure a model’s propensity to reproduce falsehoods commonly found online.
37
+ - [Winogrande](https://arxiv.org/abs/1907.10641) (5-shot) - an adversarial and difficult Winograd benchmark at scale, for commonsense reasoning.
38
+ - [GSM8k](https://arxiv.org/abs/2110.14168) (5-shot) - diverse grade school math word problems to measure a model's ability to solve multi-step mathematical reasoning problems.
39
+ - [DROP](https://arxiv.org/abs/1903.00161) (3-shot) - English reading comprehension benchmark requiring Discrete Reasoning Over the content of Paragraphs.
40
 
41
  A detailed breakdown of the evaluation can be found [here](https://huggingface.co/datasets/open-llm-leaderboard/details_MayaPH__GodziLLa2-70B). Huge thanks to [@thomwolf](https://huggingface.co/thomwolf).
42
 
43
+ ## Open LLM Leaderboard Metrics (before Fall 2023 update)
44
+ | Metric | Value |
45
+ |-----------------------|-------|
46
+ | MMLU (5-shot) | 69.88 |
47
+ | ARC (25-shot) | 71.42 |
48
+ | HellaSwag (10-shot) | 87.53 |
49
+ | TruthfulQA (0-shot) | 61.54 |
50
+ | Average | 72.59 |
51
+
52
+ ## Leaderboard Highlights (Fall 2023 update, November 10, 2023)
53
+ - Godzilla 2 70B debuts at 2nd place worldwide in the newly updated Open LLM Leaderboard.
54
+ - Godzilla 2 70B beats GPT-3.5 (ChatGPT) in terms of average performance and the HellaSwag benchmark (87.53 > 85.5).
55
+ - Godzilla 2 70B outperforms GPT-3.5 (ChatGPT) and GPT-4 on the TruthfulQA benchmark (61.54 for G2-70B, 47 for GPT-3.5, 59 for GPT-4).
56
+ - Godzilla 2 70B is on par with GPT-3.5 (ChatGPT) on the MMLU benchmark (<0.12%).
57
 
58
  *Based on a [leaderboard clone](https://huggingface.co/spaces/gsaivinay/open_llm_leaderboard) with GPT-3.5 and GPT-4 included.
59