garage-bAInd
/

SuperPlatty-30B

Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

Update README.md

#1

by lilloukas - opened Jun 30, 2023

base: refs/heads/main

←

from: refs/pr/1

Discussion Files changed

Files changed (1) hide show

README.md +9 -9

README.md CHANGED Viewed

@@ -17,11 +17,11 @@ SuperPlatty-30B is a merge of [lilloukas/Platypus-30B](https://huggingface.co/li
 | Metric                | Value |
 |-----------------------|-------|
-| MMLU (5-shot)         |       |
-| ARC (25-shot)         |       |
-| HellaSwag (10-shot)   |       |
-| TruthfulQA (0-shot)   |       |
-| Avg.                  |       |
 We use state-of-the-art EleutherAI [Language Model Evaluation Harness](https://github.com/EleutherAI/lm-evaluation-harness) to run the benchmark tests above.
@@ -51,22 +51,22 @@ Each task was evaluated on a single A100 80GB GPU.
 ARC:
 ```
-python main.py --model hf-causal-experimental --model_args pretrained=lilloukas/GPlatty-30B --tasks arc_challenge --batch_size 1 --no_cache --write_out --output_path results/Platypus-30B/arc_challenge_25shot.json --device cuda --num_fewshot 25
 ```
 HellaSwag:
 ```
-python main.py --model hf-causal-experimental --model_args pretrained=lilloukas/GPlatty-30B --tasks hellaswag --batch_size 1 --no_cache --write_out --output_path results/Platypus-30B/hellaswag_10shot.json --device cuda --num_fewshot 10
 ```
 MMLU:
 ```
-python main.py --model hf-causal-experimental --model_args pretrained=lilloukas/GPlatty-30B --tasks hendrycksTest-* --batch_size 1 --no_cache --write_out --output_path results/Platypus-30B/mmlu_5shot.json --device cuda --num_fewshot 5
 ```
 TruthfulQA:
 ```
-python main.py --model hf-causal-experimental --model_args pretrained=lilloukas/GPlatty-30B --tasks truthfulqa_mc --batch_size 1 --no_cache --write_out --output_path results/Platypus-30B/truthfulqa_0shot.json --device cuda
 ```
 ## Limitations and bias

 | Metric                | Value |
 |-----------------------|-------|
+| MMLU (5-shot)         | 62.6  |
+| ARC (25-shot)         | 66.1  |
+| HellaSwag (10-shot)   | 83.9  |
+| TruthfulQA (0-shot)   | 54.0  |
+| Avg.                  | 66.6  |
 We use state-of-the-art EleutherAI [Language Model Evaluation Harness](https://github.com/EleutherAI/lm-evaluation-harness) to run the benchmark tests above.
 ARC:
 ```
+python main.py --model hf-causal-experimental --model_args pretrained=ariellee/SuperPlatty-30B --tasks arc_challenge --batch_size 1 --no_cache --write_out --output_path results/SuperPlatty-30B/arc_challenge_25shot.json --device cuda --num_fewshot 25
 ```
 HellaSwag:
 ```
+python main.py --model hf-causal-experimental --model_args pretrained=ariellee/SuperPlatty-30B --tasks hellaswag --batch_size 1 --no_cache --write_out --output_path results/SuperPlatty-30B/hellaswag_10shot.json --device cuda --num_fewshot 10
 ```
 MMLU:
 ```
+python main.py --model hf-causal-experimental --model_args pretrained=ariellee/SuperPlatty-30B --tasks hendrycksTest-* --batch_size 1 --no_cache --write_out --output_path results/SuperPlatty-30B/mmlu_5shot.json --device cuda --num_fewshot 5
 ```
 TruthfulQA:
 ```
+python main.py --model hf-causal-experimental --model_args pretrained=ariellee/SuperPlatty-30B --tasks truthfulqa_mc --batch_size 1 --no_cache --write_out --output_path results/SuperPlatty-30B/truthfulqa_0shot.json --device cuda
 ```
 ## Limitations and bias