jaspercatapang
commited on
Commit
•
e0c9db3
1
Parent(s):
9711067
Update README.md
Browse files
README.md
CHANGED
@@ -41,6 +41,7 @@ According to the leaderboard description, here are the benchmarks used for the e
|
|
41 |
*Based on a [leaderboard clone](https://huggingface.co/spaces/gsaivinay/open_llm_leaderboard) with GPT-3.5 and GPT-4 included.
|
42 |
|
43 |
### Reproducing Evaluation Results
|
|
|
44 |
|
45 |
Install LM Evaluation Harness:
|
46 |
```
|
@@ -53,26 +54,25 @@ git checkout b281b0921b636bc36ad05c0b0b0763bd6dd43463
|
|
53 |
# install
|
54 |
pip install -e .
|
55 |
```
|
56 |
-
Each task was evaluated on a single A100 80GB GPU.
|
57 |
|
58 |
ARC:
|
59 |
```
|
60 |
-
python main.py --model hf-causal-experimental --model_args pretrained=
|
61 |
```
|
62 |
|
63 |
HellaSwag:
|
64 |
```
|
65 |
-
python main.py --model hf-causal-experimental --model_args pretrained=
|
66 |
```
|
67 |
|
68 |
MMLU:
|
69 |
```
|
70 |
-
python main.py --model hf-causal-experimental --model_args pretrained=
|
71 |
```
|
72 |
|
73 |
TruthfulQA:
|
74 |
```
|
75 |
-
python main.py --model hf-causal-experimental --model_args pretrained=
|
76 |
```
|
77 |
|
78 |
### Prompt Template
|
|
|
41 |
*Based on a [leaderboard clone](https://huggingface.co/spaces/gsaivinay/open_llm_leaderboard) with GPT-3.5 and GPT-4 included.
|
42 |
|
43 |
### Reproducing Evaluation Results
|
44 |
+
*Instruction template taken from [Platypus 2 70B instruct](https://huggingface.co/garage-bAInd/Platypus2-70B-instruct).
|
45 |
|
46 |
Install LM Evaluation Harness:
|
47 |
```
|
|
|
54 |
# install
|
55 |
pip install -e .
|
56 |
```
|
|
|
57 |
|
58 |
ARC:
|
59 |
```
|
60 |
+
python main.py --model hf-causal-experimental --model_args pretrained=MayaPH/GodziLLa2-70B --tasks arc_challenge --batch_size 1 --no_cache --write_out --output_path results/G270B/arc_challenge_25shot.json --device cuda --num_fewshot 25
|
61 |
```
|
62 |
|
63 |
HellaSwag:
|
64 |
```
|
65 |
+
python main.py --model hf-causal-experimental --model_args pretrained=MayaPH/GodziLLa2-70B --tasks hellaswag --batch_size 1 --no_cache --write_out --output_path results/G270B/hellaswag_10shot.json --device cuda --num_fewshot 10
|
66 |
```
|
67 |
|
68 |
MMLU:
|
69 |
```
|
70 |
+
python main.py --model hf-causal-experimental --model_args pretrained=MayaPH/GodziLLa2-70B --tasks hendrycksTest-* --batch_size 1 --no_cache --write_out --output_path results/G270B/mmlu_5shot.json --device cuda --num_fewshot 5
|
71 |
```
|
72 |
|
73 |
TruthfulQA:
|
74 |
```
|
75 |
+
python main.py --model hf-causal-experimental --model_args pretrained=MayaPH/GodziLLa2-70B --tasks truthfulqa_mc --batch_size 1 --no_cache --write_out --output_path results/G270B/truthfulqa_0shot.json --device cuda
|
76 |
```
|
77 |
|
78 |
### Prompt Template
|