Spaces:
Running
Running
Jae-Won Chung
commited on
Commit
•
4cf6de5
1
Parent(s):
68300e6
Add one more commit to lm-evaluation-harness
Browse files- deployment/benchmark.Dockerfile +1 -1
- docs/leaderboard.md +1 -1
- pegasus/README.md +1 -1
deployment/benchmark.Dockerfile
CHANGED
@@ -26,7 +26,7 @@ RUN cd /workspace/leaderboard && pip install -e .[benchmark]
|
|
26 |
RUN cd /workspace \
|
27 |
&& git clone https://github.com/EleutherAI/lm-evaluation-harness.git \
|
28 |
&& cd lm-evaluation-harness \
|
29 |
-
&& git checkout
|
30 |
&& rm -r .git \
|
31 |
&& pip install -e .
|
32 |
|
|
|
26 |
RUN cd /workspace \
|
27 |
&& git clone https://github.com/EleutherAI/lm-evaluation-harness.git \
|
28 |
&& cd lm-evaluation-harness \
|
29 |
+
&& git checkout d1537059b515511801ae9b742f8e949f1bfcd010 \
|
30 |
&& rm -r .git \
|
31 |
&& pip install -e .
|
32 |
|
docs/leaderboard.md
CHANGED
@@ -42,7 +42,7 @@ Find our benchmark script for one model [here](https://github.com/ml-energy/lead
|
|
42 |
- PyTorch 2.0.1
|
43 |
- [Zeus](https://ml.energy/zeus) -- For GPU time and energy measurement
|
44 |
- [FastChat](https://github.com/lm-sys/fastchat) -- For running inference on various models
|
45 |
-
- [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness/tree/
|
46 |
|
47 |
### Hardware
|
48 |
|
|
|
42 |
- PyTorch 2.0.1
|
43 |
- [Zeus](https://ml.energy/zeus) -- For GPU time and energy measurement
|
44 |
- [FastChat](https://github.com/lm-sys/fastchat) -- For running inference on various models
|
45 |
+
- [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness/tree/d1537059b515511801ae9b742f8e949f1bfcd010) -- For NLP evaluation metrics
|
46 |
|
47 |
### Hardware
|
48 |
|
pegasus/README.md
CHANGED
@@ -65,7 +65,7 @@ After all the tasks finish, aggregate all the data into one node and run [`compu
|
|
65 |
|
66 |
## NLP benchmark
|
67 |
|
68 |
-
We'll use [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness/tree/
|
69 |
|
70 |
Use Pegasus to run benchmarks for all the models across all nodes.
|
71 |
|
|
|
65 |
|
66 |
## NLP benchmark
|
67 |
|
68 |
+
We'll use [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness/tree/d1537059b515511801ae9b742f8e949f1bfcd010) to run models through three NLP datasets: ARC challenge (`arc`), HellaSwag (`hellaswag`), and TruthfulQA (`truthfulqa`).
|
69 |
|
70 |
Use Pegasus to run benchmarks for all the models across all nodes.
|
71 |
|