Spaces:

ml-energy
/

leaderboard

Running

Jae-Won Chung commited on Jul 5, 2023

Commit

48843fe

1 Parent(s): 91c65f8

Install lm-evaluation-harness in Dockerfile

Files changed (3) hide show

Dockerfile CHANGED Viewed

@@ -26,6 +26,14 @@ ADD . /workspace/leaderboard
 RUN cd /workspace/leaderboard \
       && pip install -r requirements-benchmark.txt
 # Where all the weights downloaded from Hugging Face Hub will go to
 ENV TRANSFORMERS_CACHE=/data/leaderboard/hfcache

 RUN cd /workspace/leaderboard \
       && pip install -r requirements-benchmark.txt
+# Clone lm-evaluation-harness and install
+RUN cd /workspace \
+      && git clone https://github.com/EleutherAI/lm-evaluation-harness.git \
+      && cd lm-evaluation-harness \
+      && git checkout 72b7f0c00a6ff94632c5b873fc24e093ae74fa47 \
+      && rm -r .git \
+      && pip install -e .
 # Where all the weights downloaded from Hugging Face Hub will go to
 ENV TRANSFORMERS_CACHE=/data/leaderboard/hfcache

LEADERBOARD.md CHANGED Viewed

@@ -42,7 +42,7 @@ Find our benchmark script for one model [here](https://github.com/ml-energy/lead
 - PyTorch 2.0.1
 - [Zeus](https://ml.energy/zeus) -- For GPU time and energy measurement
 - [FastChat](https://github.com/lm-sys/fastchat) -- For running inference on various models
-- [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness/commit/72b7f0c00a6ff94632c5b873fc24e093ae74fa47) -- For NLP evaluation metrics
 ### Hardware

 - PyTorch 2.0.1
 - [Zeus](https://ml.energy/zeus) -- For GPU time and energy measurement
 - [FastChat](https://github.com/lm-sys/fastchat) -- For running inference on various models
+- [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness/tree/72b7f0c00a6ff94632c5b873fc24e093ae74fa47) -- For NLP evaluation metrics
 ### Hardware

pegasus/README.md CHANGED Viewed

@@ -65,7 +65,7 @@ After all the tasks finish, aggregate all the data into one node and run [`compu
 ## NLP benchmark
-We'll use [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness/commit/72b7f0c00a6ff94632c5b873fc24e093ae74fa47) to run models through three NLP datasets: ARC challenge (`arc`), HellaSwag (`hellaswag`), and TruthfulQA (`truthfulqa`).
 Use Pegasus to run benchmarks for all the models across all nodes.

 ## NLP benchmark
+We'll use [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness/tree/72b7f0c00a6ff94632c5b873fc24e093ae74fa47) to run models through three NLP datasets: ARC challenge (`arc`), HellaSwag (`hellaswag`), and TruthfulQA (`truthfulqa`).
 Use Pegasus to run benchmarks for all the models across all nodes.