Jae-Won Chung commited on
Commit
48843fe
1 Parent(s): 91c65f8

Install lm-evaluation-harness in Dockerfile

Browse files
Files changed (3) hide show
  1. Dockerfile +8 -0
  2. LEADERBOARD.md +1 -1
  3. pegasus/README.md +1 -1
Dockerfile CHANGED
@@ -26,6 +26,14 @@ ADD . /workspace/leaderboard
26
  RUN cd /workspace/leaderboard \
27
  && pip install -r requirements-benchmark.txt
28
 
 
 
 
 
 
 
 
 
29
  # Where all the weights downloaded from Hugging Face Hub will go to
30
  ENV TRANSFORMERS_CACHE=/data/leaderboard/hfcache
31
 
 
26
  RUN cd /workspace/leaderboard \
27
  && pip install -r requirements-benchmark.txt
28
 
29
+ # Clone lm-evaluation-harness and install
30
+ RUN cd /workspace \
31
+ && git clone https://github.com/EleutherAI/lm-evaluation-harness.git \
32
+ && cd lm-evaluation-harness \
33
+ && git checkout 72b7f0c00a6ff94632c5b873fc24e093ae74fa47 \
34
+ && rm -r .git \
35
+ && pip install -e .
36
+
37
  # Where all the weights downloaded from Hugging Face Hub will go to
38
  ENV TRANSFORMERS_CACHE=/data/leaderboard/hfcache
39
 
LEADERBOARD.md CHANGED
@@ -42,7 +42,7 @@ Find our benchmark script for one model [here](https://github.com/ml-energy/lead
42
  - PyTorch 2.0.1
43
  - [Zeus](https://ml.energy/zeus) -- For GPU time and energy measurement
44
  - [FastChat](https://github.com/lm-sys/fastchat) -- For running inference on various models
45
- - [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness/commit/72b7f0c00a6ff94632c5b873fc24e093ae74fa47) -- For NLP evaluation metrics
46
 
47
  ### Hardware
48
 
 
42
  - PyTorch 2.0.1
43
  - [Zeus](https://ml.energy/zeus) -- For GPU time and energy measurement
44
  - [FastChat](https://github.com/lm-sys/fastchat) -- For running inference on various models
45
+ - [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness/tree/72b7f0c00a6ff94632c5b873fc24e093ae74fa47) -- For NLP evaluation metrics
46
 
47
  ### Hardware
48
 
pegasus/README.md CHANGED
@@ -65,7 +65,7 @@ After all the tasks finish, aggregate all the data into one node and run [`compu
65
 
66
  ## NLP benchmark
67
 
68
- We'll use [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness/commit/72b7f0c00a6ff94632c5b873fc24e093ae74fa47) to run models through three NLP datasets: ARC challenge (`arc`), HellaSwag (`hellaswag`), and TruthfulQA (`truthfulqa`).
69
 
70
  Use Pegasus to run benchmarks for all the models across all nodes.
71
 
 
65
 
66
  ## NLP benchmark
67
 
68
+ We'll use [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness/tree/72b7f0c00a6ff94632c5b873fc24e093ae74fa47) to run models through three NLP datasets: ARC challenge (`arc`), HellaSwag (`hellaswag`), and TruthfulQA (`truthfulqa`).
69
 
70
  Use Pegasus to run benchmarks for all the models across all nodes.
71