ahmedashrafay commited on
Commit
f4eebef
·
verified ·
1 Parent(s): 716fd77

Add metrics/generations for ahmedashrafay/staradapters-python

Browse files
Files changed (4) hide show
  1. README.md +212 -0
  2. group_jsons.py +42 -0
  3. multiple_eval.slurm +83 -0
  4. throughput_config.yaml +37 -0
README.md ADDED
@@ -0,0 +1,212 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <h1 align="center">:star: Multilingual Code Evaluation LeaderBoard Guide</h1>
2
+
3
+
4
+ <h4 align="center">
5
+ <p>
6
+ <a href="#running-the-evaluation">Running Evaluation</a> |
7
+ <a href="#submission-of-results-to-the-leaderboard">Results Submission</a>
8
+ <p>
9
+ </h4>
10
+
11
+ This is a guide to submit and reproduce the numbers in the [Multilingual Code Evaluation LeaderBoard](https://huggingface.co/spaces/bigcode/multilingual-code-evals).
12
+ The LeaderBoard is a demo for evaluating and comparing the performance of language models on code generation tasks.
13
+
14
+ The LeaderBoard is open for submissions of results produced by the community. If you have a model that you want to submit results for, please follow the instructions below.
15
+
16
+ ## Running the evaluation
17
+ We report the passs@1 for [HumanEval](https://huggingface.co/datasets/openai_humaneval) Python benchamrk and some languages from the [MultiPL-E](https://huggingface.co/datasets/nuprl/MultiPL-E) benchmark. We use the same template and parameters for all models.
18
+
19
+ ### 1-Setup
20
+ Follow the setup instructions in the evaluation harness [README](https://github.com/bigcode-project/bigcode-evaluation-harness/tree/main#setup).
21
+
22
+ Create two folders `generations_$model` and `metrics_$model` where you will save the generated code and the metrics respectively for your model `$model`.
23
+ ```bash
24
+ cd bigcode-evaluation-harness
25
+ mkdir generations_$model
26
+ mkdir metrics_$model
27
+ ```
28
+
29
+ To run the evaluation, we first generate the code solutions for the target tasks on GPUs, then execute the code on a docker container (only cpus are needed).
30
+
31
+ ### 2- Generation
32
+ Below are the instruction for generating the code solutions sequentially or in parallel with slurm. You might need to reduce the batch size for some models or change the precision based on your device.
33
+ ```bash
34
+ # after activating env and setting up accelerate...
35
+ langs=(py js java cpp swift php d jl lua r rkt rs)
36
+
37
+ model=YOUR_MODEL
38
+ org=HF_ORGANISATION
39
+
40
+ for lang in "${langs[@]}"; do
41
+ # use humaneval for py and multipl-e for the rest
42
+ if [ "$lang" == "py" ]; then
43
+ task=humaneval
44
+ else
45
+ task=multiple-$lang
46
+ fi
47
+
48
+ echo "Running task $task"
49
+ generations_path=generations_$model/generations_$task\_$model.json
50
+ accelerate launch main.py \
51
+ --model $org/$model \
52
+ --task $task \
53
+ --n_samples 50 \
54
+ --batch_size 50 \
55
+ --max_length_generation 512 \
56
+ --temperature 0.2 \
57
+ --precision bf16 \
58
+ --trust_remote_code \
59
+ --use_auth_token \
60
+ --generation_only \
61
+ --save_generations \
62
+ --save_generations_path $generations_path
63
+ echo "Task $task done"
64
+ done
65
+ ```
66
+ This will generate and save the code solutions for all tasks in the `generations_$model` folder.
67
+
68
+ If you want to submit jobs in parallel with `slurm`, run multiple-eval.slurm with:
69
+ ```bash
70
+ langs=(py js java cpp swift php d jl lua r rkt rs)
71
+
72
+ model=YOUR_MODEL
73
+ org=HF_ORGANISATION
74
+ out_path=generations_$model
75
+
76
+ for lang in "${langs[@]}"; do
77
+ if [ "$lang" == "py" ]; then
78
+ task=humaneval
79
+ else
80
+ task=multiple-$lang
81
+ fi
82
+ echo "Submitting task $task"
83
+ sbatch -J "eval-$model-$task" multiple_evals.slurm "$model" "$task" "$org" "$out_path"
84
+ done
85
+ ```
86
+ This will submit one job for each task.
87
+
88
+ ### 3- Execution
89
+
90
+ We execute and evaluate the solutions inside a docker container, you can either build the image or pull the one we provide:
91
+ ```bash
92
+ # to build it:
93
+ # sudo make DOCKERFILE=Dockerfile-multiple all
94
+ sudo docker pull ghcr.io/bigcode-project/evaluation-harness-multiple
95
+ sudo docker tag ghcr.io/bigcode-project/evaluation-harness-multiple evaluation-harness-multiple
96
+ ````
97
+
98
+ Then, you can run the evaluation on the generated code:
99
+ ```bash
100
+ langs=(py js java cpp swift php d jl lua r rkt rs)
101
+
102
+ model=YOUR_MODEL
103
+ org=HF_ORGANISATION
104
+ # if you provide absolute paths remove the $(pwd) from the command below
105
+ generations_path=generations_$model
106
+ metrics_path=metrics_$model
107
+
108
+ for lang in "${langs[@]}"; do
109
+ if [ "$lang" == "py" ]; then
110
+ task=humaneval
111
+ else
112
+ task=multiple-$lang
113
+ fi
114
+
115
+ gen_suffix=generations_$task\_$model.json
116
+ metric_suffix=metrics_$task\_$model.json
117
+ echo "Evaluation of $model on $task benchmark, data in $generations_path/$gen_suffix"
118
+
119
+ sudo docker run -v $(pwd)/$generations_path/$gen_suffix:/app/$gen_suffix:ro -v $(pwd)/$metrics_path:/app/$metrics_path -it evaluation-harness-multiple python3 main.py \
120
+ --model $org/$model \
121
+ --tasks $task \
122
+ --load_generations_path /app/$gen_suffix \
123
+ --metric_output_path /app/$metrics_path/$metric_suffix \
124
+ --allow_code_execution \
125
+ --use_auth_token \
126
+ --temperature 0.2 \
127
+ --n_samples 50 | tee -a logs_$model.txt
128
+ echo "Task $task done, metric saved at $metrics_path/$metric_suffix"
129
+ done
130
+ ```
131
+
132
+ ## Submission of results to the LeaderBoard
133
+ If you followed the steps above you now have a folder `metrics_$model` with `json` files, each containing the result of one task. To submit the results to the LeaderBoard, you need to create a json summarizing these metrics using `group_jsons.py` and submit it [here](https://huggingface.co/spaces/bigcode/multilingual-code-evals). Follow the instruction on `Submit here` section.
134
+ ```bash
135
+ python group_jsons.py --metrics_path metrics_$model --model $model --org $org --username $your_hf_username
136
+ ```
137
+ For credibility, we invite you to add the generations and json metrics to your submission.
138
+
139
+ Now you're ready to submit your results by opening a PR on the leaderboard, go to `Submit results :rocket:`section for more details.
140
+
141
+ ## Notes
142
+ Some models might require some extra arguments, like [CodeGeeX2-6b](https://huggingface.co/THUDM/codegeex2-6b) which requires providing the language tag as a prefix and doing generation under torch 2.0. And [replit-v1-3b](https://huggingface.co/replit/replit-code-v1-3b) that requires adding extra. You can just add the prefix as a new argument
143
+ ```bash
144
+ # define prefixes base on codegeex-2 repo
145
+ declare -A langs
146
+ langs=( [py]="# Python" [js]="// JavaScript" [java]="// Java" [cpp]="// C++" [swift]="// Swift" [php]="// PHP" [jl]="# Julia" [lua]="// Lua" [r]="# R" [rkt]="; Racket" [rs]="// Rust" [d]="" )
147
+
148
+ model="codegeex2-6b"
149
+ org="THUDM"
150
+
151
+ for lang in "${!langs[@]}"; do
152
+ prefix="language: ${langs[$lang]}"
153
+ echo "For language $lang, the prefix is: $prefix"
154
+ generations_path=generations_$model/generations_$task\_$model.json
155
+ accelerate launch main.py \
156
+ --model $org/$model \
157
+ --task multiple-l$ang \
158
+ --n_samples 5 \
159
+ --batch_size 5 \
160
+ --limit 8 \
161
+ --max_length_generation 512 \
162
+ --temperature 0.2 \
163
+ --precision bf16 \
164
+ --trust_remote_code \
165
+ --use_auth_token \
166
+ --generation_only \
167
+ --save_generations_path $generations_path \
168
+ --prefix \"$prefix\" \
169
+ echo "Task $task done"
170
+ done
171
+ ```
172
+ Replit model command (pull code from this [PR](https://github.com/bigcode-project/bigcode-evaluation-harness/pull/115)):
173
+ ```bash
174
+ accelerate launch main.py \
175
+ --model replit/replit-code-v1-3b \
176
+ --tasks multiple-$lang \
177
+ --max_length_generation 512 \
178
+ --batch_size 50 \
179
+ --n_samples 10 \
180
+ --temperature 0.2 \
181
+ --precision fp16 \
182
+ --allow_code_execution \
183
+ --trust_remote_code \
184
+ --save_generations \
185
+ --use_auth_token \
186
+ --generation_only \
187
+ --save_generations_path /fsx/loubna/code/bigcode-evaluation-harness/multiple_gens_replit/replit-$lang.json \
188
+ --automodel_kwargs '{\
189
+ \"attn_config\": {\
190
+ \"alibi\": true,\
191
+ \"alibi_bias_max\": 8,\
192
+ \"attn_impl\": \"triton\",\
193
+ \"attn_pdrop\": 0,\
194
+ \"attn_type\": \"multihead_attention\",\
195
+ \"attn_uses_sequence_id\": false,\
196
+ \"clip_qkv\": null,\
197
+ \"prefix_lm\": false,\
198
+ \"qk_ln\": false,\
199
+ \"softmax_scale\": null\
200
+ }\
201
+ }'
202
+ ```
203
+
204
+ ## Bonus
205
+ For the throughput and peak memory measurments, we point you to [optimum-benchamrk](https://github.com/huggingface/optimum-benchmark) (checkout commit `49f0924e2bb041cf17d78dd0848d8e2cad31632d` [here](https://github.com/huggingface/optimum-benchmark/commit/49f0924e2bb041cf17d78dd0848d8e2cad31632d)).
206
+ You can follow the instructions in the repo, copy our config yaml and run the command below:
207
+ ```bash
208
+ cp throughput_config.yaml optimum-benchmark/examples
209
+ device=cuda:0
210
+ batch=1
211
+ optimum-benchmark --config-dir examples --config-name throughput_config model=$org/$model device=$device benchmark.input_shapes.batch_size=$batch
212
+ ```
group_jsons.py ADDED
@@ -0,0 +1,42 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import argparse
2
+ import pandas as pd
3
+ import json
4
+ import os
5
+ import glob
6
+
7
+
8
+ parser = argparse.ArgumentParser(description='Process metric files')
9
+ parser.add_argument('--metrics_path', type=str, required=True, help='Path where metric files are stored')
10
+ parser.add_argument('--model', type=str, required=True, help='Name of the model')
11
+ parser.add_argument('--org', type=str, required=True, help='Organization/user hosting the model')
12
+ parser.add_argument('--username', type=str, required=True, help='Your HF username')
13
+ args = parser.parse_args()
14
+
15
+
16
+ # List of valid tasks
17
+ valid_tasks = ["humaneval"] + ["multiple-" + lang for lang in ["js", "java", "cpp", "swift", "php", "d", "jl", "lua", "r", "rkt", "rb", "rs"]]
18
+
19
+ final_results = {"results": [], "meta": {"model": f"{args.org}/{args.model}"}}
20
+
21
+ # Iterate over all .json files in the metrics_path
22
+ for json_file in glob.glob(os.path.join(args.metrics_path, '*.json')):
23
+
24
+ # Extract task from file name
25
+ print(f"Processing {json_file}")
26
+ task = os.path.splitext(os.path.basename(json_file))[0].split('_')[1]
27
+ if task not in valid_tasks:
28
+ print(f"Skipping invalid task: {task}")
29
+ continue
30
+
31
+ with open(json_file, 'r') as f:
32
+ data = json.load(f)
33
+
34
+ pass_at_1 = data.get(task, {}).get("pass@1", None)
35
+ output = {"task": task, "pass@1": pass_at_1}
36
+ final_results["results"].append(output)
37
+
38
+
39
+ with open(f"{args.org}_{args.model}_{args.username}.json", 'w') as f:
40
+ json.dump(final_results, f)
41
+
42
+ print(f"Saved {args.org}_{args.model}_{args.username}.json")
multiple_eval.slurm ADDED
@@ -0,0 +1,83 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/bin/bash
2
+ #SBATCH --nodes=1
3
+ #SBATCH --ntasks-per-node=1 # crucial - only 1 task per dist per node!
4
+ #SBATCH --cpus-per-task=48
5
+ #SBATCH --gres=gpu:4
6
+ #SBATCH --partition=production-cluster
7
+ #SBATCH --output=/fsx/loubna/logs/evaluation/leaderboard/%x-%j.out
8
+
9
+ set -x -e
10
+ source /admin/home/loubna/.bashrc
11
+
12
+ conda activate brr4
13
+
14
+ # File Path setup
15
+ echo "START TIME: $(date)"
16
+
17
+ GPUS_PER_NODE=4
18
+ MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1)
19
+ MASTER_PORT=6000
20
+ NNODES=$SLURM_NNODES
21
+ NODE_RANK=$SLURM_PROCID
22
+ WORLD_SIZE=$(($GPUS_PER_NODE*$NNODES))
23
+
24
+
25
+ model=$1
26
+ task=$2
27
+ org=$3
28
+ out_path=$4
29
+
30
+ CMD="\
31
+ /fsx/loubna/code/bigcode-evaluation-harness/main.py \
32
+ --model $org/$model \
33
+ --tasks $task \
34
+ --max_length_generation 512 \
35
+ --batch_size 50 \
36
+ --n_samples 50 \
37
+ --temperature 0.2 \
38
+ --precision bf16 \
39
+ --allow_code_execution \
40
+ --trust_remote_code \
41
+ --save_generations \
42
+ --use_auth_token \
43
+ --generation_only \
44
+ --save_generations_path $out_path/generations_$task\_$model.json \
45
+ "
46
+
47
+ export LAUNCHER="accelerate launch \
48
+ --multi_gpu \
49
+ --num_machines $NNODES \
50
+ --num_processes $WORLD_SIZE \
51
+ --main_process_ip "$MASTER_ADDR" \
52
+ --main_process_port $MASTER_PORT \
53
+ --num_processes $WORLD_SIZE \
54
+ --machine_rank \$SLURM_PROCID \
55
+ --role $SLURMD_NODENAME: \
56
+ --rdzv_conf rdzv_backend=c10d \
57
+ --max_restarts 0 \
58
+ --tee 3 \
59
+ "
60
+
61
+ # force crashing on nccl issues like hanging broadcast
62
+ export NCCL_ASYNC_ERROR_HANDLING=1
63
+
64
+ # AWS specific
65
+ export NCCL_PROTO=simple
66
+ export RDMAV_FORK_SAFE=1
67
+ export FI_EFA_FORK_SAFE=1
68
+ export FI_EFA_USE_DEVICE_RDMA=1
69
+ export FI_PROVIDER=efa
70
+ export FI_LOG_LEVEL=1
71
+ export NCCL_IB_DISABLE=1
72
+ export NCCL_SOCKET_IFNAME=ens
73
+
74
+ echo $CMD
75
+
76
+ SRUN_ARGS=" \
77
+ --wait=60 \
78
+ --kill-on-bad-exit=1 \
79
+ "
80
+
81
+ clear; srun $SRUN_ARGS --jobid $SLURM_JOB_ID bash -c "$LAUNCHER $CMD"
82
+
83
+ echo "END TIME: $(date)"
throughput_config.yaml ADDED
@@ -0,0 +1,37 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ defaults:
2
+ - backend: pytorch # default backend
3
+ - benchmark: inference # default benchmark
4
+ - experiment # inheriting experiment schema
5
+ - _self_ # for hydra 1.1 compatibility
6
+ - override hydra/job_logging: colorlog # colorful logging
7
+ - override hydra/hydra_logging: colorlog # colorful logging
8
+
9
+ hydra:
10
+ run:
11
+ dir: runs/${experiment_name}
12
+ sweep:
13
+ dir: sweeps/${experiment_name}
14
+ job:
15
+ chdir: true
16
+ env_set:
17
+ CUDA_VISIBLE_DEVICES: 0,1,2,3,4,5,6,7
18
+
19
+ experiment_name: code_evals
20
+
21
+ model: bigcode/santacoder
22
+
23
+ hub_kwargs:
24
+ use_auth_token: true
25
+ trust_remote_code: true
26
+
27
+ backend:
28
+ torch_dtype: float16
29
+
30
+ device: cuda:0
31
+
32
+ benchmark:
33
+ memory: true
34
+ input_shapes:
35
+ batch_size: 1
36
+ sequence_length: 1
37
+ new_tokens: 1000