English Benchmarking

by msalhab96 - opened Sep 4, 2023

Sep 4, 2023

I'm trying to replicate the results of lm-eval using lm-harness as mentioned in the paper under zero shot-setting, but it seems the values in the paper does not match the values I get for the foundation model, is it the same model that has been benchmarked in the paper?

samta-kamboj

Core42 org Sep 4, 2023

•

edited Sep 4, 2023

Yes, its the same model. Please share the results that you got and the metric you are trying to compare.

msalhab96

Sep 4, 2023

•

edited Sep 4, 2023

I tried HellaSwag, arc challenge, MMLU, zero-shot as in the paper

using https://github.com/EleutherAI/lm-evaluation-harness

for arc_challange here how my command looks like

python main.py --model hf-causal --model_args pretrained=inception-mbzuai/jais-13b,trust_remote_code=True --tasks arc_challenge --num_fewshot 0

samta-kamboj

Core42 org Sep 4, 2023

Thanks, Could you please share the results (actual numbers that you got) and also what metric are you looking at "acc" or "acc_norm"?

msalhab96

Sep 4, 2023

Here is the results for hellaswag, the reported number in the paper is 71.8 while the acc_norm is 43.75

{
  "results": {
    "hellaswag": {
      "acc": 0.37134037044413465,
      "acc_stderr": 0.0048217577341567374,
      "acc_norm": 0.43756223859788884,
      "acc_norm_stderr": 0.004950723480149755
    }
  },
  "versions": {
    "hellaswag": 0
  },
  "config": {
    "model": "hf-causal",
    "model_args": "pretrained=inception-mbzuai/jais-13b,trust_remote_code=True",
    "num_fewshot": 0,
    "batch_size": null,
    "batch_sizes": [],
    "device": null,
    "no_cache": false,
    "limit": null,
    "bootstrap_iters": 100000,
    "description_dict": {}
  }
}

The command used

python main.py --model hf-causal --model_args pretrained=inception-mbzuai/jais-13b,trust_remote_code=True --tasks hellaswag --num_fewshot 0

onkarpandit-g42

Core42 org Sep 7, 2023

Thank you for the message!

We have verified results for the above mentioned tasks and we have got same numbers as we reported in our paper. The reported results are reproducible when we load the model in original precision i.e. float32.
However, when we load the model in float16 we got the results which you noted above. Therefore, the lower precision seems to be lowering the performance of the model.

We suggest you to load the model in original float32 precision to reproduce the results.
You can do this by reserving enough GPU/CPU memory (either with single gpu or multiple gpus) around 60 GB . The following command can be used to run the evaluations

python main.py --model hf-causal-experimental --model_args use_accelerate=True,pretrained=inception-mbzuai/jais-13b,trust_remote_code=True --tasks hellaswag --num_fewshot 0

The flag use_accelerate=True will load model over multiple GPUs in efficient manner.

msalhab96 changed discussion status to closed Sep 10, 2023

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment