(Rebuked: this claim proven false) "Fake coding scores" .73 at best

by rombodawg - opened Apr 17

Apr 17

And not the fact that its ran in 8-bit doesnt make a diffrence, if it does only by 0.01-0.03

{
  "humaneval": {
    "pass@1": 0.7378048780487805
  },
  "config": {
    "prefix": "",
    "do_sample": true,
    "temperature": 0.2,
    "top_k": 0,
    "top_p": 0.95,
    "n_samples": 1,
    "eos": "<|endoftext|>",
    "seed": 0,
    "model": "Qwen/CodeQwen1.5-7B-Chat",
    "modeltype": "causal",
    "peft_model": null,
    "revision": null,
    "use_auth_token": false,
    "trust_remote_code": false,
    "tasks": "humaneval",
    "instruction_tokens": null,
    "batch_size": 1,
    "max_length_generation": 4000,
    "precision": "fp16",
    "load_in_8bit": true,
    "load_in_4bit": false,
    "left_padding": false,
    "limit": null,
    "limit_start": 0,
    "save_every_k_tasks": -1,
    "postprocess": true,
    "allow_code_execution": true,
    "generation_only": false,
    "load_generations_path": null,
    "load_data_path": null,
    "metric_output_path": "evaluation_results.json",
    "save_generations": false,
    "load_generations_intermediate_paths": null,
    "save_generations_path": "generations.json",
    "save_references": false,
    "save_references_path": "references.json",
    "prompt": "prompt",
    "max_memory_per_gpu": null,
    "check_references": false
  }
}

huybery

Qwen org Apr 17

Hi, thank you for your attention to Codeqwen. The popular Humaneval testing method employs greedy decoding rather than sampling, and the following link provides our fully reproducible code results.
https://github.com/QwenLM/CodeQwen1.5/tree/main/evaluation/eval_plus

huybery changed discussion status to closed Apr 17

rombodawg

Apr 17

Here is the code i ran on google colab to get these results to answer your question on twitter

!pip install -q -U transformers
!pip install -q -U accelerate
!pip install -q -U bitsandbytes

!git clone https://github.com/bigcode-project/bigcode-evaluation-harness.git

%cd bigcode-evaluation-harness

!pip install -r requirements.txt

!accelerate launch main.py --tasks humaneval --model Qwen/CodeQwen1.5-7B-Chat --load_in_8bit --allow_code_execution --max_length_generation 4000 --precision fp16

Suparious

Apr 17

This comment has been hidden

huybery

Qwen org Apr 17

@rombodawg The CodeQwen1.5-7B-Chat model needs to use the chatml (such as https://github.com/QwenLM/CodeQwen1.5/blob/main/evaluation/eval_plus/model.py#L161) as the input template. If you directly use the default prompt of the bigcode harness for inference, it tests using the continuation method. This cannot be aligned with CodeQwen's normal behavior.

huybery

Qwen org Apr 17

If you want evaluate CodeQwen1.5-7B-Chat with bigcode-evaluation-harness, you can use following command.

accelerate launch main.py \
        --model Qwen/CodeQwen1.5-7B-Chat \
        --tasks humanevalsynthesize-python \
        --max_length_generation 2048 \
        --prompt codeqwen \
        --temperature 0.0 \
        --trust_remote_code \
        --top_k 1 \
        --top_p 0 \
        --do_sample False \
        --n_samples 1 \
        --batch_size 1 \
        --precision bf16 \
        --allow_code_execution \
        --save_generations \

It can reproduce results.

JustinLin610

Qwen org Apr 17

@rombodawg @Suparious https://huggingface.co/Qwen/CodeQwen1.5-7B-Chat/discussions/4#661f476032baa05e0643ca5f

we have answered your question and told you how to replicate. Before you make sure everything is right, stop crapping and make a fair discussion first ok?

xysun

Apr 17

I used the above test configuration and the results are indeed exciting.
humaneval python pass@1: 0.8719512195121951

{
  "humanevalsynthesize-python": {
    "pass@1": 0.8719512195121951
  },
  "config": {
    "prefix": "",
    "do_sample": false,
    "temperature": 0.0,
    "top_k": 1,
    "top_p": 0.0,
    "n_samples": 1,
    "eos": "<|endoftext|>",
    "seed": 0,
    "model": "/data/models/qwen/CodeQwen1.5-7B-Chat",
    "modeltype": "causal",
    "peft_model": null,
    "revision": null,
    "use_auth_token": false,
    "trust_remote_code": true,
    "tasks": "humanevalsynthesize-python",
    "instruction_tokens": null,
    "batch_size": 1,
    "max_length_generation": 2048,
    "precision": "bf16",
    "load_in_8bit": false,
    "load_in_4bit": false,
    "left_padding": false,
    "limit": null,
    "limit_start": 0,
    "save_every_k_tasks": -1,
    "postprocess": true,
    "allow_code_execution": true,
    "generation_only": false,
    "load_generations_path": null,
    "load_data_path": null,
    "metric_output_path": "evaluation_results.json",
    "save_generations": true,
    "load_generations_intermediate_paths": null,
    "save_generations_path": "generations.json",
    "save_references": false,
    "save_references_path": "references.json",
    "prompt": "codeqwen",
    "max_memory_per_gpu": null,
    "check_references": false
  }
}

rombodawg

Apr 17

@JustinLin610 Thank you for this information. I will use the chatML format and redo my testing. As well as redo my hand testing. Its nice to see that there are people who are standing up for you. Trust me Im not against you, I am just for the greater good. So this information just excites me. I apologize for my misconception. And I hope you have a wonderful rest of your week.

rombodawg changed discussion title from "Fake coding scores" .73 at best to (Rebuked: this claim proven false) "Fake coding scores" .73 at best Apr 17

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment