(Rebuked: this claim proven false) "Fake coding scores" .73 at best

#4
by rombodawg - opened

And not the fact that its ran in 8-bit doesnt make a diffrence, if it does only by 0.01-0.03

{
  "humaneval": {
    "pass@1": 0.7378048780487805
  },
  "config": {
    "prefix": "",
    "do_sample": true,
    "temperature": 0.2,
    "top_k": 0,
    "top_p": 0.95,
    "n_samples": 1,
    "eos": "<|endoftext|>",
    "seed": 0,
    "model": "Qwen/CodeQwen1.5-7B-Chat",
    "modeltype": "causal",
    "peft_model": null,
    "revision": null,
    "use_auth_token": false,
    "trust_remote_code": false,
    "tasks": "humaneval",
    "instruction_tokens": null,
    "batch_size": 1,
    "max_length_generation": 4000,
    "precision": "fp16",
    "load_in_8bit": true,
    "load_in_4bit": false,
    "left_padding": false,
    "limit": null,
    "limit_start": 0,
    "save_every_k_tasks": -1,
    "postprocess": true,
    "allow_code_execution": true,
    "generation_only": false,
    "load_generations_path": null,
    "load_data_path": null,
    "metric_output_path": "evaluation_results.json",
    "save_generations": false,
    "load_generations_intermediate_paths": null,
    "save_generations_path": "generations.json",
    "save_references": false,
    "save_references_path": "references.json",
    "prompt": "prompt",
    "max_memory_per_gpu": null,
    "check_references": false
  }
}
Qwen org

Hi, thank you for your attention to Codeqwen. The popular Humaneval testing method employs greedy decoding rather than sampling, and the following link provides our fully reproducible code results.
https://github.com/QwenLM/CodeQwen1.5/tree/main/evaluation/eval_plus

huybery changed discussion status to closed

Here is the code i ran on google colab to get these results to answer your question on twitter

!pip install -q -U transformers
!pip install -q -U accelerate
!pip install -q -U bitsandbytes
!git clone https://github.com/bigcode-project/bigcode-evaluation-harness.git
%cd bigcode-evaluation-harness
!pip install -r requirements.txt
!accelerate launch main.py --tasks humaneval --model Qwen/CodeQwen1.5-7B-Chat --load_in_8bit --allow_code_execution --max_length_generation 4000 --precision fp16
This comment has been hidden
Qwen org

@rombodawg The CodeQwen1.5-7B-Chat model needs to use the chatml (such as https://github.com/QwenLM/CodeQwen1.5/blob/main/evaluation/eval_plus/model.py#L161) as the input template. If you directly use the default prompt of the bigcode harness for inference, it tests using the continuation method. This cannot be aligned with CodeQwen's normal behavior.

Qwen org

If you want evaluate CodeQwen1.5-7B-Chat with bigcode-evaluation-harness, you can use following command.

accelerate launch main.py \
        --model Qwen/CodeQwen1.5-7B-Chat \
        --tasks humanevalsynthesize-python \
        --max_length_generation 2048 \
        --prompt codeqwen \
        --temperature 0.0 \
        --trust_remote_code \
        --top_k 1 \
        --top_p 0 \
        --do_sample False \
        --n_samples 1 \
        --batch_size 1 \
        --precision bf16 \
        --allow_code_execution \
        --save_generations \

It can reproduce results.

Qwen org

@rombodawg @Suparious https://huggingface.co/Qwen/CodeQwen1.5-7B-Chat/discussions/4#661f476032baa05e0643ca5f

we have answered your question and told you how to replicate. Before you make sure everything is right, stop crapping and make a fair discussion first ok?

I used the above test configuration and the results are indeed exciting.
humaneval python pass@1: 0.8719512195121951

{
  "humanevalsynthesize-python": {
    "pass@1": 0.8719512195121951
  },
  "config": {
    "prefix": "",
    "do_sample": false,
    "temperature": 0.0,
    "top_k": 1,
    "top_p": 0.0,
    "n_samples": 1,
    "eos": "<|endoftext|>",
    "seed": 0,
    "model": "/data/models/qwen/CodeQwen1.5-7B-Chat",
    "modeltype": "causal",
    "peft_model": null,
    "revision": null,
    "use_auth_token": false,
    "trust_remote_code": true,
    "tasks": "humanevalsynthesize-python",
    "instruction_tokens": null,
    "batch_size": 1,
    "max_length_generation": 2048,
    "precision": "bf16",
    "load_in_8bit": false,
    "load_in_4bit": false,
    "left_padding": false,
    "limit": null,
    "limit_start": 0,
    "save_every_k_tasks": -1,
    "postprocess": true,
    "allow_code_execution": true,
    "generation_only": false,
    "load_generations_path": null,
    "load_data_path": null,
    "metric_output_path": "evaluation_results.json",
    "save_generations": true,
    "load_generations_intermediate_paths": null,
    "save_generations_path": "generations.json",
    "save_references": false,
    "save_references_path": "references.json",
    "prompt": "codeqwen",
    "max_memory_per_gpu": null,
    "check_references": false
  }
}

@JustinLin610 Thank you for this information. I will use the chatML format and redo my testing. As well as redo my hand testing. Its nice to see that there are people who are standing up for you. Trust me Im not against you, I am just for the greater good. So this information just excites me. I apologize for my misconception. And I hope you have a wonderful rest of your week.

rombodawg changed discussion title from "Fake coding scores" .73 at best to (Rebuked: this claim proven false) "Fake coding scores" .73 at best

Sign up or log in to comment