--- license: llama2 metrics: - code_eval library_name: transformers tags: - code model-index: - name: WizardCoder-Python-34B-V1.0 results: - task: type: text-generation dataset: type: openai_humaneval name: HumanEval metrics: - name: pass@1 type: pass@1 value: 0.732 verified: false ---

πŸ€— HF Repo β€’πŸ± Github Repo β€’ 🐦 Twitter β€’ πŸ“ƒ [WizardLM] β€’ πŸ“ƒ [WizardCoder] β€’ πŸ“ƒ [WizardMath]

πŸ‘‹ Join our Discord

## News - πŸ”₯πŸ”₯πŸ”₯[2023/08/26] We released **WizardCoder-Python-34B-V1.0** , which achieves the **73.2 pass@1** and surpasses **GPT4 (2023/03/15)**, **ChatGPT-3.5**, and **Claude2** on the [HumanEval Benchmarks](https://github.com/openai/human-eval). - [2023/06/16] We released **WizardCoder-15B-V1.0** , which achieves the **57.3 pass@1** and surpasses **Claude-Plus (+6.8)**, **Bard (+15.3)** and **InstructCodeT5+ (+22.3)** on the [HumanEval Benchmarks](https://github.com/openai/human-eval). ❗Note: There are two HumanEval results of GPT4 and ChatGPT-3.5. The 67.0 and 48.1 are reported by the official GPT4 Report (2023/03/15) of [OpenAI](https://arxiv.org/abs/2303.08774). The 82.0 and 72.5 are tested by ourselves with the latest API (2023/08/26). | Model | Checkpoint | Paper | HumanEval | MBPP | Demo | License | | ----- |------| ---- |------|-------| ----- | ----- | | WizardCoder-Python-34B-V1.0 | πŸ€— HF Link | πŸ“ƒ [WizardCoder] | 73.2 | 61.2 | [Demo](http://47.103.63.15:50085/) | Llama2 | | WizardCoder-15B-V1.0 | πŸ€— HF Link | πŸ“ƒ [WizardCoder] | 59.8 |50.6 | -- | OpenRAIL-M | | WizardCoder-Python-13B-V1.0 | πŸ€— HF Link | πŸ“ƒ [WizardCoder] | 64.0 | 55.6 | -- | Llama2 | | WizardCoder-3B-V1.0 | πŸ€— HF Link | πŸ“ƒ [WizardCoder] | 34.8 |37.4 | [Demo](http://47.103.63.15:50086/) | OpenRAIL-M | | WizardCoder-1B-V1.0 | πŸ€— HF Link | πŸ“ƒ [WizardCoder] | 23.8 |28.6 | -- | OpenRAIL-M | - Our **WizardMath-70B-V1.0** model slightly outperforms some closed-source LLMs on the GSM8K, including **ChatGPT 3.5**, **Claude Instant 1** and **PaLM 2 540B**. - Our **WizardMath-70B-V1.0** model achieves **81.6 pass@1** on the [GSM8k Benchmarks](https://github.com/openai/grade-school-math), which is **24.8** points higher than the SOTA open-source LLM, and achieves **22.7 pass@1** on the [MATH Benchmarks](https://github.com/hendrycks/math), which is **9.2** points higher than the SOTA open-source LLM. | Model | Checkpoint | Paper | GSM8k | MATH |Online Demo| License| | ----- |------| ---- |------|-------| ----- | ----- | | WizardMath-70B-V1.0 | πŸ€— HF Link | πŸ“ƒ [WizardMath]| **81.6** | **22.7** |[Demo](http://47.103.63.15:50083/)| Llama 2 | | WizardMath-13B-V1.0 | πŸ€— HF Link | πŸ“ƒ [WizardMath]| **63.9** | **14.0** |[Demo](http://47.103.63.15:50082/)| Llama 2 | | WizardMath-7B-V1.0 | πŸ€— HF Link | πŸ“ƒ [WizardMath]| **54.9** | **10.7** | [Demo ](http://47.103.63.15:50080/)| Llama 2 | - [08/09/2023] We released **WizardLM-70B-V1.0** model. Here is [Full Model Weight](https://huggingface.co/WizardLM/WizardLM-70B-V1.0). | Model | Checkpoint | Paper |MT-Bench | AlpacaEval | GSM8k | HumanEval | License| | ----- |------| ---- |------|-------| ----- | ----- | ----- | | **WizardLM-70B-V1.0** | πŸ€— HF Link |πŸ“ƒ**Coming Soon**| **7.78** | **92.91%** |**77.6%** | **50.6**| Llama 2 License | | WizardLM-13B-V1.2 | πŸ€— HF Link | | 7.06 | 89.17% |55.3% | 36.6 | Llama 2 License | | WizardLM-13B-V1.1 | πŸ€— HF Link | | 6.76 |86.32% | | 25.0 | Non-commercial| | WizardLM-30B-V1.0 | πŸ€— HF Link | | 7.01 | | | 37.8 | Non-commercial | | WizardLM-13B-V1.0 | πŸ€— HF Link | | 6.35 | 75.31% | | 24.0 | Non-commercial| | WizardLM-7B-V1.0 | πŸ€— HF Link | πŸ“ƒ [WizardLM] | | | |19.1 | Non-commercial| ## Comparing WizardCoder-Python-34B-V1.0 with Other LLMs. πŸ”₯ The following figure shows that our **WizardCoder-Python-34B-V1.0 attains the second position in this benchmark**, surpassing GPT4 (2023/03/15, 73.2 vs. 67.0), ChatGPT-3.5 (73.2 vs. 72.5) and Claude2 (73.2 vs. 71.2).

WizardCoder

## Prompt Format ``` "Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\n{instruction}\n\n### Response:" ``` ## Inference Demo Script We provide the inference demo code [here](https://github.com/nlpxucan/WizardLM/tree/main/demo). ## Citation Please cite the repo if you use the data, method or code in this repo. ``` @article{luo2023wizardcoder, title={WizardCoder: Empowering Code Large Language Models with Evol-Instruct}, author={Luo, Ziyang and Xu, Can and Zhao, Pu and Sun, Qingfeng and Geng, Xiubo and Hu, Wenxiang and Tao, Chongyang and Ma, Jing and Lin, Qingwei and Jiang, Daxin}, journal={arXiv preprint arXiv:2306.08568}, year={2023} } ``` # [Open LLM Leaderboard Evaluation Results](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard) Detailed results can be found [here](https://huggingface.co/datasets/open-llm-leaderboard/details_WizardLM__WizardCoder-Python-34B-V1.0) | Metric | Value | |-----------------------|---------------------------| | Avg. | 46.83 | | ARC (25-shot) | 52.13 | | HellaSwag (10-shot) | 74.78 | | MMLU (5-shot) | 49.15 | | TruthfulQA (0-shot) | 48.85 | | Winogrande (5-shot) | 68.35 | | GSM8K (5-shot) | 9.48 | | DROP (3-shot) | 25.06 |