Edit model card

THIS MODEL HAS EVAL DATA LEAKED INTO THE DATASET

THIS IS NOT AN OFFICIAL MODEL CARD

NewHope: Harnessing 99% of GPT-4's Programming Capabilities

We introduce NewHope, a fine-tuned chat model based on llama-2-13b, aiming to provide a strong coding capability. NewHope handle different languages including Python, C++, Java, JavaScript, Go, and more. Preliminary evaluation on HumanEval shows that NewHope possesses 99% of GPT-4's programming capabilities.

Contact: SLAM (SUFE Large AI Model) is a research group at Shanghai University of Finance and Economics. cui.wanyun@sufe.edu.cn

TODO: We will release more evaluatation results and training details later.

Evaluation Results

We evaluated NewHope on HumanEval using the official evaluation script by OpenAI. We compared the Pass@1 metric of NewHope with other models. The results of other models are from PapersWithCode.

Model Pass@1
GPT-4 67.0
NewHope 66.5
PanGu-Coder2 15B 61.6
WizardCoder 15B 57.3
phi-1 1.3B 50.6
GPT-3.5 48.1
phi-1-small 45.0
PaLM-Coder 36.0
CodeGeeX2-6B 35.9

Model Weights

We have open-sourced the model weights NewHope.

We are uploading the model weights. The weights will be available in a few hours.

Usage

To load the NewHope model using Transformers, use the following code:

import torch
from transformers import LlamaTokenizer, LlamaForCausalLM

base_model = "SLAM-group/NewHope"
tokenizer = LlamaTokenizer.from_pretrained(base_model)
model = LlamaForCausalLM.from_pretrained(base_model, torch_dtype=torch.float16, device_map="auto")
# model.config.use_cache is default to `False`. For inference: `model.config.use_cache = True`

Note: At least Huggingface Transformers 4.31.0 is required to load this model!

You can ask NewHope to generate code with instructions. We provide a simple example of how NewHope model generates code with the specific prompt:

# Suppose required tokenizer and model have already been loaded

instruction = "Write a Python function to tell me what the date is today."
prompt = f"<s> ### Instruction:\n{instruction}\n\n### Response:\n"
inputs = tokenizer(prompt, add_special_tokens=False, return_tensors="pt").to("cuda")
output = model.generate(**inputs, do_sample=True, top_p=0.9, max_new_tokens=2048)[0]
decoded_output = tokenizer.decode(output, skip_special_tokens=True).split("### Response:\n")[-1].strip()
print(decoded_output)

You can also interact with NewHope in a dialog manner with the following prompt:

<s> ### Instruction:\nQ1\n\n### Response:\nA1</s><s> ### Instruction:\nQ2\n\n### Response:\nA2</s>

Evaluation

Local setup

  1. Install HumanEval for evaluation. Details

  2. Install dependencies

    pip install -r requirements.txt
    

For HumanEval, we use the following prompt:

example_input = 'def is_odd(number: int) -> bool:\n    """ Check whether the given number is odd\n    >>> is_odd(3)\n    True\n    >>> is_odd(6)\n    False\n    """\n'
example_output = 'def is_odd(number: int) -> bool:\n    """ Check whether the given number is odd\n    >>> is_odd(3)\n    True\n    >>> is_odd(6)\n    False\n    """\n    return number % 2 == 1'

task_in_humaneval = "REPLACE `task_in_humaneval` WITH THE SPECIFIC TASK IN HUMANEVAL DATA"

prompt = f"<s> ### Instruction:\nComplete the given function below:\n\n{example_input}\n\n### Response:\n{example_output}</s><s> ### Instruction:\nComplete the given function below:\n\n{task_in_human_eval}\n\n### Response:\n"

To reproduce the results on HumanEval, use the following script:

python complete.py --base_model SLAM-group/NewHope --output_dir output --n_gpu 8

The above script will generate samples.jsonl in output_dir, which can be directly evaluated by HumanEval. Evaluation procedure. We conducted the experiment with fp16 on 8xA800, 80GB GPUs, reaching 66.5% on Pass@1 (v.s. GPT4 67.0%).

Citation

@misc{2023newhope,
    title={NewHope: Harnessing 99% of GPT-4's Programming Capabilities},
    author={Wanyun Cui and Qianle Wang},
    howpublished = https://github.com/SLAM-group/newhope,
    year={2023}
}

Open LLM Leaderboard Evaluation Results

Detailed results can be found here

Metric Value
Avg. 51.9
ARC (25-shot) 61.09
HellaSwag (10-shot) 84.03
MMLU (5-shot) 55.73
TruthfulQA (0-shot) 44.96
Winogrande (5-shot) 74.98
GSM8K (5-shot) 15.85
DROP (3-shot) 26.66
Downloads last month
1,504
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Spaces using WhoTookMyAmogusNickname/NewHope_HF_not_official 21