Text Generation
Transformers
PyTorch
English
llama
Eval Results
Inference Endpoints
text-generation-inference

Performance?

#1
by flashvenom - opened

Have you tried benchmarking this model? Curious to see what the orca approach does

Hi,
Thanks, Yes I am in process to do so, we just finished training last night. Any particular recommendations?

HumanEval seems appropriate to check, since better reasoning should also lead to better programming performance and it's typically most indicative of relative model performance.

HumanEval seems appropriate to check, since better reasoning should also lead to better programming performance and it's typically most indicative of relative model performance.

Yeah, though HE is pretty limited and not representative for real-world development, it is the simplest way to evaluate the functional correctness of code generation capability.

Do you have any plans to retrain the model for a larger context window, possibly with qlora and landmark. This model seems to preform so good under 1k tokens, I would love to see it be able to do more!

As matter of fact I do, but gotta wait for some time, trying to arrange few other things like GPU credits, etc.

As matter of fact I do, but gotta wait for some time, trying to arrange few other things like GPU credits, etc.

I will be waiting patiently thank you! Ill try training the 3b (or Santa Coder if I can somehow get the training working) with lora on a 3090, I'm very interested in how this model impacts base performance of a model and then how added lora's will impact performance once the model has a larger foundation with these datasets. Wizard coder for example would benefit significantly more from a dataset like this as opposed to the original wizard dataset. Wizard coder was like 10% off gpt 3.5 in the coding benchmark, with this dataset the gap could be negligible in comparison to gpt 3.5. Also seriously thank you for making this dataset and training these models from what it looks like you are a one person team which makes all this WAY more impressive, Thank you!

Actually I have done some attempts on HumanEval-Python, but two issues are spotted and I failed to counter them, not sure if it is the essential problem of the model or data:

  1. Hard to follow explicit instructions:
    I expect the model to output pure Python code, but it seems not possible after I tried several prompts:
# system = 'You are an AI programming assistant that follows instruction to write Python3 code extremely well. Given the instruction and function signature, you are only allowed to implement the function in response.'
# system = 'You are an AI programming assistant that follows instruction to write Python3 code extremely well. Given the instruction and function signature, you must implement the function in response. The response MUST only consists of Python3 code.'
system = 'You are an AI programming assistant that follows instruction to write Python3 code extremely well.'
...
def generate(system, instruction, input=None):
    if input:
        prompt = f"### System:\n{system}\n\n### User:\n{instruction}\n\n### Input:\n{input}\n\n### Response:\n"
    else:
        prompt = f"### System:\n{system}\n\n### User:\n{instruction}\n\n### Response:\n"

For the instruction part, I have tried such formats:

1. from typing import List\n\n\ndef has_close_elements(numbers: List[float], threshold: float) -> bool:\n    \"\"\" Check if in given list of numbers, are any two numbers closer to each other than\n    given threshold.\n    >>> has_close_elements([1.0, 2.0, 3.0], 0.5)\n    False\n    >>> has_close_elements([1.0, 2.8, 3.0, 4.0, 5.0, 2.0], 0.3)\n    True\n    \"\"\"\n
2. Check if in given list of numbers, are any two numbers closer to each other than\ngiven threshold.\n>>> has_close_elements([1.0, 2.0, 3.0], 0.5)\nFalse\n>>> has_close_elements([1.0, 2.8, 3.0, 4.0, 5.0, 2.0], 0.3)\nTrue\nfrom typing import List\n\n\ndef has_close_elements(numbers: List[float], threshold: float) -> bool:

The output still contains instructions, explanations, and code snippets in markdown format

  1. The indentation is missing in the code snippets:
    Here is one typical output, for the last question in HumanEval:

Here's the Python3 code to generate the even digits between two positive integers in ascending order:

def generate_integers(a, b):
 even_digits = []
 for i in range(a, b+1):
 even_digits.append(i)
 even_digits.sort(reverse=True)
 return even_digits

This code will return a list of even digits between a and b, in ascending order.


I printed it directed with print() and loguru, but the code is apparently unexecutable due to indentation error. I also tried to run it in debug mode and also tried other models (like StarCoder), this should be the problem of Orca, maybe the tokenizer merged multiple neighboring white spaces?

Sign up or log in to comment