Unable to get accurate infilling

#1
by narphorium - opened

According to the model card, the way to do infilling is to pass in the input as :

<SUF> {some text following cursor} <PRE> {some prelude text here} <MID>

In the example code, the special token IDs are specified as:

<SUF> = 50253
<PRE> = 50254
<MID> = 50255

However, when I generate completions using those tokens I haven't been able to get any accurate results. For example:

prefix = "def top_k(values):\n"
suffix = "  return results"

... infills as:

def top_k(values):
return results.count(values  return results

This looks like the suffix is being ignored and the model is just completing after the prefix.

When I decode the special tokens back to text I get:

50253 = ' Outcomes'
50254 = 24 spaces
50255 = 23 spaces

So I'm wondering if those are really the correct tokens to separate the FIM inputs?

CarperAI org

thanks for bringing this to our attention! Looking into this and will get back to you asap.

Thank you for raising this concern. It seems like it's an issue with the tokenizer. Unfortunately all of our engineers are OOO for the long weekend, we should have a patch out Tuesday or Wednesday. Thanks.

CarperAI org

There was an issue where the sentinel <|SUF|>, <|PRE|>, and <|MID|> tokens were not the correct ids in the uploaded tokenizer and model card! Please try clearing the Huggingface cache and redownloading the model :))

This is what I get, attempting to try out open-ended generation on a simple code function

def score(x,y) -> int:
    """
    

and also infilling with

def score(x,y) -> int:
    """
    <|MID|> (infill here)
    """

    score = x + y
    return score
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model = AutoModelForCausalLM.from_pretrained("CarperAI/FIM-NeoX-1.3B")
tok = AutoTokenizer.from_pretrained("CarperAI/

# infilling demo
prefix = 'def score(x, y) -> int:\n"""\n'
suffix = '"""\n\n    score = x + y\n    return score'

 model_input = [50277, *tok(suffix)["input_ids"], 50278, *tok(prefix)["input_ids"], 50279]
 output = tok.decode(model.generate(torch.IntTensor(model_input).unsqueeze(0), max_length=40)[0])

print(output)

'<|SUF|>"""\n\n score = x + y\n return score<|PRE|>def score(x, y) -> int:\n"""\n<|MID|> score(x, y) -> int\n<|endoftext|>'

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

# non-infilling demo
prefix = 'def score(x, y) -> int:\n"""\n'
model_input = [*tok(prefix)["input_ids"]]
output = tok.decode(model.generate(torch.IntTensor(model_input).unsqueeze(0), max_length=100)[0])
print(output)

'def score(x, y) -> int:\n"""\n Return the score of the given point.\n """\n return sum(x * y for x, y in zip(x_list, y_list))\n\ndef get_point_score(x, y) -> int:\n """\n Return the score of the given point.\n """\n return sum(x * y for x, y in zip(x_list, y'

Hope this helps! I will also update the model card with this example :)

Sign up or log in to comment