bigcode/santacoder · How to run the Fill-in-the-middle setting

Jan 11, 2023

I have been able to get the model to generate autoregressively, however, when I try to tokenize a sequence consisting of special tokens as shown in the FIM example in the model card - "", "", "" - I see multiple token ids being generated corresponding to each of the tokens, further, I am not able to get good generations in the FIM setting, even with the example provided in the Model card. I do not see the FIM tokens being part of the special symbols in the tokenizer either.

Kindly suggest how to use the fill-in-the-middle setting of Santacoder

loubnabnl

BigCode org Jan 14, 2023

•

edited Jan 14, 2023

Hi, you need to manually add the FIM special tokens to the vocab, you will also need to specify return_token_type_ids=False when tokenizing to not get the token ids that might confuse the order. We will try to make the model card more clear about this. Here's a functioning example. You can also find more details in this notebook.

# pip install -q transformers
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("bigcode/santacoder", revision="fim", trust_remote_code=True)
tokenizer_fim = AutoTokenizer.from_pretrained("bigcode/santacoder", padding_side="left")

FIM_PREFIX = "<fim-prefix>"
FIM_MIDDLE = "<fim-middle>"
FIM_SUFFIX = "<fim-suffix>"
FIM_PAD = "<fim-pad>"
EOD = "<|endoftext|>"

tokenizer_fim.add_special_tokens({
  "additional_special_tokens": [EOD, FIM_PREFIX, FIM_MIDDLE, FIM_SUFFIX, FIM_PAD],
  "pad_token": EOD,
})

input_text = "<fim-prefix>def fib(n):<fim-suffix>    else:\n        return fib(n - 2) + fib(n - 1)<fim-middle>"
inputs = tokenizer_fim(input_text, return_tensors="pt", padding=True, return_token_type_ids=False)
outputs = model.generate(**inputs, max_new_tokens=25)
generation = [tokenizer_fim.decode(tensor, skip_special_tokens=False) for tensor in outputs]

print(generation[0])

<fim-prefix>def fib(n):<fim-suffix>    else:
        return fib(n - 2) + fib(n - 1)<fim-middle>
    if n == 0:
        return 0
    elif n == 1:
        return 1
<|endoftext|><fim-prefix>

loubnabnl changed discussion status to closed Jan 24, 2023

loubnabnl

BigCode org Jan 25, 2023

•

edited Jan 28, 2023

FYI the special tokens are now in the tokenizer by default: https://huggingface.co/bigcode/santacoder/discussions/11

And you don't even need to specify return_token_type_ids=False now, we turned it off by default