The sample codes generates bad code

#3
by devymex - opened

I run the sample code :

from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained('/data/models/codegen25/7b-instruct', trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained('/data/models/codegen25/7b-instruct')


def format(prefix, suffix):
  return prefix + "<mask_1>" + suffix + "<|endoftext|>" + "<sep>" + "<mask_1>"


prefix = "def hello_world():\n    "
suffix = "    return name"
text = format(prefix, suffix)
input_ids = tokenizer(text, return_tensors="pt").input_ids
generated_ids = model.generate(input_ids, max_length=128)
print(tokenizer.decode(generated_ids[0], skip_special_tokens=False)[len(text):])

Got following results:

    return render_template('<eom><|endoftext|>#

It doesn't seem to be the expected result.

$ md5sum ./*
72abc1c968a3591ca78b4b3627182151  ./config.json
185162afdfbe7b61b786b1556233efcb  ./generation_config.json
a859f8a89685747ffd4171b870540c41  ./gitattributes.txt
957e7d6eba323e9fadfe67a0fc235fa5  ./pytorch_model-00001-of-00003.bin
0d25abaa01bde623d3c9b2c7e052f240  ./pytorch_model-00002-of-00003.bin
62e4b3239286f72cafc5e3f55b8d1cf2  ./pytorch_model-00003-of-00003.bin
238155cf5ccec23d742a2c2347063a15  ./pytorch_model.bin.index.json
e0d2431919f2d456fbc22f2aaf4488d7  ./README.md
cf2859a1a9efba39aa84d82b0f3ef426  ./tokenization_codegen25.py
fd3285d0e1655a66e051cfb520afb8e0  ./tokenizer_config.json

Sign up or log in to comment