Russian words in output

#18
by syf97 - opened

When I tried out the model, in a lot of test cases I received Russian words in my replies, like
"Ъ answер:",
"Исходный текст: БСч (Поч"
"зеро Новембар 2019 г."

All the prompts that I tried were in English, and I've asked it to answer simple questions from a given text.

Odd, could you provide the steps to recreate.

Sure:

model_id = "senseable/WestLake-7B-v2"
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16
)

tokenizer = AutoTokenizer.from_pretrained(model_id, token=os.environ['HF_TOKEN'])
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"
tokenizer.add_eos_token = True
model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config, device_map={"":0}, token=os.environ['HF_TOKEN'])
device = "cuda:0"

TESTING THE BASE MODEL

inputs = tokenizer("What is that movie where a bunch of knights search for a golden cup using coconuts as horses?", return_tensors="pt").to(device)
outputs = model.generate(**inputs, max_new_tokens=50)
result = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(result)

"What is that movie where a bunch of knights search for a golden cup using coconuts as horses?зеро Новембар 2019 г.

This movie you're describing sounds like it could be "Monty Python and the Holy Grail" (1975). In this classic British comedy film by Mont"

It looks like the way you're setting up your tokenizer's padding_side and add_eos_token maybe the issue.

Here's an example of how I generate output_ids, adding a chat_template is optional but good for testing:

...
tokenizer = AutoTokenizer.from_pretrained(model_id, token=os.environ['HF_TOKEN'])
tokenizer.pad_token_id = 2

chat_template = "{% for message in messages %}{{bos_token + message['role'] + '\n' + message['content'] + eos_token + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ bos_token + 'assistant\n' }}{% endif %}"        
tokenizer.chat_template = chat_template

model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config, device_map={"":0}, token=os.environ['HF_TOKEN'])
model.eval()

# system_prompt example
system_prompt = "You are an attentive, unbiased, and knowledgeable guide, always striving to deliver accurate and helpful responses to a wide range of inquiries. You're here to assist, not to entertain or form personal opinions, but aim to be as friendly and approachable."

messages = [
     {"role": "system", "content": system_prompt},
     {"role": "user", "content": "What is that movie where a bunch of knights search for a golden cup using coconuts as horses?" },
]
inputs = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=True, return_tensors='pt').to(device)
outputs = model.generate(input_ids=inputs, max_new_tokens=50, pad_token_id=tokenizer.eos_token_id)
...

Sign up or log in to comment