Russian words in output

#18

by syf97 - opened Mar 22, 2024

Mar 22, 2024

When I tried out the model, in a lot of test cases I received Russian words in my replies, like
"Ъ answер:",
"Исходный текст: БСч (Поч"
"зеро Новембар 2019 г."

All the prompts that I tried were in English, and I've asked it to answer simple questions from a given text.

senseable

Owner Mar 22, 2024

Odd, could you provide the steps to recreate.

syf97

Mar 23, 2024

•

edited Mar 23, 2024

Sure:

model_id = "senseable/WestLake-7B-v2"
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16
)

tokenizer = AutoTokenizer.from_pretrained(model_id, token=os.environ['HF_TOKEN'])
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"
tokenizer.add_eos_token = True
model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config, device_map={"":0}, token=os.environ['HF_TOKEN'])
device = "cuda:0"

TESTING THE BASE MODEL

inputs = tokenizer("What is that movie where a bunch of knights search for a golden cup using coconuts as horses?", return_tensors="pt").to(device)
outputs = model.generate(**inputs, max_new_tokens=50)
result = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(result)

"What is that movie where a bunch of knights search for a golden cup using coconuts as horses?зеро Новембар 2019 г.

This movie you're describing sounds like it could be "Monty Python and the Holy Grail" (1975). In this classic British comedy film by Mont"

senseable

Owner Mar 23, 2024

•

edited Mar 23, 2024

It looks like the way you're setting up your tokenizer's padding_side and add_eos_token maybe the issue.

Here's an example of how I generate output_ids, adding a chat_template is optional but good for testing:

...
tokenizer = AutoTokenizer.from_pretrained(model_id, token=os.environ['HF_TOKEN'])
tokenizer.pad_token_id = 2

chat_template = "{% for message in messages %}{{bos_token + message['role'] + '\n' + message['content'] + eos_token + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ bos_token + 'assistant\n' }}{% endif %}"        
tokenizer.chat_template = chat_template

model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config, device_map={"":0}, token=os.environ['HF_TOKEN'])
model.eval()

# system_prompt example
system_prompt = "You are an attentive, unbiased, and knowledgeable guide, always striving to deliver accurate and helpful responses to a wide range of inquiries. You're here to assist, not to entertain or form personal opinions, but aim to be as friendly and approachable."

messages = [
     {"role": "system", "content": system_prompt},
     {"role": "user", "content": "What is that movie where a bunch of knights search for a golden cup using coconuts as horses?" },
]
inputs = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=True, return_tensors='pt').to(device)
outputs = model.generate(input_ids=inputs, max_new_tokens=50, pad_token_id=tokenizer.eos_token_id)
...

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment