Issues with tokenizer causing bad performance of model.

#66
by Takuonline - opened

I am experiencing a weird issue with the phi3-mini tokenizer where it will incorrectly interpret a character which, in turn influences how the model responds.

Here is an example where l am loading and using the exact same example they gave on their huggingface page as far as configs go

messages = [
{"role": "system", "content": "You are a helpful digital assistant. Please provide safe, ethical and accurate information to the user."},
{"role": "user", "content": """Why does the string r"F:\ml_models\llm\2102Phi-3-mini-128k-instruct" function correctly in my Python code, while the string "F:\ml_models\llm\2102Phi-3-mini-128k-instruct" causes an error?"""},
]

pipe = pipeline( "text-generation",  model=model,  tokenizer=tokenizer  )

generation_args = {"max_new_tokens": 1_000, "return_full_text": False, "temperature": 0.0, "do_sample": False, }

output = pipe(messages, **generation_args)

print(output[0]['generated_text'])
The issue you're experiencing is likely due to the presence of a non-standard character in your string. In your case, the character "ˆ" is causing the error. This character is not recognized in standard ASCII or Unicode, which Python uses for string handling.

Here's how you can resolve this issue:

1. Identify the non-standard character: In your string, the non-standard character is "ˆ". You can identify it by using the `ord()` function in Python, which returns the Unicode code point of a character.

```python
print(ord('ˆ'))
```

2. Replace the non-standard character: Once you've identified the non-standard character, you can replace it with a standard one. In this case, you can replace "ˆ" with a space or any other character that makes sense in your context.

```python
corrected_string = "F:\ml_models\llm 2 Phi-3-mini-128k-instruct"
```

3. Test your code: After making the replacement, run your code again to see if the error persists.

```python
print(corrected_string)
```

Remember, it's crucial to ensure that all characters in your strings are standard ASCII or Unicode characters. This will help avoid such issues in the future.

I tried this using both transformers==4.40.0 (recommended by the Microsoft team)
and the latest transformers==4.41.0.dev0
I am not using flash attention

Meta Llama org

Hey! Wrong repo, the Llama3 tokenizer is very different from the Llama2 and this from the Phi-3 mini 😉

Meta Llama org

Closing as unrelated

ArthurZ changed discussion status to closed

Sign up or log in to comment