Issues with tokenizer causing bad performance of model.

#30
by Takuonline - opened

I am experiencing a weird issue with the phi3-mini tokenizer where it will incorrectly interpret a character which, in turn influences how the model responds.

Here is an example where l am loading and using the exact same example they gave on their huggingface page as far as configs go

messages = [
{"role": "system", "content": "You are a helpful digital assistant. Please provide safe, ethical and accurate information to the user."},
{"role": "user", "content": """Why does the string r"F:\ml_models\llm\2102Phi-3-mini-128k-instruct" function correctly in my Python code, while the string "F:\ml_models\llm\2102Phi-3-mini-128k-instruct" causes an error?"""},
]

pipe = pipeline( "text-generation",  model=model,  tokenizer=tokenizer  )

generation_args = {"max_new_tokens": 1_000, "return_full_text": False, "temperature": 0.0, "do_sample": False, }

output = pipe(messages, **generation_args)

print(output[0]['generated_text'])
The issue you're experiencing is likely due to the presence of a non-standard character in your string. In your case, the character "ˆ" is causing the error. This character is not recognized in standard ASCII or Unicode, which Python uses for string handling.

Here's how you can resolve this issue:

1. Identify the non-standard character: In your string, the non-standard character is "ˆ". You can identify it by using the `ord()` function in Python, which returns the Unicode code point of a character.

```python
print(ord('ˆ'))
```

2. Replace the non-standard character: Once you've identified the non-standard character, you can replace it with a standard one. In this case, you can replace "ˆ" with a space or any other character that makes sense in your context.

```python
corrected_string = "F:\ml_models\llm 2 Phi-3-mini-128k-instruct"
```

3. Test your code: After making the replacement, run your code again to see if the error persists.

```python
print(corrected_string)
```

Remember, it's crucial to ensure that all characters in your strings are standard ASCII or Unicode characters. This will help avoid such issues in the future.

I tried this using both transformers==4.40.0 (recommended by the Microsoft team)
and the latest transformers==4.41.0.dev0
I am not using flash attention

May it because of pipelineinterface. I tried model here (in app.py it loads model and tokenizer separately):
https://huggingface.co/spaces/eswardivi/Phi-3-mini-128k-instruct
Got different answer
изображение.png

The Phi 3 Mini 128k offers compact yet powerful computing capabilities suitable for various applications. With its 128k of memory and efficient instruction set, directv plans it enables rapid data processing and computational tasks. Ideal for embedded systems, IoT devices, and edge computing applications, the Phi 3 Mini 128k delivers reliable performance in a small form factor, making it a versatile choice for diverse computing needs.

Microsoft org

The issue should be mitigated now, please try re-loading the model (it will cache the new files) and should produce a proper output.

gugarosa changed discussion status to closed

Sign up or log in to comment