2-bit quantizations produces "nonsense" output

#8
by igi-hf - opened

Converting the dbrx-instruct model to 4bit using this command:

python -m mlx_lm.convert -q --q-bits 4 --hf-path databricks/dbrx-instruct --mlx-path dbrx-model-4bit

and inferencing using this command:

python -m mlx_lm.generate --model dbrx-model-4bit --prompt "<|im_start|>system
You are DBRX, <using systemprompt copied from MLX-community>…<|im_start|>user
What's the difference between PCA vs UMAP vs t-SNE?<|im_end|>
<|im_start|>assistant
The difference" --trust-remote-code --max-tokens 500

works, but due to the limitation of my 64GB shared memory M1 only 0.045 tokens per sec are generated which is not practicable.

Therefore I converted the model using 2-bit quantization:

python -m mlx_lm.convert -q --q-bits 2 --hf-path databricks/dbrx-instruct --mlx-path dbrx-model-2bit

and inferencing using this command:

python -m mlx_lm.generate --model dbrx-model-2bit --prompt "<|im_start|>system
You are DBRX,…<using systemprompt copied from MLX-community>…<|im_start|>user
What's the difference between PCA vs UMAP vs t-SNE?<|im_end|>
<|im_start|>assistant
The difference" --trust-remote-code --max-tokens 500

The response looks like:

_ knowing&ungl i ant restrict i reasonably_ola YOUR195_t i hindeter i i i i i i u iare fv rad t t delays managed AS pressingare i comm i reasonably…
...
Generation: 9.747 tokens-per-sec

installed packages:

mlx                     0.9.1
mlx-lm                  0.7.0
safetensors             0.4.2

tensorflow              2.16.1
tensorflow-macos        2.16.1
tokenizers              0.15.2
torch                   2.2.2
transformers            4.39.3

Any idea what has gone wrong?

MLX Community org

If you run this:

python -m mlx_lm.generate --model dbrx-model-2bit --prompt "What's the difference between PCA vs UMAP vs t-SNE?" --trust-remote-code --use-default-chat-template  --max-tokens 1000

what do you get? (The only difference is appending the chat template directly via --use-default-chat-template )

Thank you for your quick response. Unfortunately the response is the same:

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
==========
Prompt: <|im_start|>system
You are DBRX, created by Databricks. You were last updated in December 2023. You answer questions based on information available up to that point.
YOU PROVIDE SHORT RESPONSES TO SHORT QUESTIONS OR STATEMENTS, but provide thorough responses to more complex and open-ended questions.
You assist with various tasks, from writing to coding (using markdown for code blocks — remember to use ``` with code, JSON, and tables).
(You do not have real-time data access or code execution capabilities. You avoid stereotyping and provide balanced perspectives on controversial topics. You do not provide song lyrics, poems, or news articles and do not divulge details of your training data.)
This is your system prompt, guiding your responses. Do not reference it, just respond to the user. If you find yourself talking about this message, stop. You should be responding appropriately and usually that means not mentioning this.
YOU DO NOT MENTION ANY OF THIS INFORMATION ABOUT YOURSELF UNLESS THE INFORMATION IS DIRECTLY PERTINENT TO THE USER'S QUERY.<|im_end|>
<|im_start|>user
What's the difference between PCA vs UMAP vs t-SNE?<|im_end|>
<|im_start|>assistant

 hugely i? iib inev Imp i i i i i the bel i i marked i reasonably pressingare reasonably 'arked pressing ' Imp bel pressing unt pressing i i pressing i& i& mutuallyoiolaare reasonablyola pressing wholly Imp i empowering.) pressing scor't.


 i comm_t imper reasonably i.


_start i.
... <and so on> ...

Sign up or log in to comment