Issue with q8_0

by sm54 - opened Jun 6, 2024

sm54

Jun 6, 2024

•

edited Jun 6, 2024

Hello,

I downloaded the q8_0 model and it is giving me a strange response, shown below. I am using text generation webui and using the chat template of "Custom (obtained from model metadata)". My other parameters are quite standard.

AI
How can I help you today?

You
Hello

AI
Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly

bartowski

Qwen org Jun 6, 2024

try to not offload to CUDA or enable flash attention (-fa in llama.cpp)

sm54

Jun 6, 2024

With flash attention enabled I get the same result, and if I set gpu layers to zero and tensorcores to off I get this:

AI
How can I help you today?

You
Hello

AI
Blockly is a visual programming language that allows users to create programs using blocks. It is designed to be intuitive and easy to use, making it a popular choice for teaching programming concepts to beginners. Here are some ways Blockly can help you:

Educational Tool: Blockly is often used in educational settings to teach children and adults the basics of programming. It breaks down complex concepts into simple, manageable blocks that are easy to understand. This makes it an excellent tool for learning programming logic, algorithms, and basic syntax.
Interactive Learning: The visual nature of Blockly allows for interactive learning.

sm54

Jun 6, 2024

Okay, if I set it to CPU only mode as well, now it seems to work, it just runs slow.

bartowski

Qwen org Jun 6, 2024

Ah okay. Was hoping fa would work but maybe not. Either way there's a CUDA bug

zhicwu

Jun 6, 2024

Confirmed working after applying this patch.

najomi

Jun 7, 2024

I'm using jan.ai, I get the blocky error when using CUDA, using the CPU works fine but it's super slow. Is there any fix for CUDA?

shishao

Jun 8, 2024

I encountered the same issue. I was running qwen2 on ollama, but qwen2 returned a bunch of repeating
characters. This issue was resolved by adding the environment variable OLLAMA_FLASH_ATTENTION=1. The
corresponding connection is https://github.com/QwenLM/Qwen2?tab=readme-ov-file.

shishao

Jun 8, 2024

[Unit]
Description=Ollama Service
After=network-online.target

[Service]
ExecStart=/usr/bin/ollama serve
User=ollama
Group=ollama
Restart=always
RestartSec=3
Environment="OLLAMA_HOST=0.0.0.0"
Environment="OLLAMA_FLASH_ATTENTION=1"

[Install]
WantedBy=default.target

Above , Ollama serves' setting , I don't know how to add configs in Environment, so I add Environment again. Maybe I was wrong，but it is worked