TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ · Is working with 2xRTX4090 and GPTQ but extremly slow

Dec 21, 2023

•

edited Dec 21, 2023

this mixtral version of TheBloke is really awesome. It works and the quality of the responses is outstanding!
My problem relates to the response times of the local AI.

To compare this, I have installed the Python version (see source code below) and Oobabooga on the same system. Both are running with:
Model: TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ
Branch: gptq-4bit-32g-actorder_True

Oobabooga:

Question: How much drinking water is there on our planet?
Answer: 00:00:11
Task: Write a poem with minimum 200 words about dolphins.
Answer: 00:00:21

Python/PyCharm:

Question: How much drinking water is there on our planet?
Answer Generate: 00:01:53
Answer Pipeline: 00:01:39

Task: Write a poem with minimum 200 words about dolphins.
Answer Generate: 00:02:25
Answer Pipeline: 00:02:34

In both test cases the model is loaded into the dedicated GPU memory.

As you can see, the response times for the same model and the same hardware are extremely different!
Do you have any idea why Oobawooga is so much faster or what I can install in the Pycharm project to make it perform much better?

System:

GPU	2 x RTX4090
RAM	100GB DDR4
CPU	AMD EPYC 7282
OS	Windows 10
Python	3.10
PyCharm	2023.2.1
torch.__version__	2.1.0+cu121
torch.cuda.is_available():	True
transformers	4.37.0.dev0
optimum	1.16.0
auto-gptq	0.7.0.dev0+cu121*

*:
You told us to do this:

pip3 uninstall -y auto-gptq

git clone https://github.com/PanQiWei/AutoGPTQ

cd AutoGPTQ

DISABLE_QIGEN=1 pip3 install .

DISABLE_QIGEN=1 does not work for me:

DISABLE_QIGEN=1 : Die Benennung "DISABLE_QIGEN=1" wurde nicht als Name eines Cmdlet, einer Funktion, einer Skriptdatei oder eines ausführbaren Programms erkannt. Überprüfen Sie die Schreibweise des Namens, oder ob der Pfad korrekt ist (sofern enthalten), und wiederholen Sie den Vorgang.
In Zeile:1 Zeichen:1
+ DISABLE_QIGEN=1 pip3 install .
+ ~~~~~~~~~~~~~~~
    + CategoryInfo          : ObjectNotFound: (DISABLE_QIGEN=1:String) [], CommandNotFoundException
    + FullyQualifiedErrorId : CommandNotFoundException

So i just installed it with "pip3 install ."
Should this unset parameter be responsible for this slow behaviour?

Sourcecode:

from datetime import *
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
# https://huggingface.co/TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ

model_name_or_path = "TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ"
model = AutoModelForCausalLM.from_pretrained(model_name_or_path,
                                             device_map="auto",
                                             trust_remote_code=False,
                                             revision="gptq-4bit-32g-actorder_True")
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)
prompt = "Write a poem with minimum 200 words about dolphins."
system_message = "You have an extremely high level of general knowledge and always answer in English."
prompt_template=f'''[INST] <>{system_message}<>{prompt} [/INST]'''
print("\n\n*** Generate:")
print(str(datetime.now().strftime('%d.%m.%Y - %H:%M:%S')) + ": Start Generate")
input_ids = tokenizer(prompt_template, return_tensors='pt').input_ids.cuda()
output = model.generate(inputs=input_ids, temperature=0.7, do_sample=True, top_p=0.95, top_k=40, max_new_tokens=512)
print(tokenizer.decode(output[0]))
print(str(datetime.now().strftime('%d.%m.%Y - %H:%M:%S')) + ": End Generate\n")

print("*** Pipeline:") print(str(datetime.now().strftime('%d.%m.%Y - %H:%M:%S')) + ": Start Pipeline") pipe = pipeline( "text-generation", model=model, tokenizer=tokenizer, max_new_tokens=512, do_sample=True, temperature=0.7, top_p=0.95, top_k=40, repetition_penalty=1.1 ) print(pipe(prompt_template)[0]['generated_text']) print(str(datetime.now().strftime('%d.%m.%Y - %H:%M:%S')) + ": End Pipeline")

No errors appear during execution, only the following warning:
Warning (from warnings module):
File "(...)\Python\Python310\lib\site-packages\transformers\generation\utils.py", line 1547
warnings.warn(
UserWarning: You have modified the pretrained model configuration to control generation. This is a deprecated strategy to control generation and will be removed soon, in a future version. Please use and modify the model generation configuration (see https://huggingface.co/docs/transformers/generation_strategies#default-text-generation-configuration )

nvidia-smi
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 536.23 Driver Version: 536.23 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 4090 WDDM | 00000000:03:00.0 Off | Off |
| 0% 41C P8 12W / 450W | 514MiB / 24564MiB | 6% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA GeForce RTX 4090 WDDM | 00000000:03:01.0 Off | Off |
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 2632 C+G ....Search_cw5n1h2txyewy\SearchApp.exe N/A |
| 0 N/A N/A 2924 C+G ...oogle\Chrome\Application\chrome.exe N/A |
| 0 N/A N/A 6764 C+G C:\Windows\explorer.exe N/A |
| 0 N/A N/A 7656 C+G ...5n1h2txyewy\ShellExperienceHost.exe N/A |
| 0 N/A N/A 7748 C+G ....Search_cw5n1h2txyewy\SearchApp.exe N/A |
| 0 N/A N/A 8816 C+G ...crosoft\Edge\Application\msedge.exe N/A |
| 0 N/A N/A 10340 C+G ...CBS_cw5n1h2txyewy\TextInputHost.exe N/A |
| 1 N/A N/A 4980 C+G ...ekyb3d8bbwe\PhoneExperienceHost.exe N/A |
+---------------------------------------------------------------------------------------+

Thank you very much in advance! :)

Kukedlc

Feb 9, 2024

This comment has been hidden

iemaig

Feb 20, 2024

•

edited Feb 20, 2024

@mullerse any insights? I observe same phenomenon (Oobabooga fast, pythons scripting excruciatingly slow) with similar specs albeit 2x 48gb A6000's.
I'm even running script from same conda environment Oobabooga uses.
I observe the model loaded onto the gpu. Gpu is not off.