TheBloke/Mixtral-8x7B-Instruct-v0.1-GGUF · Issue with GPU Utilization in Colab Notebook

Dec 16, 2023

Hi

I'm encountering an issue with my Google Colab notebook where it doesn't seem to utilize the GPU, despite having the GPU runtime enabled. I've been working on a project that requires GPU acceleration, and this issue is hindering my progress.

To respond with a simple Hi, it is taking 11 minutes, something is wrong, can someone help, please?

Here is the link to my Colab notebook for reference: https://colab.research.google.com/drive/1331XPrqg4wKvT5ymQOwG4QY3Xk_kNLtl?usp=sharing

To provide more context:

I've ensured that the notebook settings are set to use GPU.
I've tried restarting the runtime and resetting all runtimes, but the issue persists.
model I have used:- mixtral-8x7b-instruct-v0.1.Q4_K_M.gguf
I'm New to using these llama-cpp and gguf files.

Thank you in advance for your time and assistance.

Sagar3745

Dec 16, 2023

Here is a screen shot of where it is running, this is a second run, the first one I got for 11 minutes. Where it is not using anything CPU or GPU. I'm not getting it.

hammad93

Dec 16, 2023

•

edited Dec 16, 2023

@Sagar3745 I think the problem might be with n_gpu_layers set to zero. If i understand correctly this the number of layers you are offloading from CPU to GPU. try setting n_gpu_layers to a number and test it from there. the higher the number the faster inference works but you might run out of VRAM and the notebook will crash.

Sagar3745

Dec 18, 2023

Hi @hammad93 thanks for the help.
I tried with some value for n_gpu_layers, but did not help. I have executed the same code in a server, it is working in the CPU, the inference is slow, if the prompt tokens are high it is taking a bit longer.
Now I'm working on running it on GPU. But no result.

But Extremely thanks for the help.

hammad93

Dec 18, 2023

•

edited Dec 18, 2023

@Sagar3745 try adding this to CMAKE_ARGS LLAMA_CUBLAS=1

Sagar3745

Dec 19, 2023

•

edited Dec 19, 2023

Hi @hammad93 Tried That But no use.

This is what the model is printing in my terminal.

for each generation.

I'm using the standard template same as mentioned in the repo.
~~[INST] {prmopt} [/INST]~~

hammad93

Dec 19, 2023

•

edited Dec 19, 2023

@Sagar3745 thats weird i got it to work on the 2xT4 gpus on kaggle using LLAMA_CUBLAS=1 and n_gpu_layers and its using both GPUs while running inference, try pulling the repository llama.cpp and building it from the repo.
github repo: https://github.com/ggerganov/llama.cpp
make command: make LLAMA_CUBLAS=1

Sagar3745

Dec 20, 2023

•

edited Dec 20, 2023

@hammad93 Thanks I got it, was able to load it in GPU but crashed because of only 15GPU, in Colab. But I'm not sure how you did in Kaggel, as the model is 24gb the allowed storage is only 19.5 GB I'm not able to download the model.

Can you tell me how you can download the model in Kaggel?
model:- mixtral-8x7b-instruct-v0.1.Q4_K_M.gguf

for now I have tried with smaller model:-

hammad93

Dec 20, 2023

@Sagar3745 Great! to use the full 73GB change directory to /kaggle . Also try playing around the n_gpu_layers number to try to fit the model in VRAM,and check out the docs as there are other options that should help with memory allocation.

ianuvrat

Dec 22, 2023

How did you guys did it in kaggle? which accelerator? can you guys be kind enough to share notebook?

LilWonga

Dec 22, 2023

@hammad93 Hey, Could you share your notebook?... I would greatly appreciate. I'm kinda having a hard time with the implementation...

hammad93

Dec 23, 2023

•

edited Dec 23, 2023

@LilWonga @ianuvrat
repo:https://github.com/mth93/mixtral_llama_cpp

you can use main branch instead of mixtral as it's already merged into main. i tested this notebook with many of the opensource LLMs in GGUF format and it works great and utilizes both GPUs in the 2xT4 gpus notebook. you'll have to play around with the context size (-c) and the n-gpu-layers with each LLM as this(gpu layers) is how you utilize more vram opposed to ram if i understand correctly. Also there are other options for the llama.cpp server that might improve performance that i'm still experimenting with. and the bigger the context the better as it allows the LLM to remember bigger parts of the history of the conversation.
this works for launching a chatbot and an openai compatible api.

also check the model you're going to use if it needs a prompt template you'll find it on the original model's page on huggingface. there is no way to add the prompt template to the llama.cpp server currently so you'll have to add it to the prompt in the request(If you're using the API) or add it to the chat in the chatbot.

most LLMs will give you very weird responses if it requires a prompt template and you don't use it based on what i tested.

hope this helps!

hammad93

Dec 23, 2023

•

edited Dec 23, 2023

Also guys i'd appreciate it if anyone can explain the difference between GGUF,GGML,AWQ etc. if i understand correctly these are different quantization algorithms, but i have no idea what's the difference between them and how that affects performance and model size.

Sagar3745

Dec 26, 2023

An update on the GPU issue, could not be solved in my server, so I downloaded the original model mistralai/Mixtral-8x7B-Instruct-v0.1, and loaded it as a quantized.
GPU usage:- 24GB (without inference).

This works perfectly fine for me.

ianuvrat

Dec 26, 2023

@Sagar3745 , sorry but i did not understood. What and how you did? I want to use yhis model with langchain agents for inferencing

Sagar3745

Dec 26, 2023

@ianuvrat , initially I faced issues in loading the gguf model in GPU in my server. it worked in Kaggel Notebook and Google Colab I was able to load it in the GPU.
So instead of the .gguf model, I downloaded the original mixtral instruct model from mistralai huggingface and just loaded it as a quantized model which only takes 24GB GPU.

There is nothing new I did, before gguf models, I used to load the models like llama2 13b, and orca2 13b as quantized models, in the same way, I did for this https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1 model.
Hope this helps.

ianuvrat

Dec 26, 2023

@Sagar3745 interesting. So uou downloded the original mode adn loaded it as quantizd? Right.

Can you be kind enough to share the colab notebook how you did this? I’ll also try the same and see if it works for me or not.

Sagar3745

Dec 26, 2023

•

edited Dec 26, 2023

@ianuvrat , Sure.
here is the Code, which uses the same model loading method as I did.

loading the model:-
................................................................................................................................
import torch
import transformers
from torch import bfloat16
from langchain import HuggingFacePipeline, PromptTemplate, LLMChain
import re
model_path = 'microsoft/Orca-2-13b'
hf_key = "hugging_api"
import time

def load_model():
device = f"cuda:{torch.cuda.current_device()}" if torch.cuda.is_available() else "cpu"
bnb_config = transformers.BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True,
bnb_4bit_compute_dtype=bfloat16,
)
model_config = transformers.AutoConfig.from_pretrained(
model_path, use_auth_token=hf_key
)
model = transformers.AutoModelForCausalLM.from_pretrained(
model_path,
trust_remote_code=True,
config=model_config,
quantization_config=bnb_config,
device_map="auto",
use_auth_token=hf_key,
)
model.eval()
print(f"Model loaded on {device}:- ")
tokenizer = transformers.AutoTokenizer.from_pretrained(
model_path, use_auth_token=hf_key
)
generate_text = transformers.pipeline(
model=model,
tokenizer=tokenizer,
return_full_text=True,
task="text-generation",
temperature=0.0,
max_new_tokens=400,
repetition_penalty=1.1,
do_sample=False,
top_k=5,
num_return_sequences=1,
eos_token_id=tokenizer.eos_token_id,
)
return HuggingFacePipeline(
pipeline=generate_text, model_kwargs={"temperature": 0}
)

llm_model = load_model()
....................................................................................................................

requirements:-
....................................................................................
pip install -U llama-cpp-python
pip install -U transformers
pip install -U accelerate
pip install -U bitsandbytes
pip install -U langchain
pip install -U sentencepiece
....................................................................................
remember the original model is too huge to download in the Colab or Kaggel, unless you are a pro user. I did it on my server which has enough disk space to download the model.

colab link:- https://colab.research.google.com/drive/1oHRk8dHYhGc9z6Olrx4pmFymf-LwhF_F?usp=sharing

ianuvrat

Dec 26, 2023

Thanks mate. Will try with this!