WARNING:root:Some parameters are on the meta device device because they were offloaded to the cpu.
I am getting "WARNING:root:Some parameters are on the meta device device because they were offloaded to the cpu." while loading the Mixtral to text-genetation pipeline.
You don't have enough GPU memory. Consider renting a GPU, or loading the model in a more efficient way (e.g. in 4-Bit)
I second what
@cekal
said, you probably don't have enough GPU ram to fit the model, try either to load it with smaller precision (e.g. float16
or load_in_4bit
, or using the serialized 4-bit here: https://huggingface.co/ybelkada/Mixtral-8x7B-Instruct-v0.1-bnb-4bit)
Hi @ybelkada
Any idea what is the minimum system requirement to run this model (for e.g. GPU, etc..) ? I am trying to run below python code using streamlit and I get the above error (or warning, I would say) -
import streamlit as st
from langchain import PromptTemplate, LLMChain
from langchain import HuggingFacePipeline
from transformers import AutoTokenizer
import transformers
import torch
token = ""
model = "meta-llama/Llama-2-7b-chat-hf"
tokenizer = AutoTokenizer.from_pretrained(model)
pipeline = transformers.pipeline(
"text-generation",
model=model,
tokenizer=tokenizer,
torch_dtype=torch.bfloat16,
trust_remote_code=True,
device_map="auto",
max_length=1000,
eos_token_id=tokenizer.eos_token_id
)
llm = HuggingFacePipeline(pipeline = pipeline, model_kwargs = {'temperature':0})
template = """
You are an intelligent chatbot that gives out useful information to humans.
You return the responses in sentences with arrows at the start of each sentence
{query}
"""
prompt = PromptTemplate(template=template, input_variables=["query"])
llm_chain = LLMChain(prompt=prompt, llm=llm)
print(llm_chain.invoke('What are the 3 causes of glacier meltdowns?'))