TheBloke/Falcon-7B-Instruct-GPTQ · RuntimeError: Expected query, key, and value to have the same dtype, but got query.dtype: float key.dtype: float and value.dtype: c10::Half instead.

Hi I got this RunTimeError as shown below while building a chatbot using "TheBloke/Falcon-7B-Instruct-GPTQ"

The error message:

RuntimeError Traceback (most recent call last)
in <cell line: 6>()
10 break
11 print(f"{yellow}" + query)
---> 12 llm_response = qa_chain({'question': query, 'chat_history':chat_history})
13 print(process_llm_response(llm_response['answer']))
14 chat_history.append((query, llm_response['answer']))

45 frames
~/.cache/huggingface/modules/transformers_modules/TheBloke/Falcon-7B-Instruct-GPTQ/d6ce55f4e840bbbd596d1a65f64888f0a3c3326b/modelling_RW.py in forward(self, hidden_states, alibi, attention_mask, layer_past, head_mask, use_cache, output_attentions)
277 value_layer_ = value_layer.reshape(batch_size, self.num_kv, -1, self.head_dim)
278
--> 279 attn_output = F.scaled_dot_product_attention(
280 query_layer_, key_layer_, value_layer_, None, 0.0, is_causal=True
281 )

RuntimeError: Expected query, key, and value to have the same dtype, but got query.dtype: float key.dtype: float and value.dtype: c10::Half instead.

That piece of code:

model_name_or_path = "TheBloke/Falcon-7B-Instruct-GPTQ"

tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)

model = AutoModelForCausalLM.from_pretrained(model_name_or_path,
device_map="auto",
trust_remote_code=True,
revision="main")

logging.set_verbosity(logging.CRITICAL)

pipe = pipeline(
"text-generation",
model=model,
tokenizer=tokenizer,
max_new_tokens=512,
do_sample = True, #True means more creative and varied outputs, False means more predictable and coherent outputs.
top_p=0.95, #low value = more predictable and low randomness outcome, high value = more randomness and creativity in the generated text.
repetition_penalty=1.15, #to control the likelihood of the model repeating the same text when generating sequences. (1 means no penalty)
return_full_text=True,
temperature=0.6
)

llm = HuggingFacePipeline(pipeline=pipe)
memory = ConversationBufferMemory(memory_key="chat_history", k=3, #k stands for how many previous messages to be kept in the memory
return_messages=True,
input_key='question',
output_key='answer'
)

qa_chain = ConversationalRetrievalChain.from_llm(
llm,
db.as_retriever(search_kwargs={'k': 3}),
return_source_documents=True,
memory = memory,
combine_docs_chain_kwargs={"prompt": prompt}, #Add prompt into the qa_chain
rephrase_question = True, #Set True to improve question clarity and thus improving answer quality.
output_key = 'answer'
)

yellow = "\033[0;33m"
green = "\033[0;32m"
red = "\033[0;31m"

chat_history = []
while True:
query = input('Prompt: ')
if query == "exit" or query == "quit" or query == "q":
print(f"{red}Exiting")
break
print(f"{yellow}" + query)
llm_response = qa_chain({'question': query, 'chat_history':chat_history})
print(process_llm_response(llm_response['answer']))
chat_history.append((query, llm_response['answer']))

It would be appreciate if someone can enlighten me on this issue.