Unable to load in ooba

#1
by dareposte - opened

Could not get this to load in transformers or exllama, AutoGPTQ, or exllamav2. Model card states otherwise, so curious if I'm doing it wrong or if the card has an error in it. Thanks for the conversion. I know it's a new model, just curious what other people are doing and if this has the llama adaptations applied to it or if it's just a conversion of the Yi naming convention.

ExLlama: KeyError: 'model.layers.0.input_layernorm.weight' -->. Looks due to the Yi model's naming on the layers.
ExLlamav2: ValueError: ## Could not find model.layers.0.input_layernorm.* in model --> Same problem?
Transformers: is_auto_gptq_available ---> AutoGPTQ does not support Yi model yet?
raise ImportError(
AutoGPTQ: raise TypeError(f"{config.model_type} isn't supported yet.") --> Same?

I believe exllamav2 has support at least, you just have to update it

Tried this with the latest exllamav2 (git pull, pip install -e .) but still getting the issue. Was able to load the other AWQ model using AutoAWQ, and able to load the GGUF using llama.cpp. Also was able to get exl2 to load an EXL2 file quant from another user, just not able to use this particular one. I'm sure the issue is on my end, thanks for the advice.

Same, have tried everything.

2023-11-11 10:54:10 ERROR:Failed to load the model.
Traceback (most recent call last):
File "C:\Users\xxxx\Deep\text-generation-webui\modules\ui_model_menu.py", line 210, in load_model_wrapper
shared.model, shared.tokenizer = load_model(shared.model_name, loader)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\xxxx\Deep\text-generation-webui\modules\models.py", line 85, in load_model
output = load_func_maploader
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\xxxx\Deep\text-generation-webui\modules\models.py", line 350, in ExLlama_HF_loader
return ExllamaHF.from_pretrained(model_name)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\xxxx\Deep\text-generation-webui\modules\exllama_hf.py", line 174, in from_pretrained
return ExllamaHF(config)
^^^^^^^^^^^^^^^^^
File "C:\Users\xxxx\Deep\text-generation-webui\modules\exllama_hf.py", line 31, in init
self.ex_model = ExLlama(self.ex_config)
^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\xxxx\Deep\text-generation-webui\installer_files\env\Lib\site-packages\exllama\model.py", line 889, in init
layer = ExLlamaDecoderLayer(self.config, tensors, f"model.layers.{i}", i, sin, cos)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\xxxx\Deep\text-generation-webui\installer_files\env\Lib\site-packages\exllama\model.py", line 520, in init
self.input_layernorm = ExLlamaRMSNorm(self.config, tensors, key + ".input_layernorm.weight")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\xxxx\Deep\text-generation-webui\installer_files\env\Lib\site-packages\exllama\model.py", line 284, in init
self.weight = tensors[key]
~~~~~~~^^^^^
KeyError: 'model.layers.0.input_layernorm.weight'

@dareposte are you able to check if you can load my Yi-34B-GPTQ, ie the non 200K version, using ExLlama?

It's expected that AutoGPTQ won't be able to load it. I didn't realise that Transformers wouldn't be able to load it, but now I see the error message, then I guess that's expected too, though disappointing. Transformers can make GPTQs that AutoGPTQ can't make, so I thought it could load them too. I guess not.

But I've been told that ExLlama added specific support for Yi, so I'm surprised that ExLlama can't load it. I actually made this GPTQ differently to how I made the non-200K version; I used AutoGPTQ to make it, using an AutoGPTQ PR which adds Yi support to make it. The non-200K was made with Transformers, and I was told that was working with ExLlama after turboderp added support.

Let me know if the non-200K works and if necessary I'll re-make these 200K GPTQs with Transformers, like I made the non-200K ones. And/or get in touch with turboderp about it.

Yep, still no luck on my end for this. I cloned the ExLlamaV2 repo directly in a fresh venv, and also did not work there to load the model. (Edit - this actually does work, it was an issue with the venv - confirmed below).

AutoAWQ was able to load the AWQ quant model, but outputs garbage. So dead-end there as well. (Note: This does works fine in AutoAWQ directly, but outputs garbage in the AutoAWQ loader in ooba. Problem is in Ooba, not the model.)

@dareposte are you able to check if you can load my Yi-34B-GPTQ, ie the non 200K version, using ExLlama?

It's expected that AutoGPTQ won't be able to load it. I didn't realise that Transformers wouldn't be able to load it, but now I see the error message, then I guess that's expected too, though disappointing. Transformers can make GPTQs that AutoGPTQ can't make, so I thought it could load them too. I guess not.

But I've been told that ExLlama added specific support for Yi, so I'm surprised that ExLlama can't load it. I actually made this GPTQ differently to how I made the non-200K version; I used AutoGPTQ to make it, using an AutoGPTQ PR which adds Yi support to make it. The non-200K was made with Transformers, and I was told that was working with ExLlama after turboderp added support.

Let me know if the non-200K works and if necessary I'll re-make these 200K GPTQs with Transformers, like I made the non-200K ones. And/or get in touch with turboderp about it.

I'm pulling it down now, will report back shortly. I've been successfully using the 4.65bpw exl2 quant uploaded by LoneStriker in v2 successfully, but I'm finding Exllamav2 is just extremely buggy right now.

Did you successfully load and test this quant on any software? Or just script it in? I'm not tied to any particular implementation but am looking for something less buggy than exlv2 currently is. It runs great for about 10-15 minutes then slows to a crawl even with an empty context for me. Don't usually use Ooba much, but it's a bit quicker to iterate prompts and model types in there when it's working.

AWQ version works fine for me via Python code:

from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer

model_name_or_path = "/workspace/process/01-ai_yi-34b-200k/awq/main/"

tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_name_or_path,
    low_cpu_mem_usage=True, trust_remote_code=True,
    device_map="cuda:0"
)

# Using the text streamer to stream output one token at a time
streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)

prompt = "The meaning of life is"
prompt_template=f'''{prompt}'''

# Convert prompt to tokens
tokens = tokenizer(
    prompt_template,
    return_tensors='pt'
).input_ids.cuda()

generation_params = {
    "do_sample": True,
    "temperature": 0.7,
    "top_p": 0.95,
    "top_k": 40,
    "max_new_tokens": 512,
    "repetition_penalty": 1.1
}

# Generate streamed output, visible one token at a time
generation_output = model.generate(
    tokens,
    streamer=streamer,
    **generation_params
)

Output:

 to find your gift. The purpose of life is to give it away.” ~Pablo Picasso
I’m really starting to understand what this quote means. For the past few months, I have been working with a young man who has a form of Autism and suffers from many learning disabilities. He has been my student since last year, but he really hasn’t opened up until recently. I was frustrated at first because I couldn’t get him to talk about his feelings or anything else. Now that he trusts me enough, he tells me when he doesn’t like something and why. It helps us communicate so much better. We’ve also bonded on another level: one where we can joke around and be silly together. I think that’s the best part of all! 🙂
Now you may ask how this relates to Picasso’s quote? Well, I feel like I found my calling in life as a teacher. I know that teaching isn’t for everyone, but I absolutely love it! I love being able to help students learn new things and challenge them everyday. I love the fact that each day is different and there are no two days alike. I love making a difference in the lives of others. I could go on forever!
My passion in life is to make sure that my kids succeed and reach their full potential. I want them to realize that they can do anything they put their minds to. I want them to believe in themselves and never let anyone tell them otherwise. Most importantly, I want them to find happiness within themselves.
This is why I chose to become an educator: because I wanted to change lives and inspire greatness. My goal every single day is to teach with passion and compassion; two qualities which are not always easy to come by these days…but definitely worth fighting for!
In conclusion, here are some tips for finding your true passion in life:
What is Your Life Purpose
“I am convinced that life is 10% what happens to me and 90% how I react to it.” – Charles R. Swindoll
Have you ever felt lost or confused about what you should do next in life? Do you feel like you don’t know what your purpose is? If so, then you are not alone. Many people struggle with these same questions. However, if we look closely at our own experiences, we will see that there are certain things that happen over and over again. These patterns are called “recurring

GPTQ version also works fine for me from Transformers, using the released version of AutoGPTQ 0.5.1, which does not have the PR. So I was right the first time, Transformers can load GPTQs without AutoGPTQ having specific support. I don't know why it's not working for you, but it works fine for me:

Test code:

import argparse
parser = argparse.ArgumentParser(description='Process and upload quantisations')
parser.add_argument('model_dir', type=str, help='model dir')
args = parser.parse_args()

from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

model_name_or_path = args.model_dir
# To use a different branch, change revision
# For example: revision="gptq-4bit-32g-actorder_True"
model = AutoModelForCausalLM.from_pretrained(model_name_or_path,
                                             device_map="auto",
                                             trust_remote_code=True)

tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True, trust_remote_code=True)

prompt = "Tell me about AI"
prompt_template=f'''{prompt}'''

print("\n\n*** Generate:")

input_ids = tokenizer(prompt_template, return_tensors='pt').input_ids.cuda()
output = model.generate(inputs=input_ids, temperature=0.7, do_sample=True, top_p=0.95, top_k=40, max_new_tokens=512)
print(tokenizer.decode(output[0]))

# Inference can also be done using transformers' pipeline

print("*** Pipeline:")
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=512,
    do_sample=True,
    temperature=0.7,
    top_p=0.95,
    top_k=40,
    repetition_penalty=1.1
)

print(pipe(prompt_template)[0]['generated_text'])

Output:

*** Generate:
Tell me about AI and Machine Learning in the context of security?
Machine learning is a subset of Artificial Intelligence. It is an algorithm that learns from the data provided to it and makes decisions. For example, if I give you a data set of 1000 images of cats and 1000 images of dogs, and tell you to classify them as cats or dogs, you would look at the images, understand what a cat looks like and what a dog looks like and then classify them. If I give you 10,000 images of cats and dogs, it would be difficult for you to classify them. But, a machine learning algorithm would be able to look at the images and tell you which one is a cat and which one is a dog. That’s the power of machine learning.
In the context of security, machine learning can be used for anomaly detection, intrusion detection, and malware detection. Machine learning algorithms are trained on a large amount of data and then used to detect anomalies or intrusions in the system.
Can you elaborate on anomaly detection?
Anomaly detection is the process of identifying outliers or anomalies in data. In the context of security, anomaly detection can be used to identify unusual or unexpected activity on a network or system. For example, if there is a sudden spike in network traffic or a large number of failed login attempts, this could be an indication of an attack. Machine learning algorithms can be used to analyze network traffic and identify patterns of anomalous behavior.
What are the other use cases of machine learning in the security context?
Machine learning can also be used for intrusion detection and malware detection. In intrusion detection, machine learning algorithms can be used to analyze network traffic and identify patterns of malicious activity. For example, if there is a sudden spike in network traffic or a large number of failed login attempts, this could be an indication of an attack. Machine learning algorithms can be used to analyze network traffic and identify patterns of anomalous behavior.
In malware detection, machine learning algorithms can be used to identify unknown or previously unseen malware. Machine learning algorithms can be trained on a large number of known malware samples and then used to identify new or unknown malware samples.
What are the different types of machine learning models?
There are three main types of machine learning models: supervised learning, unsupervised learning, and reinforcement learning.
Supervised learning is a type of machine learning in which the algorithm is trained on a set of labeled data. In supervised learning, the algorithm is given

So as far as I can tell, this GPTQ and AWQ are completely fine via Transformers, confirming they're structureally OK. I've not tested ExLlama yet; I'll try to, later.

If you continue to have problems, wait a few hours for me to complete the "Llamafied" version of Yi 34B 200K, which should work automatically everywhere.

@TheBloke - False alarm, they're both working for me now in base exllamav2 with a fresh venv. Thanks again for the conversion, it must have been something in my venv being mis-matched from all the trials.

For others -- Still not able to get them working in Ooba due to unknown reasons, but loading in latest exllamav2 does in fact work with this quant. Below is the proof on a pair of A6000's where it loaded full context.

(venv) xxx@cn1:~/AI/api_llm/exllamav2$ python3 test_inference.py -m /home/xxx/AI/models/Yi-34B-200K-GPTQ -p "Once upon a time " -gs auto
-- Model: /home/xxx/AI/models/Yi-34B-200K-GPTQ
-- Options: ['gpu_split: auto', 'rope_scale 1.0', 'rope_alpha 1.0']
-- Loading tokenizer...
-- Loading model...
-- Warmup...
-- Generating...

Once upon a time there was a king. He had 130 sons and one daughter. The princess loved her father very much, but hated her brothers for they were cruel to their poor sister. They used to make fun of the girl and say: “You are so ugly that you look like an old witch!”
One day the princess asked her father to let her leave his palace because she wanted to marry. And he consented. Soon after this the young lady came back home again. She told everybody how badly her husband treated her in fact he didn’t love her at all. The king got angry with the son-in-

-- Response generated in 4.12 seconds, 128 tokens, 31.06 tokens/second (includes prompt eval.)

@TheBloke - Thanks for the confirmation, it looks like the problems are all related to attempting to use ooba - now I remember why I don't use it much.

I'm switching back to python / transformers for the moment.

Appreciate your help and responses, and your ongoing contribution to the community is much appreciated by all.

I made Yi-34B model working with both latest textgen webui and latest exllamav2 https://huggingface.co/01-ai/Yi-34B/discussions/22

@dareposte

You should use the instruction here for insatlling exllamav2 https://github.com/turboderp/exllamav2#installation

python setup.py install --user

@vdruts

You were using exllama, but Yi requires exllamav2 I believe

Sign up or log in to comment