dolphin-2.9.4-llama3.1-8b ?

#1
by vaclavkosar - opened

Would it be possible to quantize cognitivecomputations/dolphin-2.9.4-llama3.1-8b? I wanted to do that, but I am getting an error perhaps due to setup. I created an issue for that.

SolidRusT Networks org

sure. i will do it now.
thank-you.

SolidRusT Networks org

I cant seem to do any AWQ quants anymore.
I also get the same error with the llm-quantkit python package.

SolidRusT Networks org
edited Aug 15
pip install --upgrade llm-quantkit[cuda]
quantkit awq cognitivecomputations/dolphin-2.9.4-llama3.1-8b -out dolphin-2.9.4-llama3.1-8b-AWQ

Even the simplest example is not working.

SolidRusT Networks org
edited Aug 15

seems this model needs transformers>=4.44.0.dev0 and AutoAWQ library wants 4.35 or something like that.
I will try downgrading the transformers version to see if it can work

SolidRusT Networks org
edited Aug 15

OK, i've changed the autoawq-kernel, and rebuilding the wheels for it, maybe I can get this working afterall..
basically, both of them(awq and kernels repo) lock the pytorch version to specifically to 2.3.1 and we need 2.4.0 to work with the new transformers.

SolidRusT Networks org

Screenshot 2024-08-14 at 10.29.28 PM.png
Let's go!

SolidRusT Networks org

Completed: https://huggingface.co/solidrust/dolphin-2.9.4-llama3.1-8b-AWQ

  • thank-you for you encouragement, this problem pissed me off and I had given up on it
Suparious changed discussion status to closed

You are the best! Thank you.

These Mintron models that distilled Llama 3.1 8B into 4B look to be working. One fine-tune is: Magpie-Align/Llama-3.1-Minitron-4B-Magpie-SFT-800K-MT-Magpo-3.1-Pro-05

SolidRusT Networks org
edited Aug 24

Getting a weird error with that one:

in set_module_tensor_to_device
    raise ValueError(
ValueError: Trying to set a tensor of shape torch.Size([1024, 3072]) in "weight" (which has shape torch.Size([768, 3072])), this looks incorrect.

I will need to debug it. Here is an example of how to reproduce the error (feeling too lazy to fix my repo, so using quantkit today):

transformers==4.42.3
llm-quantkit[cuda]
import json
import os
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer, AutoConfig
from huggingface_hub import snapshot_download, create_repo, upload_folder

model_path = 'Magpie-Align/Llama-3.1-Minitron-4B-Magpie-SFT-800K-MT-Magpo-3.1-Pro-05'
quant_path = 'Llama-3.1-Minitron-4B-Magpie-SFT-800K-MT-Magpo-3.1-Pro-05-AWQ'
quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM" }
quanter = 'suparious'
quant_org = 'solidrust'

# Download the model to a local directory
local_model_path = snapshot_download(model_path)

# Load the model configuration file from the local directory
config_file = os.path.join(local_model_path, "config.json")

with open(config_file, "r") as f:
    config_dict = json.load(f)

# Modify the rope_scaling dictionary to include only the required fields
if 'rope_scaling' in config_dict:
    rope_scaling = config_dict['rope_scaling']
    if 'type' in rope_scaling and 'factor' in rope_scaling:
        # Ensure the type is one of the valid values
        if rope_scaling['type'] not in ['linear', 'dynamic']:
            rope_scaling['type'] = 'linear'  # Set to a default valid value
        # Ensure the factor is a float greater than 1
        if not isinstance(rope_scaling['factor'], float) or rope_scaling['factor'] <= 1.0:
            rope_scaling['factor'] = 2.0  # Set to a default valid value
        config_dict['rope_scaling'] = {'type': rope_scaling['type'], 'factor': rope_scaling['factor']}
    else:
        # If 'type' or 'factor' is missing, set default values
        config_dict['rope_scaling'] = {'type': 'linear', 'factor': 2.0}

# Save the modified configuration file
with open(config_file, "w") as f:
    json.dump(config_dict, f, indent=2)

# Load the model with the modified configuration
model = AutoAWQForCausalLM.from_pretrained(
    local_model_path, **{"low_cpu_mem_usage": True, "use_cache": False}
)
tokenizer = AutoTokenizer.from_pretrained(local_model_path, trust_remote_code=True)

# Quantize
model.quantize(tokenizer, quant_config=quant_config)

# Save quantized model
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)

print(f'Model is quantized and saved at "{quant_path}"')

# Upload the quantized model to the Hugging Face Hub
create_repo(repo_id=f"{quant_org}/{quant_path}")
upload_folder(
    folder_path=quant_path,
    repo_id=f"{quant_org}/{quant_path}",
)

print(f'Quantized model uploaded to "{quant_org}/{quant_path}"')
SolidRusT Networks org

I had locked transformers version to workaround a llama 3.1 bug, but maybe the latest library is working / required for this model?

Nope, latest transformers doesn't stop AWQ from choking on these rope scaling methods that people are using to extend LLM context windows.

python quantize.py
Fetching 15 files: 100%|████████████████████████████████████████████████████████████████████████████████| 15/15 [00:00<00:00, 156503.88it/s]
Unrecognized keys in `rope_scaling` for 'rope_type'='linear': {'type'}
Unrecognized keys in `rope_scaling` for 'rope_type'='linear': {'type'}
Unrecognized keys in `rope_scaling` for 'rope_type'='linear': {'type'}
Loading checkpoint shards:   0%|                                                                                      | 0/2 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/home/ubuntu/smd/quantize.py", line 42, in <module>
    model = AutoAWQForCausalLM.from_pretrained(
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/venv-quantkit/lib/python3.11/site-packages/awq/models/auto.py", line 71, in from_pretrained
    return AWQ_CAUSAL_LM_MODEL_MAP[model_type].from_pretrained(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/venv-quantkit/lib/python3.11/site-packages/awq/models/base.py", line 380, in from_pretrained
    model = target_cls.from_pretrained(
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/venv-quantkit/lib/python3.11/site-packages/transformers/models/auto/auto_factory.py", line 564, in from_pretrained
    return model_class.from_pretrained(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/venv-quantkit/lib/python3.11/site-packages/transformers/modeling_utils.py", line 3960, in from_pretrained
    ) = cls._load_pretrained_model(
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/venv-quantkit/lib/python3.11/site-packages/transformers/modeling_utils.py", line 4434, in _load_pretrained_model
    new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
                                                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/venv-quantkit/lib/python3.11/site-packages/transformers/modeling_utils.py", line 961, in _load_state_dict_into_meta_model
    set_module_tensor_to_device(model, param_name, param_device, **set_module_kwargs)
  File "/home/ubuntu/venv-quantkit/lib/python3.11/site-packages/accelerate/utils/modeling.py", line 373, in set_module_tensor_to_device
    raise ValueError(
ValueError: Trying to set a tensor of shape torch.Size([1024, 3072]) in "weight" (which has shape torch.Size([768, 3072])), this looks incorrect.
SolidRusT Networks org

I stopped doing AWQ quants as I would need to really learn AutoAWQ library to keep up with all the weird stuff that the community does to the foundational models.
I'll try my best to debug this one.

SolidRusT Networks org
edited Aug 24

OK, I managed to handle this is such a shitty and miserable way....

def override_rope_embeddings():
    from transformers.models.llama.modeling_llama import apply_rotary_pos_emb

    def custom_apply_rotary_pos_emb(q, k, cos, sin):
        min_dim = min(q.shape[-1], cos.shape[-1], sin.shape[-1])
        q = q[..., :min_dim]
        k = k[..., :min_dim]
        cos = cos[..., :min_dim]
        sin = sin[..., :min_dim]
        return (q * cos) + (rotate_half(q) * sin), (k * cos) + (rotate_half(k) * sin)

    # Override the existing function
    transformers.models.llama.modeling_llama.apply_rotary_pos_emb = custom_apply_rotary_pos_emb

This is so stupid....

The attention layers in this model are transitioning from computing the RoPE embeddings internally through position_ids (2D tensor with the indexes of the tokens), to using externally computed position_embeddings (Tuple of tensors, containing cos and sin). In transformers v4.45 position_ids will be removed and position_embeddings will be mandatory.

It is quantizing now...

SolidRusT Networks org
edited Aug 24

OK, @vaclavkosar - Thank-you for the Saturday morning algebra challenge.

solidrust/Llama-3.1-Minitron-4B-Magpie-SFT-800K-MT-Magpo-3.1-Pro-05-AWQ is ready, not sure it it will work properly or not, I did alot of messing around this time.

Trying now!

I have to say that solidrust/Meta-Llama-3.1-8B-Instruct-abliterated-AWQ was the best model so far. Maybe because the Llama fine-tuning is exceptional and abliteration just add the free-range talk back in.

It failed with:

/usr/local/lib/python3.10/dist-packages/accelerate/utils/modeling.py in set_module_tensor_to_device(module, tensor_name, device, value, dtype, fp16_statistics, tied_params_map)
    360     if value is not None:
    361         if old_value.shape != value.shape:
--> 362             raise ValueError(
    363                 f'Trying to set a tensor of shape {value.shape} in "{tensor_name}" (which has shape {old_value.shape}), this look incorrect.'
    364             )

ValueError: Trying to set a tensor of shape torch.Size([3072, 96]) in "qweight" (which has shape torch.Size([3072, 128])), this look incorrect.

I think quant config needs to be added like: https://huggingface.co/solidrust/Starling-LM-7B-beta-AWQ/blob/main/quant_config.json

SolidRusT Networks org

Yeah, these models have this issue.

Here is the script that I used, but it is not great as the quantized model seems shit after...

import json
import os
import torch
import transformers
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer, AutoConfig
from huggingface_hub import snapshot_download, create_repo, upload_folder

model_path = 'Magpie-Align/Llama-3.1-Minitron-4B-Magpie-SFT-800K-MT-Magpo-3.1-Pro-05'
quant_path = 'Llama-3.1-Minitron-4B-Magpie-SFT-800K-MT-Magpo-3.1-Pro-05-AWQ'
quant_config = {"zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM"}
quant_org = 'solidrust'

def download_model(model_path):
    try:
        return snapshot_download(model_path)
    except Exception as e:
        print(f"Error downloading model: {e}")
        raise

def sanitize_rope_scaling(config_dict):
    if 'rope_scaling' in config_dict:
        rope_scaling = config_dict['rope_scaling']
        if isinstance(rope_scaling, dict):
            valid_keys = ['type', 'factor']
            rope_scaling = {k: v for k, v in rope_scaling.items() if k in valid_keys}
            if rope_scaling.get('type') not in ['linear', 'dynamic']:
                print(f"Invalid 'type' in rope_scaling. Setting to 'linear'.")
                rope_scaling['type'] = 'linear'
            if not isinstance(rope_scaling.get('factor'), float) or rope_scaling['factor'] <= 1.0:
                print(f"Invalid 'factor' in rope_scaling. Setting to 2.0.")
                rope_scaling['factor'] = 2.0
        else:
            print("Unexpected format for 'rope_scaling'. Removing it.")
            del config_dict['rope_scaling']
    return config_dict

def override_rope_embeddings():
    from transformers.models.llama.modeling_llama import apply_rotary_pos_emb, rotate_half

    def custom_apply_rotary_pos_emb(q, k, cos, sin):
        # Truncate or pad the tensors to match dimensions
        min_dim = min(q.shape[-1], cos.shape[-1], sin.shape[-1])
        q = q[..., :min_dim]
        k = k[..., :min_dim]
        cos = cos[..., :min_dim]
        sin = sin[..., :min_dim]
        return (q * cos) + (rotate_half(q) * sin), (k * cos) + (rotate_half(k) * sin)

    # Override the function within transformers
    transformers.models.llama.modeling_llama.apply_rotary_pos_emb = custom_apply_rotary_pos_emb

def load_model(local_model_path):
    try:
        config_file = os.path.join(local_model_path, "config.json")
        with open(config_file, "r") as f:
            config_dict = json.load(f)
        
        config_dict = sanitize_rope_scaling(config_dict)
        
        with open(config_file, "w") as f:
            json.dump(config_dict, f, indent=2)
        
        # Load model with to_empty to avoid copying from meta tensors
        model = AutoAWQForCausalLM.from_pretrained(
            local_model_path, **{"low_cpu_mem_usage": True, "use_cache": False}, ignore_mismatched_sizes=True
        )
        model.to_empty(device=torch.device("cuda"))
        return model
    except Exception as e:
        print(f"Error loading model: {e}")
        raise

def quantize_model(model, tokenizer):
    try:
        model.quantize(tokenizer, quant_config=quant_config)
    except ValueError as ve:
        print(f"Quantization Error: {ve}")
        raise

def upload_to_hf(quant_path, quant_org):
    try:
        create_repo(repo_id=f"{quant_org}/{quant_path}", exist_ok=True)
        upload_folder(
            folder_path=quant_path,
            repo_id=f"{quant_org}/{quant_path}",
        )
        print(f'Quantized model uploaded to "{quant_org}/{quant_path}"')
    except Exception as e:
        print(f"Error uploading to Hugging Face: {e}")
        raise

def main():
    local_model_path = download_model(model_path)
    
    # Override RoPE embedding calculations
    override_rope_embeddings()

    model = load_model(local_model_path)
    tokenizer = AutoTokenizer.from_pretrained(local_model_path, trust_remote_code=True)

    quantize_model(model, tokenizer)
    
    model.save_quantized(quant_path)
    tokenizer.save_pretrained(quant_path)

    print(f'Model is quantized and saved at "{quant_path}"')
    upload_to_hf(quant_path, quant_org)

if __name__ == "__main__":
    main()
SolidRusT Networks org
edited Aug 24

I was told by Casper Hansen that the quant_config.json is no longer supported as he is adding this JSON block into the model's native config.json. I don't like this approach and prefer to use the quant_config.json in order to prevent molesting the native models config JSON. Also, some tools and apps still look for this file. So my quant process usually adds in this file, but today we used the quantkit project, which seems to have been abandoned, but simplifies what I wanted to do with my srt-model-quantizing repo. I might pick up that project and deprecate my version.

So that is the reason why my AWQ quants typically always have the quant_config.json, despite @casperhansen advice.

SolidRusT Networks org
edited Aug 24

aslo super pissed off about AutoAWQ requiring torch==2.1.3 which is such a shitty torch version, with most of the issues fixed in 2.4.
I always have to build my own AutoAWQ, and the kernels, to support the latest transformers and torch. Such a useless waste of my time.
I made a fork of AutoAWQ that just always enjoys the latest versions of everything, but sometimes it is unstable, so that is the current state of this issue.

I see. Well, these things probably will get resolved later by the package authors.

This particular fine-tune is probably not that good. But in general, these Llama Minitron model fine-tunes will probably be used, since they are efficient distillations.

Hey, can you share your autoawq repository?

SolidRusT Networks org

It is currently https://github.com/SolidRusT/srt-model-quantizing but I haven't worked on it in awhile.
this Pypi package may be easier to use: https://pypi.org/project/llm-quantkit/

Thanks.

In general, I must tell you, pip dependencies are horrible. I just wanted to run an old, unrelated notebook. And it wouldn't start even when I pinned the original versions of the main packages. I would have to pin down every single one. For example with poetry lock file.

SolidRusT Networks org

I will setup a Docker image will all the python stuff sorted out, to help users with this exact issue, however, I was able to get the https://github.com/SolidRusT/srt-model-quantizing/awq repo to a stable state. I literally worked on this all day today.
I still cant quant on my 12GB, just like I had done, over 500 AWQ quants. The reason I stopped doing them, is that the memory management in AutoAWQ is not existant / incomplete. and I havent figured out how to solve it. I even connected with Casper Hansen on it.

but I can rent a NVIDIA A10g machine from Amazon, and the quant works fine on that 24GB GPU.
I also got the new llama3.1 models to quant there, using my awq repo.

That sounds painful... But, is it like, solvable in the python side? With new 3.12? That memory thing

SolidRusT Networks org

unfortunately, this seems to be a compounded issue issue with AutoAWQ, Llama 3.1. rope_scaling hackery and then GPU VRAM.
There is no way to use python 3.12 here. I tried for weeks, and there is just too much work to figure out by myself, and I had to take a break.

the problem with people using the rope scaling methods to increase the shitty context limitation of llama 3.1 (8192 tokens), is that to quantize for AWQ, you need to ensure your tensors are only on a single device (single CPU or single GPU), and there is no conceivable way to distribute them. I really detest this methodology of increasing context windows of models, for this reason.

so now I have my repo exclusively using a single device for tensors, which solves for shitty llama 3.1 rope scaling, but this now disables me from using multi-GPU, partial CPU offload and other memory management techniques. and I cant even quant a 8B model on my 12GB GPU, of which I have over 12 of them, and I intended to automate AWQ quants automatically with this hardware investment.

But now that is is all memory fucked, I have to just release my pipeline and help people make their own AWQ.

I cant do more than 24GB in AWS. I can do multi-GPU, but this is now disabled in AutoAWQ.

I am sincerely considering making a fork of it, and using Claude Sonnet 3.5 to fix AutoAWQ automagically for us.
Maybe tomorrow, I am exhausted from todays refactoring.

I got 98% code coverage and all unit tests passing now in my repo.

Great work!
But.. let say we delete rope scaling from config.json quantize it and put it back after quantization, will that be a problem? But idk, their rope scaling type is different that usual isn't it?

SolidRusT Networks org

Trying this now, seeing this message, but it might still work... we'll see:

The attention layers in this model are transitioning from computing the RoPE embeddings internally through `position_ids` (2D tensor with the indexes of the tokens), to using externally computed `position_embeddings` (Tuple of tensors, containing cos and sin). In v4.45 `position_ids` will be removed and `position_embeddings` will be mandatory.
SolidRusT Networks org

Genius idea, seems to work.
solidrust/Hermes-3-Llama-3.1-8B-lorablated-AWQ
let me play more with my script and get it stable.

That's unexpected

Sign up or log in to comment