Aria-sequential_mlp-bnb_nf4

BitsAndBytes NF4 quantization from Aria-sequential_mlp, requires about 15.5 GB of VRAM and runs on a RTX 3090 and (not really practical, only without device_map=auto) on a RTX 4060 Ti 16 GB. Currently the model is not 5 GB sharded, as this seems to cause problems when loading serialized BNB models. This might make it impossible to load the model in free-tier Colab.

Installation

pip install transformers==4.45.0 accelerate==0.34.1 sentencepiece==0.2.0 torchvision requests torch Pillow bitsandbytes
pip install flash-attn --no-build-isolation

Inference

Run this model with:

import requests
import torch
from PIL import Image
from transformers import AutoModelForCausalLM, AutoProcessor, BitsAndBytesConfig
torch.cuda.set_device(0)

model_id_or_path = "thwin27/Aria-sequential_mlp-bnb_nf4"

model = AutoModelForCausalLM.from_pretrained(model_id_or_path, device_map="auto", torch_dtype=torch.bfloat16, trust_remote_code=True)
processor = AutoProcessor.from_pretrained(model_id_or_path, trust_remote_code=True)

image_path = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/cat.png"

image = Image.open(requests.get(image_path, stream=True).raw)

messages = [
    {
        "role": "user",
        "content": [
            {"text": None, "type": "image"},
            {"text": "what is the image?", "type": "text"},
        ],
    }
]

text = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=text, images=image, return_tensors="pt")
inputs["pixel_values"] = inputs["pixel_values"].to(model.dtype)
inputs = {k: v.to(model.device) for k, v in inputs.items()}

with torch.inference_mode(), torch.amp.autocast("cuda", dtype=torch.bfloat16):
    output = model.generate(
        **inputs,
        max_new_tokens=500,
        stop_strings=["<|im_end|>"],
        tokenizer=processor.tokenizer,
        do_sample=True,
        temperature=0.9,
    )
    output_ids = output[0][inputs["input_ids"].shape[1]:]
    result = processor.decode(output_ids, skip_special_tokens=True)

print(result)
print(f'Max allocated memory: {torch.cuda.max_memory_allocated(device="cuda") / 1024 ** 3:.3f}GiB')

Quantization

Quantization created with:

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

model_id = "rhymes-ai/Aria-sequential_mlp"

nf4_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
    llm_int8_enable_fp32_cpu_offload=True,
    llm_int8_skip_modules=["language_model.lm_head", "multi_modal_projector", "vision_tower"],
    )

model_nf4 = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=nf4_config)
Downloads last month
181
Safetensors
Model size
13.5B params
Tensor type
F32
FP16
U8
Inference Examples
Inference API (serverless) does not yet support transformers models for this pipeline type.

Model tree for leon-se/Aria-sequential_mlp-bnb_nf4

Finetuned
rhymes-ai/Aria
Quantized
(3)
this model