Kanana-2-30B-A3B-Instruct AWQ (W4A16)

kakaocorp/kanana-2-30b-a3b-instruct 모델을 AWQ 방식으로 4비트 양자화한 버전입니다. compressed-tensors==0.13.0 버전에서 제작되었습니다.

Model Details

Attribute Value
Base Model kakaocorp/kanana-2-30b-a3b-instruct
Quantization AWQ (W4A16)
Bits 4-bit weights, 16-bit activations
Calibration Dataset ChuGyouk/Asan-AMC-Healthinfo
Quantization Tool llmcompressor

Quantization Config

AWQModifier(
    ignore=["lm_head", "re:.*mlp.gate$", "re:.*mlp.shared_expert_gate$"],
    scheme="W4A16",
    targets=["Linear"],
)
  • lm_head: 출력 레이어는 양자화 제외
  • mlp.gate: MoE 라우터 게이트는 양자화 제외
  • shared_expert_gate: 공유 전문가 게이트는 양자화 제외

Installation

pip install compressed-tensors==0.13.0

호환성을 위해 위 버전 설치를 권장합니다.

Usage

With vLLM (Recommended)

from vllm import LLM, SamplingParams

model = LLM(model="NotoriousH2/kanana-awq-w4a16")
sampling_params = SamplingParams(temperature=0.7, max_tokens=512)

prompt = "고혈압 환자의 식이요법에 대해 설명해주세요."
output = model.generate([prompt], sampling_params)
print(output[0].outputs[0].text)

With Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "NotoriousH2/kanana-awq-w4a16",
    torch_dtype="auto",
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("NotoriousH2/kanana-awq-w4a16")

messages = [{"role": "user", "content": "고혈압 환자의 식이요법에 대해 설명해주세요."}]
input_ids = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)
output = model.generate(input_ids, max_new_tokens=512)
print(tokenizer.decode(output[0], skip_special_tokens=True))

License

This model inherits the license from the base model. Please refer to kakaocorp/kanana-2-30b-a3b-instruct for license details.

Downloads last month
37
Safetensors
Model size
5B params
Tensor type
I64
·
I32
·
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for NotoriousH2/kanana-2-30b-a3b-instruct-awq-w4a16

Quantized
(1)
this model