SmolLM2-360M-Instruct GLQ 4bpw

SmolLM2-360M-Instruct quantized using GLQ (Golay-Leech Quantization).

Note on effective bpw: This model was quantized with power-of-2 FHT padding. The labeled 4bpw refers to the codebook information content, but effective storage is ~6.4 bpw due to dimensional padding (hidden_size=960 padded to 1024). Quality benchmarks below reflect this effective rate.

Quality (lm-eval 5-task)

Method avg % of bf16
bf16 0.5572 100%
GLQ 4bpw 0.5548 99.6%
GPTQ W4 (g64) 0.4855 87.2%

Tasks: arc_easy, hellaswag, piqa, winogrande, lambada_openai

Usage

pip install glq
import glq.hf_integration
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "xv0y5ncu/SmolLM2-360M-Instruct-GLQ-4bpw",
    device_map="cuda",
    dtype="float16",
)
tokenizer = AutoTokenizer.from_pretrained("xv0y5ncu/SmolLM2-360M-Instruct-GLQ-4bpw")

inputs = tokenizer("The capital of France is", return_tensors="pt").to("cuda")
output = model.generate(**inputs, max_new_tokens=64)
print(tokenizer.decode(output[0], skip_special_tokens=True))

Requirements

  • transformers >= 5.0
  • torch >= 2.0
  • glq >= 0.2.8 (pip install glq)

License

Apache 2.0

Downloads last month
90
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for xv0y5ncu/SmolLM2-360M-Instruct-GLQ-4bpw

Quantized
(85)
this model