SmolLM2-135M-Instruct GLQ 4bpw

SmolLM2-135M-Instruct quantized using GLQ (Golay-Leech Quantization).

Note on effective bpw: This model was quantized with power-of-2 FHT padding. Effective storage is ~6.4 bpw due to dimensional padding (hidden_size=576 padded to 1024).

Usage

pip install glq
import glq.hf_integration
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "xv0y5ncu/SmolLM2-135M-Instruct-GLQ-4bpw",
    device_map="cuda",
    dtype="float16",
)
tokenizer = AutoTokenizer.from_pretrained("xv0y5ncu/SmolLM2-135M-Instruct-GLQ-4bpw")

inputs = tokenizer("The capital of France is", return_tensors="pt").to("cuda")
output = model.generate(**inputs, max_new_tokens=64)
print(tokenizer.decode(output[0], skip_special_tokens=True))

Requirements

  • transformers >= 5.0
  • torch >= 2.0
  • glq >= 0.2.8 (pip install glq)

License

Apache 2.0

Downloads last month
3
Safetensors
Model size
95.7M params
Tensor type
BF16
F16
I16
Inference Providers NEW
This model isn't deployed by any Inference Provider. 馃檵 Ask for provider support

Model tree for xv0y5ncu/SmolLM2-135M-Instruct-GLQ-4bpw

Quantized
(101)
this model