Instructions to use sahilchachra/Qwable-v1-AWQ with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use sahilchachra/Qwable-v1-AWQ with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="sahilchachra/Qwable-v1-AWQ") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForMultimodalLM processor = AutoProcessor.from_pretrained("sahilchachra/Qwable-v1-AWQ") model = AutoModelForMultimodalLM.from_pretrained("sahilchachra/Qwable-v1-AWQ") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use sahilchachra/Qwable-v1-AWQ with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "sahilchachra/Qwable-v1-AWQ" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "sahilchachra/Qwable-v1-AWQ", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/sahilchachra/Qwable-v1-AWQ
- SGLang
How to use sahilchachra/Qwable-v1-AWQ with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "sahilchachra/Qwable-v1-AWQ" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "sahilchachra/Qwable-v1-AWQ", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "sahilchachra/Qwable-v1-AWQ" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "sahilchachra/Qwable-v1-AWQ", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use sahilchachra/Qwable-v1-AWQ with Docker Model Runner:
docker model run hf.co/sahilchachra/Qwable-v1-AWQ
Qwable-v1-AWQ
AWQ 4-bit (W4A16) quantization of lordx64/Qwable-v1 — a 35B-total /
3B-active text generation Mixture-of-Experts model (Qwen3_5MoeForConditionalGeneration, Qwen3.6
family, with hybrid linear / full attention). Per the base model card it is text-only and aimed at
reasoning, agentic tool-use, and coding (see Capabilities).
Variant: AWQ weight-only (W4A16) — int4 symmetric weights, group size 128, activation-aware scaling; activations stay BF16
Disk size: ~22 GB (vs ~72 GB BF16, ~3.3×)
Quantized by: sahilchachra
Tooling: llm-compressor AWQ (oneshot) — activation-aware, calibrated on general instruct chat (UltraChat-200k)
Note on what is quantized: only the linear weights that hold the bulk of the parameters are taken to int4 — the 256-way routed experts, the shared experts, and the full-attention projections. The linear/Gated-Delta-Net (mamba-style) layers, the MoE routers, embeddings,
lm_head, the MTP head and all norms are kept in BF16 for stability. The architecture also carries a vision tower (Qwen3_5MoeForConditionalGeneration), which is likewise kept in BF16 — but the base model is documented as text-only, so this quantization neither adds nor validates any image capability. The headline variant name reflects the dominant (expert/attention) quantization; the on-disk size averages the int4 and BF16 halves of the model.
Capabilities
Unchanged from the base model — quantization only changes weight precision, not behavior. Per the base model card:
- Reasoning — thinks in explicit
<think>…</think>chains-of-thought. - Agentic tool-use — emits
<tool_use>XML blocks for file/shell operations (activates with agent-style system prompts or prior<tool_result>turns). - Coding — designed for agentic coding tasks with multi-turn agent interactions.
- Context length: 4096 tokens (training) / 16384 tokens (serving).
See the base card for limitations (narrow training distribution, tool-name differences, reasoning inherited from the Opus-4.7 distill).
Smoke test
Loaded and run with transformers on an NVIDIA Thor (Blackwell) device. The model loads, runs the
hybrid linear-attention + int4 MoE path, and produces coherent text from a chat-templated prompt. A
structure census confirms only the intended decoder Linears are int4 (routed experts, shared expert,
full-attention q/k/v/o) with the routers, linear-attention, vision, MTP and norms left in BF16. This
is a functional smoke test only — it is not a quality benchmark.
Test device
- GPU: NVIDIA Thor (Blackwell)
- CPU / memory: 14-core ARM (aarch64), 122 GB unified memory
- Software: JetPack / L4T R38.4 (Ubuntu 24.04), CUDA 13.0, driver 580, kernel 6.8.12-tegra
What's quantized
| Quantized → int4 (AWQ W4A16) | Kept in BF16 |
|---|---|
Routed experts (mlp.experts.*.{gate,up,down}_proj, 40 layers × 256 experts) |
Linear / Gated-Delta-Net layers (*.linear_attn.*) |
Shared experts (mlp.shared_expert.{gate,up,down}_proj) |
MoE routers (mlp.gate), shared-expert gates |
Full-attention projections (self_attn.{q,k,v,o}_proj) |
Embeddings, lm_head, MTP head, all norms |
Vision tower (model.visual.*) — present in the arch, unused for text |
Usage (vLLM)
from vllm import LLM, SamplingParams
llm = LLM(model="sahilchachra/Qwable-v1-AWQ", dtype="bfloat16", max_model_len=16384, trust_remote_code=True)
out = llm.generate(["Hello!"], SamplingParams(temperature=0.7, top_p=0.9, max_tokens=128))
print(out[0].outputs[0].text)
Runs on GPUs with compressed-tensors W4A16 support (vLLM unpacks the int4 weights for you).
Notes
- Weight-only AWQ (W4A16): weights are int4 (group size 128, symmetric, activation-aware scales), activations remain BF16.
- Format:
pack-quantized(compressed-tensors), per-expert layout — the standard layout vLLM consumes for quantized MoE. - Loading requires
compressed-tensorsand a recenttransformers(theqwen3_5_moearchitecture). - Smoke-tested only; not formally benchmarked for quality.
- Sibling quantization: sahilchachra/Qwable-v1-NVFP4A16 (NVFP4 for Blackwell GPUs).
Original model
See lordx64/Qwable-v1 for full lineage, intended use, and limitations. License (AGPL-3.0) is inherited from the base model.
- Downloads last month
- 115
Model tree for sahilchachra/Qwable-v1-AWQ
Base model
Qwen/Qwen3.6-35B-A3B