NPC Agentic 7B — GPTQ 4-bit

DOI

W4A16 GPTQ-quantized build of ramankrishna10/npc-agentic-7b-v3 for fast, memory-efficient inference (loads in ~5 GB VRAM, ideal for vLLM serving).

See the FP16 reference card for the full training recipe, eval numbers, and known limitations (particularly the GSM8K regression vs base — use base Qwen2.5 or Qwen2.5-Math-7B for math-heavy workflows).

Quantization details

  • Method: GPTQ via llm-compressor
  • Scheme: W4A16 (4-bit weights, fp16 activations)
  • Group size: 128
  • Desc-act: true
  • Symmetric: false
  • Calibration: 512 samples from the training set, 2048 tokens each
  • Ignored layers: lm_head (kept in full precision)
  • Size on disk: ~4.5 GB

Inference (vLLM)

from vllm import LLM, SamplingParams
llm = LLM(model="ramankrishna10/npc-agentic-7b-v3-gptq-4bit", dtype="float16")
out = llm.generate(
    ["Design an event-sourced microservice with exactly-once command handling."],
    SamplingParams(max_tokens=1024, temperature=0.7, top_p=0.9),
)
print(out[0].outputs[0].text)

See also


Built by Bottensor.

Citation

If you use NPC Agentic 7B in your work, please cite:

@misc{bachu2026npcagentic7b,
  title        = {NPC Agentic 7B: A Single-GPU QLoRA Recipe for a Laptop-Scale Conversational Model},
  author       = {Bachu, Rama Krishna},
  year         = {2026},
  month        = may,
  publisher    = {Zenodo},
  version      = {v1},
  doi          = {10.5281/zenodo.19954103},
  url          = {https://doi.org/10.5281/zenodo.19954103},
  note         = {Preprint}
}

Paper: https://doi.org/10.5281/zenodo.19954103

Downloads last month
35
Safetensors
Model size
2B params
Tensor type
F16
·
I64
·
I32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ramankrishna10/npc-agentic-7b-v3-gptq-4bit

Base model

Qwen/Qwen2.5-7B
Quantized
(2)
this model