Instructions to use XReyRobert/Nex-N2-mini-GPTQ-Pro with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use XReyRobert/Nex-N2-mini-GPTQ-Pro with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="XReyRobert/Nex-N2-mini-GPTQ-Pro") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForMultimodalLM processor = AutoProcessor.from_pretrained("XReyRobert/Nex-N2-mini-GPTQ-Pro") model = AutoModelForMultimodalLM.from_pretrained("XReyRobert/Nex-N2-mini-GPTQ-Pro") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use XReyRobert/Nex-N2-mini-GPTQ-Pro with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "XReyRobert/Nex-N2-mini-GPTQ-Pro" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "XReyRobert/Nex-N2-mini-GPTQ-Pro", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/XReyRobert/Nex-N2-mini-GPTQ-Pro
- SGLang
How to use XReyRobert/Nex-N2-mini-GPTQ-Pro with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "XReyRobert/Nex-N2-mini-GPTQ-Pro" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "XReyRobert/Nex-N2-mini-GPTQ-Pro", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "XReyRobert/Nex-N2-mini-GPTQ-Pro" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "XReyRobert/Nex-N2-mini-GPTQ-Pro", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use XReyRobert/Nex-N2-mini-GPTQ-Pro with Docker Model Runner:
docker model run hf.co/XReyRobert/Nex-N2-mini-GPTQ-Pro
Nex-N2-mini GPTQ-Pro
This is a GPTQ-Pro 4-bit quantization of
nex-agi/Nex-N2-mini.
It is a deployment artifact, not a new fine-tune. The goal is to make the Nex-N2-mini MoE checkpoint easier to test in GPTQ-compatible local serving stacks while keeping the model card honest about the validation status.
The source checkpoint includes vision/visual tensors. This artifact preserves those tensors, but the validated publication story here is text and coding-agent serving. Vision behavior has not yet been validated for the quantized artifact.
Source And Credits
Source model:
Quantization tooling and reference recipe:
Artifact Summary
| Field | Value |
|---|---|
| Source model | nex-agi/Nex-N2-mini |
| Architecture | Qwen3_5MoeForConditionalGeneration |
| Model type | qwen3_5_moe |
| Tensor files | 5 |
| Safetensors size | 19.23 GiB |
| Indexed tensors | 124576 |
Quantized qweight tensors |
30970 |
mtp.* tensors in index |
false |
| vision/visual tensors in index | true |
| Index metadata size matches shards | true |
The source index/logs showed no mtp.* tensors. This artifact therefore
normalizes text_config.mtp_num_hidden_layers to 0 and records the change
under artifact_notes.mtp.
Quantization Recipe
| Setting | Value |
|---|---|
| Method | GPTQ-Pro / GPTQModel |
| Quantizer | gptqmodel:6.1.0-dev |
| Bits | 4 |
| Group size | 128 |
| Symmetric quantization | true |
| Desc act | false |
| True sequential | true |
| Calibration dataset | WikiText |
| Calibration samples | 256 |
| Calibration sequence length | 2048 |
| MSE | 2.0 |
| Damp percent | 0.05 |
| Damp auto increment | 0.01 |
| FOEM alpha | 0.25 |
| FOEM beta | 0.2 |
| FOEM device | cuda:0 |
| MoE routing | ExpertsRoutingBypass |
| MoE bypass batch size | 320 |
| Dense VRAM strategy | exclusive |
| MoE VRAM strategy | balanced |
| Pack implementation | cpu |
Fallback smoothing was enabled for difficult groups with threshold 0.5%.
Intended Serving Shape
This checkpoint is intended for advanced users testing text-only GPTQ serving for Qwen3.6-style MoE models.
A starting vLLM shape for text-only testing:
vllm serve XReyRobert/Nex-N2-mini-GPTQ-Pro \
--served-model-name nex-n2-mini-gptq-pro \
--language-model-only \
--dtype float16 \
--quantization gptq_marlin \
--tensor-parallel-size 1 \
--max-model-len 262144 \
--max-num-seqs 1 \
--kv-cache-dtype fp8_e5m2 \
--reasoning-parser qwen3 \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--enable-prefix-caching \
--gpu-memory-utilization 0.95 \
--trust-remote-code
Treat this as a starting point. Loader compatibility depends on vLLM, Transformers, GPTQModel, GPTQ-Marlin, and Qwen3.6 MoE support.
The RTX 3090 image above reflects separate 262k-context serving validation.
Validation And Benchmarks
Completed artifact checks:
- Local shard index inspection completed before upload.
- Remote file list verified after upload.
- Remote
model.safetensors.index.jsonverified after upload. - Index metadata total size matches the local safetensor shards.
- The remote artifact contains the expected five safetensor shards.
Terminal-Bench 2.0 Smoke24 result and associated vLLM serving measurements.
This Smoke24 run used max_model_len=131072 for apples-to-apples comparison
with the other local models in this publication batch:
| Run | Score | Success rate | Wall-time | Output tokens | Observed decode | LLM API time |
|---|---|---|---|---|---|---|
nex-n2-mini-gptq-pro |
14/24 |
58.3% |
314.6m |
1670.6k |
140.8 tok/s |
197.4m |
Smoke24 is a fixed 24-task Terminal-Bench 2.0 comparison corpus, not a full Terminal-Bench leaderboard run. In this harness, Nex-N2-mini GPTQ-Pro tied the Qwen3.6 27B GPTQ reference on solved tasks but used more wall time and far more output tokens. That makes it a useful candidate for further serving and generation-control tuning, not an efficiency leader in this specific test.
Task list and harness shape:
MTP And Vision Status
mtp.*tensors are not present in this artifact.text_config.mtp_num_hidden_layerswas normalized to0.- Do not enable MTP speculative decoding for this artifact.
- Vision/visual tensors are present, but multimodal serving has not been validated for this quantized artifact.
Limitations
- Experimental quantization.
- Terminal-Bench Smoke24 is a small local comparison corpus, not a full benchmark submission.
- Nex-N2-mini was verbose and reasoning-heavy in the Smoke24 harness; generation controls may need further tuning.
- MTP speculative decoding is not supported by this artifact.
- Vision tensors are preserved, but vision behavior has not been validated.
- Loader behavior may vary across vLLM, Transformers, GPTQModel, and GPTQ-Marlin versions.
Files
Key files:
model.safetensors.index.jsonmodel-00001-of-00005.safetensorsthroughmodel-00005-of-00005.safetensorsconfig.jsonquantize_config.jsonprocessor_config.jsontokenizer.jsonUPLOAD_MANIFEST.json
UPLOAD_MANIFEST.json records the upload guardrail checks and artifact
inspection summary.
References
- Source model:
nex-agi/Nex-N2-mini - GPTQ-Pro tooling:
groxaxo/GPTQ-Pro - Reference recipe:
groxaxo/Qwen3.6-27B-GPTQ-Pro-4bit - Terminal-Bench:
laude-institute/terminal-bench
Individual Project Notice
This repository is an individual research project. It is not affiliated with, sponsored by, or endorsed by any employer or organization.
- Downloads last month
- 39
Model tree for XReyRobert/Nex-N2-mini-GPTQ-Pro
Base model
nex-agi/Nex-N2-mini