Instructions to use XReyRobert/Nex-N2-mini-GPTQ-Pro with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use XReyRobert/Nex-N2-mini-GPTQ-Pro with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="XReyRobert/Nex-N2-mini-GPTQ-Pro")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForMultimodalLM

processor = AutoProcessor.from_pretrained("XReyRobert/Nex-N2-mini-GPTQ-Pro")
model = AutoModelForMultimodalLM.from_pretrained("XReyRobert/Nex-N2-mini-GPTQ-Pro")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use XReyRobert/Nex-N2-mini-GPTQ-Pro with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "XReyRobert/Nex-N2-mini-GPTQ-Pro"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "XReyRobert/Nex-N2-mini-GPTQ-Pro",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/XReyRobert/Nex-N2-mini-GPTQ-Pro

SGLang

How to use XReyRobert/Nex-N2-mini-GPTQ-Pro with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "XReyRobert/Nex-N2-mini-GPTQ-Pro" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "XReyRobert/Nex-N2-mini-GPTQ-Pro",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "XReyRobert/Nex-N2-mini-GPTQ-Pro" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "XReyRobert/Nex-N2-mini-GPTQ-Pro",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use XReyRobert/Nex-N2-mini-GPTQ-Pro with Docker Model Runner:
```
docker model run hf.co/XReyRobert/Nex-N2-mini-GPTQ-Pro
```

Nex-N2-mini GPTQ-Pro

Follow @xreyrobert

This is a GPTQ-Pro 4-bit quantization of nex-agi/Nex-N2-mini.

It is a deployment artifact, not a new fine-tune. The goal is to make the Nex-N2-mini MoE checkpoint easier to test in GPTQ-compatible local serving stacks while keeping the model card honest about the validation status.

The source checkpoint includes vision/visual tensors. This artifact preserves those tensors, but the validated publication story here is text and coding-agent serving. Vision behavior has not yet been validated for the quantized artifact.

Source And Credits

Source model:

nex-agi/Nex-N2-mini

Quantization tooling and reference recipe:

Artifact Summary

Field	Value
Source model	`nex-agi/Nex-N2-mini`
Architecture	`Qwen3_5MoeForConditionalGeneration`
Model type	`qwen3_5_moe`
Tensor files	`5`
Safetensors size	`19.23 GiB`
Indexed tensors	`124576`
Quantized `qweight` tensors	`30970`
`mtp.*` tensors in index	`false`
vision/visual tensors in index	`true`
Index metadata size matches shards	`true`

The source index/logs showed no mtp.* tensors. This artifact therefore normalizes text_config.mtp_num_hidden_layers to 0 and records the change under artifact_notes.mtp.

Quantization Recipe

Setting	Value
Method	GPTQ-Pro / GPTQModel
Quantizer	`gptqmodel:6.1.0-dev`
Bits	`4`
Group size	`128`
Symmetric quantization	`true`
Desc act	`false`
True sequential	`true`
Calibration dataset	WikiText
Calibration samples	`256`
Calibration sequence length	`2048`
MSE	`2.0`
Damp percent	`0.05`
Damp auto increment	`0.01`
FOEM alpha	`0.25`
FOEM beta	`0.2`
FOEM device	`cuda:0`
MoE routing	`ExpertsRoutingBypass`
MoE bypass batch size	`320`
Dense VRAM strategy	`exclusive`
MoE VRAM strategy	`balanced`
Pack implementation	`cpu`

Fallback smoothing was enabled for difficult groups with threshold 0.5%.

Intended Serving Shape

This checkpoint is intended for advanced users testing text-only GPTQ serving for Qwen3.6-style MoE models.

A starting vLLM shape for text-only testing:

vllm serve XReyRobert/Nex-N2-mini-GPTQ-Pro \
  --served-model-name nex-n2-mini-gptq-pro \
  --language-model-only \
  --dtype float16 \
  --quantization gptq_marlin \
  --tensor-parallel-size 1 \
  --max-model-len 262144 \
  --max-num-seqs 1 \
  --kv-cache-dtype fp8_e5m2 \
  --reasoning-parser qwen3 \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --enable-prefix-caching \
  --gpu-memory-utilization 0.95 \
  --trust-remote-code

Treat this as a starting point. Loader compatibility depends on vLLM, Transformers, GPTQModel, GPTQ-Marlin, and Qwen3.6 MoE support.

The RTX 3090 image above reflects separate 262k-context serving validation.

Validation And Benchmarks

Completed artifact checks:

Local shard index inspection completed before upload.
Remote file list verified after upload.
Remote model.safetensors.index.json verified after upload.
Index metadata total size matches the local safetensor shards.
The remote artifact contains the expected five safetensor shards.

Terminal-Bench 2.0 Smoke24 result and associated vLLM serving measurements. This Smoke24 run used max_model_len=131072 for apples-to-apples comparison with the other local models in this publication batch:

Run	Score	Success rate	Wall-time	Output tokens	Observed decode	LLM API time
`nex-n2-mini-gptq-pro`	`14/24`	`58.3%`	`314.6m`	`1670.6k`	`140.8 tok/s`	`197.4m`

Smoke24 is a fixed 24-task Terminal-Bench 2.0 comparison corpus, not a full Terminal-Bench leaderboard run. In this harness, Nex-N2-mini GPTQ-Pro tied the Qwen3.6 27B GPTQ reference on solved tasks but used more wall time and far more output tokens. That makes it a useful candidate for further serving and generation-control tuning, not an efficiency leader in this specific test.

Task list and harness shape:

benchmarks/terminal-bench-2.0/smoke24_task_list_20260616.md

MTP And Vision Status

mtp.* tensors are not present in this artifact.
text_config.mtp_num_hidden_layers was normalized to 0.
Do not enable MTP speculative decoding for this artifact.
Vision/visual tensors are present, but multimodal serving has not been validated for this quantized artifact.

Limitations

Experimental quantization.
Terminal-Bench Smoke24 is a small local comparison corpus, not a full benchmark submission.
Nex-N2-mini was verbose and reasoning-heavy in the Smoke24 harness; generation controls may need further tuning.
MTP speculative decoding is not supported by this artifact.
Vision tensors are preserved, but vision behavior has not been validated.
Loader behavior may vary across vLLM, Transformers, GPTQModel, and GPTQ-Marlin versions.

Files

Key files:

model.safetensors.index.json
model-00001-of-00005.safetensors through model-00005-of-00005.safetensors
config.json
quantize_config.json
processor_config.json
tokenizer.json
UPLOAD_MANIFEST.json

UPLOAD_MANIFEST.json records the upload guardrail checks and artifact inspection summary.

References

Source model: nex-agi/Nex-N2-mini
GPTQ-Pro tooling: groxaxo/GPTQ-Pro
Reference recipe: groxaxo/Qwen3.6-27B-GPTQ-Pro-4bit
Terminal-Bench: laude-institute/terminal-bench

Individual Project Notice

This repository is an individual research project. It is not affiliated with, sponsored by, or endorsed by any employer or organization.

Downloads last month: 39

Safetensors

Model size

35B params

Tensor type

BF16

I32

Model tree for XReyRobert/Nex-N2-mini-GPTQ-Pro

Base model

nex-agi/Nex-N2-mini

Quantized

(51)

this model

Collection including XReyRobert/Nex-N2-mini-GPTQ-Pro

GPTQ-Pro

Collection

GPTQ-Pro quantized models. • 4 items • Updated 3 days ago • 1