Instructions to use saricles/MiniMax-M2.7-NVFP4-GB10 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use saricles/MiniMax-M2.7-NVFP4-GB10 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="saricles/MiniMax-M2.7-NVFP4-GB10", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("saricles/MiniMax-M2.7-NVFP4-GB10", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("saricles/MiniMax-M2.7-NVFP4-GB10", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use saricles/MiniMax-M2.7-NVFP4-GB10 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "saricles/MiniMax-M2.7-NVFP4-GB10"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "saricles/MiniMax-M2.7-NVFP4-GB10",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/saricles/MiniMax-M2.7-NVFP4-GB10

SGLang

How to use saricles/MiniMax-M2.7-NVFP4-GB10 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "saricles/MiniMax-M2.7-NVFP4-GB10" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "saricles/MiniMax-M2.7-NVFP4-GB10",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "saricles/MiniMax-M2.7-NVFP4-GB10" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "saricles/MiniMax-M2.7-NVFP4-GB10",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use saricles/MiniMax-M2.7-NVFP4-GB10 with Docker Model Runner:
```
docker model run hf.co/saricles/MiniMax-M2.7-NVFP4-GB10
```

Quantization Scipt

by vgoklani - opened Apr 15

Discussion

vgoklani

Apr 15

Thanks for sharing! Could you please share your quantization script for ModelOpt?

Thanks!

saricles changed discussion status to closed Apr 15

saricles changed discussion status to open Apr 15

saricles

Owner Apr 15

Sure give me a little time - I'll share that out later today

saricles

Owner Apr 15

I dropped the script in this same repo: quantize-nvfp4-gb10.py.

It's the recipe that produced this exact quant — env-var configurable, with a documented header explaining the MoE-expert calibration gotcha you'll hit on M2 (where 64 calibration samples × top-K=2 routing leaves most of the 256 experts un-touched, so their weight_quantizer.amax stays unset and export_hf_checkpoint asserts). Phase 2.5 of the script handles that — mirrors _calibrate_weight_quantizer_if_needed from newer modelopt versions for folks on older releases that don't have that auto-fix yet.

If you adapt it for another architecture, the only model-specific bit is the ignore-list patterns near the bottom of Phase 2 — comments call out what to change. Hope it's useful!

vgoklani

Apr 15

Thank you for sharing! Why do you only use 64 samples for calibration?

saricles

Owner Apr 16

•

edited Apr 17

Gonna be honest with you - I'm a bit of a newb, learning as I go. As such - I've been using my AI agents to facilitate my efforts. 😁
Here's the response my agent helped me to prepare:

64 is the standard for NVFP4 weight quantization — it's computing per-block scaling factors (amax), not fine-tuning, so the statistics converge fast. 64 samples × 2048 tokens = roughly 131K tokens of activation data, which gives a representative distribution of weight activation magnitudes.

More samples have diminishing returns on amax quality but linear cost — each sample is a full forward pass through a 230B model. On A100x8 that's roughly 25-30s per sample, so 64 = roughly 30 min. The quality delta between 64 and 256 samples is negligible for NVFP4 weight scales.

The real calibration challenge on MoE isn't sample count — it's expert coverage. With 256 experts and top-K=8 routing, 64 samples × 2048 tokens drive roughly 1M expert activations per layer (approximately 4K per expert on average). Most experts see plenty of calibration, but routing is heavily skewed — popular experts dominate while tail experts may be undersampled or never fire. That's why Phase 2.5 exists: it force-populates amax from weight statistics on the never-activated experts. The amax from weight stats is slightly less precise than activation-derived amax, but for experts that were never routed during calibration, any reasonable scale is fine.

NVIDIA's own ModelOpt examples use 64-128 samples. If you wanted to push quality, 128 would help expert coverage marginally, but the amax-populate fix handles the real gap.

Edit: corrected top-K value from 2 → 8 (actual MiniMax-M2.7 routing). Expert-activation math updated accordingly.

maxnbk

16 days ago

Could you also share a matching requirements.txt or pyproject deps block that satisifes the reqs for the quantization script?

I'm a huge fan of this, but I have struggled with some of my quant efforts because it has been tricky to get a reqs file with modern versions of vllm, llmcompressor, and compressed-tensors pip packages in the same venv, and these scripts are all very API sensitive.

Thank you for this, it's already a great reference for me!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment