Instructions to use djdeniro/MiniMax-M2.7-MXFP416 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use djdeniro/MiniMax-M2.7-MXFP416 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="djdeniro/MiniMax-M2.7-MXFP416", trust_remote_code=True) messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForMultimodalLM tokenizer = AutoTokenizer.from_pretrained("djdeniro/MiniMax-M2.7-MXFP416", trust_remote_code=True) model = AutoModelForMultimodalLM.from_pretrained("djdeniro/MiniMax-M2.7-MXFP416", trust_remote_code=True) messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use djdeniro/MiniMax-M2.7-MXFP416 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "djdeniro/MiniMax-M2.7-MXFP416" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "djdeniro/MiniMax-M2.7-MXFP416", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/djdeniro/MiniMax-M2.7-MXFP416
- SGLang
How to use djdeniro/MiniMax-M2.7-MXFP416 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "djdeniro/MiniMax-M2.7-MXFP416" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "djdeniro/MiniMax-M2.7-MXFP416", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "djdeniro/MiniMax-M2.7-MXFP416" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "djdeniro/MiniMax-M2.7-MXFP416", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use djdeniro/MiniMax-M2.7-MXFP416 with Docker Model Runner:
docker model run hf.co/djdeniro/MiniMax-M2.7-MXFP416
mxfp4_16 Quantization of MiniMaxAI/MiniMax-M2.7
Runtime: Requires tcclaviger/vllm22:latest — a RDNA 4 (gfx12xx) vLLM image with mxfp4_16 kernel support. No other vLLM build currently loads these weights.
1. Introduction
This is an MXFP4-16 (Mixed-precision 4-bit with 16-element group size) quantized variant of MiniMaxAI/MiniMax-M2.7, produced using compressed-tensors with an IQ4_NL codebook.
The quantization:
- 4-bit weights with 16-element group size, IQ4_NL codebook
- All
Linearlayers quantized (MoE experts, FFN, attention projections) - Attention
k/v_projscales, router gate, norms, embeddings kept BF16 - KV cache: FP8 (e4m3), calibrated scales baked into checkpoint
The result fits in ~17.5 GiB per GPU (TP8) while retaining near-BF16 quality.
2. Model Architecture
- 229B total params (BF16), ~12B activated per token (top-8)
- 256 experts per MoE layer, top-8 routing, 62 transformer layers
- 200k context window
- Native tool-calling support
3. Runtime Requirements
- GPU: 8× RX 9700 (RDNA 4 / gfx12xx)
- Memory: 128GB+ system RAM
- Docker:
tcclaviger/vllm22:latest— only validated runtime
The Docker image includes:
- Custom Triton attention kernels tuned for RDNA4
- Fixed FP8 KV-cache quantization path
- Pre-tuned GEMM configs for RX 9700
- MXFP4-16 kernels for gfx12xx
4. Deployment
Full deployment guide (RDNA4 / RX 9700): docs/vllm_deploy_guide.md
Quick-start:
docker run --name minimax-mxfp416 \
--rm --tty --ipc=host --shm-size=128g \
--device /dev/kfd:/dev/kfd \
--device /dev/dri/renderD128:/dev/dri/renderD128 \
--device /dev/dri/renderD129:/dev/dri/renderD129 \
--device /dev/dri/renderD130:/dev/dri/renderD130 \
--device /dev/dri/renderD132:/dev/dri/renderD132 \
--device /dev/dri/renderD137:/dev/dri/renderD137 \
--device /dev/dri/renderD138:/dev/dri/renderD138 \
--device /dev/dri/renderD139:/dev/dri/renderD139 \
--device /dev/dri/renderD140:/dev/dri/renderD140 \
-e HIP_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
-e ROCR_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
-e TRUST_REMOTE_CODE=1 \
-v /path/to/models:/app/models:ro \
-p 8000:8000 \
tcclaviger/vllm22:latest \
bash -c "cp /app/models/vllm22_minimax_m2.py /app/vllm/vllm/model_executor/models/minimax_m2.py && \
pip install -q sentencepiece && \
exec vllm serve /app/models/MiniMax-M2.7-MXFP416 \
--served-model-name minimax-m2.7-mxfp416 \
--host 0.0.0.0 --port 8000 --trust-remote-code \
--tensor-parallel-size 8 --enable-expert-parallel \
--disable-cascade-attn \
--reasoning-parser minimax_m2 \
--enable-auto-tool-choice --tool-call-parser minimax_m2 \
--enable-prefix-caching --gpu-memory-utilization 0.93 \
--max-model-len 180000 --max-num-seqs 48 --max-num-batched-tokens 2048 \
--kv-cache-dtype fp8_e4m3 --attention-backend TRITON_ATTN \
--override-generation-config '{\"max_tokens\": 16384}'"
Performance (8× RX 9700, 210W power limit)
| Metric | Value |
|---|---|
| Generation throughput | ~30–35 tokens/s |
| Prefill throughput | up to 2,190 tokens/s (w/ prefix cache) |
| Prefix cache hit rate | ~93% |
| KV cache memory | 11.35 GiB |
| KV cache capacity | 767,856 tokens |
| Max context per request | 180,000 tokens |
| Max concurrent (180k) | 4 requests |
| Model weight memory (TP8) | ~17.5 GiB/GPU |
Power tip: Set
rocm-smi --setpowerlimit <i> 210per GPU. At 210W sustained throughput is higher than at full 300W due to reduced thermal throttling.
5. API Usage
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
completion = client.chat.completions.create(
model="minimax-m2.7-mxfp416",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Hello!"}
],
temperature=1.0,
max_tokens=1024,
)
print(completion.choices[0].message.content)
6. Chat Template
The model uses a Jinja chat template supporting system messages, tool calls (<minimax:tool_call>/</minimax:tool_call>), reasoning content (<think>/</think>), and tool responses (<response>).
from transformers import AutoProcessor, AutoModelForCausalLM
processor = AutoProcessor.from_pretrained(
"djdeniro/MiniMax-M2.7-MXFP416", trust_remote_code=True
)
model = AutoModelForCausalLM.from_pretrained(
"djdeniro/MiniMax-M2.7-MXFP416",
device_map="auto", dtype="auto", trust_remote_code=True
)
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Hello, how are you?"}
]
inputs = processor.apply_chat_template(
messages, tokenize=True, add_generation_prompt=True,
return_dict=True, return_tensors="pt"
).to(model.device)
out = model.generate(**inputs, max_new_tokens=128, do_sample=False)
print(processor.decode(out[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))
7. Inference Parameters
temperature: 1.0top_p: 0.95top_k: 40max_tokens: 16384 (default)
8. Acknowledgments
- Base model: MiniMaxAI/MiniMax-M2.7
- Quantization inspiration: tcclaviger/Step-3.7-Flash-240REAP-MXFP416
- Runtime: tcclaviger/vllm22
9. License
Apache 2.0 — inherits from base model.
- Downloads last month
- -
Model tree for djdeniro/MiniMax-M2.7-MXFP416
Base model
MiniMaxAI/MiniMax-M2.7