Instructions to use dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B

SGLang

How to use dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B with Docker Model Runner:
```
docker model run hf.co/dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B
```

Question About KV Cache Efficiency

by ineedquants - opened Apr 15

Discussion

ineedquants

Apr 15

Since I didn’t know how to use Hugging Face, I accidentally wrote a comment on a different topic. This is an addition to my previous comment. As I mentioned earlier, the KV cache is much larger compared to the Qwen 3.5 models. After the suggestion to set the KV cache to Q8, I became curious: is the MiniMax architecture less efficient in terms of KV caching compared to Qwen 3.5? Or is there an optimization that hasn’t been implemented yet? (I don’t want to imply that anything was done incorrectly.)

dervig

Owner about 1 month ago

You're entirely right. The difference you're noticing is architectural, not a missing implementation:

MiniMax-M2.7 uses traditional GQA with ratio 6:1 (48 query heads, 8 KV heads) and n_embd_k_gqa = 1024
Qwen 3 MoE models use GQA ratio 16:1 and n_embd_k_gqa = 512, roughly half the KV cache per layer
DeepSeek V3 goes further with MLA (Multi-head Latent Attention), getting KV cache down ~5× vs standard GQA

MiniMax chose to prioritize MoE expert expansion (256 experts, top-8 routing) and wide native context (196K positions), and stuck with standard GQA at a moderate compression ratio. That's a training-time decision. llama.cpp (or any inference framework) just implements what the architecture specifies.

The only real knob you have on the inference side is KV cache quantization --cache-type-k q8_0 --cache-type-v q8_0 halves KV footprint with essentially no quality loss. For this model that's especially worthwhile at longer context.

ineedquants

about 1 month ago

Ah, thank you very much. I am still learning. I just noticed that the RAM allocation works differently compared to other models. While this architecture seems to have a fixed memory allocation, Qwen 3.5 appears to use a somewhat hybrid approach.
Just for my understanding: with IQ4_NL, how does this quantization compare to Q4_K_M in real-world scenarios? I’m considering replacing Qwen 3.5 122B with a reasonably competent MiniMax setup, mainly for coding-related tasks.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment