Instructions to use moonshotai/Moonlight-16B-A3B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use moonshotai/Moonlight-16B-A3B with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="moonshotai/Moonlight-16B-A3B", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("moonshotai/Moonlight-16B-A3B", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("moonshotai/Moonlight-16B-A3B", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use moonshotai/Moonlight-16B-A3B with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "moonshotai/Moonlight-16B-A3B"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "moonshotai/Moonlight-16B-A3B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/moonshotai/Moonlight-16B-A3B

SGLang

How to use moonshotai/Moonlight-16B-A3B with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "moonshotai/Moonlight-16B-A3B" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "moonshotai/Moonlight-16B-A3B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "moonshotai/Moonlight-16B-A3B" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "moonshotai/Moonlight-16B-A3B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use moonshotai/Moonlight-16B-A3B with Docker Model Runner:
```
docker model run hf.co/moonshotai/Moonlight-16B-A3B
```

fix(modeling): add training-path MoE dispatch and KV cache API compat

by delock - opened Apr 13

base: refs/heads/main

←

from: refs/pr/9

Discussion Files changed

+37

-6

delock

Apr 13

Fixes #8 (UnboundLocalError in DeepseekV3MoE.forward during training).

Three changes in modeling_deepseek.py:

Add training-path MoE dispatch (DeepseekV3MoE.forward)
The original code only had ,
leaving y undefined during training (causing UnboundLocalError).
Added a proper training branch using sort-based dispatch:
- Expand each token top_k times, sort by expert ID
- Single GPU->CPU sync for all expert boundaries
- Call each expert on its contiguous slice, unsort, apply routing weights
Remove in moe_infer
Commented out so eval steps inside a training loop do not crash.
KV cache API compatibility (get_usable_length -> get_seq_length)
past_key_value.get_usable_length() and past_key_values.seen_tokens are
deprecated in transformers >= 4.40. Replaced with get_seq_length().

fix(modeling): add training-path MoE dispatch and KV cache API compat1cacceb6

delock

Apr 13

This PR fixes the issue reported in #8.

Root cause: only had an inference branch (), so was never assigned during training, causing when the shared-expert accumulation was reached.

Changes:

Training-path MoE dispatch — Added a proper branch using sort-based dispatch. Tokens are expanded top_k times, sorted by expert ID so each expert receives a contiguous slice, processed with a single GPU→CPU sync (instead of one per expert), then unsorted and aggregated with routing weights. Supports gradient flow (no ).
**Remove in ** — Commented out so eval steps inside a training loop do not accidentally crash.
KV cache API compatibility — and are deprecated in . Replaced with .

Validated with 100-step fine-tuning of Moonlight-16B-A3B using DeepSpeed ZeRO-2 + AutoEP + Muon optimizer; loss decreases correctly throughout.

fix: remove commented-out assert (delete instead of comment out)c80aed29

fix: rebuild patch from original with zero formatting changese06ad388

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Ready to merge

This branch is ready to get merged automatically.

· Sign up or log in to comment