Instructions to use stepfun-ai/Step-3.7-Flash-NVFP4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use stepfun-ai/Step-3.7-Flash-NVFP4 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="stepfun-ai/Step-3.7-Flash-NVFP4", trust_remote_code=True)
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("stepfun-ai/Step-3.7-Flash-NVFP4", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use stepfun-ai/Step-3.7-Flash-NVFP4 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "stepfun-ai/Step-3.7-Flash-NVFP4"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "stepfun-ai/Step-3.7-Flash-NVFP4",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/stepfun-ai/Step-3.7-Flash-NVFP4

SGLang

How to use stepfun-ai/Step-3.7-Flash-NVFP4 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "stepfun-ai/Step-3.7-Flash-NVFP4" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "stepfun-ai/Step-3.7-Flash-NVFP4",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "stepfun-ai/Step-3.7-Flash-NVFP4" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "stepfun-ai/Step-3.7-Flash-NVFP4",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use stepfun-ai/Step-3.7-Flash-NVFP4 with Docker Model Runner:
```
docker model run hf.co/stepfun-ai/Step-3.7-Flash-NVFP4
```

RuntimeError: The size of tensor a (2048) must match the size of tensor b (4096) at non-singleton dimension 1

by dhlee - opened 14 days ago

Discussion

dhlee

14 days ago

•

edited 14 days ago

MTP drafter weight shape mismatch when using speculative decoding with `--load-format safetensors`

Environment

Hardware: 2× DGX Spark (clustered, TP=2)
Docker image: eugr/spark-vllm-docker (main branch, as of 2026-06-05)
Model: stepfun-ai/Step-3.7-Flash-NVFP4
vLLM 0.22.1rc1.dev124+gace95c9cf.d20260603.cu132 (cu132)
Launch flags: --load-format safetensors --speculative-config '{"method":"mtp","num_speculative_tokens":1}'

Error

The worker crashes during MTP drafter weight loading with the following error:

RuntimeError: The size of tensor a (2048) must match the size of tensor b (4096) at non-singleton dimension 1

Traceback summary:

File ".../vllm/model_executor/models/step3p5_mtp.py", line 273, in load_weights
    weight_loader(param, loaded_weight)
File ".../vllm/model_executor/layers/vocab_parallel_embedding.py", line 474, in weight_loader
    param[: loaded_weight.shape[0]].data.copy_(loaded_weight)
RuntimeError: The size of tensor a (2048) must match the size of tensor b (4096) at non-singleton dimension 1

The main model (14 safetensors shards) loads successfully. The crash happens only when the MTP drafter subsequently attempts to load its own weights.

Observation

vLLM's step3p5_mtp.py appears to assume a vocab embedding shape with dimension 1 = 2048, but the actual weight in this NVFP4 checkpoint has dimension 1 = 4096. This suggests the drafter architecture assumed by vLLM may not match the weight layout of this specific quantized checkpoint.

huangyu-nv

StepFun org 11 days ago

A few things to check — the mismatch most likely comes from the runtime loader / vLLM build / a stale checkpoint revision / the --load-format safetensors override, not the checkpoint itself:

Refresh the local checkpoint to the latest HF revision, especially config.json, model.safetensors.index.json, and model-mtp-bf16.safetensors.
Verify the text_config per-layer lists are length 48:

python3 -c "import json; c=json.load(open('config.json'))['text_config']; print({k:len(c[k]) for k in ['layer_types','partial_rotary_factors','swiglu_limits','swiglu_limits_shared','rope_theta']})"

Try without forcing --load-format safetensors, so vLLM uses its default loader path.

dhlee

11 days ago

•

edited 11 days ago

A few things to check — the mismatch most likely comes from the runtime loader / vLLM build / a stale checkpoint revision / the --load-format safetensors override, not the checkpoint itself:

Refresh the local checkpoint to the latest HF revision, especially config.json, model.safetensors.index.json, and model-mtp-bf16.safetensors.
Verify the text_config per-layer lists are length 48:

python3 -c "import json; c=json.load(open('config.json'))['text_config']; print({k:len(c[k]) for k in ['layer_types','partial_rotary_factors','swiglu_limits','swiglu_limits_shared','rope_theta']})"

Try without forcing --load-format safetensors, so vLLM uses its default loader path.

The result is the same.
This issue is being similarly discovered by many people in the DGX Spark community besides myself.

https://forums.developer.nvidia.com/t/step-3-7-flash-is-supported-in-community-docker-on-dgx-spark/371652/49

(Worker_TP0_EP0 pid=174) INFO 06-08 03:30:36 [gpu_model_runner.py:5116] Loading drafter model...
(Worker_TP0_EP0 pid=174) INFO 06-08 03:30:36 [kernel.py:270] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['native'], fused_add_rms_norm=['native'])
(Worker_TP0_EP0 pid=174) WARNING 06-08 03:30:36 [modelopt.py:1022] Detected ModelOpt NVFP4 checkpoint (quant_algo=NVFP4). Please note that the format is experimental and could change in future.
(Worker_TP0_EP0 pid=174) INFO 06-08 03:30:36 [kernel.py:270] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['native'], fused_add_rms_norm=['native'])
(Worker_TP0_EP0 pid=174) INFO 06-08 03:30:36 [__init__.py:962] Using FlashInferCutlassNvFp4LinearKernel for NVFP4 GEMM
(Worker_TP0_EP0 pid=174) WARNING 06-08 03:30:37 [vllm.py:2203] `torch.compile` is turned on, but the model 0xSero/Step-3.7-Flash-173B does not support it. Please open an issue on GitHub if you want it to be supported.
(Worker_TP0_EP0 pid=174) INFO 06-08 03:30:37 [weight_utils.py:922] Filesystem type for checkpoints: EXT4. Checkpoint size: 107.20 GiB. Available RAM: 49.01 GiB.
(Worker_TP0_EP0 pid=174) INFO 06-08 03:30:37 [weight_utils.py:952] Auto-prefetch is disabled because the filesystem (EXT4) is not a recognized network FS (NFS/Lustre) and the checkpoint size (107.20 GiB) exceeds 90% of available RAM (49.01 GiB).
Loading safetensors checkpoint shards:   0% Completed | 0/14 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:   7% Completed | 1/14 [00:04<00:52,  4.06s/it]
(Worker_TP0_EP0 pid=174) ERROR 06-08 03:30:44 [multiproc_executor.py:888] WorkerProc failed to start.
(Worker_TP0_EP0 pid=174) ERROR 06-08 03:30:44 [multiproc_executor.py:888] Traceback (most recent call last):
(Worker_TP0_EP0 pid=174) ERROR 06-08 03:30:44 [multiproc_executor.py:888]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 855, in worker_main
(Worker_TP0_EP0 pid=174) ERROR 06-08 03:30:44 [multiproc_executor.py:888]     worker = WorkerProc(*args, **kwargs)
(Worker_TP0_EP0 pid=174) ERROR 06-08 03:30:44 [multiproc_executor.py:888]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0_EP0 pid=174) ERROR 06-08 03:30:44 [multiproc_executor.py:888]   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(Worker_TP0_EP0 pid=174) ERROR 06-08 03:30:44 [multiproc_executor.py:888]     return func(*args, **kwargs)
(Worker_TP0_EP0 pid=174) ERROR 06-08 03:30:44 [multiproc_executor.py:888]            ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0_EP0 pid=174) ERROR 06-08 03:30:44 [multiproc_executor.py:888]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 634, in __init__
(Worker_TP0_EP0 pid=174) ERROR 06-08 03:30:44 [multiproc_executor.py:888]     self.worker.load_model()
(Worker_TP0_EP0 pid=174) ERROR 06-08 03:30:44 [multiproc_executor.py:888]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 349, in load_model
(Worker_TP0_EP0 pid=174) ERROR 06-08 03:30:44 [multiproc_executor.py:888]     self.model_runner.load_model(load_dummy_weights=load_dummy_weights)
(Worker_TP0_EP0 pid=174) ERROR 06-08 03:30:44 [multiproc_executor.py:888]   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(Worker_TP0_EP0 pid=174) ERROR 06-08 03:30:44 [multiproc_executor.py:888]     return func(*args, **kwargs)
(Worker_TP0_EP0 pid=174) ERROR 06-08 03:30:44 [multiproc_executor.py:888]            ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0_EP0 pid=174) ERROR 06-08 03:30:44 [multiproc_executor.py:888]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 5118, in load_model
(Worker_TP0_EP0 pid=174) ERROR 06-08 03:30:44 [multiproc_executor.py:888]     self.drafter.load_model(self.model)
(Worker_TP0_EP0 pid=174) ERROR 06-08 03:30:44 [multiproc_executor.py:888]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/spec_decode/llm_base_proposer.py", line 1199, in load_model
(Worker_TP0_EP0 pid=174) ERROR 06-08 03:30:44 [multiproc_executor.py:888]     self.model = self._get_model()
(Worker_TP0_EP0 pid=174) ERROR 06-08 03:30:44 [multiproc_executor.py:888]                  ^^^^^^^^^^^^^^^^^
(Worker_TP0_EP0 pid=174) ERROR 06-08 03:30:44 [multiproc_executor.py:888]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/spec_decode/llm_base_proposer.py", line 1184, in _get_model
(Worker_TP0_EP0 pid=174) ERROR 06-08 03:30:44 [multiproc_executor.py:888]     model = get_model(
(Worker_TP0_EP0 pid=174) ERROR 06-08 03:30:44 [multiproc_executor.py:888]             ^^^^^^^^^^
(Worker_TP0_EP0 pid=174) ERROR 06-08 03:30:44 [multiproc_executor.py:888]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/__init__.py", line 143, in get_model
(Worker_TP0_EP0 pid=174) ERROR 06-08 03:30:44 [multiproc_executor.py:888]     return loader.load_model(
(Worker_TP0_EP0 pid=174) ERROR 06-08 03:30:44 [multiproc_executor.py:888]            ^^^^^^^^^^^^^^^^^^
(Worker_TP0_EP0 pid=174) ERROR 06-08 03:30:44 [multiproc_executor.py:888]   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(Worker_TP0_EP0 pid=174) ERROR 06-08 03:30:44 [multiproc_executor.py:888]     return func(*args, **kwargs)
(Worker_TP0_EP0 pid=174) ERROR 06-08 03:30:44 [multiproc_executor.py:888]            ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0_EP0 pid=174) ERROR 06-08 03:30:44 [multiproc_executor.py:888]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/base_loader.py", line 64, in load_model
(Worker_TP0_EP0 pid=174) ERROR 06-08 03:30:44 [multiproc_executor.py:888]     self.load_weights(model, model_config)
(Worker_TP0_EP0 pid=174) ERROR 06-08 03:30:44 [multiproc_executor.py:888]   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(Worker_TP0_EP0 pid=174) ERROR 06-08 03:30:44 [multiproc_executor.py:888]     return func(*args, **kwargs)
(Worker_TP0_EP0 pid=174) ERROR 06-08 03:30:44 [multiproc_executor.py:888]            ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0_EP0 pid=174) ERROR 06-08 03:30:44 [multiproc_executor.py:888]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/default_loader.py", line 394, in load_weights
(Worker_TP0_EP0 pid=174) ERROR 06-08 03:30:44 [multiproc_executor.py:888]     loaded_weights = model.load_weights(self.get_all_weights(model_config, model))
(Worker_TP0_EP0 pid=174) ERROR 06-08 03:30:44 [multiproc_executor.py:888]                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0_EP0 pid=174) ERROR 06-08 03:30:44 [multiproc_executor.py:888]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/step3p5_mtp.py", line 273, in load_weights
(Worker_TP0_EP0 pid=174) ERROR 06-08 03:30:44 [multiproc_executor.py:888]     weight_loader(param, loaded_weight)
(Worker_TP0_EP0 pid=174) ERROR 06-08 03:30:44 [multiproc_executor.py:888]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/vocab_parallel_embedding.py", line 474, in weight_loader
(Worker_TP0_EP0 pid=174) ERROR 06-08 03:30:44 [multiproc_executor.py:888]     param[: loaded_weight.shape[0]].data.copy_(loaded_weight)
(Worker_TP0_EP0 pid=174) ERROR 06-08 03:30:44 [multiproc_executor.py:888] RuntimeError: The size of tensor a (2048) must match the size of tensor b (4096) at non-singleton dimension 1

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

RuntimeError: The size of tensor a (2048) must match the size of tensor b (4096) at non-singleton dimension 1

MTP drafter weight shape mismatch when using speculative decoding with --load-format safetensors

MTP drafter weight shape mismatch when using speculative decoding with `--load-format safetensors`