Instructions to use stepfun-ai/Step-3.7-Flash-NVFP4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use stepfun-ai/Step-3.7-Flash-NVFP4 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="stepfun-ai/Step-3.7-Flash-NVFP4", trust_remote_code=True) messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("stepfun-ai/Step-3.7-Flash-NVFP4", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use stepfun-ai/Step-3.7-Flash-NVFP4 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "stepfun-ai/Step-3.7-Flash-NVFP4" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "stepfun-ai/Step-3.7-Flash-NVFP4", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/stepfun-ai/Step-3.7-Flash-NVFP4
- SGLang
How to use stepfun-ai/Step-3.7-Flash-NVFP4 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "stepfun-ai/Step-3.7-Flash-NVFP4" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "stepfun-ai/Step-3.7-Flash-NVFP4", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "stepfun-ai/Step-3.7-Flash-NVFP4" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "stepfun-ai/Step-3.7-Flash-NVFP4", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use stepfun-ai/Step-3.7-Flash-NVFP4 with Docker Model Runner:
docker model run hf.co/stepfun-ai/Step-3.7-Flash-NVFP4
RuntimeError: The size of tensor a (2048) must match the size of tensor b (4096) at non-singleton dimension 1
MTP drafter weight shape mismatch when using speculative decoding with --load-format safetensors
Environment
- Hardware: 2× DGX Spark (clustered, TP=2)
- Docker image:
eugr/spark-vllm-docker(main branch, as of 2026-06-05) - Model:
stepfun-ai/Step-3.7-Flash-NVFP4 - vLLM
0.22.1rc1.dev124+gace95c9cf.d20260603.cu132(cu132) - Launch flags:
--load-format safetensors --speculative-config '{"method":"mtp","num_speculative_tokens":1}'
Error
The worker crashes during MTP drafter weight loading with the following error:
RuntimeError: The size of tensor a (2048) must match the size of tensor b (4096) at non-singleton dimension 1
Traceback summary:
File ".../vllm/model_executor/models/step3p5_mtp.py", line 273, in load_weights
weight_loader(param, loaded_weight)
File ".../vllm/model_executor/layers/vocab_parallel_embedding.py", line 474, in weight_loader
param[: loaded_weight.shape[0]].data.copy_(loaded_weight)
RuntimeError: The size of tensor a (2048) must match the size of tensor b (4096) at non-singleton dimension 1
The main model (14 safetensors shards) loads successfully. The crash happens only when the MTP drafter subsequently attempts to load its own weights.
Observation
vLLM's step3p5_mtp.py appears to assume a vocab embedding shape with dimension 1 = 2048, but the actual weight in this NVFP4 checkpoint has dimension 1 = 4096. This suggests the drafter architecture assumed by vLLM may not match the weight layout of this specific quantized checkpoint.
A few things to check — the mismatch most likely comes from the runtime loader / vLLM build / a stale checkpoint revision / the --load-format safetensors override, not the checkpoint itself:
Refresh the local checkpoint to the latest HF revision, especially config.json, model.safetensors.index.json, and model-mtp-bf16.safetensors.
Verify the text_config per-layer lists are length 48:
python3 -c "import json; c=json.load(open('config.json'))['text_config']; print({k:len(c[k]) for k in ['layer_types','partial_rotary_factors','swiglu_limits','swiglu_limits_shared','rope_theta']})"
Try without forcing --load-format safetensors, so vLLM uses its default loader path.
A few things to check — the mismatch most likely comes from the runtime loader / vLLM build / a stale checkpoint revision / the --load-format safetensors override, not the checkpoint itself:
Refresh the local checkpoint to the latest HF revision, especially config.json, model.safetensors.index.json, and model-mtp-bf16.safetensors.
Verify the text_config per-layer lists are length 48:python3 -c "import json; c=json.load(open('config.json'))['text_config']; print({k:len(c[k]) for k in ['layer_types','partial_rotary_factors','swiglu_limits','swiglu_limits_shared','rope_theta']})"
Try without forcing --load-format safetensors, so vLLM uses its default loader path.
The result is the same.
This issue is being similarly discovered by many people in the DGX Spark community besides myself.
(Worker_TP0_EP0 pid=174) INFO 06-08 03:30:36 [gpu_model_runner.py:5116] Loading drafter model...
(Worker_TP0_EP0 pid=174) INFO 06-08 03:30:36 [kernel.py:270] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['native'], fused_add_rms_norm=['native'])
(Worker_TP0_EP0 pid=174) WARNING 06-08 03:30:36 [modelopt.py:1022] Detected ModelOpt NVFP4 checkpoint (quant_algo=NVFP4). Please note that the format is experimental and could change in future.
(Worker_TP0_EP0 pid=174) INFO 06-08 03:30:36 [kernel.py:270] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['native'], fused_add_rms_norm=['native'])
(Worker_TP0_EP0 pid=174) INFO 06-08 03:30:36 [__init__.py:962] Using FlashInferCutlassNvFp4LinearKernel for NVFP4 GEMM
(Worker_TP0_EP0 pid=174) WARNING 06-08 03:30:37 [vllm.py:2203] `torch.compile` is turned on, but the model 0xSero/Step-3.7-Flash-173B does not support it. Please open an issue on GitHub if you want it to be supported.
(Worker_TP0_EP0 pid=174) INFO 06-08 03:30:37 [weight_utils.py:922] Filesystem type for checkpoints: EXT4. Checkpoint size: 107.20 GiB. Available RAM: 49.01 GiB.
(Worker_TP0_EP0 pid=174) INFO 06-08 03:30:37 [weight_utils.py:952] Auto-prefetch is disabled because the filesystem (EXT4) is not a recognized network FS (NFS/Lustre) and the checkpoint size (107.20 GiB) exceeds 90% of available RAM (49.01 GiB).
Loading safetensors checkpoint shards: 0% Completed | 0/14 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 7% Completed | 1/14 [00:04<00:52, 4.06s/it]
(Worker_TP0_EP0 pid=174) ERROR 06-08 03:30:44 [multiproc_executor.py:888] WorkerProc failed to start.
(Worker_TP0_EP0 pid=174) ERROR 06-08 03:30:44 [multiproc_executor.py:888] Traceback (most recent call last):
(Worker_TP0_EP0 pid=174) ERROR 06-08 03:30:44 [multiproc_executor.py:888] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 855, in worker_main
(Worker_TP0_EP0 pid=174) ERROR 06-08 03:30:44 [multiproc_executor.py:888] worker = WorkerProc(*args, **kwargs)
(Worker_TP0_EP0 pid=174) ERROR 06-08 03:30:44 [multiproc_executor.py:888] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0_EP0 pid=174) ERROR 06-08 03:30:44 [multiproc_executor.py:888] File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(Worker_TP0_EP0 pid=174) ERROR 06-08 03:30:44 [multiproc_executor.py:888] return func(*args, **kwargs)
(Worker_TP0_EP0 pid=174) ERROR 06-08 03:30:44 [multiproc_executor.py:888] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0_EP0 pid=174) ERROR 06-08 03:30:44 [multiproc_executor.py:888] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 634, in __init__
(Worker_TP0_EP0 pid=174) ERROR 06-08 03:30:44 [multiproc_executor.py:888] self.worker.load_model()
(Worker_TP0_EP0 pid=174) ERROR 06-08 03:30:44 [multiproc_executor.py:888] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 349, in load_model
(Worker_TP0_EP0 pid=174) ERROR 06-08 03:30:44 [multiproc_executor.py:888] self.model_runner.load_model(load_dummy_weights=load_dummy_weights)
(Worker_TP0_EP0 pid=174) ERROR 06-08 03:30:44 [multiproc_executor.py:888] File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(Worker_TP0_EP0 pid=174) ERROR 06-08 03:30:44 [multiproc_executor.py:888] return func(*args, **kwargs)
(Worker_TP0_EP0 pid=174) ERROR 06-08 03:30:44 [multiproc_executor.py:888] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0_EP0 pid=174) ERROR 06-08 03:30:44 [multiproc_executor.py:888] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 5118, in load_model
(Worker_TP0_EP0 pid=174) ERROR 06-08 03:30:44 [multiproc_executor.py:888] self.drafter.load_model(self.model)
(Worker_TP0_EP0 pid=174) ERROR 06-08 03:30:44 [multiproc_executor.py:888] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/spec_decode/llm_base_proposer.py", line 1199, in load_model
(Worker_TP0_EP0 pid=174) ERROR 06-08 03:30:44 [multiproc_executor.py:888] self.model = self._get_model()
(Worker_TP0_EP0 pid=174) ERROR 06-08 03:30:44 [multiproc_executor.py:888] ^^^^^^^^^^^^^^^^^
(Worker_TP0_EP0 pid=174) ERROR 06-08 03:30:44 [multiproc_executor.py:888] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/spec_decode/llm_base_proposer.py", line 1184, in _get_model
(Worker_TP0_EP0 pid=174) ERROR 06-08 03:30:44 [multiproc_executor.py:888] model = get_model(
(Worker_TP0_EP0 pid=174) ERROR 06-08 03:30:44 [multiproc_executor.py:888] ^^^^^^^^^^
(Worker_TP0_EP0 pid=174) ERROR 06-08 03:30:44 [multiproc_executor.py:888] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/__init__.py", line 143, in get_model
(Worker_TP0_EP0 pid=174) ERROR 06-08 03:30:44 [multiproc_executor.py:888] return loader.load_model(
(Worker_TP0_EP0 pid=174) ERROR 06-08 03:30:44 [multiproc_executor.py:888] ^^^^^^^^^^^^^^^^^^
(Worker_TP0_EP0 pid=174) ERROR 06-08 03:30:44 [multiproc_executor.py:888] File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(Worker_TP0_EP0 pid=174) ERROR 06-08 03:30:44 [multiproc_executor.py:888] return func(*args, **kwargs)
(Worker_TP0_EP0 pid=174) ERROR 06-08 03:30:44 [multiproc_executor.py:888] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0_EP0 pid=174) ERROR 06-08 03:30:44 [multiproc_executor.py:888] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/base_loader.py", line 64, in load_model
(Worker_TP0_EP0 pid=174) ERROR 06-08 03:30:44 [multiproc_executor.py:888] self.load_weights(model, model_config)
(Worker_TP0_EP0 pid=174) ERROR 06-08 03:30:44 [multiproc_executor.py:888] File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(Worker_TP0_EP0 pid=174) ERROR 06-08 03:30:44 [multiproc_executor.py:888] return func(*args, **kwargs)
(Worker_TP0_EP0 pid=174) ERROR 06-08 03:30:44 [multiproc_executor.py:888] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0_EP0 pid=174) ERROR 06-08 03:30:44 [multiproc_executor.py:888] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/default_loader.py", line 394, in load_weights
(Worker_TP0_EP0 pid=174) ERROR 06-08 03:30:44 [multiproc_executor.py:888] loaded_weights = model.load_weights(self.get_all_weights(model_config, model))
(Worker_TP0_EP0 pid=174) ERROR 06-08 03:30:44 [multiproc_executor.py:888] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0_EP0 pid=174) ERROR 06-08 03:30:44 [multiproc_executor.py:888] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/step3p5_mtp.py", line 273, in load_weights
(Worker_TP0_EP0 pid=174) ERROR 06-08 03:30:44 [multiproc_executor.py:888] weight_loader(param, loaded_weight)
(Worker_TP0_EP0 pid=174) ERROR 06-08 03:30:44 [multiproc_executor.py:888] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/vocab_parallel_embedding.py", line 474, in weight_loader
(Worker_TP0_EP0 pid=174) ERROR 06-08 03:30:44 [multiproc_executor.py:888] param[: loaded_weight.shape[0]].data.copy_(loaded_weight)
(Worker_TP0_EP0 pid=174) ERROR 06-08 03:30:44 [multiproc_executor.py:888] RuntimeError: The size of tensor a (2048) must match the size of tensor b (4096) at non-singleton dimension 1