Instructions to use Intel/DeepSeek-V4-Flash-W4A16-AutoRound with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Intel/DeepSeek-V4-Flash-W4A16-AutoRound with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="Intel/DeepSeek-V4-Flash-W4A16-AutoRound")

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("Intel/DeepSeek-V4-Flash-W4A16-AutoRound")
model = AutoModelForCausalLM.from_pretrained("Intel/DeepSeek-V4-Flash-W4A16-AutoRound")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use Intel/DeepSeek-V4-Flash-W4A16-AutoRound with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Intel/DeepSeek-V4-Flash-W4A16-AutoRound"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Intel/DeepSeek-V4-Flash-W4A16-AutoRound",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/Intel/DeepSeek-V4-Flash-W4A16-AutoRound

SGLang

How to use Intel/DeepSeek-V4-Flash-W4A16-AutoRound with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "Intel/DeepSeek-V4-Flash-W4A16-AutoRound" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Intel/DeepSeek-V4-Flash-W4A16-AutoRound",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "Intel/DeepSeek-V4-Flash-W4A16-AutoRound" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Intel/DeepSeek-V4-Flash-W4A16-AutoRound",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use Intel/DeepSeek-V4-Flash-W4A16-AutoRound with Docker Model Runner:
```
docker model run hf.co/Intel/DeepSeek-V4-Flash-W4A16-AutoRound
```

RuntimeError: Shape mismatch: a.size(1) = 8192, size_k = 4096

by LIFengJu - opened 12 days ago

Discussion

LIFengJu

12 days ago

My Environment：
GPU: H200*8
Python: 3.10
CUDA : 13.3
OS: Linux x86_64
PyTorch: 2.12.0 (torch2.12.0+cu130)

Error encountered during 8-GPU distributed inference using torchrun.
Error Description
After entering any prompt interactively (e.g., "hello"), the program throws an error and exits. The error observed across all ranks is a RuntimeError: Shape mismatch; specific details follow below.

[rank0]: Traceback (most recent call last):
[rank0]:   File "/home/all_shared/models/Intel/DeepSeek-V4-Pro-W4A16-AutoRound/inference/generate.py", line 159, in <module>
[rank0]:     main(args.ckpt_path, args.config, args.input_file, args.interactive, args.max_new_tokens, args.temperature)
[rank0]:   File "/home/all_shared/models/Intel/DeepSeek-V4-Pro-W4A16-AutoRound/inference/generate.py", line 130, in main
[rank0]:     completion_tokens = generate(model, [prompt_tokens], max_new_tokens, tokenizer.eos_token_id, temperature)
[rank0]:                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/user/anaconda3/envs/dsv4/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/all_shared/models/Intel/DeepSeek-V4-Pro-W4A16-AutoRound/inference/generate.py", line 53, in generate
[rank0]:     logits = model.forward(tokens[:, prev_pos:cur_pos], prev_pos)
[rank0]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/user/anaconda3/envs/dsv4/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/all_shared/models/Intel/DeepSeek-V4-Pro-W4A16-AutoRound/inference/model.py", line 972, in forward
[rank0]:     h = layer(h, start_pos, input_ids)
[rank0]:         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/user/anaconda3/envs/dsv4/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1778, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/user/anaconda3/envs/dsv4/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1789, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/all_shared/models/Intel/DeepSeek-V4-Pro-W4A16-AutoRound/inference/model.py", line 842, in forward
[rank0]:     x = self.attn(x, start_pos)
[rank0]:         ^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/user/anaconda3/envs/dsv4/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1778, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/user/anaconda3/envs/dsv4/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1789, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/all_shared/models/Intel/DeepSeek-V4-Pro-W4A16-AutoRound/inference/model.py", line 646, in forward
[rank0]:     o = self.wo_a(o.flatten(2)).view(bsz, seqlen, self.n_local_groups, self.o_lora_rank)
[rank0]:         ^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/user/anaconda3/envs/dsv4/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1778, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/user/anaconda3/envs/dsv4/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1789, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/all_shared/models/Intel/DeepSeek-V4-Pro-W4A16-AutoRound/inference/model.py", line 260, in forward
[rank0]:     return Linear.forward(self, x)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/all_shared/models/Intel/DeepSeek-V4-Pro-W4A16-AutoRound/inference/model.py", line 241, in forward
[rank0]:     y = self._woq(x.to(torch.bfloat16))
[rank0]:         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/user/anaconda3/envs/dsv4/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1778, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/user/anaconda3/envs/dsv4/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1789, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/user/anaconda3/envs/dsv4/lib/python3.11/site-packages/gptqmodel/nn_modules/qlinear/marlin.py", line 301, in forward
[rank0]:     out = apply_gptq_marlin_linear(
[rank0]:           ^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/user/anaconda3/envs/dsv4/lib/python3.11/site-packages/gptqmodel/utils/marlin.py", line 216, in apply_gptq_marlin_linear
[rank0]:     output = gptq_marlin_gemm(reshaped_x,
[rank0]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/user/anaconda3/envs/dsv4/lib/python3.11/site-packages/gptqmodel/utils/marlin.py", line 299, in gptq_marlin_gemm
[rank0]:     return gptqmodel_marlin_kernels.gptq_marlin_gemm(a, c, b_q_weight, b_bias, b_scales,
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: RuntimeError: Shape mismatch: a.size(1) = 8192, size_k = 4096

LIFengJu

12 days ago

by teh way,the model is "Intel/DeepSeek-V4-Pro-W4A16-AutoRound".

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment