Instructions to use persadian/DeepSeek-V4-Flash-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use persadian/DeepSeek-V4-Flash-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="persadian/DeepSeek-V4-Flash-GGUF", filename="DeepSeek-V4-Flash-IQ1_S-XL-00001-of-00002.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use persadian/DeepSeek-V4-Flash-GGUF with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf persadian/DeepSeek-V4-Flash-GGUF:IQ1_S # Run inference directly in the terminal: llama-cli -hf persadian/DeepSeek-V4-Flash-GGUF:IQ1_S
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf persadian/DeepSeek-V4-Flash-GGUF:IQ1_S # Run inference directly in the terminal: llama-cli -hf persadian/DeepSeek-V4-Flash-GGUF:IQ1_S
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf persadian/DeepSeek-V4-Flash-GGUF:IQ1_S # Run inference directly in the terminal: ./llama-cli -hf persadian/DeepSeek-V4-Flash-GGUF:IQ1_S
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf persadian/DeepSeek-V4-Flash-GGUF:IQ1_S # Run inference directly in the terminal: ./build/bin/llama-cli -hf persadian/DeepSeek-V4-Flash-GGUF:IQ1_S
Use Docker
docker model run hf.co/persadian/DeepSeek-V4-Flash-GGUF:IQ1_S
- LM Studio
- Jan
- vLLM
How to use persadian/DeepSeek-V4-Flash-GGUF with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "persadian/DeepSeek-V4-Flash-GGUF" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "persadian/DeepSeek-V4-Flash-GGUF", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/persadian/DeepSeek-V4-Flash-GGUF:IQ1_S
- Ollama
How to use persadian/DeepSeek-V4-Flash-GGUF with Ollama:
ollama run hf.co/persadian/DeepSeek-V4-Flash-GGUF:IQ1_S
- Unsloth Studio new
How to use persadian/DeepSeek-V4-Flash-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for persadian/DeepSeek-V4-Flash-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for persadian/DeepSeek-V4-Flash-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for persadian/DeepSeek-V4-Flash-GGUF to start chatting
- Pi new
How to use persadian/DeepSeek-V4-Flash-GGUF with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf persadian/DeepSeek-V4-Flash-GGUF:IQ1_S
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "persadian/DeepSeek-V4-Flash-GGUF:IQ1_S" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use persadian/DeepSeek-V4-Flash-GGUF with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf persadian/DeepSeek-V4-Flash-GGUF:IQ1_S
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default persadian/DeepSeek-V4-Flash-GGUF:IQ1_S
Run Hermes
hermes
- Docker Model Runner
How to use persadian/DeepSeek-V4-Flash-GGUF with Docker Model Runner:
docker model run hf.co/persadian/DeepSeek-V4-Flash-GGUF:IQ1_S
- Lemonade
How to use persadian/DeepSeek-V4-Flash-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull persadian/DeepSeek-V4-Flash-GGUF:IQ1_S
Run and chat with the model
lemonade run user.DeepSeek-V4-Flash-GGUF-IQ1_S
List all available models
lemonade list
- persadian/DeepSeek-V4-Flash-GGUF
- Shard-Based GGUF Distribution Repository
- Repository Status
- Canonical DFQS Deployment
- Recommended Deployment Path
- DFQS Compatibility
- Model Overview
- Abstract
- Intended Use
- Technical Specifications
- Usage Instructions
- Performance Benchmarks
- Links & Resources
- Evaluation Results
- Citation
- Environmental Impact
- License & Acknowledgments
- Shard-Based GGUF Distribution Repository
persadian/DeepSeek-V4-Flash-GGUF
Shard-Based GGUF Distribution Repository
Repository Status
This repository contains the original shard-based GGUF distribution artifacts for DeepSeek-V4-Flash.
It is intended primarily for:
- shard-level reconstruction workflows
- archival distribution
- low-level artifact access
- alternative deployment experimentation
Canonical DFQS Deployment
The canonical DFQS reference implementation is provided separately as:
→ persadian/DeepSeek-V4-Flash-IQ1_S-XL
The IQ1_S-XL repository defines the reference single-file deployment architecture for constrained-memory inference environments.
Recommended Deployment Path
For standard DFQS deployment workflows, use:
→ persadian/DeepSeek-V4-Flash-IQ1_S-XL
DFQS Compatibility
The shard artifacts contained in this repository are compatible with the DFQS-IQ1_S-XL reference deployment workflow.
These artifacts may be used for:
- deterministic reconstruction procedures
- GGUF merge experimentation
- deployment validation workflows
- compatibility testing across inference runtimes
Model Overview
| Property | Value |
|---|---|
| Model Name | DeepSeek-V4-Flash-GGUF |
| Base Model | DeepSeek-V4-Flash (DeepSeek AI) |
| Quantization | IQ1_S-XL |
| Total Parameters | 284 Billion |
| Active Parameters (per token) | 13 Billion |
| Total Size | 61.5 GB (2 shards) |
| Architecture | Mixture-of-Experts (MoE) |
| Number of Experts | 256 |
| Context Length | 1,048,576 tokens (1M) |
| Format | GGUF (llama.cpp compatible) |
Abstract
This work presents a quantized version of DeepSeek-V4-Flash, a 284-billion parameter Mixture-of-Experts (MoE) language model developed by DeepSeek AI. Using IQ1_S-XL quantization, the model is compressed from its original 500GB FP8 size to 57.3GB, representing an ~89% reduction in storage requirements while maintaining inference capabilities suitable for research and development applications. The quantized model is distributed as a 2-shard GGUF file, enabling deployment with llama.cpp, and other GGUF-compatible inference engines.
Intended Use
Primary Use Cases:
- Research on large language model efficiency
- Development of AI applications requiring long-context understanding (up to 1M tokens)
- Academic experimentation with MoE architectures
- Resource-constrained deployment environments
Limitations:
- Reduced fidelity compared to FP16/FP8 versions (typical for IQ1_S quantization)
- Requires substantial RAM (80GB+) for inference
- Optimized for text completion and chat; fine-tuning not directly supported
- Use of the custom V4 architecture requires a modified llama.cpp fork
Technical Specifications
Quantization Details
| Metric | Value |
|---|---|
| Original Format | FP8 (500GB) |
| Quantized Format | IQ1_S-XL (GGUF) |
| Compression Ratio | ~8.7x |
| File Shards | 2 (50GB + 11.6GB) |
Hardware Requirements
| Component | Minimum | Recommended |
|---|---|---|
| RAM | 80 GB | 128 GB |
| VRAM (GPU offload) | 22 GB | 24 GB (RTX 3090) |
| Storage | 60 GB | 150 GB |
| GPU Compute | CUDA 11.0+ | CUDA 12.0+ |
Usage Instructions
Local Server (llama.cpp)
# Clone V4-aware fork
git clone -b feat/v4-port-cuda https://github.com/arishma108/llama.cpp
cd llama.cpp
make LLAMA_CUDA=1 -j
# Start server
./build/bin/llama-server \
-hf persadian/DeepSeek-V4-Flash-GGUF\
--jinja \
--ctx-size 393216 \
--n-gpu-layers 999
Python (llama-cpp-python)
from llama_cpp import Llama
llm = Llama.from_pretrained(
repo_id="persadian/DeepSeek-V4-Flash-GGUF",
filename="DeepSeek-V4-Flash-IQ1_S-XL-00001-of-00002.gguf",
n_ctx=8192,
n_gpu_layers=35,
verbose=False
)
response = llm.create_chat_completion(
messages=[{"role": "user", "content": "Explain attention mechanisms."}]
)
print(response["choices"][0]["message"]["content"])
Docker
docker model run hf.co/persadian/DeepSeek-V4-Flash-GGUF
Performance Benchmarks
| Hardware Configuration | Tokens/Second | Notes |
|---|---|---|
| CPU only (128GB RAM) | 0.2–0.5 | Not recommended |
| RTX 3090 (24GB) + 80GB RAM | 1–3 | ~35 GPU layers |
| 2× RTX 3090 + 128GB RAM | 5–8 | NVLink beneficial |
| H100 80GB + 256GB RAM | 15–25 | Optimal configuration |
| Benchmarks conducted with 8,192 token context. |
Links & Resources
| Resource | Link |
|---|---|
| Model Repository | DeepSeek-V4-Flash-GGUF |
| DOI | DOI Reference |
| GitHub | @arishma108 |
| Hugging Face Profile | @persadian |
| Base Model | DeepSeek-V4-Flash |
Evaluation Results
Coming soon. Initial qualitative testing shows:
- Coherent text generation at 1-3 tokens/second on RTX 3090
- Maintains long-context understanding up to 32K tokens in testing
- Perplexity within expected range for IQ1_S quantization
Formal evaluation metrics will be added as benchmarking completes.
- ✅ model file is valid and correctly formatted
- ✅ llama.cpp build is working correctly
- ✅ The shard detection works (First shard (00001): True ) (Second shard (00002): True)
Version: 1.0 Last Updated: 2026-05-17
Citation
If you use this model in your research, please cite:
@misc{persadian2026deepseek,
author = {Persadh, Darshani},
title = {DeepSeek-V4-Flash-GGUF: A Quantized 284B-Parameter Mixture-of-Experts Language Model},
year = {2026},
publisher = {Hugging Face},
version = {IQ1_S-XL},
doi = {10.57967/hf/8828},
url = {https://doi.org/10.57967/hf/8828}
}
APA: Persadh, D.R. (2026). DeepSeek-V4-Flash-GGUF: A Quantized 284B-Parameter Mixture-of-Experts Language Model (IQ1_S-XL) [persadian/DeepSeek-V4-Flash-GGUF]. Hugging Face. https://doi.org/10.57967/hf/8828
Environmental Impact
This model's development and hosting have been carbon-offset through reforestation initiatives by:
Total CO2 offset: 50 Kg · Offset Project Code: 9162366
This model is part of sustainable AI practices.
License & Acknowledgments
License: MIT Acknowledgments:
- DeepSeek AI for developing the base model
- The llama.cpp community for GGUF format and V4 support
- teamblobfish for the original IQ1_S-XL quantization work
- Hugging Face for hosting infrastructure
Author: Darshani Persadh (@drpersadh)
Hugging Face Handle: @persadian
GitHub: arishma108
DOI: 10.57967/hf/8828
Publication Date: May 17, 2026
- Downloads last month
- 1,437
1-bit
Model tree for persadian/DeepSeek-V4-Flash-GGUF
Base model
deepseek-ai/DeepSeek-V4-Flash