Instructions to use persadian/DeepSeek-V4-Flash-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use persadian/DeepSeek-V4-Flash-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="persadian/DeepSeek-V4-Flash-GGUF",
	filename="DeepSeek-V4-Flash-IQ1_S-XL-00001-of-00002.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps

llama.cpp

How to use persadian/DeepSeek-V4-Flash-GGUF with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf persadian/DeepSeek-V4-Flash-GGUF:IQ1_S
# Run inference directly in the terminal:
llama-cli -hf persadian/DeepSeek-V4-Flash-GGUF:IQ1_S

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf persadian/DeepSeek-V4-Flash-GGUF:IQ1_S
# Run inference directly in the terminal:
llama-cli -hf persadian/DeepSeek-V4-Flash-GGUF:IQ1_S

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf persadian/DeepSeek-V4-Flash-GGUF:IQ1_S
# Run inference directly in the terminal:
./llama-cli -hf persadian/DeepSeek-V4-Flash-GGUF:IQ1_S

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf persadian/DeepSeek-V4-Flash-GGUF:IQ1_S
# Run inference directly in the terminal:
./build/bin/llama-cli -hf persadian/DeepSeek-V4-Flash-GGUF:IQ1_S

Use Docker

docker model run hf.co/persadian/DeepSeek-V4-Flash-GGUF:IQ1_S

LM Studio
Jan

vLLM

How to use persadian/DeepSeek-V4-Flash-GGUF with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "persadian/DeepSeek-V4-Flash-GGUF"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "persadian/DeepSeek-V4-Flash-GGUF",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/persadian/DeepSeek-V4-Flash-GGUF:IQ1_S

Ollama
How to use persadian/DeepSeek-V4-Flash-GGUF with Ollama:
```
ollama run hf.co/persadian/DeepSeek-V4-Flash-GGUF:IQ1_S
```

Unsloth Studio new

How to use persadian/DeepSeek-V4-Flash-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for persadian/DeepSeek-V4-Flash-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for persadian/DeepSeek-V4-Flash-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for persadian/DeepSeek-V4-Flash-GGUF to start chatting

Pi new

How to use persadian/DeepSeek-V4-Flash-GGUF with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf persadian/DeepSeek-V4-Flash-GGUF:IQ1_S

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "persadian/DeepSeek-V4-Flash-GGUF:IQ1_S"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use persadian/DeepSeek-V4-Flash-GGUF with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf persadian/DeepSeek-V4-Flash-GGUF:IQ1_S

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default persadian/DeepSeek-V4-Flash-GGUF:IQ1_S

Run Hermes

hermes

Docker Model Runner
How to use persadian/DeepSeek-V4-Flash-GGUF with Docker Model Runner:
```
docker model run hf.co/persadian/DeepSeek-V4-Flash-GGUF:IQ1_S
```

Lemonade

How to use persadian/DeepSeek-V4-Flash-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull persadian/DeepSeek-V4-Flash-GGUF:IQ1_S

Run and chat with the model

lemonade run user.DeepSeek-V4-Flash-GGUF-IQ1_S

List all available models

lemonade list

You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

persadian/DeepSeek-V4-Flash-GGUF

Shard-Based GGUF Distribution Repository

Repository Status

This repository contains the original shard-based GGUF distribution artifacts for DeepSeek-V4-Flash.

It is intended primarily for:

shard-level reconstruction workflows
archival distribution
low-level artifact access
alternative deployment experimentation

Canonical DFQS Deployment

The canonical DFQS reference implementation is provided separately as:

→ persadian/DeepSeek-V4-Flash-IQ1_S-XL

The IQ1_S-XL repository defines the reference single-file deployment architecture for constrained-memory inference environments.

Recommended Deployment Path

For standard DFQS deployment workflows, use:

→ persadian/DeepSeek-V4-Flash-IQ1_S-XL

DFQS Compatibility

The shard artifacts contained in this repository are compatible with the DFQS-IQ1_S-XL reference deployment workflow.

These artifacts may be used for:

deterministic reconstruction procedures
GGUF merge experimentation
deployment validation workflows
compatibility testing across inference runtimes

Model Overview

Property	Value
Model Name	DeepSeek-V4-Flash-GGUF
Base Model	DeepSeek-V4-Flash (DeepSeek AI)
Quantization	IQ1_S-XL
Total Parameters	284 Billion
Active Parameters (per token)	13 Billion
Total Size	61.5 GB (2 shards)
Architecture	Mixture-of-Experts (MoE)
Number of Experts	256
Context Length	1,048,576 tokens (1M)
Format	GGUF (llama.cpp compatible)

Abstract

This work presents a quantized version of DeepSeek-V4-Flash, a 284-billion parameter Mixture-of-Experts (MoE) language model developed by DeepSeek AI. Using IQ1_S-XL quantization, the model is compressed from its original 500GB FP8 size to 57.3GB, representing an ~89% reduction in storage requirements while maintaining inference capabilities suitable for research and development applications. The quantized model is distributed as a 2-shard GGUF file, enabling deployment with llama.cpp, and other GGUF-compatible inference engines.

Intended Use

Primary Use Cases:

Research on large language model efficiency
Development of AI applications requiring long-context understanding (up to 1M tokens)
Academic experimentation with MoE architectures
Resource-constrained deployment environments

Limitations:

Reduced fidelity compared to FP16/FP8 versions (typical for IQ1_S quantization)
Requires substantial RAM (80GB+) for inference
Optimized for text completion and chat; fine-tuning not directly supported
Use of the custom V4 architecture requires a modified llama.cpp fork

Technical Specifications

Quantization Details

Metric	Value
Original Format	FP8 (500GB)
Quantized Format	IQ1_S-XL (GGUF)
Compression Ratio	~8.7x
File Shards	2 (50GB + 11.6GB)

Hardware Requirements

Component	Minimum	Recommended
RAM	80 GB	128 GB
VRAM (GPU offload)	22 GB	24 GB (RTX 3090)
Storage	60 GB	150 GB
GPU Compute	CUDA 11.0+	CUDA 12.0+

Usage Instructions

Local Server (llama.cpp)

# Clone V4-aware fork
git clone -b feat/v4-port-cuda https://github.com/arishma108/llama.cpp
cd llama.cpp
make LLAMA_CUDA=1 -j

# Start server
./build/bin/llama-server \
  -hf persadian/DeepSeek-V4-Flash-GGUF\
  --jinja \
  --ctx-size 393216 \
  --n-gpu-layers 999

Python (llama-cpp-python)

from llama_cpp import Llama

llm = Llama.from_pretrained(
    repo_id="persadian/DeepSeek-V4-Flash-GGUF",
    filename="DeepSeek-V4-Flash-IQ1_S-XL-00001-of-00002.gguf",
    n_ctx=8192,
    n_gpu_layers=35,
    verbose=False
)

response = llm.create_chat_completion(
    messages=[{"role": "user", "content": "Explain attention mechanisms."}]
)
print(response["choices"][0]["message"]["content"])

Docker

docker model run hf.co/persadian/DeepSeek-V4-Flash-GGUF

Performance Benchmarks

Hardware Configuration	Tokens/Second	Notes
CPU only (128GB RAM)	0.2–0.5	Not recommended
RTX 3090 (24GB) + 80GB RAM	1–3	~35 GPU layers
2× RTX 3090 + 128GB RAM	5–8	NVLink beneficial
H100 80GB + 256GB RAM	15–25	Optimal configuration
Benchmarks conducted with 8,192 token context.

Links & Resources

Resource	Link
Model Repository	DeepSeek-V4-Flash-GGUF
DOI	DOI Reference
GitHub	@arishma108
Hugging Face Profile	@persadian
Base Model	DeepSeek-V4-Flash

Evaluation Results

Coming soon. Initial qualitative testing shows:

Coherent text generation at 1-3 tokens/second on RTX 3090
Maintains long-context understanding up to 32K tokens in testing
Perplexity within expected range for IQ1_S quantization

Formal evaluation metrics will be added as benchmarking completes.

✅ model file is valid and correctly formatted
✅ llama.cpp build is working correctly
✅ The shard detection works (First shard (00001): True ) (Second shard (00002): True)

Version: 1.0 Last Updated: 2026-05-17

Citation

If you use this model in your research, please cite:

@misc{persadian2026deepseek,
  author = {Persadh, Darshani},
  title = {DeepSeek-V4-Flash-GGUF: A Quantized 284B-Parameter Mixture-of-Experts Language Model},
  year = {2026},
  publisher = {Hugging Face},
  version = {IQ1_S-XL},
  doi = {10.57967/hf/8828},
  url = {https://doi.org/10.57967/hf/8828}
}

APA: Persadh, D.R. (2026). DeepSeek-V4-Flash-GGUF: A Quantized 284B-Parameter Mixture-of-Experts Language Model (IQ1_S-XL) [persadian/DeepSeek-V4-Flash-GGUF]. Hugging Face. https://doi.org/10.57967/hf/8828

Environmental Impact

This model's development and hosting have been carbon-offset through reforestation initiatives by: Total CO2 offset: 50 Kg · Offset Project Code: 9162366

This model is part of sustainable AI practices.

License & Acknowledgments

License: MIT Acknowledgments:

DeepSeek AI for developing the base model
The llama.cpp community for GGUF format and V4 support
teamblobfish for the original IQ1_S-XL quantization work
Hugging Face for hosting infrastructure

Author: Darshani Persadh (@drpersadh)
Hugging Face Handle: @persadian
GitHub: arishma108
DOI: 10.57967/hf/8828
Publication Date: May 17, 2026

Downloads last month: 1,437

GGUF

Model size

284B params

Architecture

deepseek4

Hardware compatibility

1-bit

Model tree for persadian/DeepSeek-V4-Flash-GGUF

Base model

deepseek-ai/DeepSeek-V4-Flash

Quantized

(62)

this model

persadian
/

DeepSeek-V4-Flash-GGUF