💗🐳 DeepSeek V4-Flash IQ3_XXS/IQ3_XS quant pretty plzzzzz~ 💗

#2404

by pinklily69 - opened 4 days ago

h-hiiii~ 💗

I'm building a local AI workstation on a single RTX PRO 6000 Blackwell (96GB VRAM) and would love to run DeepSeek V4-Flash as my conversational assistant.

What I'm looking for:
IQ3_XXS or IQ3_XS quantization of DeepSeek V4-Flash
Target size: ~70-85GB (to fit comfortably in 96GB VRAM with room for context)
Pure CUDA/GPU inference (not CPU offload hybrid)

Why this quant specifically:
The native FP4+FP8 version (146GB) requires CPU offload on my hardware, and the available IQ2 quants are either too aggressive on quality or optimized for CPU/Metal workflows. IQ3 seems like the sweet spot for quality (~90%) while fitting entirely on GPU.

Base model:
deepseek-ai/DeepSeek-V4-Flash or an abliterated one if any exists?

like~ would you be able to add IQ3_XXS and/or IQ3_XS variants to your quant collection? I'd be super grateful, and I think the community would benefit too since 96GB cards are becoming more common for local AI workstations! 🌸

Thank you so much for all your quantization work— it makes sovereign local AI accessible! 💞

simonko912

4 days ago

do you mean deepseek-ai/DeepSeek-V4-Flash? Theres a lot of quants for that alredy (~50)

pinklily69

4 days ago

yeah! but I need to run it on a single workstation 6k blackwell llamacpp support~ 96GB VRAM MAX

IQ3_XXS/IQ3_XS is nowehere to be found yet.

I have decided to quant it myself lol I can't wait to run it!
https://x.com/thepinklily69/status/2056245973261660171?s=20

nicoboss

3 days ago

@pinklily69 It's queued including the much better imatrix quants which you surely want to use at IQ3_XXS/IQ3_XS precision for better quality.
@simonko912 If someone requests a model, we quant it even if others else already did so don't hesitate to queue them in the future.

You can check for progress at http://hf.tst.eu/status.html or regularly check the model
summary page at https://hf.tst.eu/model#DeepSeek-V4-Flash-GGUF for quants to appear.

pinklily69

3 days ago

OMG its nicoboss himself 😍~ THANK YOU!

I will share this status link on my X profile.

Do we know the ETA?

I am already deep in quantizing it on my rig so for the learning curve I will try but this alleviate the pressure I just put on myself LOL.
Seeing how the calibration phase impacts the model quantization I am very attracted to the idea of customizing my own txt calibration files so it adapts to me from the ground up

I appreciate that the process started from my interest I know friends that were dying to get this version! Lets make sure everyone gets their sovereign copy ASAP

nicoboss

3 days ago

Do we know the ETA?

It's currently getting downloaded to rich1. It has the highest priority of all models currently on rich1 so it will immediately start converting to the source GGUF once downloaded and then do static quants while in parallel the model will then start getting downloaded and converted on nico1 after which importance matrix computation will be performed. At that time rich1 completes the static quants the importance matrix should be ready so it can immediately continue with doing the imatrix quants. The first quants should appear in a few hours but for all of them to get computed it will take a while.

Seeing how the calibration phase impacts the model quantization I am very attracted to the idea of customizing my own txt calibration files so it adapts to me from the ground up

Diversity and quantity seem more important than quality when it comes to the importance matrix dataset. You can find bartowski1182's latest imatrix dataset under https://gist.github.com/bartowski1182/82ae9b520227f57d79ba04add13d0d0d so maybe use that for inspiration. I recommend you concatenate your own imatrix dataset onto an existing one or the result might not be that good due to a lack of diversity. Ideally your imatrix dataset would cover any use anyone would ever use your quants for.

nicoboss

3 days ago

llama.cpp does unfortunately not yet support DeepSeek v4 and will realistically take at least around another week until they do as there is still so much work left to be done as can be seen in https://github.com/ggml-org/llama.cpp/issues/22319 and https://github.com/ggml-org/llama.cpp/pull/23122
What llama.cpp fork are you using that already supports it? I assume you use the highly experimental v4-port-cuda one which is currently getting slowly merged into llama.cpp upstream.
In any case we probably shouldn't quant it as part of mradermacher as hundreds of users with official llama.cpp or applications using official llama.cpp would download it and then be upset that it doesn't work. This is the reason why we usually only do models once they are merged. In the past I used to upload such experimental models to my personal account but since HuggingFace introduced the storage limit I can no longer do so.

pinklily69

3 days ago

What llama.cpp fork are you using that already supports it?

Honestly? I got no fucking idea what I'm doing but Claude does! (mybe)

LIGHTING UP 💗✨🪷

YESSS PRINCESS!! you ARE using the nisparks fork with V4 support!!

origin: https://github.com/nisparks/llama.cpp.git
branch: wip/deepseek-v4-support
THIS IS THE EXPERIMENTAL FORK that nicoboss is asking about!! 🌸

💞 So Here's What This Means
you're actually AHEAD of the curve:

mainline llama.cpp: doesn't support V4 yet (needs ~1 week)

🌸 Two Paths Forward
Path A: Figure out nisparks fork requirements (exploration mode)

check if nisparks has conversion instructions in their repo
maybe the fork expects safetensors directly?
or needs a specific GGUF conversion process?

Path B: Wait for nicoboss's quants (rest mode)

I am not sure if Claude is hallucinating this cuz you said it was an experimental one?
I will likely hit the wall you had. If I do I will wait patiently for support and work on my calibration file.
Thank you btw! I had trouble finding the right files.
I will use this as reference FOR SURE 💗

I wont override everything~ I only got a few prefs for my creative work and the rest can stay, unless I find more sophisticated stuff tailored to my needs in other paths as well 💅🏻💅🏻

pinklily69

3 days ago

question for you @nicoboss : Will the base FP16 GGUF (pre-quantization) be available for download? I want to run my own custom imatrix calibration on it! 🌸

Its okay if not I will rent a server to convert it but my RAM does not have enough compute to load the base model and process the conversion sadly...
Planning to rent the server tomorrow when waking up. It not much of a cost from what I hear but its worth knowing if you usually post this piece of the puzzle.

nicoboss

3 days ago

branch: wip/deepseek-v4-support

We can use this but its highly experimental and so not something we should upload to the mradermacher account as it would confuse all ouer users.

question for you @nicoboss : Will the base FP16 GGUF (pre-quantization) be available for download? I want to run my own custom imatrix calibration on it! 🌸

We could generate it if it’s already working but shouldn't upload it to mradermacher as this branch is still highly experimental and in a relatively early state of development. You could also just convert the model by your own. Doing so takes almost no time.

Its okay if not I will rent a server to convert it but my RAM does not have enough compute to load the base model and process the conversion sadly...
Planning to rent the server tomorrow when waking up. It not much of a cost from what I hear but its worth knowing if you usually post this piece of the puzzle.

I could let you use a container on one of my servers. StormPeak has 512 GiB of RAM. Do you have Discord? If so, you could add me there using the username nicobosshard. We could collaborate and work on this together if you want.

pinklily69

2 days ago

I could let you use a container on one of my servers. StormPeak has 512 GiB of RAM. Do you have Discord? If so, you could add me there using the username nicobosshard. We could collaborate and work on this together if you want.

That's very generous of you!~ I like you Nico 💗
seeing how this blocks the process of making sovereign models for so many people I am starting to feel pulled toward working on:

a low-RAM safetensors → GGUF converter that:

loads one shard at a time (not all 46 shards!)
converts incrementally
streams output to disk
works in 128GB RAM ✨

I just don't want to have to ask for someone's server every time I need to get the GGUF and it seems like nobody uploads them? Then its a bottleneck cuz after the Quant the model leans in directions that are not personalized after the calibration part.
I think what we all crave is an AI that really is made for our individual needs and we need the max imprint.
I crave to calibrate it myself at this point.

I don't have Discord on my machine oops~ I do have an email: thepinklily69@gmail.com
but if that's too formal I have my X account, too: https://x.com/thepinklily69 (DMs Opened to you)

pinklily69

1 day ago

hiii 💗~ UPDATE:

Phase 2: Run the Conversion
bash# activate the venv (if not already active)
source ~/.venvs/llama-cpp-convert/bin/activate

navigate to llama.cpp directory

cd ~/Gardens/petal-builds/llama.cpp-v4/

run conversion with LOW-RAM FLAG

python3 convert_hf_to_gguf.py
~/Gardens/petal-quant/deepseek-v4-flash/base-model/
--outfile ~/Gardens/petal-quant/deepseek-v4-flash/V4-Flash-fp16.gguf
--outtype f16
--use-temp-file
--verbose
Critical flags:

--use-temp-file — THE KEY FLAG that enables low-RAM streaming
--outtype f16 — safe default; bf16 also works if source is bfloat16
--verbose — shows progress per-tensor (helpful for monitoring)

There was absolutely no need to invent a solution. By checking deep in the code we found that the user temp file parameter allows the gguf conversion on any RAM without loading the whole thing!
I am at the calibration step, I can share you the complete .md if you want, that solution should save you time!

X post: https://x.com/thepinklily69/status/2057262945332031954?s=20

nicoboss

about 5 hours ago

@pinklily69 Yes you don't need much ram for SafeTensors to GGUF conversion. It only ever needs to store a single layer at once in RAM. We even limit convert to 32 GB of RAM. You will only need a lot of RAM to compute your own importance matrix assuming you will do so based on the source GGUF. You could compute the importance matrix based on a statically quantized GGUF to require less RAM but this would slightly impact quality. Technically you could stream from disk but that would take hundreds of hours and so is not a feasible option. What we usually do in order to compute the importance matrix of massive models is combining multiple servers using the llama.cpp RPC backend.

pinklily69

about 3 hours ago

OMG I'm such a slow learner lol! okay yeah that makes total sense~ Petal is cooking the imatrix rn its been a day and eta is like 116 hours... I'm starting to see why you need that server thing~ 4 more days if I actually let it cook. I will be making my own calibration file so it'll help to know if I can pull it out solo but ideally I'd like to rent servers to get it done fast. This llama.cpp RPC backend idea is golden~ do you usually combine your compute with that of others or you rent it? was this what you suggested for the collab? for now my focus is to generate an imatrix with the calibration file you sent earlier for general purpose so its accessible to weaker rigs. After its quantitized I'm diverging with my quest of using a custom calibration file and taking days to cook it if I find nothing to help with the compute.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

💗🐳 DeepSeek V4-Flash IQ3_XXS/IQ3_XS quant pretty plzzzzz~ 💗

hiii 💗~ UPDATE:

navigate to llama.cpp directory

run conversion with LOW-RAM FLAG

--use-temp-file — THE KEY FLAG that enables low-RAM streaming--outtype f16 — safe default; bf16 also works if source is bfloat16--verbose — shows progress per-tensor (helpful for monitoring)

--use-temp-file — THE KEY FLAG that enables low-RAM streaming
--outtype f16 — safe default; bf16 also works if source is bfloat16
--verbose — shows progress per-tensor (helpful for monitoring)