Can it be converted to GGUF's? What about comfyUI support? thanks. Looking good but not sure if its for a single GPU consumer....
title
This, with gguf it might run on high end system (;
It's been out for 25min or so........I mean.......doubt more than a few have downloaded the full model yet.
Seems very unprudish thanks for that. Keep up the good work.
just have a lookοΌbyeο½
This, with gguf it might run on high end system (;
Yeah, a 4-bit model would fit snuggly in my Ryzen AI Max+ 395, but I imagine it'll be pretty slow. They are planning distilled versions, though.
It's a MoE, so it will be far faster to launch that if it would be a dense 70B (70B models 4bit fit into two 24gb GPUs, by the way). If you don't have enough memory, you can use dynamic CPU offload for some experts, speeding up the generation significantly. After all, people launch GPT-OSS 120B and Qwen 80bs on their consumer hardware just fine (quantized, of course)
It's a MoE, so it will be far faster to launch that if it would be a dense 70B (70B models 4bit fit into two 24gb GPUs, by the way). If you don't have enough memory, you can use dynamic CPU offload for some experts, speeding up the generation significantly. After all, people launch GPT-OSS 120B and Qwen 80bs on their consumer hardware just fine (quantized, of course)
This, you need 64 gb ram probably, but this model should be able to run at a decent speed on somewhat good gpus. You can fit the active parameters in around 12gb vram with some 8k context for q4 llms so it might actually run with somewhat acceptable speeds on 12gb vram +, though ofc the more vram the better (; (only if it is optimized correctly though)
It looks like a similar 80B-A13B MoE architecture as the earlier text LLM which was added to llama.cpp a while back: https://github.com/ggml-org/llama.cpp/pull/14425 (i worked on the ik_llama.cpp port, and the text model had some really strange issues with high perplexity likely due to the MoE router implementation being "unique")...
With only 13B active, a somewhat quantized version should be easy enough to run on hybrid CPU+GPU hopefully yeah...
Unfortunately, not so simple as changing a few lines in llama.cpp's convert_hf_to_gguf.py as this model has different named tensors:
$ numactl -N 1 -m 1 \
python \
convert_hf_to_gguf.py \
--outtype bf16 \
--split-max-size 50G \
--outfile /mnt/data/models/ubergarm/HunyuanImage-3.0-GGUF \
/mnt/data/models/tencent/HunyuanImage-3/
INFO:hf-to-gguf:Loading model: HunyuanImage-3
WARNING:hf-to-gguf:Trying to load config.json instead
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Exporting model...
INFO:hf-to-gguf:gguf: loading model weight map from 'model.safetensors.index.json'
INFO:hf-to-gguf:gguf: loading model part 'model-0001-of-0032.safetensors'
Traceback (most recent call last):
File "/home/w/projects/llama.cpp/convert_hf_to_gguf.py", line 8466, in modify_tensors
return [(self.map_tensor_name(name), data_torch)]
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/w/projects/llama.cpp/convert_hf_to_gguf.py", line 259, in map_tensor_name
raise ValueError(f"Can not map tensor {name!r}")
ValueError: Can not map tensor 'final_layer.model.0.emb_layers.1.bias'
Plan B for me is to try get the demo running on a CPU only big AMD EPYC Rig (thanks Wendell of level1techs for the hardware!!) with triton-cpu
and search replacing "cuda"
to "cpu"
lol... looks like 2 hours 45 minutes to generate a single 1024x image so far...
4%|βββββ | 2/50 [06:20<2:31:38, 189.55s/it]
This is probably much easier if you have ~180GB VRAM or so π
Here is my procedure:
# modified from https://huggingface.co/tencent/HunyuanImage-3.0#%F0%9F%8F%A0-local-installation--usage
$ mkdir hi3 && cd hi3
$ uv venv ./venv --python 3.12 --python-preference=only-managed
$ source venv/bin/activate
$ git clone https://github.com/Tencent-Hunyuan/HunyuanImage-3.0.git
$ cd HunyuanImage-3.0/
$ uv pip install -r requirements.txt
$ $ uv pip install loguru torchvision
# i'll replace triton with triton-cpu for my use case, but if u have GPU just try it
$ uv pip uninstall triton
# install triton-cpu from source following: https://github.com/triton-lang/triton-cpu/issues/237#issuecomment-2878180022
$ cd ..
$ git clone https://github.com/triton-lang/triton-cpu --recursive
$ cd triton-cpu
$ uv pip install ninja cmake wheel setuptools pybind11
$ MAX_JOBS=32 uv pip install -e python --no-build-isolation
$ cd ../HunyuanImage-3.0/
$ uv pip install tencentcloud-sdk-python # sketchy lol
$ export SOCKET=1
$ numactl -N "$SOCKET" -m "$SOCKET" \
python3 run_image_gen.py \
--model-id /mnt/data/models/tencent/HunyuanImage-3/ \
--verbose 1 \
--rewrite False \
--prompt "A cybernetic beaver is chewing on an ai robotic tree."
I also had to comment out any code about rewriting via API with deepseek the prompt as it doesn't seem to listen to --rewrite 0
etc...
UPDATE
I ran a smaller 5 step gen just to test. It doesn't seem to honor passing in size e.g. --image-size 512x512
and always does 1024x1024
... Main issue though is it fails on decode: RuntimeError: mixed dtype (CPU): expect parameter to have scalar type of Float
so gotta fuss some more.
Unfortunately, not so simple as changing a few lines in llama.cpp's convert_hf_to_gguf.py as this model has different named tensors:
$ numactl -N 1 -m 1 \ python \ convert_hf_to_gguf.py \ --outtype bf16 \ --split-max-size 50G \ --outfile /mnt/data/models/ubergarm/HunyuanImage-3.0-GGUF \ /mnt/data/models/tencent/HunyuanImage-3/ INFO:hf-to-gguf:Loading model: HunyuanImage-3 WARNING:hf-to-gguf:Trying to load config.json instead INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only INFO:hf-to-gguf:Exporting model... INFO:hf-to-gguf:gguf: loading model weight map from 'model.safetensors.index.json' INFO:hf-to-gguf:gguf: loading model part 'model-0001-of-0032.safetensors' Traceback (most recent call last): File "/home/w/projects/llama.cpp/convert_hf_to_gguf.py", line 8466, in modify_tensors return [(self.map_tensor_name(name), data_torch)] ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/w/projects/llama.cpp/convert_hf_to_gguf.py", line 259, in map_tensor_name raise ValueError(f"Can not map tensor {name!r}") ValueError: Can not map tensor 'final_layer.model.0.emb_layers.1.bias'
Plan B for me is to try get the demo running on a CPU only big AMD EPYC Rig (thanks Wendell of level1techs for the hardware!!) with
triton-cpu
and search replacing"cuda"
to"cpu"
lol... looks like 2 hours 45 minutes to generate a single 1024x image so far...
4%|βββββ | 2/50 [06:20<2:31:38, 189.55s/it]
This is probably much easier if you have ~180GB VRAM or so π
Here is my procedure:
# modified from https://huggingface.co/tencent/HunyuanImage-3.0#%F0%9F%8F%A0-local-installation--usage $ mkdir hi3 && cd hi3 $ uv venv ./venv --python 3.12 --python-preference=only-managed $ source venv/bin/activate $ git clone https://github.com/Tencent-Hunyuan/HunyuanImage-3.0.git $ cd HunyuanImage-3.0/ $ uv pip install -r requirements.txt $ $ uv pip install loguru torchvision # i'll replace triton with triton-cpu for my use case, but if u have GPU just try it $ uv pip uninstall triton # install triton-cpu from source following: https://github.com/triton-lang/triton-cpu/issues/237#issuecomment-2878180022 $ cd .. $ git clone https://github.com/triton-lang/triton-cpu --recursive $ cd triton-cpu $ uv pip install ninja cmake wheel setuptools pybind11 $ MAX_JOBS=32 uv pip install -e python --no-build-isolation $ cd ../HunyuanImage-3.0/ $ uv pip install tencentcloud-sdk-python # sketchy lol $ export SOCKET=1 $ numactl -N "$SOCKET" -m "$SOCKET" \ python3 run_image_gen.py \ --model-id /mnt/data/models/tencent/HunyuanImage-3/ \ --verbose 1 \ --rewrite False \ --prompt "A cybernetic beaver is chewing on an ai robotic tree."
I also had to comment out any code about rewriting via API with deepseek the prompt as it doesn't seem to listen to
--rewrite 0
etc...UPDATE
I ran a smaller 5 step gen just to test. It doesn't seem to honor passing in size e.g.
--image-size 512x512
and always does1024x1024
... Main issue though is it fails on decode:RuntimeError: mixed dtype (CPU): expect parameter to have scalar type of Float
so gotta fuss some more.
You can do it another way, but that depends on what the support will look like, if it gets comfyui support its straight forward to quant, if it only gets llama.cpp support that is different.