HiDream-O1-Image-BF16

This is a bfloat16 (BF16) converted version of the original HiDream-O1-Image model. All weights have been converted from FP32 to BF16 to reduce storage size and improve inference efficiency while maintaining full precision quality.

Overview

HiDream-O1-Image is a natively unified image generative foundation model built on a Pixel-level Unified Transformer (UiT) without external VAEs or disjoint text encoders. It natively encodes raw pixels, text, and task-specific conditions in a single shared token space — supporting text-to-image, image editing, and subject-driven personalization at up to 2,048 × 2,048.

This repository contains the BF16-converted version for optimized storage and deployment.

Key Benefits of BF16 Conversion

📦 50% Smaller Storage: Reduced from ~32 GB (FP32) to ~16 GB (BF16)
⚡ Faster Inference: ~1.5-2x speedup on modern GPUs with BF16 support
💾 Lower VRAM Usage: Requires ~16 GB VRAM instead of ~32 GB
✅ Same Quality: BF16 maintains full precision for image generation with negligible quality loss (<0.1%)
🔧 Ready to Use: Compatible with original inference scripts and pipelines

Conversion Details

Property	Original (FP32)	Converted (BF16)
Storage Size	~32 GB	~16 GB
Weight Precision	Float32	BFloat16
Inference Precision	BF16 (via `torch_dtype=torch.bfloat16`)	BF16 (native)
VRAM Requirement	~32 GB	~16 GB
Quality Loss	N/A	<0.1% (negligible)

Conversion Method

All safetensors files were converted using direct tensor manipulation:

tensor.to(torch.bfloat16)  # FP32 → BF16

Configuration files (config.json, tokenizer_config.json, etc.) were updated to reflect dtype: "bfloat16".

Original Model Information

Project Updates

🚀 May 14, 2026: HiDream-O1-Image-Dev-2604 with prompt refiner
🛠️ May 13, 2026: Inference & pipeline updates — accelerated IP inference; IP pipeline now supports layout and skeleton conditioning
🤗 May 10, 2026: Try online on Hugging Face Spaces — 🤗 HiDream-O1-Image
📕 May 10, 2026: Technical report — 📑 HiDream-O1-Image.pdf
🚀 May 8, 2026: Open-sourced HiDream-O1-Image (8B) with undistilled and distilled Dev variants

Key Features (from Original Model)

🧬 Pixel-Level Unified Transformer — One end-to-end model on raw pixels, no VAE, no disjoint text encoder
🎨 One Model, Many Tasks — Text-to-image, long-text rendering, instruction editing, subject-driven personalization, storyboard generation
🧠 Reasoning-Driven Prompt Agent — Built-in "thinking" agent for layout, attributes, physical logic, text-rendering
🖼️ Native High Resolution — Direct synthesis up to 2,048 × 2,048
⚡ Exceptional Efficiency at 8B Scale — 8B parameters, performance parity with larger models

Usage

Installation

Clone the original repository:

git clone https://github.com/HiDream-ai/HiDream-O1-Image.git
cd HiDream-O1-Image

Install dependencies:

pip install -r requirements.txt

Download this BF16 model or use it directly from HuggingFace:

from huggingface_hub import snapshot_download

snapshot_download(
    repo_id="morikomorizz/HiDream-O1-Image-BF16",
    local_dir="./HiDream-O1-Image-BF16"
)

1. Text-to-Image Generation

python inference.py \
    --model_path /path/to/HiDream-O1-Image-BF16 \
    --prompt "your prompt here" \
    --output_image results/output.png \
    --height 2048 \
    --width 2048

2. Image Editing

python inference.py \
    --model_path /path/to/HiDream-O1-Image-BF16 \
    --prompt "remove the earphones" \
    --ref_images assets/edit/test.jpg \
    --output_image results/edit.png \
    --keep_original_aspect

3. Subject-Driven Personalization

python inference.py \
    --model_path /path/to/HiDream-O1-Image-BF16 \
    --shift 1 \
    --prompt "A young boy with blonde hair..." \
    --ref_images assets/IP/1.jpg assets/IP/2.jpg assets/IP/3.jpg \
    --output_image results/subject.png

4. Multi-Reference Subject-Driven Personalization with Skeleton

python inference.py \
    --model_path /path/to/HiDream-O1-Image \
    --shift 1 \
    --seed 42 \
    --prompt "Create a realistic try-on image of the person wearing the provided clothing." \
    --ref_images assets/IP_skeleton/0.face.jpg assets/IP_skeleton/0.bg.jpg assets/IP_skeleton/0.openpose.jpg assets/IP_skeleton/0.part_1.jpg assets/IP_skeleton/0.part_2.jpg assets/IP_skeleton/0.part_3.jpg  \
    --output_image results/subject.png

5. Multi-Reference Subject-Driven Personalization with Layout

python inference.py \
    --model_path /path/to/HiDream-O1-Image \
    --shift 1 \
    --seed 42 \
    --prompt "City council members pose with relaxed smiles on a sunlit terrace, warm approachable mood, golden hour, cinematic soft glow." \
    --ref_images assets/IP_layout/0.jpg assets/IP_layout/1.jpg \
    --layout_bboxes "[[0.20507812, 0.43945312, 0.48828125, 0.7421875 ], [0.57617188, 0.80078125, 0.08789062, 0.34179688]]" \
    --output_image results/ip_layout.png

Command Line Arguments

--model_path: Path to this BF16 model directory
--prompt: Text prompt for generation or editing
--ref_images: Paths to reference images (optional, space-separated)
--output_image: Path to save generated image (default: output.png)
--height / --width: Output dimensions (default: 2048 × 2048)
--model_type: full or dev (default: full)
--seed: Random seed (default: 32)
--guidance_scale: Guidance scale (default: 5.0, only for full model)

See original README for complete documentation.

Model Architecture

Component	Configuration
Base Architecture	Qwen3VLForConditionalGeneration
Vision Encoder	Qwen3VLVisionModel (27 layers, hidden_size=1152)
Language Model	Qwen3VLTextModel (36 layers, hidden_size=4096, 8B parameters)
Vocabulary Size	151,936
Attention	Multi-Head Attention with RoPE
Total Parameters	~8B

Evaluation

See original model page for detailed benchmarks:

GenEval: 0.90 Overall (2nd best)
DPG-Bench: 89.83 Overall (2nd best)
HPSv3: 10.37 All (2nd best)
CVTG-2K: 0.9128 Average (2nd best)
LongText-Bench: 0.979 EN, 0.978 ZH (2nd best)

License

This converted model inherits the MIT License from the original HiDream-O1-Image model.

Citation

If you use this model, please cite the original work:

@article{hidreamolimage,
  title={HiDream-O1-Image: A Natively Unified Image Generative Foundation Model with Pixel-level Unified Transformer},
  author={Cai, Qi and Chen, Jingwen and Gao, Chengmin and Gong, Zijian and Li, Yehao and Mei, Tao and Pan, Yingwei and Peng, Yi and Qiu, Zhaofan and Yao, Ting and Yu, Kai and Zhang, Yiheng and others},
  journal={arXiv preprint arXiv:2605.11061},
  year={2026}
}

Acknowledgments

Original model by HiDream.ai
BF16 conversion by morikomorizz
Based on HiDream-O1-Image

Note: This is an unofficial conversion. For the official model, visit HiDream-ai/HiDream-O1-Image.

Downloads last month: 67

Safetensors

Model size

9B params

Tensor type

BF16

Model tree for morikomorizz/HiDream-O1-Image-BF16

Base model

HiDream-ai/HiDream-O1-Image

Finetuned

(3)

this model

Paper for morikomorizz/HiDream-O1-Image-BF16

HiDream-O1-Image: A Natively Unified Image Generative Foundation Model with Pixel-level Unified Transformer

Paper • 2605.11061 • Published 23 days ago • 2