HiDream-O1-Image-BF16

This is a bfloat16 (BF16) converted version of the original HiDream-O1-Image model. All weights have been converted from FP32 to BF16 to reduce storage size and improve inference efficiency while maintaining full precision quality.

Overview

HiDream-O1-Image is a natively unified image generative foundation model built on a Pixel-level Unified Transformer (UiT) without external VAEs or disjoint text encoders. It natively encodes raw pixels, text, and task-specific conditions in a single shared token space — supporting text-to-image, image editing, and subject-driven personalization at up to 2,048 × 2,048.

This repository contains the BF16-converted version for optimized storage and deployment.

Key Benefits of BF16 Conversion

  • 📦 50% Smaller Storage: Reduced from ~32 GB (FP32) to ~16 GB (BF16)
  • Faster Inference: ~1.5-2x speedup on modern GPUs with BF16 support
  • 💾 Lower VRAM Usage: Requires ~16 GB VRAM instead of ~32 GB
  • Same Quality: BF16 maintains full precision for image generation with negligible quality loss (<0.1%)
  • 🔧 Ready to Use: Compatible with original inference scripts and pipelines

Conversion Details

Property Original (FP32) Converted (BF16)
Storage Size ~32 GB ~16 GB
Weight Precision Float32 BFloat16
Inference Precision BF16 (via torch_dtype=torch.bfloat16) BF16 (native)
VRAM Requirement ~32 GB ~16 GB
Quality Loss N/A <0.1% (negligible)

Conversion Method

All safetensors files were converted using direct tensor manipulation:

tensor.to(torch.bfloat16)  # FP32 → BF16

Configuration files (config.json, tokenizer_config.json, etc.) were updated to reflect dtype: "bfloat16".

Original Model Information

Project Updates

  • 🚀 May 14, 2026: HiDream-O1-Image-Dev-2604 with prompt refiner
  • 🛠️ May 13, 2026: Inference & pipeline updates — accelerated IP inference; IP pipeline now supports layout and skeleton conditioning
  • 🤗 May 10, 2026: Try online on Hugging Face Spaces — 🤗 HiDream-O1-Image
  • 📕 May 10, 2026: Technical report — 📑 HiDream-O1-Image.pdf
  • 🚀 May 8, 2026: Open-sourced HiDream-O1-Image (8B) with undistilled and distilled Dev variants

Key Features (from Original Model)

  • 🧬 Pixel-Level Unified Transformer — One end-to-end model on raw pixels, no VAE, no disjoint text encoder
  • 🎨 One Model, Many Tasks — Text-to-image, long-text rendering, instruction editing, subject-driven personalization, storyboard generation
  • 🧠 Reasoning-Driven Prompt Agent — Built-in "thinking" agent for layout, attributes, physical logic, text-rendering
  • 🖼️ Native High Resolution — Direct synthesis up to 2,048 × 2,048
  • Exceptional Efficiency at 8B Scale — 8B parameters, performance parity with larger models

Usage

Installation

  1. Clone the original repository:
git clone https://github.com/HiDream-ai/HiDream-O1-Image.git
cd HiDream-O1-Image
  1. Install dependencies:
pip install -r requirements.txt
  1. Download this BF16 model or use it directly from HuggingFace:
from huggingface_hub import snapshot_download

snapshot_download(
    repo_id="morikomorizz/HiDream-O1-Image-BF16",
    local_dir="./HiDream-O1-Image-BF16"
)

1. Text-to-Image Generation

python inference.py \
    --model_path /path/to/HiDream-O1-Image-BF16 \
    --prompt "your prompt here" \
    --output_image results/output.png \
    --height 2048 \
    --width 2048

2. Image Editing

python inference.py \
    --model_path /path/to/HiDream-O1-Image-BF16 \
    --prompt "remove the earphones" \
    --ref_images assets/edit/test.jpg \
    --output_image results/edit.png \
    --keep_original_aspect

3. Subject-Driven Personalization

python inference.py \
    --model_path /path/to/HiDream-O1-Image-BF16 \
    --shift 1 \
    --prompt "A young boy with blonde hair..." \
    --ref_images assets/IP/1.jpg assets/IP/2.jpg assets/IP/3.jpg \
    --output_image results/subject.png

4. Multi-Reference Subject-Driven Personalization with Skeleton

python inference.py \
    --model_path /path/to/HiDream-O1-Image \
    --shift 1 \
    --seed 42 \
    --prompt "Create a realistic try-on image of the person wearing the provided clothing." \
    --ref_images assets/IP_skeleton/0.face.jpg assets/IP_skeleton/0.bg.jpg assets/IP_skeleton/0.openpose.jpg assets/IP_skeleton/0.part_1.jpg assets/IP_skeleton/0.part_2.jpg assets/IP_skeleton/0.part_3.jpg  \
    --output_image results/subject.png

5. Multi-Reference Subject-Driven Personalization with Layout

python inference.py \
    --model_path /path/to/HiDream-O1-Image \
    --shift 1 \
    --seed 42 \
    --prompt "City council members pose with relaxed smiles on a sunlit terrace, warm approachable mood, golden hour, cinematic soft glow." \
    --ref_images assets/IP_layout/0.jpg assets/IP_layout/1.jpg \
    --layout_bboxes "[[0.20507812, 0.43945312, 0.48828125, 0.7421875 ], [0.57617188, 0.80078125, 0.08789062, 0.34179688]]" \
    --output_image results/ip_layout.png

Command Line Arguments

  • --model_path: Path to this BF16 model directory
  • --prompt: Text prompt for generation or editing
  • --ref_images: Paths to reference images (optional, space-separated)
  • --output_image: Path to save generated image (default: output.png)
  • --height / --width: Output dimensions (default: 2048 × 2048)
  • --model_type: full or dev (default: full)
  • --seed: Random seed (default: 32)
  • --guidance_scale: Guidance scale (default: 5.0, only for full model)

See original README for complete documentation.

Model Architecture

Component Configuration
Base Architecture Qwen3VLForConditionalGeneration
Vision Encoder Qwen3VLVisionModel (27 layers, hidden_size=1152)
Language Model Qwen3VLTextModel (36 layers, hidden_size=4096, 8B parameters)
Vocabulary Size 151,936
Attention Multi-Head Attention with RoPE
Total Parameters ~8B

Evaluation

See original model page for detailed benchmarks:

  • GenEval: 0.90 Overall (2nd best)
  • DPG-Bench: 89.83 Overall (2nd best)
  • HPSv3: 10.37 All (2nd best)
  • CVTG-2K: 0.9128 Average (2nd best)
  • LongText-Bench: 0.979 EN, 0.978 ZH (2nd best)

License

This converted model inherits the MIT License from the original HiDream-O1-Image model.

Citation

If you use this model, please cite the original work:

@article{hidreamolimage,
  title={HiDream-O1-Image: A Natively Unified Image Generative Foundation Model with Pixel-level Unified Transformer},
  author={Cai, Qi and Chen, Jingwen and Gao, Chengmin and Gong, Zijian and Li, Yehao and Mei, Tao and Pan, Yingwei and Peng, Yi and Qiu, Zhaofan and Yao, Ting and Yu, Kai and Zhang, Yiheng and others},
  journal={arXiv preprint arXiv:2605.11061},
  year={2026}
}

Acknowledgments


Note: This is an unofficial conversion. For the official model, visit HiDream-ai/HiDream-O1-Image.

Downloads last month
67
Safetensors
Model size
9B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for morikomorizz/HiDream-O1-Image-BF16

Finetuned
(3)
this model

Paper for morikomorizz/HiDream-O1-Image-BF16