Boogu-Image-0.1

Boosting Open-Source Unified Multimodal Understanding and Generation

Boogu-Image-0.1 Teaser

Project Page Hugging Face GitHub ModelScope

Demo-Base Demo-Edit Demo-Turbo License Paper

Welcome to the official repository for Boogu-Image-0.1 !

English | ไธญๆ–‡


๐Ÿ“– Introduction

Boogu-Image-0.1 is a competitive Apache-2.0 open-source unified image generation and editing model family, including Base, Turbo, Edit, and other variants that provide stable, practical capabilities for high-quality text-to-image generation, fast generation, image editing, and Chinese-English text rendering. Closed-source multimodal understanding and generation systems like Nano Banana Pro and GPT-Image-2 achieve remarkable performance not because of a single model, but through a highly unified suite of system capabilities. However, under training compute that is extremely limited compared with closed-source systems, we find that systematically improving a model's understanding ability, data quality, and training pipeline can still significantly improve image generation and editing performance. Specifically, compared with some existing open-source models, our training data scale is roughly one order of magnitude smaller. We hope our empirical study and open-source release will help advance the open-source ecosystem for multimodal generation and understanding.

This repository provides checkpoints and inference code for Boogu-Image-0.1.

๐Ÿ† Boogu Arena

Since we could not evaluate on LM Arena directly, we built Boogu Arena, an LM Arena-style preference evaluation. We use an LLM to generate diverse user personas, then ask each persona to produce image generation prompts, resulting in 1K+ test prompts that we will release publicly for community reproduction. The ELO leaderboard below spans leading closed- and open-source systems. We welcome teams with questions about the results to contact us so that we can work toward a more objective, fair, and reproducible evaluation.

Boogu Arena ELO Leaderboard

โœจ Highlights

  • ๐Ÿ“ธ Beautiful and Precise Photography โ€” Accurately understands photography prompts and generates high-quality images with natural lighting, coherent composition, and faithful details, preserving coherent subject, background, and spatial relationships even in complex real-world scenes
  • ๐Ÿ“ Diverse and Stable Text Rendering โ€” Supports a wide range of text-heavy designs โ€” posters, stamps, documents, interfaces, brand guides, and handwritten boards โ€” with readable structure, stable typography, and robust bilingual (Chinese/English) rendering across diverse layouts
  • ๐ŸŽจ Diverse and Beautiful Stylization โ€” Handles stylized generation across miniature 3D scenes, Chinese-inspired gilded aesthetics, shining fantasy visuals, anime portraits, and mythic character art โ€” not just style transfer, but stable, attractive, and prompt-aware creative generation
  • ๐Ÿ“Š Competitive General Performance โ€” Demonstrates competitive performance across many scenarios and benchmarks, with the Boogu-Image-0.1 family ranking among the very top of evaluated open- and closed-source systems in Boogu Arena

๐Ÿ“– For the full set of practical lessons and an honest account of current limitations, see Responsible AI & Limitations below.

๐Ÿ“ฃ News

  • 2026-06-17 ๐Ÿ”ฅ ComfyUI-Boogu powered by ComfyUI is released! Thank you, ComfyUI!
  • 2026-06-17 ๐Ÿ”ฅ ComfyUI-Boogu is released!
  • 2026-06-16 ๐Ÿ”ฅ Boogu-Image-0.1-Base (Text-to-Image) is released! The core text-to-image foundation model. Try the online demo.
  • 2026-06-16 ๐ŸŽจ Boogu-Image-0.1-Edit (Image-to-Image) is released! Image editing and transformation capabilities now available. Try the online demo.
  • 2026-06-16 ๐Ÿš€ Boogu-Image-0.1-Turbo is released! Four-step distilled variant for fast inference and photorealistic generation. Try the online demo.

๐Ÿ“ฅ Model Zoo

Model Params Training Steps CFG Task Hugging Face ModelScope Demo
Boogu-Image-0.1-Base 10B Joint Training 25~50 2.0๏ฝž5.0
๏ผˆe.g., 4.0๏ผ‰
T2I HF MS Demo
Boogu-Image-0.1-Base-fp8 10B Joint Training 25~50 2.0๏ฝž5.0
๏ผˆe.g., 4.0๏ผ‰
T2I HF MS โ€”
Boogu-Image-0.1-Edit 10B Joint Training 25~50 2.0๏ฝž5.0
๏ผˆe.g., 5.0๏ผ‰
TI2I HF MS Demo
Boogu-Image-0.1-Edit-fp8 10B Joint Training 25~50 2.0๏ฝž5.0
๏ผˆe.g., 5.0๏ผ‰
TI2I HF MS โ€”
Boogu-Image-0.1-Turbo 10B + Decoupled DMD 4 0.0 T2I HF MS Demo
Boogu-Image-0.1-Turbo-fp8 10B + Decoupled DMD 4 0.0 T2I HF MS โ€”
  • Boogu-Image-0.1-Base: Foundation model with strong diversity and controllability โ€” ideal for fine-tuning and downstream development. Mainly intended for ultra-dense text rendering; for photorealism, Turbo is usually the better default.
  • Boogu-Image-0.1-Edit: Image editing and transformation variant.
  • Boogu-Image-0.1-Turbo: Distilled variant with the same parameter count, typically requiring only 3~4 steps. Focuses on high-quality generation and photorealism while preserving bilingual text rendering and prompt adherence.

๐Ÿ› ๏ธ Installation

Tested environment: Python 3.10 ยท CUDA 12.6 ยท PyTorch 2.7.1

# Use a brand new conda environment
conda create -y -n boogu python=3.10
conda activate boogu
# Instal necessary dependencies
# PyTorch up to 2.11.0 with CUDA up to 12.8 is supported
# Check `requirements/<torch>_<cuda>.txt`
pip install -r requirements/torch2.7-cu126.txt
pip install -e .
python utils/get_flash_attn.py

or

bash quick_start.sh
conda activate boogu

Download Checkpoints

Download the model weights into a local models/ directory before running inference. We recommend using the official Hugging Face CLI:

pip install -U "huggingface_hub[cli]"

# Download to ./models/<model-name>
huggingface-cli download Boogu/Boogu-Image-0.1-Base --local-dir models/Boogu-Image-0.1-Base
huggingface-cli download Boogu/Boogu-Image-0.1-Turbo --local-dir models/Boogu-Image-0.1-Turbo
huggingface-cli download Boogu/Boogu-Image-0.1-Edit --local-dir models/Boogu-Image-0.1-Edit

Example layout after download:

models/
โ””โ”€โ”€ Boogu-Image-0.1-Base/
    โ”œโ”€โ”€ model_index.json
    โ”œโ”€โ”€ mllm
    โ”œโ”€โ”€ processor
    โ”œโ”€โ”€ scheduler
    โ”œโ”€โ”€ transformer
    โ””โ”€โ”€ vae

Then point inference to the local path via --model models/Boogu-Image-0.1-Base.

Flash Attention

This repository provides utils/get_flash_attn.py to automatically install a compatible flash-attn wheel for your environment.

Requirements:

  • Python and PyTorch with CUDA already installed
  • Linux x86_64
# Auto: detect environment, download a prebuilt wheel, fallback to source build
python utils/get_flash_attn.py

# Force source compilation
python utils/get_flash_attn.py --build

The script first searches mjun0812/flash-attention-prebuild-wheels, then tries official Dao-AILab/flash-attention release wheels with both cxx11abi variants, and finally falls back to source compilation via pip install flash-attn --no-build-isolation.

๐Ÿš€ Quick Start

PyTorch Native T2I Inference

export device="cuda:0" # Required

# Prompt enhancement is powered by an instruction reasoner, also called the rewriter.
# We provide two ways to use it:
#
# 1. Standalone external rewriter:
#    See utils/t2i_external_prompt_rewriter.py. This is a pure external mode example and
#    requires enough GPU memory, without advanced memory management.
#    python utils/t2i_external_prompt_rewriter.py --prompt "draw a cat" --model /path/to/Qwen3-VL-32B-Instruct --lang en
#
# 2. Pipeline-integrated rewriter:
#    See the scripts under `demo_scripts` whose names contain "reasoning".
#    For example: demo_scripts/demo_t2i_local_reasoning.sh
#    This mode supports more flexible memory management. Set the generation and
#    rewriter devices manually, then pass them to inference.py:
#    export device="cuda:0"
#    export rewriter_device="cuda:1"
#    python inference.py --device $device --rewriter_device $rewriter_device ...
#    For more details, see INFERENCE_GUIDE.md.

python inference.py \
  --pretrained_pipeline_name_or_path "models/Boogu-Image-0.1-Base" \
  --instruction "ไธ€ๅน…ๅ›ฝ้ฃŽ็‰้‡‘้ฃŽๆ ผ็š„ๅฑฑๆฐด็”ปไฝœ๏ผŒๅฑ•็Žฐไบ†ๆก‚ๆž—ๅฑฑๆฐดๅœจ้‡‘ๅ…‰ๆ™ฎ็…งไธ‹็š„ๅฃฎไธฝๆ™ฏ่ฑกใ€‚่ฟœๅฑฑๅฑ‚ๅ ๏ผŒๆฑŸๆฐดๅฆ‚้•œ๏ผŒๅฑฑๅณฐ่พน็ผ˜ๅ‹พๅ‹’็€ๅ‘ๅ…‰็š„้‡‘่‰ฒ็บฟๆกใ€‚็”ป้ข้‡‡็”จ็Ÿณ้’็Ÿณ็ปฟๅฒฉๅฝฉไธŽ้Ž้‡‘่ดจๆ„Ÿ็›ธ็ป“ๅˆ๏ผŒๅฑ€้ƒจๆœ‰ๅŽšๆถ‚ๆฒน็”ป็ฌ”่งฆ๏ผŒ็ฉบไธญ้ฃ˜ๆตฎ็€้‡‘่‰ฒ็ฒ’ๅญ๏ผŒ่ฅ้€ ๅ‡บๆขฆๅนปๆœฆ่ƒง่€Œๅˆ็ฃ…็คดๅคงๆฐ”็š„ๆ„ๅขƒใ€‚" \
  --num_inference_steps 50 \
  --height 1024 --width 1024 \
  --text_guidance_scale 4.0 \
  --output_image_path "outputs/test_base/out_1.png" \
  --device "$device"

Hardware Notes

๐Ÿ“– For full CLI options, device setup, offload strategies, caching acceleration, Torch Compile, FP8, and batch inference details, see INFERENCE_GUIDE.md. Torch Compile note: --enable_torch_compile can occasionally produce all-black outputs on some GPUs/models. If that happens, disable it first.

VRAM Recommended Config (T2I 1K) Recommended Config (T2I 2K)
12GB Unquantized: --enable_sequential_cpu_offload_flag
Quantized: --enable_model_cpu_offload_flag --use_fp8_weights
Unquantized: --enable_sequential_cpu_offload_flag
Quantized: --enable_group_offload_flag --use_fp8_weights
16GB Unquantized: --enable_sequential_cpu_offload_flag
Quantized: --enable_model_cpu_offload_flag --use_fp8_weights
Unquantized: --enable_sequential_cpu_offload_flag
Quantized: --enable_model_cpu_offload_flag --use_fp8_weights
24GB Unquantized: --enable_model_cpu_offload_flag
Quantized --use_fp8_weights
--enable_model_cpu_offload_flag
32GB Unquantized: --enable_model_cpu_offload_flag
Quantized: --use_fp8_weights
Unquantized: --enable_model_cpu_offload_flag
Quantized: --use_fp8_weights
40GB Base Model Unquantized: --enable_model_cpu_offload_flag
Quantized: --use_fp8_weights
80GB Base Model Base Model

โš ๏ธ Responsible AI & Limitations

Boogu-Image-0.1 is released for research purposes and is not intended for production deployment without additional safeguards. We took responsible-AI considerations into account during data curation, training, and evaluation; however the model may still produce outputs that are inaccurate, biased, or otherwise inappropriate.

Known Limitations

๐ŸŒ World Knowledge Gap

  • For tasks requiring rich common sense, domain knowledge, real brands or people, famous landmarks, celebrities, products, or complex contextual understanding, Boogu still has a clear gap from strong closed-source systems
  • This capability is extraordinarily expensive to measure; even Arena-style evaluation struggles to assess it fully, so existing benchmarks barely quantify this dimension and the real gap is likely larger than measured scores suggest

๐Ÿ–ผ๏ธ Image-to-Image Consistency & In-Context Scenarios

  • For editing tasks requiring strict preservation of the input subject, identity, layout, or fine details, Boogu's image-to-image consistency is still not stable enough
  • Because our image-to-image capability focuses more on photography and text-generation applications, Boogu still trails Seedream 5.0 and Nano Banana Pro in some in-context generation scenarios

๐Ÿ“ Text Rendering Stability

  • Boogu can handle many Chinese and English text scenarios, but long text, dense typography, small fonts, and complex design layouts can still produce typos, missing characters, or layout drift
  • Text rendering is currently focused on Chinese and English; other languages are not specifically optimized and may degrade noticeably

๐Ÿฆด Body Structure in Complex Poses

  • In multi-person interaction, occlusion, exaggerated motion, or unusual viewpoints, hands, limbs, and body structure may still become unnatural or inconsistent

๐Ÿ‘ค Small Faces & Small Limbs

  • Because we use the open-source FLUX.1 VAE, reconstruction loss is relatively large, so details such as small faces, small limbs, eyes, and text may still show artifacts or instability

๐Ÿ“ฆ Limited Release Scope

  • Due to resource constraints, engineering complexity, and release boundaries, we are not able to open-source every training and system detail
  • The current open-source release aims to balance reproducibility, usability, and sustainable maintenance while providing a reliable starting point for community research and improvement

Downstream users are responsible for applying content moderation, validation, and compliance checks appropriate to their use case.

๐Ÿ™ Acknowledgements

Closed-source systems such as GPT-Image, Nano Banana, and the Seedream series helped us understand the frontier capabilities and practical boundaries of unified understanding-and-generation systems. We thank the Qwen-Image, Z-Image, OmniGen2, FLUX, and broader open-source communities for the foundations they provide, and DeepSeek for strong open-source understanding models that support open-source unified multimodal systems.

๐Ÿ“„ License

This project is released under the Apache-2.0 License.

Downloads last month
11
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for Boogu/Boogu-Image-0.1-Edit

Finetuned
(321)
this model
Quantizations
1 model

Spaces using Boogu/Boogu-Image-0.1-Edit 2