Instructions to use Boogu/Boogu-Image-0.1-Edit with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Diffusers
How to use Boogu/Boogu-Image-0.1-Edit with Diffusers:
pip install -U diffusers transformers accelerate
import torch from diffusers import DiffusionPipeline # switch to "mps" for apple devices pipe = DiffusionPipeline.from_pretrained("Boogu/Boogu-Image-0.1-Edit", dtype=torch.bfloat16, device_map="cuda") prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k" image = pipe(prompt).images[0] - Notebooks
- Google Colab
- Kaggle
Boosting Open-Source Unified Multimodal Understanding and Generation
๐ Introduction
Boogu-Image-0.1 is a competitive Apache-2.0 open-source unified image generation and editing model family, including Base, Turbo, Edit, and other variants that provide stable, practical capabilities for high-quality text-to-image generation, fast generation, image editing, and Chinese-English text rendering. Closed-source multimodal understanding and generation systems like Nano Banana Pro and GPT-Image-2 achieve remarkable performance not because of a single model, but through a highly unified suite of system capabilities. However, under training compute that is extremely limited compared with closed-source systems, we find that systematically improving a model's understanding ability, data quality, and training pipeline can still significantly improve image generation and editing performance. Specifically, compared with some existing open-source models, our training data scale is roughly one order of magnitude smaller. We hope our empirical study and open-source release will help advance the open-source ecosystem for multimodal generation and understanding.
This repository provides checkpoints and inference code for Boogu-Image-0.1.
๐ Boogu Arena
Since we could not evaluate on LM Arena directly, we built Boogu Arena, an LM Arena-style preference evaluation. We use an LLM to generate diverse user personas, then ask each persona to produce image generation prompts, resulting in 1K+ test prompts that we will release publicly for community reproduction. The ELO leaderboard below spans leading closed- and open-source systems. We welcome teams with questions about the results to contact us so that we can work toward a more objective, fair, and reproducible evaluation.
โจ Highlights
- ๐ธ Beautiful and Precise Photography โ Accurately understands photography prompts and generates high-quality images with natural lighting, coherent composition, and faithful details, preserving coherent subject, background, and spatial relationships even in complex real-world scenes
- ๐ Diverse and Stable Text Rendering โ Supports a wide range of text-heavy designs โ posters, stamps, documents, interfaces, brand guides, and handwritten boards โ with readable structure, stable typography, and robust bilingual (Chinese/English) rendering across diverse layouts
- ๐จ Diverse and Beautiful Stylization โ Handles stylized generation across miniature 3D scenes, Chinese-inspired gilded aesthetics, shining fantasy visuals, anime portraits, and mythic character art โ not just style transfer, but stable, attractive, and prompt-aware creative generation
- ๐ Competitive General Performance โ Demonstrates competitive performance across many scenarios and benchmarks, with the Boogu-Image-0.1 family ranking among the very top of evaluated open- and closed-source systems in Boogu Arena
๐ For the full set of practical lessons and an honest account of current limitations, see Responsible AI & Limitations below.
๐ฃ News
- 2026-06-17 ๐ฅ ComfyUI-Boogu powered by ComfyUI is released! Thank you, ComfyUI!
- 2026-06-17 ๐ฅ ComfyUI-Boogu is released!
- 2026-06-16 ๐ฅ Boogu-Image-0.1-Base (Text-to-Image) is released! The core text-to-image foundation model. Try the online demo.
- 2026-06-16 ๐จ Boogu-Image-0.1-Edit (Image-to-Image) is released! Image editing and transformation capabilities now available. Try the online demo.
- 2026-06-16 ๐ Boogu-Image-0.1-Turbo is released! Four-step distilled variant for fast inference and photorealistic generation. Try the online demo.
๐ฅ Model Zoo
- Boogu-Image-0.1-Base: Foundation model with strong diversity and controllability โ ideal for fine-tuning and downstream development. Mainly intended for ultra-dense text rendering; for photorealism, Turbo is usually the better default.
- Boogu-Image-0.1-Edit: Image editing and transformation variant.
- Boogu-Image-0.1-Turbo: Distilled variant with the same parameter count, typically requiring only 3~4 steps. Focuses on high-quality generation and photorealism while preserving bilingual text rendering and prompt adherence.
๐ ๏ธ Installation
Tested environment: Python 3.10 ยท CUDA 12.6 ยท PyTorch 2.7.1
# Use a brand new conda environment
conda create -y -n boogu python=3.10
conda activate boogu
# Instal necessary dependencies
# PyTorch up to 2.11.0 with CUDA up to 12.8 is supported
# Check `requirements/<torch>_<cuda>.txt`
pip install -r requirements/torch2.7-cu126.txt
pip install -e .
python utils/get_flash_attn.py
or
bash quick_start.sh
conda activate boogu
Download Checkpoints
Download the model weights into a local models/ directory before running inference. We recommend using the official Hugging Face CLI:
pip install -U "huggingface_hub[cli]"
# Download to ./models/<model-name>
huggingface-cli download Boogu/Boogu-Image-0.1-Base --local-dir models/Boogu-Image-0.1-Base
huggingface-cli download Boogu/Boogu-Image-0.1-Turbo --local-dir models/Boogu-Image-0.1-Turbo
huggingface-cli download Boogu/Boogu-Image-0.1-Edit --local-dir models/Boogu-Image-0.1-Edit
Example layout after download:
models/
โโโ Boogu-Image-0.1-Base/
โโโ model_index.json
โโโ mllm
โโโ processor
โโโ scheduler
โโโ transformer
โโโ vae
Then point inference to the local path via --model models/Boogu-Image-0.1-Base.
Flash Attention
This repository provides utils/get_flash_attn.py to automatically install a compatible flash-attn wheel for your environment.
Requirements:
- Python and PyTorch with CUDA already installed
- Linux x86_64
# Auto: detect environment, download a prebuilt wheel, fallback to source build
python utils/get_flash_attn.py
# Force source compilation
python utils/get_flash_attn.py --build
The script first searches mjun0812/flash-attention-prebuild-wheels, then tries official Dao-AILab/flash-attention release wheels with both cxx11abi variants, and finally falls back to source compilation via pip install flash-attn --no-build-isolation.
๐ Quick Start
PyTorch Native T2I Inference
export device="cuda:0" # Required
# Prompt enhancement is powered by an instruction reasoner, also called the rewriter.
# We provide two ways to use it:
#
# 1. Standalone external rewriter:
# See utils/t2i_external_prompt_rewriter.py. This is a pure external mode example and
# requires enough GPU memory, without advanced memory management.
# python utils/t2i_external_prompt_rewriter.py --prompt "draw a cat" --model /path/to/Qwen3-VL-32B-Instruct --lang en
#
# 2. Pipeline-integrated rewriter:
# See the scripts under `demo_scripts` whose names contain "reasoning".
# For example: demo_scripts/demo_t2i_local_reasoning.sh
# This mode supports more flexible memory management. Set the generation and
# rewriter devices manually, then pass them to inference.py:
# export device="cuda:0"
# export rewriter_device="cuda:1"
# python inference.py --device $device --rewriter_device $rewriter_device ...
# For more details, see INFERENCE_GUIDE.md.
python inference.py \
--pretrained_pipeline_name_or_path "models/Boogu-Image-0.1-Base" \
--instruction "ไธๅน
ๅฝ้ฃ็้้ฃๆ ผ็ๅฑฑๆฐด็ปไฝ๏ผๅฑ็ฐไบๆกๆๅฑฑๆฐดๅจ้ๅ
ๆฎ็
งไธ็ๅฃฎไธฝๆฏ่ฑกใ่ฟๅฑฑๅฑๅ ๏ผๆฑๆฐดๅฆ้๏ผๅฑฑๅณฐ่พน็ผๅพๅ็ๅๅ
็้่ฒ็บฟๆกใ็ป้ข้็จ็ณ้็ณ็ปฟๅฒฉๅฝฉไธ้้่ดจๆ็ธ็ปๅ๏ผๅฑ้จๆๅๆถๆฒน็ป็ฌ่งฆ๏ผ็ฉบไธญ้ฃๆตฎ็้่ฒ็ฒๅญ๏ผ่ฅ้ ๅบๆขฆๅนปๆฆ่ง่ๅ็ฃ
็คดๅคงๆฐ็ๆๅขใ" \
--num_inference_steps 50 \
--height 1024 --width 1024 \
--text_guidance_scale 4.0 \
--output_image_path "outputs/test_base/out_1.png" \
--device "$device"
Hardware Notes
๐ For full CLI options, device setup, offload strategies, caching acceleration, Torch Compile, FP8, and batch inference details, see INFERENCE_GUIDE.md. Torch Compile note:
--enable_torch_compilecan occasionally produce all-black outputs on some GPUs/models. If that happens, disable it first.
| VRAM | Recommended Config (T2I 1K) | Recommended Config (T2I 2K) |
|---|---|---|
| 12GB | Unquantized: --enable_sequential_cpu_offload_flagQuantized: --enable_model_cpu_offload_flag --use_fp8_weights |
Unquantized: --enable_sequential_cpu_offload_flagQuantized: --enable_group_offload_flag --use_fp8_weights |
| 16GB | Unquantized: --enable_sequential_cpu_offload_flagQuantized: --enable_model_cpu_offload_flag --use_fp8_weights |
Unquantized: --enable_sequential_cpu_offload_flagQuantized: --enable_model_cpu_offload_flag --use_fp8_weights |
| 24GB | Unquantized: --enable_model_cpu_offload_flagQuantized --use_fp8_weights |
--enable_model_cpu_offload_flag |
| 32GB | Unquantized: --enable_model_cpu_offload_flagQuantized: --use_fp8_weights |
Unquantized: --enable_model_cpu_offload_flagQuantized: --use_fp8_weights |
| 40GB | Base Model | Unquantized: --enable_model_cpu_offload_flagQuantized: --use_fp8_weights |
| 80GB | Base Model | Base Model |
โ ๏ธ Responsible AI & Limitations
Boogu-Image-0.1 is released for research purposes and is not intended for production deployment without additional safeguards. We took responsible-AI considerations into account during data curation, training, and evaluation; however the model may still produce outputs that are inaccurate, biased, or otherwise inappropriate.
Known Limitations
๐ World Knowledge Gap
- For tasks requiring rich common sense, domain knowledge, real brands or people, famous landmarks, celebrities, products, or complex contextual understanding, Boogu still has a clear gap from strong closed-source systems
- This capability is extraordinarily expensive to measure; even Arena-style evaluation struggles to assess it fully, so existing benchmarks barely quantify this dimension and the real gap is likely larger than measured scores suggest
๐ผ๏ธ Image-to-Image Consistency & In-Context Scenarios
- For editing tasks requiring strict preservation of the input subject, identity, layout, or fine details, Boogu's image-to-image consistency is still not stable enough
- Because our image-to-image capability focuses more on photography and text-generation applications, Boogu still trails Seedream 5.0 and Nano Banana Pro in some in-context generation scenarios
๐ Text Rendering Stability
- Boogu can handle many Chinese and English text scenarios, but long text, dense typography, small fonts, and complex design layouts can still produce typos, missing characters, or layout drift
- Text rendering is currently focused on Chinese and English; other languages are not specifically optimized and may degrade noticeably
๐ฆด Body Structure in Complex Poses
- In multi-person interaction, occlusion, exaggerated motion, or unusual viewpoints, hands, limbs, and body structure may still become unnatural or inconsistent
๐ค Small Faces & Small Limbs
- Because we use the open-source FLUX.1 VAE, reconstruction loss is relatively large, so details such as small faces, small limbs, eyes, and text may still show artifacts or instability
๐ฆ Limited Release Scope
- Due to resource constraints, engineering complexity, and release boundaries, we are not able to open-source every training and system detail
- The current open-source release aims to balance reproducibility, usability, and sustainable maintenance while providing a reliable starting point for community research and improvement
Downstream users are responsible for applying content moderation, validation, and compliance checks appropriate to their use case.
๐ Acknowledgements
Closed-source systems such as GPT-Image, Nano Banana, and the Seedream series helped us understand the frontier capabilities and practical boundaries of unified understanding-and-generation systems. We thank the Qwen-Image, Z-Image, OmniGen2, FLUX, and broader open-source communities for the foundations they provide, and DeepSeek for strong open-source understanding models that support open-source unified multimodal systems.
๐ License
This project is released under the Apache-2.0 License.
- Downloads last month
- 11