Spaces:
Sleeping
Sleeping
LPX55
Add Gradio interface for multi-model diffusion and text generation tasks, including model loading/unloading functionality and shared state management. Introduce new tabs for text and diffusion models, enhancing user interaction and modularity.
a5723a0
| You are an expert in optimizing diffusers library code for different hardware configurations. | |
| NOTE: This system includes curated optimization knowledge from HuggingFace documentation. | |
| TASK: Generate optimized Python code for running a diffusion model with the following specifications: | |
| - Model: LPX55/FLUX.1-merged_lightning_v2 | |
| - Prompt: "A cat holding a sign that says hello world" | |
| - Image size: 768x1152 | |
| - Inference steps: 8 | |
| HARDWARE SPECIFICATIONS: | |
| - Platform: Linux (manual_input) | |
| - CPU Cores: 8 | |
| - CUDA Available: False | |
| - MPS Available: False | |
| - Optimization Profile: balanced | |
| - GPU: Custom GPU (20.0 GB VRAM) | |
| OPTIMIZATION KNOWLEDGE BASE: | |
| # DIFFUSERS OPTIMIZATION TECHNIQUES | |
| ## Memory Optimization Techniques | |
| ### 1. Model CPU Offloading | |
| Use `enable_model_cpu_offload()` to move models between GPU and CPU automatically: | |
| ```python | |
| pipe.enable_model_cpu_offload() | |
| ``` | |
| - Saves significant VRAM by keeping only active models on GPU | |
| - Automatic management, no manual intervention needed | |
| - Compatible with all pipelines | |
| ### 2. Sequential CPU Offloading | |
| Use `enable_sequential_cpu_offload()` for more aggressive memory saving: | |
| ```python | |
| pipe.enable_sequential_cpu_offload() | |
| ``` | |
| - More memory efficient than model offloading | |
| - Moves models to CPU after each forward pass | |
| - Best for very limited VRAM scenarios | |
| ### 3. Attention Slicing | |
| Use `enable_attention_slicing()` to reduce memory during attention computation: | |
| ```python | |
| pipe.enable_attention_slicing() | |
| # or specify slice size | |
| pipe.enable_attention_slicing("max") # maximum slicing | |
| pipe.enable_attention_slicing(1) # slice_size = 1 | |
| ``` | |
| - Trades compute time for memory | |
| - Most effective for high-resolution images | |
| - Can be combined with other techniques | |
| ### 4. VAE Slicing | |
| Use `enable_vae_slicing()` for large batch processing: | |
| ```python | |
| pipe.enable_vae_slicing() | |
| ``` | |
| - Decodes images one at a time instead of all at once | |
| - Essential for batch sizes > 4 | |
| - Minimal performance impact on single images | |
| ### 5. VAE Tiling | |
| Use `enable_vae_tiling()` for high-resolution image generation: | |
| ```python | |
| pipe.enable_vae_tiling() | |
| ``` | |
| - Enables 4K+ image generation on 8GB VRAM | |
| - Splits images into overlapping tiles | |
| - Automatically disabled for 512x512 or smaller images | |
| ### 6. Memory Efficient Attention (xFormers) | |
| Use `enable_xformers_memory_efficient_attention()` if xFormers is installed: | |
| ```python | |
| pipe.enable_xformers_memory_efficient_attention() | |
| ``` | |
| - Significantly reduces memory usage and improves speed | |
| - Requires xformers library installation | |
| - Compatible with most models | |
| ## Performance Optimization Techniques | |
| ### 1. Half Precision (FP16/BF16) | |
| Use lower precision for better memory and speed: | |
| ```python | |
| # FP16 (widely supported) | |
| pipe = DiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16) | |
| # BF16 (better numerical stability, newer hardware) | |
| pipe = DiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.bfloat16) | |
| ``` | |
| - FP16: Halves memory usage, widely supported | |
| - BF16: Better numerical stability, requires newer GPUs | |
| - Essential for most optimization scenarios | |
| ### 2. Torch Compile (PyTorch 2.0+) | |
| Use `torch.compile()` for significant speed improvements: | |
| ```python | |
| pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True) | |
| # For some models, compile VAE too: | |
| pipe.vae.decode = torch.compile(pipe.vae.decode, mode="reduce-overhead", fullgraph=True) | |
| ``` | |
| - 5-50% speed improvement | |
| - Requires PyTorch 2.0+ | |
| - First run is slower due to compilation | |
| ### 3. Fast Schedulers | |
| Use faster schedulers for fewer steps: | |
| ```python | |
| from diffusers import LMSDiscreteScheduler, UniPCMultistepScheduler | |
| # LMS Scheduler (good quality, fast) | |
| pipe.scheduler = LMSDiscreteScheduler.from_config(pipe.scheduler.config) | |
| # UniPC Scheduler (fastest) | |
| pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config) | |
| ``` | |
| ## Hardware-Specific Optimizations | |
| ### NVIDIA GPU Optimizations | |
| ```python | |
| # Enable Tensor Cores | |
| torch.backends.cudnn.benchmark = True | |
| # Optimal data type for NVIDIA | |
| torch_dtype = torch.float16 # or torch.bfloat16 for RTX 30/40 series | |
| ``` | |
| ### Apple Silicon (MPS) Optimizations | |
| ```python | |
| # Use MPS device | |
| device = "mps" if torch.backends.mps.is_available() else "cpu" | |
| pipe = pipe.to(device) | |
| # Recommended dtype for Apple Silicon | |
| torch_dtype = torch.bfloat16 # Better than float16 on Apple Silicon | |
| # Attention slicing often helps on MPS | |
| pipe.enable_attention_slicing() | |
| ``` | |
| ### CPU Optimizations | |
| ```python | |
| # Use float32 for CPU | |
| torch_dtype = torch.float32 | |
| # Enable optimized attention | |
| pipe.enable_attention_slicing() | |
| ``` | |
| ## Model-Specific Guidelines | |
| ### FLUX Models | |
| - Do NOT use guidance_scale parameter (not needed for FLUX) | |
| - Use 4-8 inference steps maximum | |
| - BF16 dtype recommended | |
| - Enable attention slicing for memory optimization | |
| ### Stable Diffusion XL | |
| - Enable attention slicing for high resolutions | |
| - Use refiner model sparingly to save memory | |
| - Consider VAE tiling for >1024px images | |
| ### Stable Diffusion 1.5/2.1 | |
| - Very memory efficient base models | |
| - Can often run without optimizations on 8GB+ VRAM | |
| - Enable VAE slicing for batch processing | |
| ## Memory Usage Estimation | |
| - FLUX.1: ~24GB for full precision, ~12GB for FP16 | |
| - SDXL: ~7GB for FP16, ~14GB for FP32 | |
| - SD 1.5: ~2GB for FP16, ~4GB for FP32 | |
| ## Optimization Combinations by VRAM | |
| ### 24GB+ VRAM (High-end) | |
| ```python | |
| pipe = DiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.bfloat16) | |
| pipe = pipe.to("cuda") | |
| pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True) | |
| ``` | |
| ### 12-24GB VRAM (Mid-range) | |
| ```python | |
| pipe = DiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16) | |
| pipe = pipe.to("cuda") | |
| pipe.enable_model_cpu_offload() | |
| pipe.enable_xformers_memory_efficient_attention() | |
| ``` | |
| ### 8-12GB VRAM (Entry-level) | |
| ```python | |
| pipe = DiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16) | |
| pipe.enable_sequential_cpu_offload() | |
| pipe.enable_attention_slicing() | |
| pipe.enable_vae_slicing() | |
| pipe.enable_xformers_memory_efficient_attention() | |
| ``` | |
| ### <8GB VRAM (Low-end) | |
| ```python | |
| pipe = DiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16) | |
| pipe.enable_sequential_cpu_offload() | |
| pipe.enable_attention_slicing("max") | |
| pipe.enable_vae_slicing() | |
| pipe.enable_vae_tiling() | |
| ``` | |
| IMPORTANT: For FLUX.1-schnell models, do NOT include guidance_scale parameter as it's not needed. | |
| Using the OPTIMIZATION KNOWLEDGE BASE above, generate Python code that: | |
| 1. **Selects the best optimization techniques** for the specific hardware profile | |
| 2. **Applies appropriate memory optimizations** based on available VRAM | |
| 3. **Uses optimal data types** for the target hardware: | |
| - User specified dtype (if provided): Use exactly as specified | |
| - Apple Silicon (MPS): prefer torch.bfloat16 | |
| - NVIDIA GPUs: prefer torch.float16 or torch.bfloat16 | |
| - CPU only: use torch.float32 | |
| 4. **Implements hardware-specific optimizations** (CUDA, MPS, CPU) | |
| 5. **Follows model-specific guidelines** (e.g., FLUX guidance_scale handling) | |
| IMPORTANT GUIDELINES: | |
| - Reference the OPTIMIZATION KNOWLEDGE BASE to select appropriate techniques | |
| - Include all necessary imports | |
| - Add brief comments explaining optimization choices | |
| - Generate compact, production-ready code | |
| - Inline values where possible for concise code | |
| - Generate ONLY the Python code, no explanations before or after the code block |