diffusers/examples/research_projects/intel_opts/README.md · declare-lab/tango2-full at 133ca117fa060cb441341f74a339b42d01f99a61

Diffusers examples with Intel optimizations

This research project is not actively maintained by the diffusers team. For any questions or comments, please make sure to tag @hshen14 .

This aims to provide diffusers examples with Intel optimizations such as Bfloat16 for training/fine-tuning acceleration and 8-bit integer (INT8) for inference acceleration on Intel platforms.

Accelerating the fine-tuning for textual inversion

We accelereate the fine-tuning for textual inversion with Intel Extension for PyTorch. The examples enable both single node and multi-node distributed training with Bfloat16 support on Intel Xeon Scalable Processor.

Accelerating the inference for Stable Diffusion using Bfloat16

We start the inference acceleration with Bfloat16 using Intel Extension for PyTorch. The script is generally designed to support standard Stable Diffusion models with Bfloat16 support.

pip install diffusers transformers accelerate scipy safetensors

export KMP_BLOCKTIME=1
export KMP_SETTINGS=1
export KMP_AFFINITY=granularity=fine,compact,1,0

# Intel OpenMP
export OMP_NUM_THREADS=< Cores to use >
export LD_PRELOAD=${LD_PRELOAD}:/path/to/lib/libiomp5.so
# Jemalloc is a recommended malloc implementation that emphasizes fragmentation avoidance and scalable concurrency support.
export LD_PRELOAD=${LD_PRELOAD}:/path/to/lib/libjemalloc.so
export MALLOC_CONF="oversize_threshold:1,background_thread:true,metadata_thp:auto,dirty_decay_ms:-1,muzzy_decay_ms:9000000000"

# Launch with default DDIM
numactl --membind <node N> -C <cpu list> python python inference_bf16.py
# Launch with DPMSolverMultistepScheduler
numactl --membind <node N> -C <cpu list> python python inference_bf16.py --dpm

Accelerating the inference for Stable Diffusion using INT8

Coming soon ...