VQDiffusionScheduler
VQDiffusionScheduler
converts the transformer model’s output into a sample for the unnoised image at the previous diffusion timestep. It was introduced in Vector Quantized Diffusion Model for Text-to-Image Synthesis by Shuyang Gu, Dong Chen, Jianmin Bao, Fang Wen, Bo Zhang, Dongdong Chen, Lu Yuan, Baining Guo.
The abstract from the paper is:
We present the vector quantized diffusion (VQ-Diffusion) model for text-to-image generation. This method is based on a vector quantized variational autoencoder (VQ-VAE) whose latent space is modeled by a conditional variant of the recently developed Denoising Diffusion Probabilistic Model (DDPM). We find that this latent-space method is well-suited for text-to-image generation tasks because it not only eliminates the unidirectional bias with existing methods but also allows us to incorporate a mask-and-replace diffusion strategy to avoid the accumulation of errors, which is a serious problem with existing methods. Our experiments show that the VQ-Diffusion produces significantly better text-to-image generation results when compared with conventional autoregressive (AR) models with similar numbers of parameters. Compared with previous GAN-based text-to-image methods, our VQ-Diffusion can handle more complex scenes and improve the synthesized image quality by a large margin. Finally, we show that the image generation computation in our method can be made highly efficient by reparameterization. With traditional AR methods, the text-to-image generation time increases linearly with the output image resolution and hence is quite time consuming even for normal size images. The VQ-Diffusion allows us to achieve a better trade-off between quality and speed. Our experiments indicate that the VQ-Diffusion model with the reparameterization is fifteen times faster than traditional AR methods while achieving a better image quality.
VQDiffusionScheduler
class diffusers.VQDiffusionScheduler
< source >( num_vec_classes: int num_train_timesteps: int = 100 alpha_cum_start: float = 0.99999 alpha_cum_end: float = 9e-06 gamma_cum_start: float = 9e-06 gamma_cum_end: float = 0.99999 )
Parameters
- num_vec_classes (
int
) — The number of classes of the vector embeddings of the latent pixels. Includes the class for the masked latent pixel. - num_train_timesteps (
int
, defaults to 100) — The number of diffusion steps to train the model. - alpha_cum_start (
float
, defaults to 0.99999) — The starting cumulative alpha value. - alpha_cum_end (
float
, defaults to 0.00009) — The ending cumulative alpha value. - gamma_cum_start (
float
, defaults to 0.00009) — The starting cumulative gamma value. - gamma_cum_end (
float
, defaults to 0.99999) — The ending cumulative gamma value.
A scheduler for vector quantized diffusion.
This model inherits from SchedulerMixin and ConfigMixin. Check the superclass documentation for the generic methods the library implements for all schedulers such as loading and saving.
log_Q_t_transitioning_to_known_class
< source >( t: torch.int32 x_t: LongTensor log_onehot_x_t: Tensor cumulative: bool ) → torch.Tensor
of shape (batch size, num classes - 1, num latent pixels)
Parameters
- t (
torch.Long
) — The timestep that determines which transition matrix is used. - x_t (
torch.LongTensor
of shape(batch size, num latent pixels)
) — The classes of each latent pixel at timet
. - log_onehot_x_t (
torch.Tensor
of shape(batch size, num classes, num latent pixels)
) — The log one-hot vectors ofx_t
. - cumulative (
bool
) — If cumulative isFalse
, the single step transition matrixt-1
->t
is used. If cumulative isTrue
, the cumulative transition matrix0
->t
is used.
Returns
torch.Tensor
of shape (batch size, num classes - 1, num latent pixels)
Each column of the returned matrix is a row of log probabilities of the complete probability transition matrix.
When non cumulative, returns self.num_classes - 1
rows because the initial latent pixel cannot be
masked.
Where:
q_n
is the probability distribution for the forward process of then
th latent pixel.- C_0 is a class of a latent pixel embedding
- C_k is the class of the masked latent pixel
non-cumulative result (omitting logarithms):
cumulative result (omitting logarithms):
Calculates the log probabilities of the rows from the (cumulative or non-cumulative) transition matrix for each
latent pixel in x_t
.
q_posterior
< source >( log_p_x_0 x_t t ) → torch.Tensor
of shape (batch size, num classes, num latent pixels)
Parameters
- log_p_x_0 (
torch.Tensor
of shape(batch size, num classes - 1, num latent pixels)
) — The log probabilities for the predicted classes of the initial latent pixels. Does not include a prediction for the masked class as the initial unnoised image cannot be masked. - x_t (
torch.LongTensor
of shape(batch size, num latent pixels)
) — The classes of each latent pixel at timet
. - t (
torch.Long
) — The timestep that determines which transition matrix is used.
Returns
torch.Tensor
of shape (batch size, num classes, num latent pixels)
The log probabilities for the predicted classes of the image at timestep t-1
.
set_timesteps
< source >( num_inference_steps: int device: typing.Union[str, torch.device] = None )
Sets the discrete timesteps used for the diffusion chain (to be run before inference).
step
< source >( model_output: Tensor timestep: torch.int64 sample: LongTensor generator: typing.Optional[torch._C.Generator] = None return_dict: bool = True ) → VQDiffusionSchedulerOutput or tuple
Parameters
- log_p_x_0 — (
torch.Tensor
of shape(batch size, num classes - 1, num latent pixels)
): The log probabilities for the predicted classes of the initial latent pixels. Does not include a prediction for the masked class as the initial unnoised image cannot be masked. - t (
torch.long
) — The timestep that determines which transition matrices are used. - x_t (
torch.LongTensor
of shape(batch size, num latent pixels)
) — The classes of each latent pixel at timet
. - generator (
torch.Generator
, orNone
) — A random number generator for the noise applied top(x_{t-1} | x_t)
before it is sampled from. - return_dict (
bool
, optional, defaults toTrue
) — Whether or not to return a VQDiffusionSchedulerOutput ortuple
.
Returns
VQDiffusionSchedulerOutput or tuple
If return_dict is True
, VQDiffusionSchedulerOutput is
returned, otherwise a tuple is returned where the first element is the sample tensor.
Predict the sample from the previous timestep by the reverse transition distribution. See q_posterior() for more details about how the distribution is computer.
VQDiffusionSchedulerOutput
class diffusers.schedulers.scheduling_vq_diffusion.VQDiffusionSchedulerOutput
< source >( prev_sample: LongTensor )
Output class for the scheduler’s step function output.