Understanding InstaFlow/Rectified Flow

Community Article Published October 6, 2023

Hi! I usually do posts on medium here but I wanted to test the huggingface blog out so this will be my first blog post here! The reason I am interested in Instaflows/rectified flows is

  1. We talked about this in the Eleuther Diffusion Reading group and it sounded interesting
  2. I wanted to make a pr for this in diffusers(in this issue). I'll add code once the pr is done!

What is InstaFlow/Rectified Flow?

Rectified flow is a method of finetuning diffusion models so that you can generate images in just 1 step while, traditionally, you need around 12 steps. Instaflow is just that applied to stable diffusion. If you want to test it out, check out the demo here!

So, let's first look into how rectified flows

Rectified Flows

image/jpeg

Rectified Flows were introduced in the paper "Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow".

The paper deals with the idea of not just generating images in one step but instead, having 2 image distributions, π0\pi_0, which can be horses, and π1\pi_1 which can be zebras—and then making the flow between them as straight as possible. The extension for this to generate in one step is that if we make the flow between a noisy distribution π0\pi_0 and the real image distribution π1\pi_1 as short and straight as possible, instant generation of images becomes possible!

This is called an image-image translation/transport mapping problem.

Transport Mapping Problem in GANs

The transport mapping problem has seen its fair share of interest in almost any field that deals with images. For example, CycleGAN

image/jpeg

And its subsequent improvements of StarGAN and StarGAN2 use GANs to learn the mapping between images of one domain to another! The main method used to make this work is called cycle consistency loss where given an image of a zebra, you make it turn into a horse, then back to a zebra and make sure it's still a zebra. For more details, check out the link!

However, GANs, by themselves have severe training instability and the generated do not have as good quality as diffusion models so that's where this method comes in!

Back to Rectified Flows

Now, let us sample X0π0X_0 \sim \pi_0 and X1π1X_1 \sim \pi_1 back to our analogy, this will mean we will get a particular image of a horse, X0X_0, from the distribution of images of horses, π0\pi_0, and a particular zebra image X1X_1, from a distribution of zebra images π1\pi_1.

Now, given the pair of images (X0,X1)(X_0, X_1). Here, we will define a parameter t which is between(including) 0 and 1 which says how far between X0X_0 and X1X_1 we are. So X0.5X_{0.5} will be exactly part way there.

Now, while in traditional diffusion models, we can do this, the pathway is not exactly straight. In fact, it can be very roundabout as can be seen below

image/jpeg

So now, let's move from pathway X to pathway Z. We still want the same endpoints, in that X0=Z0X_0=Z_0 and X1=Z1X_1=Z_1 but we want the path to the points as straight as possible. This is written as

dZt=v(Zt,t)dtdZ_t = v(Z_t, t)dt

Now, first of all, d means a very tiny step. And v here means the velocity at the point of ZtZ_t at timestep t. What this means is that if we keep pushing ZtZ_t in the direction of v(Zt,t)v(Z_t, t), recomputing v each time, then we will reach Z1Z_1 and we will successfully have a zebra.

Now, as we mentioned plenty of times before, we want a straight line. What will this mean for the context of v? First of all, v should be constant because a straight line should push ZtZ_t the same amount and same direction regardless of what t is. Secondary, it should be the closest path. This will mean that when we do integration(which means just adding the length of the entire path made in Z), it should be X1X0X_1-X_0.

Now, in formal terms, this will mean

minv01E[(X1X0)v(Xt,t)2]dt\min_v \int_0^1 \mathbb{E}[\mid\mid (X_1-X_0)-v(X_t, t) \mid \mid^2]dt

with

Xt=tX1+(1t)X0X_t = tX_1+(1-t)X_0

this can be written also as

dXt=(X1X0)dtdX_t = (X_1-X_0)dt

too!

A quick sidenote here, for DDIMs, this will be

Xt=αtX0+BtX1X_t = \alpha_t X_0+\Beta_t X_1

if people are interested, I can link some theory background for this here!

Now, the paper goes into some very interesting math parts which I'll skip in this blog but I recommend you check out if you like math and differential equations and why the above won't fall into some pitfalls.

Now, in practice, as you may have guessed, v will be our stable diffusion model. And X0X_0 will be the initial noise and X1X_1 will be the output image. So one strategy I am understanding is we can record a huge dataset of initial noise and the output image from stable diffusion. Then, we can finetune a stable diffusion model so that the epsilon/v predicted is always a straight line between the 2 given Xt=tX1+(1t)X0X_t=tX_1+(1-t)X_0. So overall algorithm is

image/jpeg

One great thing about this, as can be seen from the algorithm, is that we defined t to be between 0 and 1 so we can just add v times 1 to X0X_0 to get X1X_1!

Reflow

Now, one problem is is this straightening that much of a trivial solution? Is there no error associated there? And the answer is yes! The solution is once you get your best possible path of Z0Z_0 to Z1Z_1, you just apply rectified flow on that path again and again until it finally becomes straight as you can see below

image/jpeg

the algorithm is

image/jpeg

However, the paper mentions that while doing reflow makes the line straighter and shorter, it'll be at the cost of getting a proper X1X_1 as it deviates too much.

Distillation

Now, given we have a reflow model that can predict velocity, we can distill it. For this, InstaFlow gave the best equation so

image/jpeg

Essentially, what this does is instead of us trying to predict the velocity that when added to X0X_0 will become X1X_1, we are trying to directly predict X1X_1 which is pretty interesting.

Now, the paper has math applying to DDIMs given non-linear problems but since we are mainly concerned with getting a PR done, let's move on to instaflows!

InstaFlow

Instaflow is pretty much stable diffusion applied to rectified flows. For some statistics, they trained with 199 A100 days=4776 A100 GPU hours which should cost around 5000 dollars for institutions or 10000 for those without deals. This is pretty cheap considering stable diffusion 2.1 was trained with 200000 GPU hours which does translate pretty much to 200k dollars. It can generate images in 0.12 seconds on A100 which makes sense as it is a 1-step model.

Training algorithm

image/jpeg

As can be seen above, the algorithm is pretty much exactly the same except we condition on text while the original rectified flow was unconditional. Then, there's an extra step for distilling. The authors observed reflow was very important for good quality.

Instaflow training setup

They used a subset of prompts from laion2B-en. The generated images are done with 25 steps in the DPM solver with a guidance scale of 6.0. For distillation, they used LPIPS loss using a network which I assume is vgg to get the high-level similarities of images(faces, objects etc). Finally, they used a batch size of 32 and 8 A 100 GPUS for training with AdamW optimizer.

TODO list

Overall, this is it! So as a TODO list for the PR, we need to

  • Figure out how to map epsilon to velocity My understanding is we ignore DDPM/epsilon objectives during Rectified flow and just have the unet output v directly
  • Make a script to generate the latent noise, images, and text to save to the dataset
  • Make rectified flow/reflow script
  • Make distillation script