RaiFlow

RaiFlow is a 0.7B Single Stream Transformer model for text to image generation.
RaiFlow has no text encoder, it handles text tokens directly in the image transformer.
RaiFlow uses x0 prediction with Rectified Flow sampling.
RaiFlow uses Flux2 VAE and Qwen Tokenizer.

This version of RaiFlow is a very rough high learning rate pre-train run.
Current RaiFlow version is very rough to use and can only do resolutions around 256x256 pixels.
Current RaiFlow version is trained on 2x Intel ARC Pro B60 24 GB GPUs at home for around 550K training steps with effective batch size of 128 and costs $50 in electricity.
Current RaiFlow version has no aesthetic training or RL applied. It is the raw pre-training model straight from the oven.

It uses curated version of Danbooru and some other sources totaling to around 6.7M images.
It uses natural language from Florence2 and Danbooru tags from Danbooru and WD taggers.

Example Code

import torch
import diffusers

device = "xpu" if hasattr(torch,"xpu") and torch.xpu.is_available() else "mps" if hasattr(torch,"mps") and hasattr(torch.mps, "is_available") and torch.mps.is_available() else "cuda" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if device != "cpu" else torch.float32

model_path = "Disty0/RaiFlow-v0_01-256px-rough-pre-train"
pipe = diffusers.DiffusionPipeline.from_pretrained(model_path, custom_pipeline=model_path, trust_remote_code=True, torch_dtype=torch_dtype).to(device)

prompt = "1girl, character raifu, blush, upper body, blue eyes, woman, white sweater, solo, light smile, cat ears, looking at viewer, small medium breasts, long hair, dark red hair, simple background, white background"
aes_prompt = "very aesthetic, best quality, year 2025, sfw rating, art by aviary, art by demcheg, art by woogi (duguddnr), art by honon, art by rynzfrancis"
negative_prompt = "worst aesthetic, bad aesthetic, worst quality, bad quality, low quality, monochrome, sketch, compression artifacts"
add_aes = True

with torch.no_grad():
    pipeline_output = pipe(
        width=192,
        height=320,
        guidance_scale=7.5,
        num_inference_steps=30,
        pad_to_multiple_of=256,
        max_sequence_length=512,
        prompt=prompt if not add_aes else (aes_prompt + ", " + prompt),
        negative_prompt=negative_prompt,
    )
pipeline_output.images[0].save("raiflow.png")

Training details:

Effective Batch Size 128
Batch Size 16
Gradient Accumulation Steps 4
Number of GPUs 2
GPUs Intel Arc Pro B60 24 GB
Learning Rate 4e-5
Optimizer sdnq.optim.AdamW
Optimizer Final Norm Mode rms_clip
Optimizer States Dtype SDNQ FP16 Quantization with Group Size 32
Model Weights Dtype SDNQ FP16 Quantization with Group Size 32
Activations and Gradients Dtype FP16 with Grad Scaler
EMA Weights Dtype FP32
Use EMA True
Self Correct Rate 0.25
Mask Rate 0.25
Image Resolution Buckets Around 256x256
Text Token Lenght 256 or 512 depending on prompt lenght

Self Correct Rate:

We run a forward pass on the image inputs then take the model x0 outputs as starting point and create a new target to get away from model pred to real images.
This in theory will allow the model to correct itself in inference instead of going straight into the wrong path with rectified flow.

Mask Rate:

We mask the model inputs with ones and make the model guess the missing chunks.
This in theory will make the model better in inpainting.
We masked with ones instead of zeros as zeros is actually a valid input.
Since we are using x0 pred and don't have to worry about modeling the noise, we can safely use masking.

Tag formats:

Notes:

Underscores from tags are removed.
We used 4 seperate text files for each image.
1st is natural language from Florence2.
2nd is full WD Tagger + Danbooru tags combined without any dropout.
3rd is WD Tagger + Danbooru tags combined with 0.5 dropout.
4th only includes rating, artist, character and anything from 0 to 8 tags of randomly selected general tags.

Tag order:

aesthetic tags, quality tags, year tags, style age tags, special tags, artist tags, medium tags, rating tags, no shuffle tags, character tags, copyright tags, general tags, meta tags

Aesthetic tags:

Aesthetic tags are scored with 4 different aesthetic scorers and assigned to tag based on average score from all the taggers combined.

very aesthetic # around 80K images
highly aesthetic # around 230K images
moderate aesthetic # around 2.7M images
low aesthetic # around 2M images
bad aesthetic # around 680K images
worst aesthetic # around 220K images

Image numbers here inclued only the Danbooru curated dataset.

Quality tags:

Quality tags are scored with rating normalized scores from Danbooru and 2 different quality scorers and assigned to tag based on average score from all the taggers combined.

best quality # around 150K images
high quality # around 420K images
great quality # around 720K images
normal quality # around 1.75M images
low quality # around 1.4M images
bad quality # around 1M images
worst quality # around 550K images

Image numbers here inclued only the Danbooru curated dataset.

Year tags:

year number

Example:

year 2025

Style Age tags:

If the tagged style age of the image is older than the year tag, then this tag will be added.

1970s (style) # around 13K images
1980s (style) # around 90K images
1990s (style) # around 100K images
2000s (style) # around 550K images
2010s (style) # around 500K images
2015s (style) # around 1.3M images

Special tags:

Images from the VNCG curated dataset adds visual novel cg as special tag.

Artist tags:

art by name_of_the_artist

Example:

art by ru zhai

Medium tags:

Meta tags that has (medium) in them get put into here.
One exception is the tag photoshop (medium), this tag is removed.

Rating tags:

Rating tags from WD Tagger.
I used the rating from WD Tagger instead of Danbooru because Danbooru is very unreliable for rating.

sfw rating # around 2M images
sensitive rating # around 3M images
nsfw rating # around 550K images
explicit nsfw rating # around 420K images

No Shuffle tags:

1girl, 2girls etc. tags from the general tags are put into here.

Character tags:

character name_of_the_character

Example:

character blaze (arknights)

Copyright tags:

from name_of_the_series

Example:

from arknights

General tags

General tags in random order.

Meta tags

Meta tags in random order.

Custom General tags

Images that are less than 300KB in size after lossless JXL compression gets compression artifacts tag added.
PNG images don't get this tag as an exception.

Downloads last month
123
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Datasets used to train Disty0/RaiFlow-v0_01-256px-rough-pre-train