Instructions to use Disty0/RaiFlow-v0_01-256px-rough-pre-train with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Diffusers
How to use Disty0/RaiFlow-v0_01-256px-rough-pre-train with Diffusers:
pip install -U diffusers transformers accelerate
import torch from diffusers import DiffusionPipeline # switch to "mps" for apple devices pipe = DiffusionPipeline.from_pretrained("Disty0/RaiFlow-v0_01-256px-rough-pre-train", dtype=torch.bfloat16, device_map="cuda") prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k" image = pipe(prompt).images[0] - Notebooks
- Google Colab
- Kaggle
- Local Apps
- Draw Things
- DiffusionBee
RaiFlow
RaiFlow is a 0.7B Single Stream Transformer model for text to image generation.
RaiFlow has no text encoder, it handles text tokens directly in the image transformer.
RaiFlow uses x0 prediction with Rectified Flow sampling.
RaiFlow uses Flux2 VAE and Qwen Tokenizer.
This version of RaiFlow is a very rough high learning rate pre-train run.
Current RaiFlow version is very rough to use and can only do resolutions around 256x256 pixels.
Current RaiFlow version is trained on 2x Intel ARC Pro B60 24 GB GPUs at home for around 550K training steps with effective batch size of 128 and costs $50 in electricity.
Current RaiFlow version has no aesthetic training or RL applied. It is the raw pre-training model straight from the oven.
It uses curated version of Danbooru and some other sources totaling to around 6.7M images.
It uses natural language from Florence2 and Danbooru tags from Danbooru and WD taggers.
Example Code
import torch
import diffusers
device = "xpu" if hasattr(torch,"xpu") and torch.xpu.is_available() else "mps" if hasattr(torch,"mps") and hasattr(torch.mps, "is_available") and torch.mps.is_available() else "cuda" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if device != "cpu" else torch.float32
model_path = "Disty0/RaiFlow-v0_01-256px-rough-pre-train"
pipe = diffusers.DiffusionPipeline.from_pretrained(model_path, custom_pipeline=model_path, trust_remote_code=True, torch_dtype=torch_dtype).to(device)
prompt = "1girl, character raifu, blush, upper body, blue eyes, woman, white sweater, solo, light smile, cat ears, looking at viewer, small medium breasts, long hair, dark red hair, simple background, white background"
aes_prompt = "very aesthetic, best quality, year 2025, sfw rating, art by aviary, art by demcheg, art by woogi (duguddnr), art by honon, art by rynzfrancis"
negative_prompt = "worst aesthetic, bad aesthetic, worst quality, bad quality, low quality, monochrome, sketch, compression artifacts"
add_aes = True
with torch.no_grad():
pipeline_output = pipe(
width=192,
height=320,
guidance_scale=7.5,
num_inference_steps=30,
pad_to_multiple_of=256,
max_sequence_length=512,
prompt=prompt if not add_aes else (aes_prompt + ", " + prompt),
negative_prompt=negative_prompt,
)
pipeline_output.images[0].save("raiflow.png")
Training details:
| Effective Batch Size | 128 |
| Batch Size | 16 |
| Gradient Accumulation Steps | 4 |
| Number of GPUs | 2 |
| GPUs | Intel Arc Pro B60 24 GB |
| Learning Rate | 4e-5 |
| Optimizer | sdnq.optim.AdamW |
| Optimizer Final Norm Mode | rms_clip |
| Optimizer States Dtype | SDNQ FP16 Quantization with Group Size 32 |
| Model Weights Dtype | SDNQ FP16 Quantization with Group Size 32 |
| Activations and Gradients Dtype | FP16 with Grad Scaler |
| EMA Weights Dtype | FP32 |
| Use EMA | True |
| Self Correct Rate | 0.25 |
| Mask Rate | 0.25 |
| Image Resolution | Buckets Around 256x256 |
| Text Token Lenght | 256 or 512 depending on prompt lenght |
Self Correct Rate:
We run a forward pass on the image inputs then take the model x0 outputs as starting point and create a new target to get away from model pred to real images.
This in theory will allow the model to correct itself in inference instead of going straight into the wrong path with rectified flow.
Mask Rate:
We mask the model inputs with ones and make the model guess the missing chunks.
This in theory will make the model better in inpainting.
We masked with ones instead of zeros as zeros is actually a valid input.
Since we are using x0 pred and don't have to worry about modeling the noise, we can safely use masking.
Tag formats:
Notes:
Underscores from tags are removed.
We used 4 seperate text files for each image.
1st is natural language from Florence2.
2nd is full WD Tagger + Danbooru tags combined without any dropout.
3rd is WD Tagger + Danbooru tags combined with 0.5 dropout.
4th only includes rating, artist, character and anything from 0 to 8 tags of randomly selected general tags.
Tag order:
aesthetic tags, quality tags, year tags, style age tags, special tags, artist tags, medium tags, rating tags, no shuffle tags, character tags, copyright tags, general tags, meta tags
Aesthetic tags:
Aesthetic tags are scored with 4 different aesthetic scorers and assigned to tag based on average score from all the taggers combined.
very aesthetic # around 80K images
highly aesthetic # around 230K images
moderate aesthetic # around 2.7M images
low aesthetic # around 2M images
bad aesthetic # around 680K images
worst aesthetic # around 220K images
Image numbers here inclued only the Danbooru curated dataset.
Quality tags:
Quality tags are scored with rating normalized scores from Danbooru and 2 different quality scorers and assigned to tag based on average score from all the taggers combined.
best quality # around 150K images
high quality # around 420K images
great quality # around 720K images
normal quality # around 1.75M images
low quality # around 1.4M images
bad quality # around 1M images
worst quality # around 550K images
Image numbers here inclued only the Danbooru curated dataset.
Year tags:
year number
Example:
year 2025
Style Age tags:
If the tagged style age of the image is older than the year tag, then this tag will be added.
1970s (style) # around 13K images
1980s (style) # around 90K images
1990s (style) # around 100K images
2000s (style) # around 550K images
2010s (style) # around 500K images
2015s (style) # around 1.3M images
Special tags:
Images from the VNCG curated dataset adds visual novel cg as special tag.
Artist tags:
art by name_of_the_artist
Example:
art by ru zhai
Medium tags:
Meta tags that has (medium) in them get put into here.
One exception is the tag photoshop (medium), this tag is removed.
Rating tags:
Rating tags from WD Tagger.
I used the rating from WD Tagger instead of Danbooru because Danbooru is very unreliable for rating.
sfw rating # around 2M images
sensitive rating # around 3M images
nsfw rating # around 550K images
explicit nsfw rating # around 420K images
No Shuffle tags:
1girl, 2girls etc. tags from the general tags are put into here.
Character tags:
character name_of_the_character
Example:
character blaze (arknights)
Copyright tags:
from name_of_the_series
Example:
from arknights
General tags
General tags in random order.
Meta tags
Meta tags in random order.
Custom General tags
Images that are less than 300KB in size after lossless JXL compression gets compression artifacts tag added.
PNG images don't get this tag as an exception.
- Downloads last month
- 123