Pokemon Stable Diffusion v1.5

A fine-tuned version of Stable Diffusion v1.5 specifically trained to generate high-quality Pokémon images in various artistic styles.

Model Details

Base Model: Stable Diffusion v1.5
Developed by: RekklesAI
Model Type: Latent Diffusion Model for Text-to-Image generation
Language(s): English
License: CreativeML OpenRAIL-M
Training Data: reach-vb/pokemon-blip-captions
Training Steps: 15,000 steps at resolution 512x512
Model Architecture: Same as Stable Diffusion v1.5 (UNet with cross-attention layers)
Diffusers Version: 0.33.0.dev0
Scheduler: PNDMScheduler
Safety Checker: StableDiffusionSafetyChecker (can be disabled during inference)

Model Description

This model is a fine-tuned version of Stable Diffusion v1.5, specifically trained to generate high-quality Pokémon images. It can produce Pokémon in various artistic styles, from photorealistic renders to cartoon styles, cyberpunk aesthetics to watercolor art.

The model was fine-tuned for 15,000 steps on the reach-vb/pokemon-blip-captions dataset, allowing it to learn the distinctive features and characteristics of different Pokémon species while maintaining the generative capabilities of the base model.

Training Details

The model was trained using the following configuration:

accelerate launch train_text_to_image.py \
  --pretrained_model_name_or_path="stable-diffusion-v1-5/stable-diffusion-v1-5" \
  --dataset_name="reach-vb/pokemon-blip-captions" \
  --caption_column="text" \
  --image_column="image" \
  --resolution=512 \
  --random_flip \
  --train_batch_size=1 \
  --gradient_accumulation_steps=4 \
  --max_train_steps=15000 \
  --learning_rate=1e-05 \
  --max_grad_norm=1 \
  --lr_scheduler="constant" \
  --lr_warmup_steps=0 \
  --output_dir="sd-pokemon-model" \
  --report_to tensorboard

Key training parameters:

Learning rate: 1e-5
Optimizer: AdamW (default)
LR scheduler: Constant (no decay)
Batch size: 1 with gradient accumulation steps of 4
Resolution: 512x512
Data augmentation: Random horizontal flip

Sample Images

Below are sample images generated with this model:

Epic Photorealistic Style

Prompt: A majestic Charizard with iridescent scales soaring through a dramatic sunset sky, casting long shadows over an ancient volcanic landscape. Intricate details of ember particles floating around its powerful wings, reflecting the golden-crimson light. Hyper-realistic texturing, volumetric lighting, cinematic composition with rule of thirds, depth of field focusing on its determined expression, 8K resolution, photorealistic rendering with ray-traced shadows, award-winning digital art, trending on ArtStation.

Prompt: A majestic Rayquaza with iridescent scales soaring through a dramatic sunset sky, casting long shadows over an ancient cloud kingdom landscape. Intricate details of cosmic energy particles flowing along its serpentine body, reflecting the golden-crimson light. Hyper-realistic texturing, volumetric lighting, cinematic composition with rule of thirds, depth of field focusing on its ancient and wise expression, 8K resolution, photorealistic rendering with ray-traced shadows, award-winning digital art, trending on ArtStation.

Cyberpunk Style

Prompt: A cybernetic Mewtwo floating in a neon-drenched futuristic cityscape at night. Bioluminescent purple energy coursing through transparent tubes connected to its body. Holographic interfaces surrounding it, reflecting on wet asphalt streets. Cyberpunk aesthetic with glowing technological implants, sharp contrasts between shadows and vibrant neon lights. Blade Runner inspired atmosphere, digital distortion effects, lens flares, and electric particle effects. Ultramodern sci-fi concept art with intricate mechanical details.

Gothic Style

Prompt: A haunting Gengar lurking in a decrepit Victorian mansion. High contrast black and white photography style with dramatic chiaroscuro lighting. Gothic architecture with ornate details fading into deep shadows. Film noir aesthetic with grainy texture and vignette edges. Eerie atmosphere enhanced by fog tendrils and moonlight streaming through broken stained glass windows. Reminiscent of classic horror cinema, with stark silhouettes and ominous negative space. Timeless monochromatic art with haunting emotional depth.

Watercolor Art Style

Prompt: A serene Gardevoir in an enchanted forest glade, surrounded by luminescent butterflies and delicate wildflowers. Soft watercolor style with gentle pastel hues, flowing brushstrokes creating an ethereal atmosphere. Dappled sunlight filtering through the canopy, creating a dreamy bokeh effect. Impressionistic details, emotional color palette with teal and lavender accents, artistic composition inspired by Studio Ghibli, whimsical fantasy illustration.

Kawaii/Chibi Style

Prompt: An adorable chibi-style Eevee and its evolutions having a tea party in a candy-colored meadow. Kawaii anime style with exaggerated expressions and oversized eyes. Pastel rainbow palette with soft shading and cute decorative elements like hearts and stars. Playful composition with rounded shapes and simplified forms. Cheerful atmosphere with cartoon sparkles and emotion symbols. Inspired by children's animation, with clean outlines and flat color blocks. Whimsical and heartwarming illustration style perfect for merchandise.

Prompt: A delightful tea party hosted by Bulbasaur, Chikorita, and Rowlet in a blooming flower garden. Cute storybook illustration style with soft rounded shapes. Tiny teacups and miniature pastries served on lily pad tables. Pastel green and pink color scheme with dainty flower patterns. Chibi proportions with oversized heads and stubby limbs. Cheerful expressions with sparkling eyes and happy smiles. Whimsical details like butterfly waiters and ladybug guests. Heartwarming scene rendered in a children's picture book style with gentle outlines and soft textures.

Usage

You can use this model with the Diffusers library:

import torch
from diffusers import StableDiffusionPipeline

# Load the model
model_path = "path/to/PokemonStable-v1-5"  # Replace with actual path
pipe = StableDiffusionPipeline.from_pretrained(
    model_path,
    torch_dtype=torch.float16,
    safety_checker=None  # Set to None to disable safety checker
)

# Move to GPU if available
if torch.cuda.is_available():
    pipe = pipe.to("cuda")

# Generate image
prompt = "A cute Pikachu playing in a grassy field, high resolution, detailed"
image = pipe(
    prompt=prompt,
    num_inference_steps=50,
    guidance_scale=7.5
).images[0]

# Save image
image.save("generated_pokemon.png")

Advanced Usage with Custom Scheduler

You can also customize the scheduler for different generation qualities:

from diffusers import StableDiffusionPipeline, DPMSolverMultistepScheduler
import torch

model_path = "path/to/PokemonStable-v1-5"  # Replace with actual path

# Load model
pipe = StableDiffusionPipeline.from_pretrained(
    model_path, 
    torch_dtype=torch.float16,
    safety_checker=None
)

# Replace scheduler for faster inference
pipe.scheduler = DPMSolverMultistepScheduler.from_config(
    pipe.scheduler.config,
    algorithm_type="dpmsolver++",
    solver_order=2
)

# Move to GPU
pipe = pipe.to("cuda")

# Generate with fewer steps
prompt = "A majestic Charizard in battle stance, fire breathing, detailed scales, epic lighting"
image = pipe(
    prompt=prompt,
    num_inference_steps=25,  # Fewer steps needed with DPM-Solver++
    guidance_scale=7.5
).images[0]

image.save("charizard_dpm_solver.png")

Prompt Engineering Tips

For optimal results, consider the following prompt engineering techniques:

Specify Pokémon Names: Include specific Pokémon names like "Pikachu", "Bulbasaur", "Charizard", etc.
Add Scene Descriptions: Describe the environment, such as "in a forest", "in battle", "sleeping", etc.
Include Style Descriptors: Add style terms like "high resolution", "detailed", "cartoon style", etc.
Emotional Context: Include emotional states like "happy", "angry", "cute", etc.
Artistic Techniques: Specify art styles like "watercolor", "oil painting", "digital art", etc.
Lighting and Atmosphere: Describe lighting conditions like "sunset", "moonlight", "studio lighting", etc.
Composition Guidelines: Add terms like "rule of thirds", "dynamic pose", "close-up shot", etc.

Limitations

The model may occasionally generate Pokémon with anatomical inaccuracies
Text rendering within images may be illegible or distorted
Complex compositions with multiple Pokémon may not always position them correctly
The model performs best with English prompts
As with all Stable Diffusion models, it inherits certain biases and limitations from the base model
The safety checker may occasionally filter legitimate content; it can be disabled but use with caution

Ethical Considerations

This model is intended for creative and artistic purposes only. Users should:

Respect the intellectual property rights of The Pokémon Company and Nintendo
Avoid generating harmful, offensive, or inappropriate content
Not use generated images for commercial purposes without proper licensing
Be transparent about AI-generated content when sharing

License

This model is based on Stable Diffusion v1.5 and follows the CreativeML OpenRAIL-M license of the original model.

Acknowledgements

Thanks to all artists and creators who have contributed to the Pokémon franchise, and to Stability AI for developing the Stable Diffusion model. Special thanks to the creators of the reach-vb/pokemon-blip-captions dataset used for training this model.

Citation

If you use this model in your research, please cite:

@InProceedings{Rombach_2022_CVPR,
    author    = {Rombach, Robin and Blattmann, Andreas and Lorenz, Dominik and Esser, Patrick and Ommer, Bj\"orn},
    title     = {High-Resolution Image Synthesis With Latent Diffusion Models},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    month     = {June},
    year      = {2022},
    pages     = {10684-10695}
}

RekklesAI
/

PokemonStable-v1-5