Proteus Rundiffusion x DPO (Direct Preference Optimization) SDXL

An Experimental Merge. I run through a brief summary of both models below. Image results are at the end of the post if you're not interested in - or are already familiar with - the idea behind Proteus & DPO.

Preface

This model came about just as I started looking into the science behind merging models - starting with block-weighted merging, and how the different methods influence generation results. Because I'm still new to the concept, my process thus far was far from structured or scientific. I'm in no way a seasoned expert in the training and merging of models, knowing just enough to achieve stable merges, combined with a healthy amount and make some logical adjustments and judgements.

I say this, because - while I found a merge that I'm happy enough with to share - better variaations probably exist.

The models are licensed under openrail++ (DPO) and GPL3 (Proteus), both allowing for free use and redistribution within the original model use-case. The merge is published under GPL3.

The two variants of this model, and the original base models (converted to transformers-base for use with stable diffusion, etc) can be found in this repo.

The Goal

When I started - artistic curiosity.

What it turned into - creating a model that has a good balance of artistic freedom while staying true to the original prompt.

The balance of accuracy and creativity when working with diffusion AI models is something we've all gotten used to - sacrficing some accuracy to add a bit more pop, and vice versa. This model aims to narrow that window - using Proteus to broaden

The Models

mhdang/dpo-sdxl-text2image-v1

Note: DPO's influence is way more apparent in SDXL models. This project had me focused on the SDXL version - there is probably room for exploration in merging some of the stronger 1.5 models with DPO.

Fine-tuned from stable-diffusion-xl-base-1.0 and stable-diffusion-v1-5 respectively, using offline human preference data pickapic_v2.

What makes DPO interesting?

Direct Preference Optimization proposes fine-tuning models through directly optimizing on human comparison data, in stead of the widely used RLHF (Reinforcement Learning from Human Feedback) approach.

The claims to significantly improve visual appeal and prompt alignment, and - keeping in mind subjectivity - for the most part, general concensus agrees with this claim.

Research Paper on DPO

dataautogpt3/Proteus-RunDiffusion

While investigating ways in which image generation can be improved within existing frameworks, the team at RunDiffusion stumbled upon a new approach in retraining CLIP - claiming to have unlocked and broadened their model's potential for character recognition, natural language processing and, most notably, its versatility in artistic expression.

What makes Proteus interesting?

Extract from their model card:

When you start using Proteus-RunDiffusion, be ready for it to behave differently from other AI art models you've used. It's been designed in a unique way, which means it will respond to your prompts and commands in its own style. This difference is part of what makes it special, but it also means there's a learning curve. You'll need some time to get familiar with how it works and what it can do. So, as you begin, keep an open mind and be prepared to adjust your approach.

This paragraph told me that a merging would either be chaos, or it could at the very least output some interesting results -a welcome variance from the nuances we've gotten used to by now.

The Merging

I happened across the two models only a few days apart - both piquing my interest in their own way - and as I was in the middle of my exploration in merging methods, I naturally wondered what a merging of these two ideas would look like - and I was pleasantly surprised by the results (comparison at the end of the post).

Unfortunately I didn't think to document the entire process, and only had all the information on my final contenders when I decided I want to share the model.

The Challenge

The biggest challenge was finding a merge that could even be considered close to what I wanted to acihieve. Most merging attempts would result in a model outputting wildly different or inaccurate images, or there would be little to no change from the base model,

The fact that the two model's focus was so far seperated it could almost be considered oppopsites, actually worked in my favor - with my only real challenge was finding a alpha-base and block-weighting strategy that checks all the right boxes on each side.

My Approach

I decided to use Proteus as a base, retaining its CLIP and VAE. I started with a 1-to-1 conversion from the original Proteus UNET model, uncluding their original clip models, tokenizers and VAE, to a transformers-based model - doing the same with the DPO models. I then brute-force tested my way through 16 block-weighted merging strategies (list at the end of this section), running prompts and comparing the results to the original model's output. I did a base_alpha 0.4, 0.5 and 0.6 merges of each, landing on 0.4 as my preferred base weighting.

Keeping in mind the subjective nature of art, I tried to keep the judgment of each resulting image as objective as possible by looking only at the below:

Prompt Accuracy
- How much of the original prompt was adhered to?
- The amount of artistic liberties taken
- How much these liberties strayed from the prompt.
Resemblence to the output of the base model (Proteus)
- Deciding to use Proteus as a base I, in theory, was attempting to add structure to a broad creative scope
- It was therefore important for me that a good balance was struck between the artistry and the accuracy of the final result

Most of the merging results either strayed way too far from the base image, disqualifying it based on my goal; or the resemblance was too far off from the original for my liking. It's worth noting that while I kept my metrics as objective as possible, subjective measurements/judgements are unavoidable - the acceptable travel distance from the base image being one of them.

I was left with two models that produced fairly similar results, but had slight, noticeable output variance on the creativity-accuracy scale - the ReverseSmoothstep and TrueReverseCubicHermite merging variants. I couldn't decide between the final two, because they both achieved my goal, either with a bit of artistic flair, or an impressive accuracy...like two sides to the same coin. The margin and variance is small enough that the final judgement would vary given your preference on a per-image basis.

So I decided to share both.

Merging Strategies used: GradV ,GradA ,Flat_25 ,Wrap2 ,Mid2_50 ,Out07 ,Ring08Soft ,Ring08_5 ,Smoothstep ,ReverseSmoothstep ,Cosine ,ReverseCosine ,TrueCubicHermite ,TrueReverseCubicHermite ,FakeCubicHermite ,FakeReverseCubicHermite

Testing Parameters and Result Interpretation

As previously mentioned, I mostly cowboy'd the process before I considered sharing the models. Below are the some comparison images between each variant. Both models show a healthy deviation from the base model, while staying within the contraints of the prompt.

The ReverseSmoothStep variant seems to add a bit more artistic flair while remaining close enough to what was asked, giving it the edge in artistic variance and possible outcomes while still checking the majority of boxes in the prompt..

The TrueReverseCubicHermite variant walks very closely to the ReverseSmoothStep model, but seems to add more detail that anchor it back to the prompt.

My take on the variants:

If you're a seasoned prompt engineer, and/or know exactly what you want in an image, give Proteus-RunDiffusion-DPO_TrueReverseCubicHermite a go. Throughout all my testing for the merge, and in subsequent tests, I've definitely noticed some unique interpretations, ranging from interesting to aweosme.

If you prefer being more loose with how the prompt will be interpreted, or you want the AI to do a bit more heavy lifting in detailing and style choice, Proteus-RunDiffusion-DPO_ReverseSmoothstep ticks many of the same boxes, While it's fairly stable and consistent, it does make more decisions that are purely artistic, which increases the risk of small errors or inconsistencies. In the same breathe, this did often result in the preferable output given my personal expectation for the prompt.

I tested multiple sampler/scheduler/CFG/Step combinations before deciding a foundation for the tests. This was another subjective choice, and I'm sure other parameters could have very different - and probably even better - outcomes. These were sufficient for testing my goal though:

Tests were run in ComfyUI with the model's CLIP and VAE, no Loras and with no pre/post-processing. Worth noting is that DataPulse recommends a CLIP skip of -2. I did not make any modification to the CLIP skip layer during my merge variant testing. My only adjustment was setting the CLIP scaling to 4 to improve the final quality and clarity of the images. The workdlow was kept as simple as possible, running only base sampling in a single KSampler node. Any errors or small issues were left as is for consisency.

The workflow is available in this repository. The comparison images are also the original, unaltered output images, and have their workflows embedded. They're all the same, of course, the only variable being the model used and the prompt text for each test. If you plan on loading the workflow, Use Everywhere (UE Nodes) was used to neaten the flow (CLIP weight and seed distribution), and OneButtonPrompt's preset node was used to generate some random base prompts - so you'll either need to install these or make the necessary tweaks to bypass them.

Universal Sampling Parameters:

Resolution: 1344x768
Steps: 45
CFG: 4
Sampler: dpmpp_3m_sde_gpu
Scheduler: exponential
Denoise: 1
CLIP Scaling: 4
Negative Prompt: text, watermark, logo, blurry, out of focus

These parameters gave me the best results - according to my preference, but also for the sake of comparison. They do lean a bit more to the tame side of things to keep random variance down, so I'd recommend experimenting and pushing the boundaries of these parameters if you decide to play around with these models yoruself. I've gotten some good results with CFG set to 1.5, and making use of the Dynamic Thresholding extension (available for both Stable Diffusion and ComfyUI) to boost some of the CFG effects significantly.

If you're not familiar with the extension and would like to try it out, it can produce some amazing results on its own. Below is a relatively safe setup for fairly consistent results:

Set your CFG to a very low value, I usually stick with 1.5 as 1 tends turn to chaos very easily, especially at such high CFG levels
Set the "mimic scale" to 30 and "threshold percentile" to 0.9 (if you lower the CFG to a range of 18-26 0.95 works well)
Set both mimic modes to "Half Cosine Up", both scale minimum values to 4 and the scheduler value to 4

The rest of your inputs and processing steps can be used as per usual. I've only hit walls with some Loras that can't handle the extremes, resulting in either a jumbled mess or pure black/blue/grey outputs. So if you experience this, look at removing loras before lowering your mimic level, etc.

I'll give two brief (subjective) comments the 1st and 4th comparisons as they tie in with earlier statemetns, the rest I'll leave to individual interpretation(as it should be).

1. Robotic Chicken is an example where I prefer the 'TrueReverseCubicHermite' variant. While both models moved away from the half real chicken the base model opted for, 'TrueReverseCubicHermite' added subtle details to the head that hinted more at it being a chicken. Both have too little chicken for my taste, and have more simple backgrounds, and I prefer the base model above both variants in this case. Outside of testing, I would tweak the parameters to find that balance more closely, but it does a good enough job at showing the nuance between the variants.

4. Japanese Cherry Blossom is a result where I prefer the 'ReverseSmoothStep' variant above 'TrueReverseCubicHermite'. The addition of the house with Japanese-inspired architecture is a welcomed addition for me - adding more depth to the story told and giving life to a prompt at risk of being lifeless, proven by the 'TrueReverseCubicHermite' variant's more accurate-to-the-prompt interpretation. A good example where a bit more artistic freedom can add to the image without moving outside of the prompt boundaries.

1. Robotic Chicken - Seed 40212733440049

Prompt:

sci-fi, art by Iris van Herpen, digital art, Rule of Thirds, robotic (Chicken:1.1) , the Chicken is wearing a Iron and Velvet Jacquard cybernetics that was designed by Apple computers, It is Hyperdetailed, Avant-garde, futuristic, 3D printing, science fiction, fashionable

Cropped Comparison:

Full images:

ProteusBase ReverseSmoothStep TrueReverseCubicHermite

2. Cyberpunk Cyborg Female - Seed 754467864831453

Prompt:

cyberpunk, concept art, cyborg Female Troll, background is night city, neon-lit, vogue pose, her hair is Silver, futuristic, womanly, D&D, neon hue, shadowrun

Images:

ProteusBase ReverseSmoothStep TrueReverseCubicHermite

3. Fashion Photgraph - Seed 321237786

Prompt:

Fashion photography of a supermodel, laughing, she is wearing a fashion outfit, her fashion outfit is Smart, It has Anthemion patterns on it, Tattoos, Fomapan 400, Depth of field 270mm, photorealism

Images:

ProteusBase ReverseSmoothStep TrueReverseCubicHermite

4. Japanese Cherry Blossom - Seed 123 (not a typo)

Prompt:

(nature art by Yuko Tatsushima:1.0) , wabi-sabi (Cherry blossom tree:1.1) , trees, deep focus, Japanese watercolor painting, traditional motives

Images:

ProteusBase ReverseSmoothStep TrueReverseCubicHermite

Final Word

Impatience and limitations in my knowledge on the mathematics and science behind the process almost certainly means I've done some things suboptimally, or incorrectly.

That said, I'm content with the merges I found, and have been pleasantly surprised by some of the results I've gotten thus far.

Ultimately, both the base models - and by extension the merged models - are about exploring what is possible, exploring the fringes of an already well established framework.

If this sounds like something you'd like to explore, I hope these models (and the individual base models) prove valurable to you.

If you read this post in its entirety - cudos! I probably woulnd't have. A big thanks to everyone in open-source land that makes these things possible for common folk like me.

Credits

Alexander Izquierdo and the team at RunDiffusion for their time, effort and investment in developing Proteus.
Meihua Dang, and the folks that worked with him, for their efforts in unfusing DPO into the Stable Diffusion models - making a merge like this one a breeze
All the wonderful folks in open-source land who invest so much time and effort into development, training and educating.

If you have dollars burning holes your pockets and any of these projects are of value to you, show these people some love by saying thanks with those spare bucks.