I wrote a plugin to try to solve the issue of anima artist tag mixing

#161

by R3DeK - opened 8 days ago

Due to my limited English proficiency, I can understand English but struggle to write long articles. Therefore, I used an LLM to assist me in completing this article

\This plugin is currently in an experimental stage and requires further testing; I will officially release it once this approach is confirmed to be fully viable.\

The Challenge with LLM Text Encoders
In traditional models like SDXL, the CLIP text encoder treats tokens relatively independently, allowing users to easily mix artist styles by simply chaining names together. However, next-generation anime models like Anima utilize an LLM text encoder. Because LLMs process text contextually, chaining artist names forces the model to merge their semantic meanings into a single, often confused context. This fundamentally breaks the traditional artist-mixing workflow.

The Solution
The Anima Style Blend (Multi) node solves this by intercepting the diffusion model's preprocessing stage. It encodes each artist's prompt entirely independently through the LLM adapter, bypassing the contextual interference, and then intelligently merges them before they reach the core generation layers.

Current Status: Delta Mode Only
Please note: As this plugin is currently in active development, only the Delta mode is fully functional and recommended for use.
Other modes included in the node (Concat, Slerp, Linear) are highly experimental. Because they alter sequence lengths or push embeddings off the trained manifold, they currently cause severe image degradation, attention dilution, or CFG collapse. Please stick to Delta mode.

How Delta Mode Works
Instead of clumsily concatenating sequences or averaging embeddings, the Delta mode uses a non-destructive mathematical approach:

The Anchor: It takes your conditioning_base and establishes it as the structural foundation.
The Style Deviation: It calculates the mathematical difference (the delta) between your base concept and your secondary artist conditions.
The Application: It gently layers these style differences over the base output based on your weights.
Smart Safeguard: It intelligently isolates positive prompts, ensuring negative conditions remain untouched to prevent latent corruption.
Best Practices & Prompting Strategy
To get the best results and avoid generic, "AI-looking" outputs, you must follow this specific prompting structure:

The Base Must Have an Artist: Your conditioning_base cannot be a generic content description. It must contain a primary artist to firmly anchor the model in a high-quality, stylized latent space.
Consistent Content: The subject and scene descriptions must be absolutely identical across all condition nodes. Only the artist tags should change.
Example Setup:

Base Node: masterpiece, 1girl, outdoors, by artist_A (Weight: 1.0)
Cond 2: masterpiece, 1girl, outdoors, by artist_B (Weight: 0.4)
Cond 3: masterpiece, 1girl, outdoors, by artist_C (Weight: 0.3)

synta

7 days ago

... so let us test it

R3DeK

7 days ago

... so let us test it

I have used up my Claude credits today. I don't have any programming background, but I do have some relevant knowledge. I will try to improve this plugin and release it in the next few days

Nuke1229

7 days ago

•

edited 7 days ago

In my opinion, going with (@artistname1:1), (@artistname2:0.3), 1girl, ..... is probably better than using multiple conditioning blocks for different art styles. The method in your screenshot looks a bit messy to me. (Regarding the style order, just sort them as main style, sub style 1, sub style 2.)

As for the CLIP Text Encoder, I suggest that your 'anima style blend' node should accept two more inputs: text (prompt) and clip. In the backend process, if the node receives a prompt like '(@A:1), (@B:0.2), 1girl, ...', it should loop the encoding by calling the Comfy API. For example, the first pass runs '@A, 1girl, ...', the second pass runs '@B, 1girl, ...', and finally, it blends them together before outputting to the conditioning slot of KSampler or whatever node comes next. Well, something like that—you don't have to take my idea too seriously, it's just my two cents.

R3DeK

6 days ago

I conducted some more tests today. Currently, without modifying the model or training adapter (if feasible), the only feasible method is norm_blend within the delta mode, which involves multiple sets of prompts and performs blend processing only by changing the artist tag. However, this method has some issues. If the artists have significant stylistic differences, the blended effect will be very poor. To address this, I added a processing step to detect the differences between them before formal blending. If the differences are too large, the blending coefficient will be automatically reduced. I also added a slider to control the reduction strength

R3DeK

6 days ago

Actually, I've always had an idea. Since adapters can convert prompts, can we similarly train an adapter to achieve the effect of mix artist tagging? Additionally, the adapter solution of Anima can indeed save costs, but it poses difficulties in developing ControlNet and plugins. Circlestone-Labs likely made this choice due to financial constraints

R3DeK

6 days ago

In my opinion, going with (@artistname1:1), (@artistname2:0.3), 1girl, ..... is probably better than using multiple conditioning blocks for different art styles. The method in your screenshot looks a bit messy to me. (Regarding the style order, just sort them as main style, sub style 1, sub style 2.)

As for the CLIP Text Encoder, I suggest that your 'anima style blend' node should accept two more inputs: text (prompt) and clip. In the backend process, if the node receives a prompt like '(@A:1), (@B:0.2), 1girl, ...', it should loop the encoding by calling the Comfy API. For example, the first pass runs '@A, 1girl, ...', the second pass runs '@B, 1girl, ...', and finally, it blends them together before outputting to the conditioning slot of KSampler or whatever node comes next. Well, something like that—you don't have to take my idea too seriously, it's just my two cents.

I am still exploring the feasibility of various solutions. If I find a solution that works well, I will consider it