Anima Control — Pose (Preview-1)

⚠️ Preview-1 — experimental. This is an early, lightly-trained release of pose control for Anima v1.0. Because it is a preview, it will often generate bad images: deformed or split bodies, wrong or extra limbs, and similar artifacts. It simply has not had enough training and refining yet. Treat it as a proof of concept, not a finished tool. It is not production-ready for now. Non-commercial use only (inherits the Anima base model license). Behavior and weights may change.

A native pose control adapter for the Anima v1.0 image model: condition generation on a skeleton pose map so the subject follows a target pose.

Abstract

Anima Control adds spatial conditioning to the frozen Anima v1.0 diffusion transformer without retraining the base model. Preview-1 ships the first component, pose, as a channel-concat control-LoRA. A small, zero-initialized embedder injects a VAE-encoded skeleton at the model input, and a low-rank adapter on the transformer blocks lets the frozen network act on that signal. Because the embedder starts at zero, an untrained or zero-strength adapter is an exact no-op: the output is identical to the base model. Control is optional and off by default.

Highlights

Pose conditioning for Anima v1.0: drive the subject's pose from a skeleton map.
Zero-init no-op: at strength = 0 the output is exactly the base model.
Small trainable footprint: only the control embedder and a rank-16 adapter train; the base stays frozen.
Part of a shared control platform: the same harness is meant to host other control types (see roadmap below).

Method

The adapter has one fusion point at the model input and trains two small parts.

Conditioning. The base VAE encodes the skeleton pose map into a control latent, the same latent space as the noisy image, so the control stays spatially aligned with the generation.

Fusion (ControlEmbedder + ControlInitialLayer). A zero-initialized ControlEmbedder (a patch projection on the control latent) produces control tokens that are added to the frozen base patch-embed output inside an overridden ControlInitialLayer. This is the only place the forward pass changes. Zero-initialization means training starts as an exact no-op (output == base) and the control contribution grows only as it earns loss.

Trainable parameters. The ControlEmbedder plus a rank-16 low-rank adapter on the transformer Blocks. The base transformer, text encoder, and VAE are frozen.

Objective. The base model's native flow-matching objective, unchanged.

skeleton ─▶ VAE ─▶ control latent ─┐
                                   ▼  (+ zero-init ControlEmbedder)
noisy latent ─▶ patch-embed ─▶ [ControlInitialLayer] ─▶ Block×N (+ rank-16 LoRA) ─▶ output

Training

Data. 3,914 (image, skeleton, caption) triples. Images are generated by Anima from a broad prompt distribution; skeletons are rendered from each image's detected keypoints (DWPose, COCO-WholeBody, black background); captions are quality-prefixed tag lists.

Configuration.

Setting	Value
Resolution	512, aspect-ratio bucketed (AR 0.5–2.0, 7 buckets)
Adapter rank	16
Learning rate	1e-4
Epochs	10
Control dropout	0.1
Precision	bf16
Optimizer	adamw (optimi)

Final training loss: ≈ 0.13 (denoising MSE, mean over the final 400 optimizer steps); stable across all 10 epochs.

Results

Preliminary — measured on 10 held-out full-body poses (fresh generations not seen in training). Pose agreement is scored by re-detecting keypoints on each generated output and comparing to the target skeleton (PCK@0.1 = fraction of keypoints within 10% of the pose bounding-box diagonal).

Each row: the input reference, the DWPose skeleton extracted from it, base Anima at the same prompt and seed, then the pose-controlled output. Base ignores the skeleton; control follows it.

Metric (PCK@0.1)	Control off (0.0)	Control on (1.0)
Body pose — 17 COCO joints	0.20	0.59
Whole-body — 133 keypoints	0.32	0.59

Control roughly triples body-pose agreement (0.20 → 0.59); several held-out poses (standing, waving, hands-on-hips) reach near-perfect adherence.
With control off (strength 0.0) the output matches base Anima whatever the skeleton: the adapter is a clean no-op when disabled.
Strength sweep (0.0 / 0.5 / 1.0): raising strength moves the subject from the base pose toward the target skeleton; 0.5 is a partial blend, 1.0 follows the skeleton.
Limitations: adherence is strongest on upright / static poses; highly dynamic poses (running, jumping) are followed only partially at this preview's scale and resolution.

Usage (ComfyUI)

Preview-1 runs as a small ComfyUI custom node, Anima Control Apply (AnimaControlApply, category AnimaControl).

Install

Download adapter_model.safetensors from this repo into ComfyUI/models/loras/.
Copy the comfyui/anima_control_lora/ folder from this repo into ComfyUI/custom_nodes/, then restart ComfyUI. That folder is the whole node: __init__.py and its one helper, control_embedder.py.
Load pose_control.json from the ComfyUI workflow menu, or build the graph by hand (below).

One file does two jobs. adapter_model.safetensors holds both a low-rank adapter (lora.* keys) and the control embedder (control_embedder.* keys). LoraLoaderModelOnly reads the first set, Anima Control Apply reads the second. A bare filename in control_embedder_path is resolved against models/loras/, so both nodes just point at adapter_model.safetensors.

Two workflows are included:

pose_control_demo.json — the easiest way to test everything. Drop in any reference image; it runs DWPose to extract the skeleton, then generates the same prompt twice, once on base Anima and once with pose control, so you can compare side by side. The DWPose step needs the comfyui_controlnet_aux nodes.
pose_control.json — the core control graph, where you supply a skeleton image yourself.

Prepare a skeleton

The control input is a skeleton pose map: DWPose keypoints (COCO-WholeBody) drawn on a black background. You can make one two ways:

A DWPose preprocessor node (for example, from comfyui_controlnet_aux) run on a reference image inside ComfyUI.
The bundled scripts/render_skeletons.py, which needs pip install rtmlib opencv-python:
```
python scripts/render_skeletons.py --imgs <ref_dir> --out <skeleton_dir>
```

Graph

Load base Anima with your usual model loader.
LoraLoaderModelOnly: set lora_name to adapter_model.safetensors.
VAEEncode: encode the skeleton image to a LATENT (this is the control_latent).
Anima Control Apply: inputs are model, control_latent, control_embedder_path (adapter_model.safetensors), and strength. It returns a patched MODEL.
Sample from the patched model as usual.

Strength. 0.0 is the base model with no control; 1.0 follows the skeleton. The range is 0 to 2. Higher values track the pose more closely but can cost some image quality.

Limitations

Preview quality. Lightly trained; pose adherence may be inconsistent and is still under evaluation.
512×512 only. The adapter is trained at 512, and the control latent has to line up spatially with the generation latent, so the skeleton input and the output are both fixed at 512×512. Other resolutions are not supported yet: off-512 the latents do not match and the pose is not tracked. Fine detail (hands, faces) is also limited at this resolution.
Single input injection. Control is asserted once, at the input; at low strength the model may under-follow the pose. Raising strength helps but can reduce image quality.
Pose only. Other control types are roadmap, not in this release.
Non-commercial. Inherits the Anima base model license.

The platform & its v1 components

Pose is the first component on a shared control harness for Anima. The harness owns the frozen base, the adapter machinery, and the data/caching path; each control type plugs in by supplying a conditioning encoder and a fusion point. Planned v1 components:

Pose v1: this preview, hardened for higher fidelity and reliable adherence.
IP-Adapter v1: image-prompt conditioning, to drive style and content from a reference image.
Identity v1: face-identity conditioning, to keep a subject's identity across generations.

A pose detector trained specifically for Anima is also on the table. The skeletons here come from DWPose, a general detector built for photographs, so it is noisy on anime art. An Anima-native detector would give cleaner skeletons for both the training data and inference, which should directly improve how closely generations follow the target pose.

License

These weights are a derivative of the Anima base model (circlestone-labs/Anima) and inherit its terms: the CircleStone Labs Non-Commercial License, and — because Anima is itself a derivative of Cosmos-Predict2 — the NVIDIA Open Model License.

The model weights are for non-commercial use only. Generated images (outputs) are not restricted by these terms and may be used commercially. See the base model card and the bundled LICENSE for the full text.

Support

Building these models means mining and labeling hundreds of thousands of images and renting GPUs to train on them, which takes real time and money. If they are useful to you and you want to chip in, it is appreciated and never expected: https://ko-fi.com/claquasse

Citation

@misc{anima_control_pose_preview1,
  title  = {Anima Control --- Pose (Preview-1)},
  author = {Claquasse},
  year   = {2026},
  note   = {Preview-1 pose control adapter for Anima v1.0},
  howpublished = {\url{https://huggingface.co/Claquasse/Anima-Control-Pose}}
}

Built on Anima (CircleStone Labs), the Cosmos-Predict2 transformer architecture, and the diffusion-pipe training framework.

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for Claquasse/Anima-Control-Pose

Base model

nvidia/Cosmos-Predict2-2B-Text2Image

Finetuned

circlestone-labs/Anima

Finetuned

(57)

this model

Collection including Claquasse/Anima-Control-Pose

Anima-Control

Collection

Native control adapters for the Anima 2B image model (frozen base, composable adapters). Pose is the first module; identity / IP-Adapter to follow. • 1 item • Updated about 10 hours ago