TokForge — SDXL IP-Adapter (Reference Identity) bundle

The highest-fidelity reference-identity image route for the TokForge Android app, and the clean multi-subject path. Attach a photo of a person (or two), then render that person in any scene ("me as a superhero, unmasked, face visible"). The plus-face IP-Adapter transfers the face only while the prompt drives the whole scene. SDXL's stronger plus-face transfer (vs SD1.5) gives sharper single-subject identity, and makes the regional-mask two-subject path produce two distinct recognizable faces.

This bundle runs on the on-device stable-diffusion.cpp engine (TokForge's IP-Adapter port) on CPU and Adreno OpenCL. Full SDXL is heavier than the SD1.5 IP-Adapter tier — this is offered on 16 GB-class phones. For 8 GB phones, use the lighter darkmaniac7/TokForge-SD15-IPAdapter tier instead.

Files

File	Size	License	Contents
`realvisxl-v40-lightning-fp16.safetensors`	~6.9 GB	OpenRAIL++	RealVisXL V4.0 Lightning (SDXL photoreal finetune) — dual CLIP text encoders + UNet + VAE in one self-contained f16 sd.cpp safetensors
`ip-adapter-plus-face_sdxl_vit-h.safetensors`	~848 MB	Apache-2.0	IP-Adapter plus-face SDXL (`h94/IP-Adapter`) — 16-token image_proj Resampler + 70 decoupled cross-attn layers (`cross_attention_dim` 2048)
`ip_adapter_clip_vision_vith.safetensors`	~2.5 GB	MIT	OpenCLIP ViT-H-14 image encoder (1280 hidden, 32 layers). The plus-face path needs ViT-H, not bigG — the same encoder the SD1.5 plus-face bundle ships

manifest.json and MD5SUMS carry the integrity hashes + render defaults.

Why this base, and why f16 (not Q4)

The base is RealVisXL V4.0 Lightning — the same self-contained SDXL safetensors the TokForge "RealVisXL SDXL Quality" tier ships, a strong photoreal SDXL finetune distilled for a few-step (6-step) Lightning floor. It is kept at f16 (full precision) so the IP-Adapter's decoupled cross-attention and the face Resampler keep subject quality high. A q4_0 base measurably weakens the transferred identity, so this bundle deliberately uses f16 (matching the SD1.5 tier's quality choice).

Why plus-face + ViT-H (not the base SDXL adapter + bigG)

The standard ip-adapter_sdxl projects the whole pooled CLIP-bigG embedding → it drags the reference's entire scene through. The plus-face variant (ip-adapter-plus-face_sdxl_vit-h) runs a 16-token Resampler over the ViT-H penultimate hidden state → it extracts the face only, so identity is preserved while the prompt controls the scene. Because this adapter is the _vit-h build (image_proj.latents shape [1, 16, 1280]), it pairs with the ViT-H encoder (1280 hidden) — not the bigG encoder (1664 hidden) the base SDXL adapter uses. The TokForge sd.cpp IP-Adapter loader auto-detects plus-face by image_proj.latents and the SDXL adapter config (2048-dim, 70 layers) by the SDXL base.

How TokForge uses it

In the app (16 GB+ phones): Image model picker → download "SDXL IP-Adapter (Reference Identity)" → attach a face photo as a reference under chat → prompt the scene. The engine is invoked as:

sd -M img_gen \
   -m realvisxl-v40-lightning-fp16.safetensors \
   -p "as a superhero, unmasked, face visible, detailed face" \
   -n "<strong negative>" \
   --clip_vision ip_adapter_clip_vision_vith.safetensors \
   --ip-adapter ip-adapter-plus-face_sdxl_vit-h.safetensors \
   --ip-adapter-image <your_face.jpg> \
   --ip-adapter-scale 0.75 \
   --cfg-scale 2.0 --sampling-method euler --scheduler discrete \
   --steps 6 -H 1024 -W 1024

Recommended render settings

Setting	Value
sampler	`euler`
scheduler	`discrete`
steps	`6` (Lightning few-step floor)
cfg-scale	`2.0` (Lightning low-CFG)
ip-adapter-scale	`0.75` (keeps the scene with strong recognizable identity; lower ≈ more scene freedom, higher ≈ closer to the reference)
resolution	`1024×1024` (SDXL native)

Plus-face transfers the face only, so the rendered face must stay visible and unobstructed for a recognizable identity. Keep the face in frame ("unmasked, face visible, detailed face, looking at viewer") — the app appends this cue automatically.

Licenses

This is an aggregate of three independently-licensed components — each retains its own license:

RealVisXL V4.0 Lightning base (realvisxl-v40-lightning-fp16.safetensors) — OpenRAIL++ (SG161222/RealVisXL_V4.0_Lightning, the SDXL openrail++ license). Use must comply with the OpenRAIL++ use-based restrictions.
IP-Adapter plus-face (SDXL, ViT-H) (ip-adapter-plus-face_sdxl_vit-h.safetensors) — Apache-2.0 (h94/IP-Adapter).
OpenCLIP ViT-H-14 image encoder (ip_adapter_clip_vision_vith.safetensors) — MIT (OpenCLIP / LAION ViT-H-14).

The non-commercial IP-Adapter-FaceID / InsightFace path is NOT used here — only the Apache-2.0 base + plus-face adapters from h94/IP-Adapter.

Provenance

Base: realvisxl-v40-lightning-fp16.safetensors, the self-contained sd.cpp SDXL safetensors built from SG161222/RealVisXL_V4.0_Lightning (the same base the TokForge RealVisXL SDXL tier ships).
Adapter + image encoder copied verbatim from h94/IP-Adapter (sdxl_models/ip-adapter-plus-face_sdxl_vit-h.safetensors, models/image_encoder/model.safetensors).

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for darkmaniac7/TokForge-SDXL-IPAdapter

Base model

SG161222/RealVisXL_V4.0_Lightning

Finetuned

(1)

this model