TokForge β SDXL IP-Adapter (Reference Identity) bundle
The highest-fidelity reference-identity image route for the TokForge Android app, and the clean multi-subject path. Attach a photo of a person (or two), then render that person in any scene ("me as a superhero, unmasked, face visible"). The plus-face IP-Adapter transfers the face only while the prompt drives the whole scene. SDXL's stronger plus-face transfer (vs SD1.5) gives sharper single-subject identity, and makes the regional-mask two-subject path produce two distinct recognizable faces.
This bundle runs on the on-device stable-diffusion.cpp
engine (TokForge's IP-Adapter port) on CPU and Adreno OpenCL. Full SDXL is heavier
than the SD1.5 IP-Adapter tier β this is offered on 16 GB-class phones. For 8 GB phones,
use the lighter darkmaniac7/TokForge-SD15-IPAdapter
tier instead.
Files
| File | Size | License | Contents |
|---|---|---|---|
realvisxl-v40-lightning-fp16.safetensors |
~6.9 GB | OpenRAIL++ | RealVisXL V4.0 Lightning (SDXL photoreal finetune) β dual CLIP text encoders + UNet + VAE in one self-contained f16 sd.cpp safetensors |
ip-adapter-plus-face_sdxl_vit-h.safetensors |
~848 MB | Apache-2.0 | IP-Adapter plus-face SDXL (h94/IP-Adapter) β 16-token image_proj Resampler + 70 decoupled cross-attn layers (cross_attention_dim 2048) |
ip_adapter_clip_vision_vith.safetensors |
~2.5 GB | MIT | OpenCLIP ViT-H-14 image encoder (1280 hidden, 32 layers). The plus-face path needs ViT-H, not bigG β the same encoder the SD1.5 plus-face bundle ships |
manifest.json and MD5SUMS carry the integrity hashes + render defaults.
Why this base, and why f16 (not Q4)
The base is RealVisXL V4.0 Lightning β the same self-contained SDXL safetensors the
TokForge "RealVisXL SDXL Quality" tier ships, a strong photoreal SDXL finetune distilled
for a few-step (6-step) Lightning floor. It is kept at f16 (full precision) so the
IP-Adapter's decoupled cross-attention and the face Resampler keep subject quality high.
A q4_0 base measurably weakens the transferred identity, so this bundle deliberately uses f16
(matching the SD1.5 tier's quality choice).
Why plus-face + ViT-H (not the base SDXL adapter + bigG)
The standard ip-adapter_sdxl projects the whole pooled CLIP-bigG embedding β it drags the
reference's entire scene through. The plus-face variant (ip-adapter-plus-face_sdxl_vit-h)
runs a 16-token Resampler over the ViT-H penultimate hidden state β it extracts the face
only, so identity is preserved while the prompt controls the scene. Because this adapter is
the _vit-h build (image_proj.latents shape [1, 16, 1280]), it pairs with the ViT-H
encoder (1280 hidden) β not the bigG encoder (1664 hidden) the base SDXL adapter uses. The
TokForge sd.cpp IP-Adapter loader auto-detects plus-face by image_proj.latents and the SDXL
adapter config (2048-dim, 70 layers) by the SDXL base.
How TokForge uses it
In the app (16 GB+ phones): Image model picker β download "SDXL IP-Adapter (Reference Identity)" β attach a face photo as a reference under chat β prompt the scene. The engine is invoked as:
sd -M img_gen \
-m realvisxl-v40-lightning-fp16.safetensors \
-p "as a superhero, unmasked, face visible, detailed face" \
-n "<strong negative>" \
--clip_vision ip_adapter_clip_vision_vith.safetensors \
--ip-adapter ip-adapter-plus-face_sdxl_vit-h.safetensors \
--ip-adapter-image <your_face.jpg> \
--ip-adapter-scale 0.75 \
--cfg-scale 2.0 --sampling-method euler --scheduler discrete \
--steps 6 -H 1024 -W 1024
Recommended render settings
| Setting | Value |
|---|---|
| sampler | euler |
| scheduler | discrete |
| steps | 6 (Lightning few-step floor) |
| cfg-scale | 2.0 (Lightning low-CFG) |
| ip-adapter-scale | 0.75 (keeps the scene with strong recognizable identity; lower β more scene freedom, higher β closer to the reference) |
| resolution | 1024Γ1024 (SDXL native) |
Plus-face transfers the face only, so the rendered face must stay visible and unobstructed for a recognizable identity. Keep the face in frame ("unmasked, face visible, detailed face, looking at viewer") β the app appends this cue automatically.
Licenses
This is an aggregate of three independently-licensed components β each retains its own license:
- RealVisXL V4.0 Lightning base (
realvisxl-v40-lightning-fp16.safetensors) β OpenRAIL++ (SG161222/RealVisXL_V4.0_Lightning, the SDXLopenrail++license). Use must comply with the OpenRAIL++ use-based restrictions. - IP-Adapter plus-face (SDXL, ViT-H) (
ip-adapter-plus-face_sdxl_vit-h.safetensors) β Apache-2.0 (h94/IP-Adapter). - OpenCLIP ViT-H-14 image encoder (
ip_adapter_clip_vision_vith.safetensors) β MIT (OpenCLIP / LAION ViT-H-14).
The non-commercial IP-Adapter-FaceID / InsightFace path is NOT used here β only the Apache-2.0 base + plus-face adapters from
h94/IP-Adapter.
Provenance
- Base:
realvisxl-v40-lightning-fp16.safetensors, the self-contained sd.cpp SDXL safetensors built fromSG161222/RealVisXL_V4.0_Lightning(the same base the TokForge RealVisXL SDXL tier ships). - Adapter + image encoder copied verbatim from
h94/IP-Adapter(sdxl_models/ip-adapter-plus-face_sdxl_vit-h.safetensors,models/image_encoder/model.safetensors).
Model tree for darkmaniac7/TokForge-SDXL-IPAdapter
Base model
SG161222/RealVisXL_V4.0_Lightning