Long-CLIP ViT-L/14 finetune: SAE-informed adversarial training
- SAE = Sparse autoencoder. All training info & code: github.com/zer0int/CLIP-SAE-finetune
- This Long-CLIP, π direct download Text Encoder π is also the best Long-CLIP to use with HunyuanVideo.
- Required: Use with my zer0int/ComfyUI-HunyuanVideo-Nyan node (changes influence of LLM vs. CLIP; otherwise, difference is very little).
- β Buy me a coffee
The original CLIP model has 77 tokens max input - but only ~20 tokens effective length. See the original Long-CLIP paper for details. HunyuanVideo demo:
69 tokens, normal scene:
- Lens: 16mm. Aperture: f/2.8. Color Grading: Blue-green monochrome. Lighting: Low-key with backlit silhouettes. Background: Gothic cathedral at night, stained glass windows breaking. Camera angle: Over the shoulder of a ninja, tracking her mid-air leap as she lands on a rooftop.
52 tokens, OOD (Out-of-Distribution) scene: Superior handling for consistency and prompt-following despite OOD concept.
- In this surreal nightmare documentary, a sizable spider with a human face is peacefully savoring her breakfast at a diner. The spider has a spider body, but a lady's face on the front, and regular human hands at the end of the spider legs.
- Downloads last month
- 1,006
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
π
Ask for provider support
HF Inference deployability: The model has no library tag.
Model tree for zer0int/LongCLIP-SAE-ViT-L-14
Base model
BeichenZhang/LongCLIP-L