--- datasets: - SPRIGHT-T2I/spright_coco --- ## A fine-tune of [BeichenZhang/LongCLIP-L](https://huggingface.co/BeichenZhang/LongCLIP-L) -- Long-CLIP ViT-L/14 expanded to 248 tokens. The fine-tune has an improved ImageNet/ObjectNet accuracy of 0.89 (original Long-CLIP by the authors:~0.81)**. Made possible with Geometric Parametrization (GmP): ``` "Normal" CLIP MLP (multi-layer perceptron): (mlp): Sequential( |-(c_fc): Linear(in_features=1024, out_features=4096, bias=True) | (gelu): QuickGELU() |-}-(c_proj): Linear(in_features=4096, out_features=1024, bias=True) | | | |-- visual.transformer.resblocks.0.mlp.c_fc.weight | |-- visual.transformer.resblocks.0.mlp.c_fc.bias | |---- visual.transformer.resblocks.0.mlp.c_proj.weight |---- visual.transformer.resblocks.0.mlp.c_proj.bias GmP CLIP MLP: Weight decomposition into: - radial component 'r' as norm of pre-trained weights - angular component 'theta' as normalized direction -> preserves weight vectors' directionality and magnitude (mlp): Sequential( |-(c_fc): GeometricLinear() | (gelu): QuickGELU() |-}-(c_proj): GeometricLinear() | | | |-- visual.transformer.resblocks.0.mlp.c_fc.r | |-- visual.transformer.resblocks.0.mlp.c_fc.theta | |-- visual.transformer.resblocks.0.mlp.c_fc.bias | |---- visual.transformer.resblocks.0.mlp.c_proj.r |---- visual.transformer.resblocks.0.mlp.c_proj.theta |---- visual.transformer.resblocks.0.mlp.c_proj.bias (Same thing for [text] transformer.resblocks) ``` ![image/png](https://cdn-uploads.huggingface.co/production/uploads/6490359a877fc29cb1b09451/OqhNxW-D9c58mkZyUQlL_.png) ✅ The model / state_dict I am sharing was converted back to .weight after fine-tuning - alas, it can be used in the same manner as any state_dict, e.g. for use with ComfyUI as the SDXL / SD3 Text Encoder using [SeaArtLab/ComfyUI-Long-CLIP](https://github.com/SeaArtLab/ComfyUI-Long-CLIP) custom nodes! 🤗 ** For details on training and those numbers / the eval, or for just fine-tuning the model yourself, see: [https://github.com/zer0int/Long-CLIP](https://github.com/zer0int/Long-CLIP) ``` @article{zhang2024longclip, title={Long-CLIP: Unlocking the Long-Text Capability of CLIP}, author={Beichen Zhang and Pan Zhang and Xiaoyi Dong and Yuhang Zang and Jiaqi Wang}, journal={arXiv preprint arXiv:2403.15378}, year={2024} } ``` Pre-trained CLIP model by OpenAI, License: [MIT License](https://github.com/openai/CLIP/blob/main/LICENSE)