Edit model card

You probably do not need this unless you are training your own IP Adapters.

Modified version of the vision encoder of CLIP-ViT-H-14-laion2B-s32B-b79K to handle 448 x 448 inputs vs the original 224 x 224 inputs. It will probbaly not work for classification (as is), but will DIP work for for IP+ adapters that use CLIP-ViT-H, though they will need to be fine tuned a little more.

Hidden layer outputs go from (257, 1280) to (1025, 1280), which can be digested by the Resampler without modification or weight resizing.

Downloads last month
29
Safetensors
Model size
633M params
Tensor type
FP16
·
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.