sentence-transformers/clip-ViT-B-32 · cross_encoder and bi

Hello,

I tried to fine tune clip model in bi_encoder model and cross_encoder modes.
I prepared the training data (image, text) pair as shown below:
train_samples = [
InputExample(texts=[Image.open('175105641.jpeg'), 'grey basket'], label=0.3),
InputExample(texts=[Image.open('175105641.jpeg'), 'vegetable washer'], label=0.2)
]. Q1. Not sure whether this is correct.

and when I trained in bi_encoder mode, there is no error. But when I pass the evaluator, there is no evaluation results coming out. Q2. Wonder why there is no evaluation results showing.

Q3: Then I trained a cross_encoder, there is an error. I posted the code and error as below. It seems that cross_encoder mode for clip model is not ready in sentence_transformer.

code:
model_type = 'openai/clip-vit-base-patch32'
savepath_location = "clip-ViT-B-32_cross_encoder"
model_cross = CrossEncoder(model_type, num_labels=1)

train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=32)
train_loss = losses.CosineSimilarityLoss(model_cross)
#, checkpoint_path=checkpoint_path_location remove checkpoint if it causes issue for now
model_cross.fit(train_dataloader=train_dataloader,
epochs=3,
warmup_steps=100,
evaluation_steps=-1,
output_path=savepath_location)
model_cross.save(savepath_location)

error:
ValueError: Unrecognized configuration class <class 'transformers.models.clip.configuration_clip.CLIPConfig'> for this kind of AutoModel: AutoModelForSequenceClassification.
Model type should be one of AlbertConfig, BartConfig, BertConfig, BigBirdConfig, BigBirdPegasusConfig, BloomConfig, CamembertConfig, CanineConfig, ConvBertConfig, CTRLConfig, Data2VecTextConfig, DebertaConfig, DebertaV2Config, DistilBertConfig, ElectraConfig, FlaubertConfig, FNetConfig, FunnelConfig, GPT2Config, GPTNeoConfig, GPTJConfig, IBertConfig, LayoutLMConfig, LayoutLMv2Config, LayoutLMv3Config, LEDConfig, LongformerConfig, MBartConfig, MegatronBertConfig, MobileBertConfig, MPNetConfig, NystromformerConfig, OpenAIGPTConfig, PerceiverConfig, PLBartConfig, QDQBertConfig, ReformerConfig, RemBertConfig, RobertaConfig, RoFormerConfig, SqueezeBertConfig, TapasConfig, TransfoXLConfig, XLMConfig, XLMRobertaConfig, XLMRobertaXLConfig, XLNetConfig, YosoConfig.

Thanks for your reply. Best,

Nan

sentence-transformers
/

clip-ViT-B-32

cross_encoder and bi_encoder for clip fine tuning