Model Card: CLIP for Chemistry (CLIPChemistryModel)
Model Details
- Model Name:
CLIPModel
- Architecture: CLIP-based multimodal model for fashion images and text
- Dataset: E-commerce Products CLIP Dataset
- Batch Size: 8
- Loss Function: Contrastive Loss
- Optimizer: Adam (learning rate = 1e-3)
- Transfer Learning: Enabled (frozen backbone layers for both image and text encoders)
Model Architecture
This model is based on the CLIP (Contrastive Language-Image Pretraining) framework, specifically designed to learn joint representations of text and image modalities for chemistry-related applications.
Components
- Image Encoder (
ImageEncoderHead
)- Uses a Vision Transformer (ViT) backbone
- Feature extraction from images
- Fully connected (FC) layers to project to a 512-dimensional space
- Text Encoder (
TextEncoderHead
)- Uses a Transformer-based text encoder
- Extracts text features and projects them to 512-dimensional space
- CLIPChemistryModel
- Combines the image and text encoders
- Computes embeddings for contrastive learning
Implementation
Model Definition
import torch
import torch.nn as nn
import torch.nn.functional as F
class ImageEncoderHead(nn.Module, PyTorchModelHubMixin):
def __init__(self, model):
super(ImageEncoderHead, self).__init__()
self.model = model
for param in self.model.parameters():
param.requires_grad = False
self.seq1 = nn.Sequential(
nn.Linear(768, 1000),
nn.Dropout(0.3),
nn.ReLU(),
nn.Linear(1000, 512),
nn.LayerNorm(512),
)
def forward(self, pixel_values):
outputs = self.model(pixel_values).pooler_output
outputs = self.seq1(outputs)
return outputs.contiguous()
class TextEncoderHead(nn.Module, PyTorchModelHubMixin):
def __init__(self, model):
super(TextEncoderHead, self).__init__()
self.model = model
for param in self.model.parameters():
param.requires_grad = False
self.seq1 = nn.Sequential(
nn.Flatten(),
nn.Linear(768 * 128, 2000),
nn.Dropout(0.3),
nn.ReLU(),
nn.Linear(2000, 512),
nn.LayerNorm(512),
)
def forward(self, input_ids, attention_mask):
outputs = self.model(input_ids=input_ids, attention_mask=attention_mask).last_hidden_state
outputs = self.seq1(outputs)
return outputs.contiguous()
class CLIPModel(nn.Module, PyTorchModelHubMixin):
def __init__(self, text_encoder, image_encoder):
super(CLIPModel, self).__init__()
self.text_encoder = text_encoder
self.image_encoder = image_encoder
def forward(self, image, input_ids, attention_mask):
ie = self.image_encoder(image)
te = self.text_encoder(input_ids, attention_mask)
return ie, te
This model has been pushed to the Hub using the PytorchModelHubMixin integration:
Inference Providers
NEW
This model is not currently available via any of the supported third-party Inference Providers, and
HF Inference API was unable to determine this model's library.