Model Card: CLIP for Chemistry (CLIPChemistryModel)

Model Details

  • Model Name: CLIPModel
  • Architecture: CLIP-based multimodal model for fashion images and text
  • Dataset: E-commerce Products CLIP Dataset
  • Batch Size: 8
  • Loss Function: Contrastive Loss
  • Optimizer: Adam (learning rate = 1e-3)
  • Transfer Learning: Enabled (frozen backbone layers for both image and text encoders)

Model Architecture

This model is based on the CLIP (Contrastive Language-Image Pretraining) framework, specifically designed to learn joint representations of text and image modalities for chemistry-related applications.

Components

  • Image Encoder (ImageEncoderHead)
    • Uses a Vision Transformer (ViT) backbone
    • Feature extraction from images
    • Fully connected (FC) layers to project to a 512-dimensional space
  • Text Encoder (TextEncoderHead)
    • Uses a Transformer-based text encoder
    • Extracts text features and projects them to 512-dimensional space
  • CLIPChemistryModel
    • Combines the image and text encoders
    • Computes embeddings for contrastive learning

Implementation

Model Definition

import torch
import torch.nn as nn
import torch.nn.functional as F

class ImageEncoderHead(nn.Module, PyTorchModelHubMixin):
    def __init__(self, model):
        super(ImageEncoderHead, self).__init__()
        self.model = model
        for param in self.model.parameters():
            param.requires_grad = False
        self.seq1 = nn.Sequential(
            nn.Linear(768, 1000),
            nn.Dropout(0.3),
            nn.ReLU(),
            nn.Linear(1000, 512),
            nn.LayerNorm(512),
        )

    def forward(self, pixel_values):
        outputs = self.model(pixel_values).pooler_output
        outputs = self.seq1(outputs)
        return outputs.contiguous()

class TextEncoderHead(nn.Module, PyTorchModelHubMixin):
    def __init__(self, model):
        super(TextEncoderHead, self).__init__()
        self.model = model
        for param in self.model.parameters():
            param.requires_grad = False
        self.seq1 = nn.Sequential(
            nn.Flatten(),
            nn.Linear(768 * 128, 2000),
            nn.Dropout(0.3),
            nn.ReLU(),
            nn.Linear(2000, 512),
            nn.LayerNorm(512),
        )

    def forward(self, input_ids, attention_mask):
        outputs = self.model(input_ids=input_ids, attention_mask=attention_mask).last_hidden_state
        outputs = self.seq1(outputs)
        return outputs.contiguous()

class CLIPModel(nn.Module, PyTorchModelHubMixin):
    def __init__(self, text_encoder, image_encoder):
        super(CLIPModel, self).__init__()
        self.text_encoder = text_encoder
        self.image_encoder = image_encoder

    def forward(self, image, input_ids, attention_mask):
        ie = self.image_encoder(image)
        te = self.text_encoder(input_ids, attention_mask)
        return ie, te

This model has been pushed to the Hub using the PytorchModelHubMixin integration:

Downloads last month

-

Downloads are not tracked for this model. How to track
Safetensors
Model size
352M params
Tensor type
F32
·
Inference Providers NEW
This model is not currently available via any of the supported third-party Inference Providers, and HF Inference API was unable to determine this model's library.