MyVLM

Paper: https://arxiv.org/abs/2403.14599

Project Page: https://snap-research.github.io/MyVLM/

Code: https://github.com/snap-research/MyVLM

MyVLM Concept Heads & Concept Embeddings

As part of our MyVLM code release, we have also released pretrained concept heads and concept embeddings for all 29 objects used in the paper.

These can be loaded using the CLIPConceptHead class and our inference scripts for reproducing the paper results.

This repository contains 5 concept heads for each object, representing five different training seeds and sets of images used for training.

Concept Heads

For each user-specific concept, we introduce an external concept head designed to identify the presence of the concept within an image.

As mentioned in the paper, we have two types of concept heads:

A facial recognition model for recognizing individuals
A CLIP-based concept head for recognizing user-specific objects

For faces, we use the buffalo_l face detection and face recognition model from insightface. See concept_heads/face_recognition/head.py for usage.

For objects, we train a single linear layer over features extracted from a CLIP ViT-H/14 model (DFN5B-CLIP-ViT-H-14-384).
See concept_heads/clip/head.py for usage.

Concept Embeddings

Having identified the presence of a user-specific concept within an image, a learned concept embedding representing an object or individual is used to guide the LLM in incorporating the concept into its personalized textual response.

The concept embeddings are saved as .pt files in the following format:

```
{
  10: {
    "keys": torch.Tensor(),    # the keys used for optimizing the concept embedding
    "values": torch.Tensor(),  # the concept embedding itself
  },
  ...
  20: {
    "keys": torch.Tensor(),    
    "values": torch.Tensor(),  
  },
  ...
}
```

where each entry in the dictionary represents a different checkpoint during the optimization process.

We provide the concept embeddings for personalized captioning using both BLIP-2 and LLaVA.

License

This sample code is made available by Snap Inc. for non-commercial, academic purposes only.
Please see the full license here.