MyVLM
Paper: https://arxiv.org/abs/2403.14599
Project Page: https://snap-research.github.io/MyVLM/
Code: https://github.com/snap-research/MyVLM
MyVLM Concept Heads & Concept Embeddings
As part of our MyVLM code release, we have also released pretrained concept heads and concept embeddings for all 29 objects used in the paper.
These can be loaded using the CLIPConceptHead
class and our inference scripts for reproducing the paper results.
This repository contains 5 concept heads for each object, representing five different training seeds and sets of images used for training.
Concept Heads
For each user-specific concept, we introduce an external concept head designed to identify the presence of the concept within an image.
As mentioned in the paper, we have two types of concept heads:
- A facial recognition model for recognizing individuals
- A CLIP-based concept head for recognizing user-specific objects
For faces, we use the buffalo_l
face detection and face recognition model from insightface.
See concept_heads/face_recognition/head.py
for usage.
For objects, we train a single linear layer over features extracted from a CLIP ViT-H/14 model (DFN5B-CLIP-ViT-H-14-384
).
See concept_heads/clip/head.py
for usage.
Concept Embeddings
Having identified the presence of a user-specific concept within an image, a learned concept embedding representing an object or individual is used to guide the LLM in incorporating the concept into its personalized textual response.
The concept embeddings are saved as .pt
files in the following format:
```
{
10: {
"keys": torch.Tensor(), # the keys used for optimizing the concept embedding
"values": torch.Tensor(), # the concept embedding itself
},
...
20: {
"keys": torch.Tensor(),
"values": torch.Tensor(),
},
...
}
```
where each entry in the dictionary represents a different checkpoint during the optimization process.
We provide the concept embeddings for personalized captioning using both BLIP-2 and LLaVA.
License
This sample code is made available by Snap Inc. for non-commercial, academic purposes only.
Please see the full license here.