CLIP model post-trained on 80M human face images.

Trained with TencentPretrain framework on 8 * A100 GPUs:

python3 pretrain.py --dataset_path faceclip.pt \
    --pretrained_model_path models/clip-b32.bin \
    --output_model_path models/faceclip-b32.bin \
    --config_path models/clip/base-32_config.json \
    --vocab_path vocab.json --merges_path merges.txt --tokenizer clip \
    --world_size 8 --gpu_ranks 0 1 2 3 4 5 6 7 --data_processor clip --accumulation_steps 8 --learning_rate 2e-5 \
    --total_steps 200000 --save_checkpoint_steps 20000 --batch_size 160 --report_steps 500

How to use:

from PIL import Image
import requests

from transformers import CLIPProcessor, CLIPModel

model = CLIPModel.from_pretrained("P01son/FaceCLIP-base-32")
processor = CLIPProcessor.from_pretrained("P01son/FaceCLIP-base-32")

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

inputs = processor(text=["a photo of a cat", "a photo of a dog"], images=image, return_tensors="pt", padding=True)

outputs = model(**inputs)
logits_per_image = outputs.logits_per_image # this is the image-text similarity score
probs = logits_per_image.softmax(dim=1) # we can take the softmax to get the label probabilities
Downloads last month
70
Safetensors
Model size
151M params
Tensor type
I64
·
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.