neko-pickscore

neko-pickscore is a custom fine-tuned version of PickScore_v1 (based on CLIP ViT-H/14). It has been trained on a highly curated dataset of 11,000 pairwise preference images to align with a specific, personalized aesthetic taste.

Unlike general-purpose aesthetic scorers, this model's Vision Encoder has been fully fine-tuned using a Bradley-Terry pairwise margin loss to deeply understand specific compositional and stylistic preferences, while the Text Encoder remains frozen to preserve zero-shot language understanding.

🚀 Usage

1. With `transformers` (Native PyTorch)

You can use this model directly with the Hugging Face transformers library to score images against text prompts.

import torch
from transformers import AutoModel, AutoProcessor
from PIL import Image

model_path = "2dameneko/neko-pickscore"
device = "cuda" if torch.cuda.is_available() else "cpu"

# Load model and processor
model = AutoModel.from_pretrained(model_path, torch_dtype=torch.float32).to(device).eval()
processor = AutoProcessor.from_pretrained(model_path)

# Load an image and define a prompt
image = Image.open("your_image.jpg").convert("RGB")
prompt = "a beautiful landscape, highly detailed"

# Process inputs
inputs = processor(text=[prompt], images=[image], return_tensors="pt", padding=True).to(device)

# Get features and calculate score
with torch.no_grad():
    img_feat = model.get_image_features(pixel_values=inputs.pixel_values)
    txt_feat = model.get_text_features(input_ids=inputs.input_ids, attention_mask=inputs.attention_mask)
    
    # Normalize
    img_feat = img_feat / img_feat.norm(p=2, dim=-1, keepdim=True)
    txt_feat = txt_feat / txt_feat.norm(p=2, dim=-1, keepdim=True)
    
    # Calculate PickScore
    logit_scale = model.logit_scale.exp()
    score = (logit_scale * (img_feat * txt_feat).sum(dim=-1)).item()

print(f"neko-pickscore: {score:.4f}")

2. With `nitpick-chan` Image Scorer

This model is fully compatible with the custom batch-scoring and hierarchical clustering tool nitpick-chan.

python nitpick-chan.py /path/to/your/images \
  --models neko_pickscore \
  --mode both \
  --tiers 5

🧠 Training Details

Base Model: yuvalkirstain/PickScore_v1 (CLIP ViT-H/14)
Dataset: 11,000 custom pairwise preference pairs (Hierarchical tier-based comparisons).
Loss Function: Bradley-Terry Pairwise Margin Loss (-log(sigmoid(score_a - score_b))).
Trainable Parameters: Full Vision Encoder (~~632M params). Text Encoder (~~354M params) was kept frozen to prevent catastrophic forgetting.
Optimizer: 8-bit AdamW (via bitsandbytes).
Optimizations:
- Scaled Dot-Product Attention (SDPA)
- Gradient Checkpointing (Vision Encoder only)
- OpenCV-based fast image decoding
- GPU-offloaded normalization
Learning Rate: 1e-5 (Cosine schedule with warmup) to gently shift the decision boundary without destroying base CLIP knowledge.

🔗 Resources

GitHub Repository (Training & Scoring Scripts): https://github.com/2dameneko/nitpick-chan
Base Model: yuvalkirstain/PickScore_v1

Downloads last month: 20

Safetensors

Model size

1.0B params

Tensor type

BF16

Model tree for 2dameneko/neko-pickscore

Base model

yuvalkirstain/PickScore_v1

Finetuned

(1)

this model