neko-pickscore

neko-pickscore is a custom fine-tuned version of PickScore_v1 (based on CLIP ViT-H/14). It has been trained on a highly curated dataset of 11,000 pairwise preference images to align with a specific, personalized aesthetic taste.

Unlike general-purpose aesthetic scorers, this model's Vision Encoder has been fully fine-tuned using a Bradley-Terry pairwise margin loss to deeply understand specific compositional and stylistic preferences, while the Text Encoder remains frozen to preserve zero-shot language understanding.

πŸš€ Usage

1. With transformers (Native PyTorch)

You can use this model directly with the Hugging Face transformers library to score images against text prompts.

import torch
from transformers import AutoModel, AutoProcessor
from PIL import Image

model_path = "2dameneko/neko-pickscore"
device = "cuda" if torch.cuda.is_available() else "cpu"

# Load model and processor
model = AutoModel.from_pretrained(model_path, torch_dtype=torch.float32).to(device).eval()
processor = AutoProcessor.from_pretrained(model_path)

# Load an image and define a prompt
image = Image.open("your_image.jpg").convert("RGB")
prompt = "a beautiful landscape, highly detailed"

# Process inputs
inputs = processor(text=[prompt], images=[image], return_tensors="pt", padding=True).to(device)

# Get features and calculate score
with torch.no_grad():
    img_feat = model.get_image_features(pixel_values=inputs.pixel_values)
    txt_feat = model.get_text_features(input_ids=inputs.input_ids, attention_mask=inputs.attention_mask)
    
    # Normalize
    img_feat = img_feat / img_feat.norm(p=2, dim=-1, keepdim=True)
    txt_feat = txt_feat / txt_feat.norm(p=2, dim=-1, keepdim=True)
    
    # Calculate PickScore
    logit_scale = model.logit_scale.exp()
    score = (logit_scale * (img_feat * txt_feat).sum(dim=-1)).item()

print(f"neko-pickscore: {score:.4f}")

2. With nitpick-chan Image Scorer

This model is fully compatible with the custom batch-scoring and hierarchical clustering tool nitpick-chan.

python nitpick-chan.py /path/to/your/images \
  --models neko_pickscore \
  --mode both \
  --tiers 5

🧠 Training Details

  • Base Model: yuvalkirstain/PickScore_v1 (CLIP ViT-H/14)
  • Dataset: 11,000 custom pairwise preference pairs (Hierarchical tier-based comparisons).
  • Loss Function: Bradley-Terry Pairwise Margin Loss (-log(sigmoid(score_a - score_b))).
  • Trainable Parameters: Full Vision Encoder (632M params). Text Encoder (354M params) was kept frozen to prevent catastrophic forgetting.
  • Optimizer: 8-bit AdamW (via bitsandbytes).
  • Optimizations:
    • Scaled Dot-Product Attention (SDPA)
    • Gradient Checkpointing (Vision Encoder only)
    • OpenCV-based fast image decoding
    • GPU-offloaded normalization
  • Learning Rate: 1e-5 (Cosine schedule with warmup) to gently shift the decision boundary without destroying base CLIP knowledge.

πŸ”— Resources

Downloads last month
20
Safetensors
Model size
1.0B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for 2dameneko/neko-pickscore

Finetuned
(1)
this model