Smash or Transformer -- ViT-Small checkpoints

Vision Transformers that predict a Pokemon's crowd "smash" fraction (0-100) from a single image -- i.e. how attractive the internet finds it. Trained on official artwork, in-game sprites, and Safebooru fan-art, with labels from aggregate votes on pokesmash.xyz.

Code, docs, and full reproduction guides: https://github.com/byrte1024/SmashOrTransformer

Checkpoints

timm ViT-Small/16 @ 224 + a scalar regression head, fine-tuned with soft-label BCE. Each .pt holds model_state + config + metrics.

File Sources Spearman (all_avg) Notes
vit_small_mixed_v1.pt portrait + in-game + booru 0.770 recommended
vit_small_portraits_v1.pt portrait + in-game 0.690 sprite-only baseline
vit_small_mixed_v2.pt + heavy booru aug 0.734 deprecated (regression)

Spearman is a fair cross-evaluation on a common held-out set of 102 Pokemon; *.calibration.json are the isotonic calibration maps (mixed_v2 has none).

Usage

git clone https://github.com/byrte1024/SmashOrTransformer && cd SmashOrTransformer
uv sync
uv run python download_models.py            # fetches vit_small_mixed_v1 into runs/
uv run python -m model.infer --checkpoint runs/vit_small_mixed_v1/checkpoints/best.pt img.png

Dataset: supernovayuli/smash-or-transformer-data

License

other -- the training data includes third-party fan-art and official assets that are not ours to relicense. Weights are provided for research use.

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support