Smash or Transformer -- ViT-Small checkpoints

Vision Transformers that predict a Pokemon's crowd "smash" fraction (0-100) from a single image -- i.e. how attractive the internet finds it. Trained on official artwork, in-game sprites, and Safebooru fan-art, with labels from aggregate votes on pokesmash.xyz.

Code, docs, and full reproduction guides: https://github.com/byrte1024/SmashOrTransformer

Checkpoints

timm ViT-Small/16 @ 224 + a scalar regression head, fine-tuned with soft-label BCE. Each .pt holds model_state + config + metrics.

File	Sources	Spearman (all_avg)	Notes
`vit_small_mixed_v1.pt`	portrait + in-game + booru	0.770	recommended
`vit_small_portraits_v1.pt`	portrait + in-game	0.690	sprite-only baseline
`vit_small_mixed_v2.pt`	+ heavy booru aug	0.734	deprecated (regression)

Spearman is a fair cross-evaluation on a common held-out set of 102 Pokemon; *.calibration.json are the isotonic calibration maps (mixed_v2 has none).

Usage

git clone https://github.com/byrte1024/SmashOrTransformer && cd SmashOrTransformer
uv sync
uv run python download_models.py            # fetches vit_small_mixed_v1 into runs/
uv run python -m model.infer --checkpoint runs/vit_small_mixed_v1/checkpoints/best.pt img.png

Dataset: supernovayuli/smash-or-transformer-data

License

other -- the training data includes third-party fan-art and official assets that are not ours to relicense. Weights are provided for research use.

Downloads last month: -