V-JEPA 2 (ViT-L, SSv2 action recognition) β€” Apple Core AI

V-JEPA 2 (Meta AI) running natively on the Apple Core AI engine β€” the zoo's first world model: a self-supervised video encoder that learns by predicting in representation space (JEPA), here with the Something-Something v2 action head (174 classes of physical interactions β€” put/lift/push/roll/cover/pretend…).

  • One bundle: ViT-L backbone (3D RoPE attention) + attentive pooler + classifier, ~375M params, fp16 ~675 MB.
  • I/O: pixel_values_videos [1,16,3,256,256] (16 frames, RGB 0..1, ImageNet mean/std) β†’ logits [1,174] (labels.json).
  • Verified: engine vs PyTorch reference cosine 0.999996, top-5 identical; a synthetic motion probe (square moving up vs down) flips the predicted direction correctly.
  • Speed: ~150–180 ms per 16-frame clip on an M4 Max (GPU) β€” real-time video understanding.

Use it

▢️ Run it (source) β€” the ActionCamera runner (live camera action recognition, one app for every video model in the catalog):

git clone https://github.com/john-rocky/coreai-kit
open coreai-kit/Examples/ActionCamera/ActionCamera.xcodeproj
# β†’ Run, then pick "V-JEPA 2 ViT-L (SSv2)" in the model picker

# agents / headless (macOS):
cd coreai-kit/Examples/ActionCamera
swift run action-cli --model vjepa2-vitl-ssv2 --video sample.mp4

πŸ’» Build with it β€” complete; the glue is kit API, copy-paste runs:

import CoreAIKitVision

let recognizer = try await ActionRecognizer(catalog: "vjepa2-vitl-ssv2")
let actions = try await recognizer.classify(videoAt: videoURL, topK: 3)
// actions: ranked [Prediction] β€” .label ("Pushing [something] from left to right"),
// .probability; 174 SSv2 classes, fully on-device

The take-home is Examples/ActionCamera/Sources/QuickStart.swift β€” this exact code as one typed function, no UI; the CLI is an argument shell over it, and the GUI classifies a rolling 16-frame clip from CameraFeed. Live camera? Keep the last 16 CameraFeed frames and call classify(frames:) β€” other frame counts are uniformly resampled to 16. The bundled sample.mp4 is a synthetic clip (a hand pushing a block); point --video at real footage for real results.

Integration checklist

  • SPM: https://github.com/john-rocky/coreai-kit β†’ product CoreAIKitVision
  • Info.plist: NSCameraUsageDescription β€” only for the live camera; the snippet needs none
  • Entitlements: none needed
  • First run downloads the model β€” 0.7 GB (Mac) / 1.4 GB (iPhone) β€” then it loads from the local cache (Application Support; progress via the downloadProgress callback)
  • Measure in Release β€” Debug is ~3Γ— slower on per-token host work

Files

path what
macos/vjepa2_ssv2_fp16.aimodel fp16 bundle (macOS / JIT)
ios/vjepa2_ssv2_fp16.h18p.aimodelc iOS AOT bundle (iPhone, A18 Pro+ GPU)
macos/labels.json, ios/labels.json 174 SSv2 class names
macos/metadata.json I/O + preprocessing spec

Live demo app: coreai-video β€” camera β†’ live top-3 actions. iPhone 17 Pro: ~0.34 s per 16-frame clip.

Preprocessing

Sample 16 frames uniformly from the clip, resize+center-crop to 256Γ—256, scale to 0..1, normalize with ImageNet mean [0.485,0.456,0.406] / std [0.229,0.224,0.225], layout [1,16,3,256,256].

Credits

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for mlboydaisuke/VJEPA2-ViTL-SSv2-CoreAI

Finetuned
(3)
this model