V-JEPA 2 (ViT-L, SSv2 action recognition) β Apple Core AI
V-JEPA 2 (Meta AI) running natively on the Apple Core AI engine β the zoo's first world model: a self-supervised video encoder that learns by predicting in representation space (JEPA), here with the Something-Something v2 action head (174 classes of physical interactions β put/lift/push/roll/cover/pretendβ¦).
- One bundle: ViT-L backbone (3D RoPE attention) + attentive pooler + classifier, ~375M params, fp16 ~675 MB.
- I/O:
pixel_values_videos [1,16,3,256,256](16 frames, RGB 0..1, ImageNet mean/std) βlogits [1,174](labels.json). - Verified: engine vs PyTorch reference cosine 0.999996, top-5 identical; a synthetic motion probe (square moving up vs down) flips the predicted direction correctly.
- Speed: ~150β180 ms per 16-frame clip on an M4 Max (GPU) β real-time video understanding.
Use it
βΆοΈ Run it (source) β the ActionCamera runner (live camera action recognition, one app for every video model in the catalog):
git clone https://github.com/john-rocky/coreai-kit
open coreai-kit/Examples/ActionCamera/ActionCamera.xcodeproj
# β Run, then pick "V-JEPA 2 ViT-L (SSv2)" in the model picker
# agents / headless (macOS):
cd coreai-kit/Examples/ActionCamera
swift run action-cli --model vjepa2-vitl-ssv2 --video sample.mp4
π» Build with it β complete; the glue is kit API, copy-paste runs:
import CoreAIKitVision
let recognizer = try await ActionRecognizer(catalog: "vjepa2-vitl-ssv2")
let actions = try await recognizer.classify(videoAt: videoURL, topK: 3)
// actions: ranked [Prediction] β .label ("Pushing [something] from left to right"),
// .probability; 174 SSv2 classes, fully on-device
The take-home is Examples/ActionCamera/Sources/QuickStart.swift
β this exact code as one typed function, no UI; the CLI is an argument shell over it, and
the GUI classifies a rolling 16-frame clip from CameraFeed.
Live camera? Keep the last 16 CameraFeed frames and call classify(frames:) β other
frame counts are uniformly resampled to 16. The bundled sample.mp4 is a synthetic
clip (a hand pushing a block); point --video at real footage for real results.
Integration checklist
- SPM:
https://github.com/john-rocky/coreai-kitβ product CoreAIKitVision - Info.plist:
NSCameraUsageDescriptionβ only for the live camera; the snippet needs none - Entitlements: none needed
- First run downloads the model β 0.7 GB (Mac) / 1.4 GB (iPhone) β then it loads from the
local cache (Application Support; progress via the
downloadProgresscallback) - Measure in Release β Debug is ~3Γ slower on per-token host work
Files
| path | what |
|---|---|
macos/vjepa2_ssv2_fp16.aimodel |
fp16 bundle (macOS / JIT) |
ios/vjepa2_ssv2_fp16.h18p.aimodelc |
iOS AOT bundle (iPhone, A18 Pro+ GPU) |
macos/labels.json, ios/labels.json |
174 SSv2 class names |
macos/metadata.json |
I/O + preprocessing spec |
Live demo app: coreai-video β camera β live top-3 actions. iPhone 17 Pro: ~0.34 s per 16-frame clip.
Preprocessing
Sample 16 frames uniformly from the clip, resize+center-crop to 256Γ256, scale to 0..1, normalize
with ImageNet mean [0.485,0.456,0.406] / std [0.229,0.224,0.225], layout [1,16,3,256,256].
Credits
- Meta AI β V-JEPA 2 (MIT).
- Conversion + Core AI port: coreai-model-zoo.
Model tree for mlboydaisuke/VJEPA2-ViTL-SSv2-CoreAI
Base model
facebook/vjepa2-vitl-fpc64-256