V-JEPA 2 (ViT-L, SSv2 action recognition) — Apple Core AI

V-JEPA 2 (Meta AI) running natively on the Apple Core AI engine — the zoo's first world model: a self-supervised video encoder that learns by predicting in representation space (JEPA), here with the Something-Something v2 action head (174 classes of physical interactions — put/lift/push/roll/cover/pretend…).

One bundle: ViT-L backbone (3D RoPE attention) + attentive pooler + classifier, ~375M params, fp16 ~675 MB.
I/O: pixel_values_videos [1,16,3,256,256] (16 frames, RGB 0..1, ImageNet mean/std) → logits [1,174] (labels.json).
Verified: engine vs PyTorch reference cosine 0.999996, top-5 identical; a synthetic motion probe (square moving up vs down) flips the predicted direction correctly.
Speed: ~150–180 ms per 16-frame clip on an M4 Max (GPU) — real-time video understanding.

Use it

▶️ Run it (source) — the ActionCamera runner (live camera action recognition, one app for every video model in the catalog):

git clone https://github.com/john-rocky/coreai-kit
open coreai-kit/Examples/ActionCamera/ActionCamera.xcodeproj
# → Run, then pick "V-JEPA 2 ViT-L (SSv2)" in the model picker

# agents / headless (macOS):
cd coreai-kit/Examples/ActionCamera
swift run action-cli --model vjepa2-vitl-ssv2 --video sample.mp4

💻 Build with it — complete; the glue is kit API, copy-paste runs:

import CoreAIKitVision

let recognizer = try await ActionRecognizer(catalog: "vjepa2-vitl-ssv2")
let actions = try await recognizer.classify(videoAt: videoURL, topK: 3)
// actions: ranked [Prediction] — .label ("Pushing [something] from left to right"),
// .probability; 174 SSv2 classes, fully on-device

The take-home is Examples/ActionCamera/Sources/QuickStart.swift — this exact code as one typed function, no UI; the CLI is an argument shell over it, and the GUI classifies a rolling 16-frame clip from CameraFeed. Live camera? Keep the last 16 CameraFeed frames and call classify(frames:) — other frame counts are uniformly resampled to 16. The bundled sample.mp4 is a synthetic clip (a hand pushing a block); point --video at real footage for real results.

Integration checklist

SPM: https://github.com/john-rocky/coreai-kit → product CoreAIKitVision
Info.plist: NSCameraUsageDescription — only for the live camera; the snippet needs none
Entitlements: none needed
First run downloads the model — 0.7 GB (Mac) / 1.4 GB (iPhone) — then it loads from the local cache (Application Support; progress via the downloadProgress callback)
Measure in Release — Debug is ~3× slower on per-token host work

Files

path	what
`macos/vjepa2_ssv2_fp16.aimodel`	fp16 bundle (macOS / JIT)
`ios/vjepa2_ssv2_fp16.h18p.aimodelc`	iOS AOT bundle (iPhone, A18 Pro+ GPU)
`macos/labels.json`, `ios/labels.json`	174 SSv2 class names
`macos/metadata.json`	I/O + preprocessing spec

Live demo app: coreai-video — camera → live top-3 actions. iPhone 17 Pro: ~0.34 s per 16-frame clip.

Preprocessing

Sample 16 frames uniformly from the clip, resize+center-crop to 256×256, scale to 0..1, normalize with ImageNet mean [0.485,0.456,0.406] / std [0.229,0.224,0.225], layout [1,16,3,256,256].

Credits

Meta AI — V-JEPA 2 (MIT).
Conversion + Core AI port: coreai-model-zoo.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

Video Classification

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mlboydaisuke/VJEPA2-ViTL-SSv2-CoreAI

Base model

facebook/vjepa2-vitl-fpc64-256

Finetuned

facebook/vjepa2-vitl-fpc16-256-ssv2

Finetuned

(3)

this model