pplx-embed for Apple CoreML (ANE-optimized)

CoreML conversion of Perplexity's pplx-embed-v1-0.6b (a bidirectional Qwen3-0.6B encoder → masked-mean pool → tanh-int8 head) produced with the CoreML-LLM pipeline. Targets macOS 26.

Each subfolder is a fixed-shape sequence-length bucket that stays resident on the Apple Neural Engine (flexible shapes force CPU fallback). At runtime the Swift package pads each input to the smallest bucket that fits; inputs longer than the largest fixed bucket fall through to the dyn*-int8/ flexible GPU catch-all. The encoder uses native RMSNorm and a single fixed RoPE table — the ANE-fastest path on M4 Max / macOS 26.

Buckets in this repo

Subfolder	Variant	Bucket (L)	Kind	Size
`L1024-int8/`	plain	1024	fixed ANE bucket	2.44 GB
`L2048-int8/`	plain	2048	fixed ANE bucket	2.44 GB
`L4096-int8/`	plain	4096	fixed ANE bucket	2.44 GB
`L512-int8/`	plain	512	fixed ANE bucket	2.44 GB
`dyn8192-int8/`	plain	1..8192	dynamic GPU catch-all	2.44 GB
`context/L512-int8/`	context	512	fixed ANE bucket	2.44 GB

The encoder weight.bin is byte-identical across every bucket (a single fixed-size RoPE table makes the weights independent of bucket length). So HF stores the weight blob once, and the HF content-addressed cache fetches it once by etag on download — pulling several buckets costs ~1.15 GB total, not ~1.15 GB × N.

Use it

Via the CoreML-LLM Swift package. It uses the HF Swift Hub client, so only the buckets you request are downloaded and the shared weight is fetched once into the content-addressed cache:

import CoreMLLLM
let embedder = try await PplxEmbed.load(
    repo: "dokterbob/pplx-embed-coreml",
    buckets: [512, 1024, 2048])       // shared HF cache; weight fetched once by etag
let vecs = try embedder.embed(["On-device embeddings", "Bonjour le monde"])  // [[Int8]]

Each bucket is published in both .mlpackage and precompiled .mlmodelc; pass preferCompiled: false for the portable package. Or download the bundle directory yourself and load it with load(bundleDir:).

I/O contract (per bucket `model_config.json`)

input_ids (1, L) int32, attention_mask (1, L) fp16 (1.0 valid, 0.0 pad)
embedding (1, 1024) int8 — clamp(round(tanh(x)*127), -128, 127); derive binary/ubinary from the int8 sign (see PplxEmbed).

License

Inherits the base model's license.

Downloads last month: 31

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for dokterbob/pplx-embed-coreml

Base model

perplexity-ai/pplx-embed-v1-0.6b

Quantized

(63)