pplx-embed for Apple CoreML (ANE-optimized)

CoreML conversion of Perplexity's pplx-embed-v1-0.6b (a bidirectional Qwen3-0.6B encoder โ†’ masked-mean pool โ†’ tanh-int8 head) produced with the CoreML-LLM pipeline. Targets macOS 26.

Each subfolder is a fixed-shape sequence-length bucket that stays resident on the Apple Neural Engine (flexible shapes force CPU fallback). At runtime the Swift package pads each input to the smallest bucket that fits; inputs longer than the largest fixed bucket fall through to the dyn*-int8/ flexible GPU catch-all. The encoder uses native RMSNorm and a single fixed RoPE table โ€” the ANE-fastest path on M4 Max / macOS 26.

Buckets in this repo

Subfolder Variant Bucket (L) Kind Size
L1024-int8/ plain 1024 fixed ANE bucket 2.44 GB
L2048-int8/ plain 2048 fixed ANE bucket 2.44 GB
L4096-int8/ plain 4096 fixed ANE bucket 2.44 GB
L512-int8/ plain 512 fixed ANE bucket 2.44 GB
dyn8192-int8/ plain 1..8192 dynamic GPU catch-all 2.44 GB
context/L512-int8/ context 512 fixed ANE bucket 2.44 GB

The encoder weight.bin is byte-identical across every bucket (a single fixed-size RoPE table makes the weights independent of bucket length). So HF stores the weight blob once, and the HF content-addressed cache fetches it once by etag on download โ€” pulling several buckets costs ~1.15 GB total, not ~1.15 GB ร— N.

Use it

Via the CoreML-LLM Swift package. It uses the HF Swift Hub client, so only the buckets you request are downloaded and the shared weight is fetched once into the content-addressed cache:

import CoreMLLLM
let embedder = try await PplxEmbed.load(
    repo: "dokterbob/pplx-embed-coreml",
    buckets: [512, 1024, 2048])       // shared HF cache; weight fetched once by etag
let vecs = try embedder.embed(["On-device embeddings", "Bonjour le monde"])  // [[Int8]]

Each bucket is published in both .mlpackage and precompiled .mlmodelc; pass preferCompiled: false for the portable package. Or download the bundle directory yourself and load it with load(bundleDir:).

I/O contract (per bucket model_config.json)

  • input_ids (1, L) int32, attention_mask (1, L) fp16 (1.0 valid, 0.0 pad)
  • embedding (1, 1024) int8 โ€” clamp(round(tanh(x)*127), -128, 127); derive binary/ubinary from the int8 sign (see PplxEmbed).

License

Inherits the base model's license.

Downloads last month
31
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for dokterbob/pplx-embed-coreml

Quantized
(63)
this model