mxbai-embed-large-v1 β€” MLX int6 quantization

6-bit group-quantized port of mixedbread-ai/mxbai-embed-large-v1 for the MLX framework on Apple Silicon.

What was quantized

  • Linear layers in all 24 BERT encoder blocks (attention Q/K/V/output, FFN intermediate/output) and the pooler dense layer are quantized to 6-bit affine, group_size=64.
  • Embedding tables (word, position, token type) are kept in fp16 β€” quantizing them tends to hurt retrieval quality more than the saved memory is worth.
  • LayerNorm weights remain in fp16.

Why int6

Internal benchmark across fp16 / int4 / int5 / int6 / int8 on a 200-query monolingual English retrieval set (50 fact groups Γ— 4 paraphrases vs 100 distractor facts):

Variant Disk GPU peak (embed) Embed mean top-1 stab vs fp16 top-1 vs ground truth top-5 jaccard MRR drift
fp16 639 MB 1411 MB 27.6 ms β€” 93.6% β€” β€”
int8 368 MB 538 MB 25.4 ms 99.5% 93.1% 0.99 +0.0033
int6 296 MB 466 MB 16.1 ms 99.0% 93.6% 0.97 +0.0000
int5 260 MB 430 MB 17.3 ms 99.0% 93.6% 0.94 +0.0008
int4 224 MB 394 MB 13.0 ms 97.5% 95.1% 0.87 -0.0082

int6 preserved the fp16 baseline exactly on top-1 accuracy and MRR, with the highest top-5 jaccard among quantized variants. It also embeds 1.7Γ— faster than int8 because of smaller intermediate matmul tensors.

Usage with MLXEmbedders (Swift)

import MLXEmbedders
import MLXLMCommon

let config = ModelConfiguration(
    id: .id("lorelaiassistant/mxbai-embed-large-v1-mlx-int6")
)

let container = try await EmbedderModelFactory.shared.loadContainer(
    from: hubDownloader,
    using: huggingFaceTokenizerLoader,
    configuration: config,
    progressHandler: { _ in }
)

The MLXEmbedders loader auto-detects the quantization block in config.json and applies mlx.nn.quantize to the matching Linear layers at load time.

Usage with mlx.core (Python)

The standard mlx.core.load("model.safetensors") returns the quantized weights; build a BERT module that uses mlx.nn.QuantizedLinear (or call mlx.nn.quantize(model, group_size=64, bits=6) on a fresh fp16 model and load the weights afterward).

Caveats

  • Vector space is incompatible with the fp16 base model. If you have an existing index built with fp16 mxbai, you must re-embed it before switching.
  • Tested on a synthetic 200-query English retrieval set; before high-stakes production use, validate on your domain.

Attribution

Base model Β© Mixedbread AI, released under Apache 2.0. This quantization preserves the same license. See the original repository for model card, citation, and training details.

Downloads last month
263
MLX
Hardware compatibility
Log In to add your hardware

Quantized

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for lorelaiassistant/mxbai-embed-large-v1-mlx-int6

Quantized
(14)
this model