Instructions to use lorelaiassistant/mxbai-embed-large-v1-mlx-int6 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use lorelaiassistant/mxbai-embed-large-v1-mlx-int6 with MLX:
# Download the model from the Hub pip install huggingface_hub[hf_xet] huggingface-cli download --local-dir mxbai-embed-large-v1-mlx-int6 lorelaiassistant/mxbai-embed-large-v1-mlx-int6
- sentence-transformers
How to use lorelaiassistant/mxbai-embed-large-v1-mlx-int6 with sentence-transformers:
from sentence_transformers import SentenceTransformer model = SentenceTransformer("lorelaiassistant/mxbai-embed-large-v1-mlx-int6") sentences = [ "The weather is lovely today.", "It's so sunny outside!", "He drove to the stadium." ] embeddings = model.encode(sentences) similarities = model.similarity(embeddings, embeddings) print(similarities.shape) # [3, 3] - Notebooks
- Google Colab
- Kaggle
- Local Apps
- LM Studio
mxbai-embed-large-v1 β MLX int6 quantization
6-bit group-quantized port of mixedbread-ai/mxbai-embed-large-v1 for the MLX framework on Apple Silicon.
What was quantized
- Linear layers in all 24 BERT encoder blocks (attention Q/K/V/output, FFN intermediate/output) and the pooler dense layer are quantized to 6-bit affine, group_size=64.
- Embedding tables (word, position, token type) are kept in fp16 β quantizing them tends to hurt retrieval quality more than the saved memory is worth.
- LayerNorm weights remain in fp16.
Why int6
Internal benchmark across fp16 / int4 / int5 / int6 / int8 on a 200-query monolingual English retrieval set (50 fact groups Γ 4 paraphrases vs 100 distractor facts):
| Variant | Disk | GPU peak (embed) | Embed mean | top-1 stab vs fp16 | top-1 vs ground truth | top-5 jaccard | MRR drift |
|---|---|---|---|---|---|---|---|
| fp16 | 639 MB | 1411 MB | 27.6 ms | β | 93.6% | β | β |
| int8 | 368 MB | 538 MB | 25.4 ms | 99.5% | 93.1% | 0.99 | +0.0033 |
| int6 | 296 MB | 466 MB | 16.1 ms | 99.0% | 93.6% | 0.97 | +0.0000 |
| int5 | 260 MB | 430 MB | 17.3 ms | 99.0% | 93.6% | 0.94 | +0.0008 |
| int4 | 224 MB | 394 MB | 13.0 ms | 97.5% | 95.1% | 0.87 | -0.0082 |
int6 preserved the fp16 baseline exactly on top-1 accuracy and MRR, with the highest top-5 jaccard among quantized variants. It also embeds 1.7Γ faster than int8 because of smaller intermediate matmul tensors.
Usage with MLXEmbedders (Swift)
import MLXEmbedders
import MLXLMCommon
let config = ModelConfiguration(
id: .id("lorelaiassistant/mxbai-embed-large-v1-mlx-int6")
)
let container = try await EmbedderModelFactory.shared.loadContainer(
from: hubDownloader,
using: huggingFaceTokenizerLoader,
configuration: config,
progressHandler: { _ in }
)
The MLXEmbedders loader auto-detects the quantization block in config.json and applies mlx.nn.quantize to the matching Linear layers at load time.
Usage with mlx.core (Python)
The standard mlx.core.load("model.safetensors") returns the quantized weights; build a BERT module that uses mlx.nn.QuantizedLinear (or call mlx.nn.quantize(model, group_size=64, bits=6) on a fresh fp16 model and load the weights afterward).
Caveats
- Vector space is incompatible with the fp16 base model. If you have an existing index built with fp16 mxbai, you must re-embed it before switching.
- Tested on a synthetic 200-query English retrieval set; before high-stakes production use, validate on your domain.
Attribution
Base model Β© Mixedbread AI, released under Apache 2.0. This quantization preserves the same license. See the original repository for model card, citation, and training details.
- Downloads last month
- 263
Quantized
Model tree for lorelaiassistant/mxbai-embed-large-v1-mlx-int6
Base model
mixedbread-ai/mxbai-embed-large-v1