lift-MLX-oQ3

MLX conversion of datalab-to/lift — a 9B qwen3_5 vision-language model for structured extraction (PDF / image → schema-constrained JSON).

This repo is the oQ3 variant: data-driven, per-layer mixed-precision quantization (~3.5 bits/weight) produced with oMLX. At the recommended low-bit floor — most aggressive in this set. Runs on Apple Silicon via mlx-vlm. Not affiliated with Datalab — a community conversion of their openly released weights.

Variants

Repo Method ~bpw Size Peak RAM* Gen*
lift-MLX-BF16 full bf16 16 18 GB 19.9 GB 31 t/s
lift-MLX-oQ8 oQ ~8.6 9.7 GB 12.3 GB 58 t/s
lift-MLX-oQ6 oQ ~6 7.7 GB 9.4 GB 73 t/s
lift-MLX-oQ5 oQ ~5 6.7 GB 8.4 GB 83 t/s
lift-MLX-oQ4 oQ ~4.6 5.6 GB 7.2 GB 100 t/s
lift-MLX-oQ3.5 oQ ~4.0 4.9 GB 6.5 GB 109 t/s
lift-MLX-oQ3 (this repo) oQ ~3.5 4.6 GB 6.2 GB 119 t/s

* Peak RAM and generation speed measured on a single-image invoice extraction on an Apple M5 Max (128 GB, 40-core GPU). Indicative, not a benchmark.

Usage

Generate (CLI)

uvx --from mlx-vlm mlx_vlm.generate \
  --model gabfssilva/lift-MLX-oQ3 \
  --image invoice.png \
  --prompt "Extract the invoice as JSON." \
  --max-tokens 800

OpenAI-compatible server + structured outputs

lift is built for schema-constrained extraction. mlx_vlm.server enforces a JSON Schema at decode time (via llguidance), so output is guaranteed valid and well-typed.

uvx --from mlx-vlm mlx_vlm.server --model gabfssilva/lift-MLX-oQ3 --port 8080
import base64, json
from openai import OpenAI

client = OpenAI(base_url="http://127.0.0.1:8080/v1", api_key="local")
img = base64.b64encode(open("invoice.png", "rb").read()).decode()

schema = {
    "type": "object",
    "properties": {
        "invoice_number": {"type": "string"},
        "total": {"type": "number"},
        "line_items": {"type": "array", "items": {"type": "object", "properties": {
            "description": {"type": "string"}, "amount": {"type": "number"}}}},
    },
    "required": ["invoice_number", "total"],
}

resp = client.chat.completions.create(
    model="gabfssilva/lift-MLX-oQ3",   # the server lists your whole HF cache — name the model explicitly
    messages=[{"role": "user", "content": [
        {"type": "text", "text": "Extract this invoice."},
        {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{img}"}},
    ]}],
    response_format={"type": "json_schema", "json_schema": {"name": "invoice", "schema": schema}},
    temperature=0.0, max_tokens=800,
)
print(json.loads(resp.choices[0].message.content))

Notes

  • eos fix applied. generation_config.json here sets eos_token_id: [248044, 248046]. Upstream only sets 248044, but the chat turn closes with <|im_end|> = 248046; without this, MLX servers reading generation_config never stop and flood <|im_end|>. If you re-convert from the source, reapply this.
  • Quality. Upstream FP lift (9B) scores 90.2% field / 20.9% full-document on Datalab's 225-doc benchmark. Every variant in this set extracted a simple test invoice correctly — that only rules out collapse, it does not rank them. Lower bit-widths may degrade on harder/adversarial documents; these were not re-benchmarked at scale.

License

Code Apache-2.0; weights under a modified OpenRAIL-M (free for research, personal use, and startups under $5M; not for use competitive with Datalab's API). See the base model.

Downloads last month
5
Safetensors
Model size
2B params
Tensor type
BF16
·
U32
·
MLX
Hardware compatibility
Log In to add your hardware

3-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for gabfssilva/lift-MLX-oQ3

Base model

datalab-to/lift
Quantized
(7)
this model