lift-MLX-oQ3

MLX conversion of datalab-to/lift — a 9B qwen3_5 vision-language model for structured extraction (PDF / image → schema-constrained JSON).

This repo is the oQ3 variant: data-driven, per-layer mixed-precision quantization (~3.5 bits/weight) produced with oMLX. At the recommended low-bit floor — most aggressive in this set. Runs on Apple Silicon via mlx-vlm. Not affiliated with Datalab — a community conversion of their openly released weights.

Variants

Repo	Method	~bpw	Size	Peak RAM*	Gen*
lift-MLX-BF16	full bf16	16	18 GB	19.9 GB	31 t/s
lift-MLX-oQ8	oQ	~8.6	9.7 GB	12.3 GB	58 t/s
lift-MLX-oQ6	oQ	~6	7.7 GB	9.4 GB	73 t/s
lift-MLX-oQ5	oQ	~5	6.7 GB	8.4 GB	83 t/s
lift-MLX-oQ4	oQ	~4.6	5.6 GB	7.2 GB	100 t/s
lift-MLX-oQ3.5	oQ	~4.0	4.9 GB	6.5 GB	109 t/s
lift-MLX-oQ3 (this repo)	oQ	~3.5	4.6 GB	6.2 GB	119 t/s

* Peak RAM and generation speed measured on a single-image invoice extraction on an Apple M5 Max (128 GB, 40-core GPU). Indicative, not a benchmark.

Usage

Generate (CLI)

uvx --from mlx-vlm mlx_vlm.generate \
  --model gabfssilva/lift-MLX-oQ3 \
  --image invoice.png \
  --prompt "Extract the invoice as JSON." \
  --max-tokens 800

OpenAI-compatible server + structured outputs

lift is built for schema-constrained extraction. mlx_vlm.server enforces a JSON Schema at decode time (via llguidance), so output is guaranteed valid and well-typed.

uvx --from mlx-vlm mlx_vlm.server --model gabfssilva/lift-MLX-oQ3 --port 8080

import base64, json
from openai import OpenAI

client = OpenAI(base_url="http://127.0.0.1:8080/v1", api_key="local")
img = base64.b64encode(open("invoice.png", "rb").read()).decode()

schema = {
    "type": "object",
    "properties": {
        "invoice_number": {"type": "string"},
        "total": {"type": "number"},
        "line_items": {"type": "array", "items": {"type": "object", "properties": {
            "description": {"type": "string"}, "amount": {"type": "number"}}}},
    },
    "required": ["invoice_number", "total"],
}

resp = client.chat.completions.create(
    model="gabfssilva/lift-MLX-oQ3",   # the server lists your whole HF cache — name the model explicitly
    messages=[{"role": "user", "content": [
        {"type": "text", "text": "Extract this invoice."},
        {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{img}"}},
    ]}],
    response_format={"type": "json_schema", "json_schema": {"name": "invoice", "schema": schema}},
    temperature=0.0, max_tokens=800,
)
print(json.loads(resp.choices[0].message.content))

Notes

eos fix applied. generation_config.json here sets eos_token_id: [248044, 248046]. Upstream only sets 248044, but the chat turn closes with <|im_end|> = 248046; without this, MLX servers reading generation_config never stop and flood <|im_end|>. If you re-convert from the source, reapply this.
Quality. Upstream FP lift (9B) scores 90.2% field / 20.9% full-document on Datalab's 225-doc benchmark. Every variant in this set extracted a simple test invoice correctly — that only rules out collapse, it does not rank them. Lower bit-widths may degrade on harder/adversarial documents; these were not re-benchmarked at scale.

License

Code Apache-2.0; weights under a modified OpenRAIL-M (free for research, personal use, and startups under $5M; not for use competitive with Datalab's API). See the base model.

Downloads last month: 5

Safetensors

Model size

2B params

Tensor type

BF16

U32

MLX

Hardware compatibility

3-bit

Model tree for gabfssilva/lift-MLX-oQ3

Base model

datalab-to/lift

Quantized

(7)

this model