Instructions to use Reza2kn/Lance-3B-und-CoreML-palettized-4bit with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Reza2kn/Lance-3B-und-CoreML-palettized-4bit with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="Reza2kn/Lance-3B-und-CoreML-palettized-4bit")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("Reza2kn/Lance-3B-und-CoreML-palettized-4bit")
model = AutoModelForCausalLM.from_pretrained("Reza2kn/Lance-3B-und-CoreML-palettized-4bit")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use Reza2kn/Lance-3B-und-CoreML-palettized-4bit with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Reza2kn/Lance-3B-und-CoreML-palettized-4bit"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Reza2kn/Lance-3B-und-CoreML-palettized-4bit",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/Reza2kn/Lance-3B-und-CoreML-palettized-4bit

SGLang

How to use Reza2kn/Lance-3B-und-CoreML-palettized-4bit with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "Reza2kn/Lance-3B-und-CoreML-palettized-4bit" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Reza2kn/Lance-3B-und-CoreML-palettized-4bit",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "Reza2kn/Lance-3B-und-CoreML-palettized-4bit" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Reza2kn/Lance-3B-und-CoreML-palettized-4bit",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use Reza2kn/Lance-3B-und-CoreML-palettized-4bit with Docker Model Runner:
```
docker model run hf.co/Reza2kn/Lance-3B-und-CoreML-palettized-4bit
```

Lance LLM (understanding path) — 4-bit kmeans palettized (CoreML-ready)

4-bit per-grouped-channel k-means palettization of the understanding-path LLM extracted from bytedance-research/Lance, via coremltools.optimize.torch.palettization.PostTrainingPalettizer.

Each Linear weight is clustered with k-means to 16 codes per group (group_size=32, granularity = per_grouped_channel). The codes + LUT are then dequantized back to fp16 for storage, so this safetensors loads as a normal HuggingFace model with the numerical quality of a 4-bit palettized checkpoint — useful for:

Quality probing: see how 4-bit kmeans palettization affects outputs without writing a custom CoreML pipeline
CoreML deployment: the same numerical scheme is what coremltools.optimize.coreml.OpPalettizerConfig(nbits=4, mode="kmeans", granularity="per_grouped_channel", group_size=32) produces inside a .mlpackage. A custom converter that traces this model into CoreML will get the same weights losslessly compressed back to 4-bit on disk.
Apple Neural Engine targeting: the kmeans LUT scheme is ANE-friendly; weight decode is hardware-accelerated.

Why fp16 storage instead of true 4-bit on disk

Compressing to actual 4-bit indices + per-group LUT requires a custom on-disk format that no standard runtime (transformers, MLX) reads directly. The CoreML .mlpackage IS that custom format, but producing it requires tracing the model through coremltools — which currently hits unimplemented torch ops in modern Qwen2's mask construction (bitwise_or_, _int of multi-dim tensors).

So this checkpoint ships the dequantized fp16 weights for drop-in usability, with the same quality as a true 4-bit deployment. Total size: ~6 GB (vs. 6.8 GB bf16 source — roughly the same because both are 2 bytes/weight on disk; the difference is in the effective precision of the values).

If you want true 4-bit on-disk storage for the same Lance LLM, use the MLX siblings:

Reza2kn/Lance-3B-und-MLX-4bit (~1.6 GB, ANE not used; Metal GPU)
Reza2kn/Lance-3B-und-MLX-4bit-DWQ (~1.6 GB + distilled scales)

Companion: full Lance multimodal pipeline

This checkpoint is the understanding path only — image/video generation lives in the _moe_gen expert path which isn't extracted here. For full multimodal inference, use:

Reza2kn/Lance-3B-AWQ-INT4 — image, AWQ INT4, 4.2 GB
Reza2kn/Lance-3B-Video-AWQ-INT4 — video, AWQ INT4, 6.0 GB
Reza2kn/Lance-3B-NVFP4 — image, NVFP4 (Blackwell), 5.1 GB
Reza2kn/Lance-3B-Video-NVFP4 — video, NVFP4, 6.9 GB

Reproduction

# scripts/palettize_weights_coreml.py from https://github.com/Reza2kn/lance-quant
python palettize_weights_coreml.py \
    --hf-path Lance_3B-und-qwen \
    --out Lance_3B-und-CoreML-palettized-4bit \
    --nbits 4 --group_size 32

License

Apache 2.0, inherited from the base model.

Downloads last month: 16

Safetensors

Model size

3B params

Tensor type

I64

F16

Model tree for Reza2kn/Lance-3B-und-CoreML-palettized-4bit

Base model

Qwen/Qwen2.5-VL-3B-Instruct

Finetuned

bytedance-research/Lance

Quantized

(16)

this model