Instructions to use Reza2kn/Lance-3B-und-CoreML-palettized-4bit with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Reza2kn/Lance-3B-und-CoreML-palettized-4bit with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="Reza2kn/Lance-3B-und-CoreML-palettized-4bit") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("Reza2kn/Lance-3B-und-CoreML-palettized-4bit") model = AutoModelForCausalLM.from_pretrained("Reza2kn/Lance-3B-und-CoreML-palettized-4bit") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use Reza2kn/Lance-3B-und-CoreML-palettized-4bit with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Reza2kn/Lance-3B-und-CoreML-palettized-4bit" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Reza2kn/Lance-3B-und-CoreML-palettized-4bit", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/Reza2kn/Lance-3B-und-CoreML-palettized-4bit
- SGLang
How to use Reza2kn/Lance-3B-und-CoreML-palettized-4bit with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "Reza2kn/Lance-3B-und-CoreML-palettized-4bit" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Reza2kn/Lance-3B-und-CoreML-palettized-4bit", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "Reza2kn/Lance-3B-und-CoreML-palettized-4bit" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Reza2kn/Lance-3B-und-CoreML-palettized-4bit", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use Reza2kn/Lance-3B-und-CoreML-palettized-4bit with Docker Model Runner:
docker model run hf.co/Reza2kn/Lance-3B-und-CoreML-palettized-4bit
Lance LLM (understanding path) โ 4-bit kmeans palettized (CoreML-ready)
4-bit per-grouped-channel k-means palettization of the understanding-path LLM extracted from bytedance-research/Lance, via coremltools.optimize.torch.palettization.PostTrainingPalettizer.
Each Linear weight is clustered with k-means to 16 codes per group (group_size=32, granularity = per_grouped_channel). The codes + LUT are then dequantized back to fp16 for storage, so this safetensors loads as a normal HuggingFace model with the numerical quality of a 4-bit palettized checkpoint โ useful for:
- Quality probing: see how 4-bit kmeans palettization affects outputs without writing a custom CoreML pipeline
- CoreML deployment: the same numerical scheme is what
coremltools.optimize.coreml.OpPalettizerConfig(nbits=4, mode="kmeans", granularity="per_grouped_channel", group_size=32)produces inside a.mlpackage. A custom converter that traces this model into CoreML will get the same weights losslessly compressed back to 4-bit on disk. - Apple Neural Engine targeting: the kmeans LUT scheme is ANE-friendly; weight decode is hardware-accelerated.
Why fp16 storage instead of true 4-bit on disk
Compressing to actual 4-bit indices + per-group LUT requires a custom on-disk format that no standard runtime (transformers, MLX) reads directly. The CoreML .mlpackage IS that custom format, but producing it requires tracing the model through coremltools โ which currently hits unimplemented torch ops in modern Qwen2's mask construction (bitwise_or_, _int of multi-dim tensors).
So this checkpoint ships the dequantized fp16 weights for drop-in usability, with the same quality as a true 4-bit deployment. Total size: ~6 GB (vs. 6.8 GB bf16 source โ roughly the same because both are 2 bytes/weight on disk; the difference is in the effective precision of the values).
If you want true 4-bit on-disk storage for the same Lance LLM, use the MLX siblings:
Reza2kn/Lance-3B-und-MLX-4bit(~1.6 GB, ANE not used; Metal GPU)Reza2kn/Lance-3B-und-MLX-4bit-DWQ(~1.6 GB + distilled scales)
Companion: full Lance multimodal pipeline
This checkpoint is the understanding path only โ image/video generation lives in the _moe_gen expert path which isn't extracted here. For full multimodal inference, use:
Reza2kn/Lance-3B-AWQ-INT4โ image, AWQ INT4, 4.2 GBReza2kn/Lance-3B-Video-AWQ-INT4โ video, AWQ INT4, 6.0 GBReza2kn/Lance-3B-NVFP4โ image, NVFP4 (Blackwell), 5.1 GBReza2kn/Lance-3B-Video-NVFP4โ video, NVFP4, 6.9 GB
Reproduction
# scripts/palettize_weights_coreml.py from https://github.com/Reza2kn/lance-quant
python palettize_weights_coreml.py \
--hf-path Lance_3B-und-qwen \
--out Lance_3B-und-CoreML-palettized-4bit \
--nbits 4 --group_size 32
License
Apache 2.0, inherited from the base model.
- Downloads last month
- 16