YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

Huihui-Step3-VL-10B-abliterated MLX

MLX implementation of Huihui-Step3-VL-10B-abliterated, a vision-language model combining Qwen3-8B with the Step3 vision encoder.

Model Architecture

  • LLM Backbone: Qwen3-8B-Instruct (bf16)
  • Vision Encoder: Step3 ViT (47 layers, 1536 hidden dim, 12 heads, patch size 14)
  • Projector: MLP (1536 -> 4096 -> 4096) with GELU
  • Special Tokens: <|im_start|>, <|im_patch|>, <|im_end|>

Installation

pip install -r requirements.txt

Usage

Basic Generation

from mlx_lm import load as mlx_load
from model import HuihuiStep3VL

# Load Qwen3-8B
model, tokenizer = mlx_load("mlx-community/Qwen3-8B-Instruct-bf16")

# Create VL model
vl_model = HuihuiStep3VL(
    llm_model=model,
    vision_hidden=1536,
    llm_hidden=4096,
)

# Generate with image
response = vl_model.generate(
    images=image_tensor,
    prompt_tokens=prompt_tokens,
    max_tokens=256,
)

With Base64 Image

from sample import generate_response

response = generate_response(
    model=vl_model,
    tokenizer=tokenizer,
    image_base64=base64_encoded_image,
    prompt="Describe this image.",
)

Chat Format

from sample import generate_with_chat_messages

messages = [
    {"role": "user", "content": "What do you see in this image?"}
]

response = generate_with_chat_messages(
    model=vl_model,
    tokenizer=tokenizer,
    messages=messages,
    image=base64_image,
)

Files

  • model.py - Model definition (VisionEncoder, ImageProjector, HuihuiStep3VL)
  • loader.py - Weight loading utilities
  • tokenizer.py - Tokenizer with Step3 special tokens
  • sample.py - Sample inference scripts
  • convert.py - Weight conversion and hub push script

Conversion

To convert and push to HuggingFace Hub:

python convert.py

Notes

  • The abliterated bias fix (1.3K vector subtraction) is baked into the original weights
  • Image tokens: <im_start> + N×<im_patch> + <im_end> where N = (H/patch_size) × (W/patch_size)
  • For 224×224 images with patch_size=14: N = 16×16 = 256 patches

License

See original model repository for license information.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support