monyschuk
/

Huihui-Step3-VL-10B-abliterated-mlx

Model card Files Files and versions

YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

Huihui-Step3-VL-10B-abliterated MLX

MLX implementation of Huihui-Step3-VL-10B-abliterated, a vision-language model combining Qwen3-8B with the Step3 vision encoder.

Model Architecture

LLM Backbone: Qwen3-8B-Instruct (bf16)
Vision Encoder: Step3 ViT (47 layers, 1536 hidden dim, 12 heads, patch size 14)
Projector: MLP (1536 -> 4096 -> 4096) with GELU
Special Tokens: <|im_start|>, <|im_patch|>, <|im_end|>

Installation

pip install -r requirements.txt

Usage

Basic Generation

from mlx_lm import load as mlx_load
from model import HuihuiStep3VL

# Load Qwen3-8B
model, tokenizer = mlx_load("mlx-community/Qwen3-8B-Instruct-bf16")

# Create VL model
vl_model = HuihuiStep3VL(
    llm_model=model,
    vision_hidden=1536,
    llm_hidden=4096,
)

# Generate with image
response = vl_model.generate(
    images=image_tensor,
    prompt_tokens=prompt_tokens,
    max_tokens=256,
)

With Base64 Image

from sample import generate_response

response = generate_response(
    model=vl_model,
    tokenizer=tokenizer,
    image_base64=base64_encoded_image,
    prompt="Describe this image.",
)

Chat Format

from sample import generate_with_chat_messages

messages = [
    {"role": "user", "content": "What do you see in this image?"}
]

response = generate_with_chat_messages(
    model=vl_model,
    tokenizer=tokenizer,
    messages=messages,
    image=base64_image,
)

Files

model.py - Model definition (VisionEncoder, ImageProjector, HuihuiStep3VL)
loader.py - Weight loading utilities
tokenizer.py - Tokenizer with Step3 special tokens
sample.py - Sample inference scripts
convert.py - Weight conversion and hub push script

Conversion

To convert and push to HuggingFace Hub:

python convert.py

Notes

The abliterated bias fix (1.3K vector subtraction) is baked into the original weights
Image tokens: <im_start> + N×<im_patch> + <im_end> where N = (H/patch_size) × (W/patch_size)
For 224×224 images with patch_size=14: N = 16×16 = 256 patches

License

See original model repository for license information.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support