Codestral-ViT

A multimodal code generation model that combines vision and language understanding. Built on MLX for Apple Silicon, it integrates CLIP's visual capabilities with Codestral's code generation abilities.

Overview

Codestral-ViT extends the Codestral language model with visual understanding capabilities. It can:

  • Generate code from text descriptions
  • Understand and explain code from screenshots
  • Suggest improvements to code based on visual context
  • Process multiple images with advanced tiling strategies

Technical Details

  • Base Models:

    • Language: Codestral-22B (4-bit quantized)
    • Vision: CLIP ViT-Large/14
    • Framework: MLX (Apple Silicon)
  • Architecture:

    • Vision encoder processes images into 512-dim embeddings
    • Learned projection layer maps vision features to language space
    • Dynamic RoPE scaling for 32K context window
    • Support for overlapping image crops and tiling
  • Input Processing:

    • Images: 224x224 pixels, CLIP normalization
    • Text: Up to 32,768 tokens
    • Special tokens for image-text fusion

Example Usage

from PIL import Image
from src.model import MultimodalCodestral

model = MultimodalCodestral()

# Code generation from screenshot
image = Image.open("code_screenshot.png")
response = model.generate_with_images(
    prompt="Explain this code and suggest improvements",
    images=[image]
)

# Multiple image processing
images = [Image.open(f) for f in ["img1.png", "img2.png"]]
response = model.generate_with_images(
    prompt="Compare these code implementations",
    images=images
)

Capabilities

  • Code Understanding:

    • Analyzes code structure from screenshots
    • Identifies patterns and anti-patterns
    • Suggests contextual improvements
  • Image Processing:

    • Handles multiple image inputs
    • Supports various image formats
    • Advanced crop and resize strategies
  • Generation Features:

    • Context-aware code completion
    • Documentation generation
    • Code refactoring suggestions
    • Bug identification and fixes

Requirements

  • Apple Silicon hardware (M1/M2/M3)
  • 32GB+ RAM recommended
  • MLX framework
  • Python 3.8+

Limitations

  • Apple Silicon only (no CPU/CUDA support)
  • Memory intensive for large images/codebases
  • Visual understanding bounded by CLIP's capabilities
  • Generation quality depends on input clarity

License

This model is released under the Mistral Non-Profit License (MNPL). See license details.

Citation

@software{codestral-vit,
  author = {Mike Casale},
  title = {Codestral-ViT: A Vision-Language Model for Code Generation},
  year = {2023},
  publisher = {Hugging Face},
  url = {https://huggingface.co/casale-xyz/codestral-vit}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model authors have turned it off explicitly.