GGUF
ehartford's picture
Update README.md
d74a9d2 verified
metadata
license: apache-2.0

NEW - I exported and added mmproj-BF16.gguf to properly support llama.cpp, ollama, and LM Studio.

Devstral-Vision-Small-2507 GGUF

Quantized GGUF versions of cognitivecomputations/Devstral-Vision-Small-2507 - the multimodal coding specialist that combines Devstral's exceptional coding abilities with vision understanding.

Model Description

This is the first vision-enabled version of Devstral, created by transplanting Devstral's language model weights into Mistral-Small-3.2's multimodal architecture. It enables:

  • Converting UI screenshots to code
  • Debugging visual rendering issues
  • Implementing designs from mockups
  • Understanding codebases with visual context

Quantization Selection Guide

Quantization Size Min RAM Recommended For Quality Notes
Q8_0 23GB 24GB RTX 3090/4090/A6000 users wanting maximum quality β˜…β˜…β˜…β˜…β˜… Near-lossless, best for production use
Q6_K 18GB 20GB High-end GPUs with focus on quality β˜…β˜…β˜…β˜…β˜† Excellent quality/size balance
Q5_K_M 16GB 18GB RTX 3080 Ti/4070 Ti users β˜…β˜…β˜…β˜…β˜† Great balance of quality and performance
Q4_K_M 13GB 16GB Most users - RTX 3060 12GB/3070/4060 β˜…β˜…β˜…β˜†β˜† The sweet spot, minimal quality loss
IQ4_XS 12GB 14GB Experimental - newer compression method β˜…β˜…β˜…β˜†β˜† Good alternative to Q4_K_M
Q3_K_M 11GB 12GB 8-12GB GPUs, quality-conscious users β˜…β˜…β˜†β˜†β˜† Noticeable quality drop for complex code

Choosing the Right Quantization

For coding with vision tasks, I recommend:

  • Production/Professional use: Q8_0 or Q6_K
  • General development: Q4_K_M (best balance)
  • Limited VRAM: Q5_K_M if you can fit it, otherwise Q4_K_M
  • Experimental: Try IQ4_XS for potentially better quality at similar size to Q4_K_M

Avoid Q3_K_M unless you're VRAM-constrained - the quality degradation becomes noticeable for complex coding tasks and visual understanding.

Usage Examples

With llama.cpp

# Download the model
huggingface-cli download cognitivecomputations/Devstral-Vision-Small-2507-GGUF \
  Devstral-Small-Vision-2507-Q4_K_M.gguf \
  --local-dir .
huggingface-cli download cognitivecomputations/Devstral-Vision-Small-2507-GGUF \
  mmproj-BF16.gguf \
  --local-dir .

# Run with llama.cpp
./llama-cli -m Devstral-Small-Vision-2507-Q4_K_M.gguf \
  -p "Analyze this UI and generate React code" \
  --image screenshot.png \
  -c 8192

With LM Studio

  1. Download your chosen quantization
  2. Load in LM Studio
  3. Enable multimodal/vision mode in settings
  4. Drag and drop images into the chat

With ollama

# Create Modelfile
cat > Modelfile << EOF
FROM ./Devstral-Small-Vision-2507-Q4_K_M.gguf
PARAMETER temperature 0.7
PARAMETER num_ctx 8192
EOF

# Create and run
ollama create devstral-vision -f Modelfile
ollama run devstral-vision

With koboldcpp

python koboldcpp.py --model Devstral-Small-Vision-2507-Q4_K_M.gguf \
  --contextsize 8192 \
  --gpulayers 999 \
  --multimodal

Performance Tips

  1. Context Size: This model supports up to 128k context, but start with 8k-16k for better performance
  2. GPU Layers: Offload all layers to GPU if possible (--gpulayers 999 or -ngl 999)
  3. Batch Size: Increase batch size for better throughput if you have VRAM headroom
  4. Temperature: Use lower temperatures (0.1-0.3) for code generation, higher (0.7-0.9) for creative tasks

Hardware Requirements

Quantization Single GPU Partial Offload CPU Only
Q8_0 24GB VRAM 16GB VRAM + 16GB RAM 32GB RAM
Q6_K 20GB VRAM 12GB VRAM + 16GB RAM 24GB RAM
Q5_K_M 18GB VRAM 12GB VRAM + 12GB RAM 24GB RAM
Q4_K_M 16GB VRAM 8GB VRAM + 12GB RAM 20GB RAM
IQ4_XS 14GB VRAM 8GB VRAM + 12GB RAM 20GB RAM
Q3_K_M 12GB VRAM 6GB VRAM + 12GB RAM 16GB RAM

Model Capabilities

βœ… Strengths:

  • Exceptional at converting visual designs to code
  • Strong debugging abilities with visual context
  • Maintains Devstral's 53.6% SWE-Bench performance
  • Handles multiple programming languages
  • 128k token context window

⚠️ Limitations:

  • Not specifically fine-tuned for vision-to-code tasks
  • Vision performance bounded by Mistral-Small-3.2's capabilities
  • Requires decent hardware for optimal performance
  • Quantization impacts both vision and coding quality

License

Apache 2.0 (inherited from base models)

image/png

Acknowledgments

Links


For issues or questions about these quantizations, please open an issue in the repository.