license: apache-2.0
NEW - I exported and added mmproj-BF16.gguf to properly support llama.cpp, ollama, and LM Studio.
Devstral-Vision-Small-2507 GGUF
Quantized GGUF versions of cognitivecomputations/Devstral-Vision-Small-2507 - the multimodal coding specialist that combines Devstral's exceptional coding abilities with vision understanding.
Model Description
This is the first vision-enabled version of Devstral, created by transplanting Devstral's language model weights into Mistral-Small-3.2's multimodal architecture. It enables:
- Converting UI screenshots to code
- Debugging visual rendering issues
- Implementing designs from mockups
- Understanding codebases with visual context
Quantization Selection Guide
| Quantization | Size | Min RAM | Recommended For | Quality | Notes | 
|---|---|---|---|---|---|
| Q8_0 | 23GB | 24GB | RTX 3090/4090/A6000 users wanting maximum quality | β β β β β | Near-lossless, best for production use | 
| Q6_K | 18GB | 20GB | High-end GPUs with focus on quality | β β β β β | Excellent quality/size balance | 
| Q5_K_M | 16GB | 18GB | RTX 3080 Ti/4070 Ti users | β β β β β | Great balance of quality and performance | 
| Q4_K_M | 13GB | 16GB | Most users - RTX 3060 12GB/3070/4060 | β β β ββ | The sweet spot, minimal quality loss | 
| IQ4_XS | 12GB | 14GB | Experimental - newer compression method | β β β ββ | Good alternative to Q4_K_M | 
| Q3_K_M | 11GB | 12GB | 8-12GB GPUs, quality-conscious users | β β βββ | Noticeable quality drop for complex code | 
Choosing the Right Quantization
For coding with vision tasks, I recommend:
- Production/Professional use: Q8_0 or Q6_K
- General development: Q4_K_M (best balance)
- Limited VRAM: Q5_K_M if you can fit it, otherwise Q4_K_M
- Experimental: Try IQ4_XS for potentially better quality at similar size to Q4_K_M
Avoid Q3_K_M unless you're VRAM-constrained - the quality degradation becomes noticeable for complex coding tasks and visual understanding.
Usage Examples
With llama.cpp
# Download the model
huggingface-cli download cognitivecomputations/Devstral-Vision-Small-2507-GGUF \
  Devstral-Small-Vision-2507-Q4_K_M.gguf \
  --local-dir .
huggingface-cli download cognitivecomputations/Devstral-Vision-Small-2507-GGUF \
  mmproj-BF16.gguf \
  --local-dir .
# Run with llama.cpp
./llama-cli -m Devstral-Small-Vision-2507-Q4_K_M.gguf \
  -p "Analyze this UI and generate React code" \
  --image screenshot.png \
  -c 8192
With LM Studio
- Download your chosen quantization
- Load in LM Studio
- Enable multimodal/vision mode in settings
- Drag and drop images into the chat
With ollama
# Create Modelfile
cat > Modelfile << EOF
FROM ./Devstral-Small-Vision-2507-Q4_K_M.gguf
PARAMETER temperature 0.7
PARAMETER num_ctx 8192
EOF
# Create and run
ollama create devstral-vision -f Modelfile
ollama run devstral-vision
With koboldcpp
python koboldcpp.py --model Devstral-Small-Vision-2507-Q4_K_M.gguf \
  --contextsize 8192 \
  --gpulayers 999 \
  --multimodal
Performance Tips
- Context Size: This model supports up to 128k context, but start with 8k-16k for better performance
- GPU Layers: Offload all layers to GPU if possible (--gpulayers 999or-ngl 999)
- Batch Size: Increase batch size for better throughput if you have VRAM headroom
- Temperature: Use lower temperatures (0.1-0.3) for code generation, higher (0.7-0.9) for creative tasks
Hardware Requirements
| Quantization | Single GPU | Partial Offload | CPU Only | 
|---|---|---|---|
| Q8_0 | 24GB VRAM | 16GB VRAM + 16GB RAM | 32GB RAM | 
| Q6_K | 20GB VRAM | 12GB VRAM + 16GB RAM | 24GB RAM | 
| Q5_K_M | 18GB VRAM | 12GB VRAM + 12GB RAM | 24GB RAM | 
| Q4_K_M | 16GB VRAM | 8GB VRAM + 12GB RAM | 20GB RAM | 
| IQ4_XS | 14GB VRAM | 8GB VRAM + 12GB RAM | 20GB RAM | 
| Q3_K_M | 12GB VRAM | 6GB VRAM + 12GB RAM | 16GB RAM | 
Model Capabilities
β Strengths:
- Exceptional at converting visual designs to code
- Strong debugging abilities with visual context
- Maintains Devstral's 53.6% SWE-Bench performance
- Handles multiple programming languages
- 128k token context window
β οΈ Limitations:
- Not specifically fine-tuned for vision-to-code tasks
- Vision performance bounded by Mistral-Small-3.2's capabilities
- Requires decent hardware for optimal performance
- Quantization impacts both vision and coding quality
License
Apache 2.0 (inherited from base models)
Acknowledgments
- Original model by Eric Hartford at Cognitive Computations
- Built on Mistral AI's Devstral and Mistral-Small models
- Quantized using llama.cpp
Links
For issues or questions about these quantizations, please open an issue in the repository.

 
			