--- license: apache-2.0 base_model: - mistralai/Pixtral-12B-2409 library_name: transformers --- # Pixtral-12B Vision Encoder ## Model Overview This repository provides direct access to the vision encoder module extracted from the Pixtral-12B multimodal model. By isolating the vision encoder, we enable researchers and developers to leverage the powerful visual feature extraction capabilities for downstream vision tasks. ## Key Features - **Standalone Vision Encoder**: Extracted from the full Pixtral-12B model - **Lightweight Architecture**: Optimized 400M parameter vision encoder - **Flexible Usage**: Easily integrated into various computer vision pipelines - **No Unnecessary Decoder Weights**: Trimmed for efficient vision-specific applications ## Motivation The Pixtral-12B Vision Encoder module is designed for researchers and developers who: - Require high-quality visual feature extraction - Want to use the vision encoder independently of the full multimodal model - Seek to implement custom downstream vision tasks - Desire a lightweight, efficient vision representation module ## Installation ```python from transformers import AutoModel import torch # Load the vision encoder vision_encoder = AutoModel.from_pretrained("your-repository/pixtral-12b-vision-encoder") ``` ## Example Usage ```python from PIL import Image import torch # Load an image image = Image.open("example_image.jpg") # Preprocess the image (ensure to use the corresponding processor) inputs = vision_processor(images=image, return_tensors="pt") # Extract visual features with torch.no_grad(): visual_embeddings = vision_encoder(**inputs).last_hidden_state # Now you can use visual_embeddings for downstream tasks ``` ## Capabilities - High-quality visual feature extraction - Support for various image sizes - Robust representation learning - Compatible with multiple vision downstream tasks ## Limitations - Designed specifically for feature extraction - Performance may vary depending on the specific downstream task - Requires careful preprocessing and task-specific fine-tuning ## Acknowledgements Special thanks to the Mistral AI team for developing the original Pixtral-12B multimodal model. ## License Distributed under the Apache 2.0 License. ## Citation If you use this vision encoder in your research, please cite the original Mistral AI Pixtral-12B model.