metadata

license: apache-2.0
base_model:
  - mistralai/Pixtral-12B-2409
library_name: transformers

Pixtral-12B Vision Encoder

Model Overview

This repository provides direct access to the vision encoder module extracted from the Pixtral-12B multimodal model. By isolating the vision encoder, we enable researchers and developers to leverage the powerful visual feature extraction capabilities for downstream vision tasks.

Key Features

Standalone Vision Encoder: Extracted from the full Pixtral-12B model
Lightweight Architecture: Optimized 400M parameter vision encoder
Flexible Usage: Easily integrated into various computer vision pipelines
No Unnecessary Decoder Weights: Trimmed for efficient vision-specific applications

Motivation

The Pixtral-12B Vision Encoder module is designed for researchers and developers who:

Require high-quality visual feature extraction
Want to use the vision encoder independently of the full multimodal model
Seek to implement custom downstream vision tasks
Desire a lightweight, efficient vision representation module

Installation

from transformers import AutoModel
import torch

# Load the vision encoder
vision_encoder = AutoModel.from_pretrained("your-repository/pixtral-12b-vision-encoder")

Example Usage

from PIL import Image
import torch

# Load an image
image = Image.open("example_image.jpg")

# Preprocess the image (ensure to use the corresponding processor)
inputs = vision_processor(images=image, return_tensors="pt")

# Extract visual features
with torch.no_grad():
    visual_embeddings = vision_encoder(**inputs).last_hidden_state

# Now you can use visual_embeddings for downstream tasks

Capabilities

High-quality visual feature extraction
Support for various image sizes
Robust representation learning
Compatible with multiple vision downstream tasks

Limitations

Designed specifically for feature extraction
Performance may vary depending on the specific downstream task
Requires careful preprocessing and task-specific fine-tuning

Acknowledgements

Special thanks to the Mistral AI team for developing the original Pixtral-12B multimodal model.

License

Distributed under the Apache 2.0 License.

Citation

If you use this vision encoder in your research, please cite the original Mistral AI Pixtral-12B model.