Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
vladbogo 
posted an update Mar 5
Post
VisionLLaMA is a new vision transformer architecture that adapts the successful LLaMA language model design for vision tasks. By integrating components like rotary positional embeddings, SwiGLU activation, and LayerNorm from LLaMA, VisionLLaMA achieves very promising performance across various vision tasks, including image generation, classification, semantic segmentation, and object detection.

Keypoints:
* Outperforms state-of-the-art vision transformers like DiT, SiT, DeiT3, and Swin on multiple benchmarks and tasks.
* Leverages Auto-Scaled 2D Rotary Positional Embeddings (AS2DRoPE) to handle variable input resolutions efficiently.
* Serves as a powerful, unified modeling framework for vision generation and understanding tasks.

Paper: VisionLLaMA: A Unified LLaMA Interface for Vision Tasks (2403.00522)
GitHub repo: https://github.com/Meituan-AutoML/VisionLLaMA

Congrats to the authors for their work!
In this post