@vladbogo on Hugging Face: "VisionLLaMA is a new vision transformer architecture that adapts the…"

Post

VisionLLaMA is a new vision transformer architecture that adapts the successful LLaMA language model design for vision tasks. By integrating components like rotary positional embeddings, SwiGLU activation, and LayerNorm from LLaMA, VisionLLaMA achieves very promising performance across various vision tasks, including image generation, classification, semantic segmentation, and object detection.

Keypoints:
* Outperforms state-of-the-art vision transformers like DiT, SiT, DeiT3, and Swin on multiple benchmarks and tasks.
* Leverages Auto-Scaled 2D Rotary Positional Embeddings (AS2DRoPE) to handle variable input resolutions efficiently.
* Serves as a powerful, unified modeling framework for vision generation and understanding tasks.

Paper: VisionLLaMA: A Unified LLaMA Interface for Vision Tasks (2403.00522)
GitHub repo: https://github.com/Meituan-AutoML/VisionLLaMA

Congrats to the authors for their work!

Join the conversation