Mohammed Hasan Goni's picture
2 2

Mohammed Hasan Goni

hasangoni
·

AI & ML interests

Machine Learning

Recent Activity

liked a Space 6 days ago
nanotron/ultrascale-playbook
liked a model 10 months ago
nielsr/mobilesam
View all activity

Organizations

None yet

hasangoni's activity

reacted to akhaliq's post with ❤️ about 1 year ago
view post
Post
VisionLLaMA

A Unified LLaMA Interface for Vision Tasks

VisionLLaMA: A Unified LLaMA Interface for Vision Tasks (2403.00522)

Large language models are built on top of a transformer-based architecture to process textual inputs. For example, the LLaMA stands out among many open-source implementations. Can the same transformer be used to process 2D images? In this paper, we answer this question by unveiling a LLaMA-like vision transformer in plain and pyramid forms, termed VisionLLaMA, which is tailored for this purpose. VisionLLaMA is a unified and generic modelling framework for solving most vision tasks. We extensively evaluate its effectiveness using typical pre-training paradigms in a good portion of downstream tasks of image perception and especially image generation. In many cases, VisionLLaMA have exhibited substantial gains over the previous state-of-the-art vision transformers. We believe that VisionLLaMA can serve as a strong new baseline model for vision generation and understanding.