arxiv:2403.09394

GiT: Towards Generalist Vision Transformer through Universal Language Interface

Published on Mar 14

· Submitted by

akhaliq on Mar 15

Upvote

Authors:

Haiyang Wang ,

Hao Tang ,

Hongsheng Li ,

Liwei Wang

Abstract

This paper proposes a simple, yet effective framework, called GiT, simultaneously applicable for various vision tasks only with a vanilla ViT. Motivated by the universality of the Multi-layer Transformer architecture (e.g, GPT) widely used in large language models (LLMs), we seek to broaden its scope to serve as a powerful vision foundation model (VFM). However, unlike language modeling, visual tasks typically require specific modules, such as bounding box heads for detection and pixel decoders for segmentation, greatly hindering the application of powerful multi-layer transformers in the vision domain. To solve this, we design a universal language interface that empowers the successful auto-regressive decoding to adeptly unify various visual tasks, from image-level understanding (e.g., captioning), over sparse perception (e.g., detection), to dense prediction (e.g., segmentation). Based on the above designs, the entire model is composed solely of a ViT, without any specific additions, offering a remarkable architectural simplification. GiT is a multi-task visual model, jointly trained across five representative benchmarks without task-specific fine-tuning. Interestingly, our GiT builds a new benchmark in generalist performance, and fosters mutual enhancement across tasks, leading to significant improvements compared to isolated training. This reflects a similar impact observed in LLMs. Further enriching training with 27 datasets, GiT achieves strong zero-shot results over various tasks. Due to its simple design, this paradigm holds promise for narrowing the architectural gap between vision and language. Code and models will be available at https://github.com/Haiyang-W/GiT.

View arXiv page View PDF Add to collection

Community

siciid-Xuseen

Mar 15

This comment has been hidden

librarian-bot

Mar 16

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

BK-Lee

Mar 17

•

edited Mar 17

Thanks for showing interesting results

I am familiar with Xdecoder even on the code level, unified framework without LLM, CVPR 2023. In addition, I am familiar with vision language models a lot, either.

I have short two questions after a quick glance to your paper.

Is there any advantage of removing vision encoder such as CLIP vision encoder except computation? Can discrete Image tokenizer encompass continuous knowledge all?
For segementation. I have had a thought that when regarding this task directly with LLM generation output (e.g., car, car, car, person,person, sky,sky (enumerating pixel wise class name)), it should have a very long time to get the outputs. Therefore, it cannot reach application level model due to terrible prediction time, but only experimental model. Furthermore, I think I cannot find the reasonable points why vision models and language models are getting unified, which is inspired by the human structure where eye and brain is definitely separated and they serve different role to progress the tasks. To my knowledge, I did not find the strong points except just computation and efficiency.

I am really curious and wondering about this discussion points and I cannot determine making generalist model for including vision tasks plus language tasks via one model is promising direction. This is because vision models soley can provide very good performances, and vision language models either acieve very good performamces, then why they are unified rather than using them all.

Although it is gonna be whether I am right or wrong, I think this paper can contribute a lot beyond my points to computer vision and language community, enough.

Haiyang-W

Paper author Mar 17

Dear BK,

Thank you for your question. :)

Regarding the first question about whether to discard the visual encoder? First, we want to create a unified general model, a data-centric, end-to-end model that only requires non-parametric design at the input and output ends, or a light tokenizer, etc. Why do this? First, it greatly simplifies the training process. The training of the model is a one-stage process, no longer requiring the pre-training of a visual encoder. Imagine, if you need to scale an adapter-based vision-language model, you need to scale the visual encoder separately and then scale the LLM (Large Language Model) separately again, and the vision-language fine-tuning. Three-stage. This is a troublesome thing. The future data-centric model I envision has a very simple structure, continuously training with data, allowing the model to become increasingly strong. The network's capability is strong enough to handle all modalities and tasks on its own. The Extensibility of the model is very important. Here, extensibility means the model continuously becomes stronger, more tasks and more modalities, all end-to-end, rather than the current non-end-to-end, multi-stage training (requiring pre-training of visual encoder and LLM). In short, what we want to do is a supremely simple end-to-end multi-layer transformer that can learn almost all tasks and data, including but not limited to 2D vision, language, point cloud, etc. Point cloud can refer to our previous work that unified image and point cloud using a plain transformer, 'UniTR: A Unified and Efficient Multi-Modal Transformer for Bird’s-Eye-View Representation'.

Regarding the second question, for tasks involving dense prediction, there indeed exist flaws in the current paper. However, the multiple parallel decoding mentioned in the paper can greatly mitigate this issue. The appendix includes our segmentation speed, which is roughly on par with SAM's speed at the same resolution. We haven't tried more points. More points mean a higher degree of parallelism and faster speed. Efficiency is where we aim to optimize and improve in the next phase. Please stay tuned to our work. A LLM-like framework means that all acceleration techniques applicable to LLMs can be used by us, and I believe the speed can be significantly increased. Our current work is not perfect yet, and there are many areas for improvement, which we can strive to enhance together.

We want to create a framework that is simple to the extreme and data-centric, not limited to vision-language. This framework can be embedded with various other modalities and data types, such as point clouds and graphs, etc. Secondly, why combine vision-language. Vision-language must inevitably move towards unification. Language is very special; I believe it is not on the same level as vision. It is a modality that ultimately aligns with the human understanding space. In a sense, all vision tasks' ground truth is language, all are texts. Segmentation is a mask because people visualized the segmentation label in the form of a 2D image. For example, in the perception of an autonomous driving scenario, you need to segment the scene, and the result of the segmentation still needs to be converted into language. Because only language is a program that a machine or computer can execute, the visualized segmentation mask is not. Third, if the model has good extensibility, allowing different tasks to promote each other, then this model is very promising and can continuously become stronger (with better performance on individual tasks, and the ability to handle more tasks, just like humans).

You also mentioned that we humans have different brain areas for different modalities, and I agree with this view. The development of deep learning has evolved from manually designing features to models being able to learn features. I believe the structure of models should also follow this path. Previously, people manually designated which parameters were responsible for different modalities. Perhaps now, the network itself can learn which parameters are responsible for which modalities, in an implicit way. In summary, it's about being data-centric and continuously reducing the part of human hand-craft design.

Of course, I do not deny the work of the previous adapter. On the contrary, I think it works very well at this stage. We just want to offer a new idea and solution, and whether it is the right path can be observed later. Thank you very much for your question, let's make the community better together.

Best,
Haiyang

Haiyang-W

Paper author Mar 17

This comment has been hidden

kanashi6

Paper author Mar 17

What we want to convey in above discussions is that a good model needs to minimize human bias as much as possible (for example, in model design). The main contribution of past developments in deep learning has been transitioning from hand-crafted features to networks learning features on their own. Therefore, we believe that reducing human-designed aspects is crucial in model development as well.

Best,
Hao Tang