Papers
arxiv:2309.05519

NExT-GPT: Any-to-Any Multimodal LLM

Published on Sep 11, 2023
ยท Featured in Daily Papers on Sep 12, 2023
Authors:
Wei Ji ,

Abstract

While recently Multimodal Large Language Models (MM-LLMs) have made exciting strides, they mostly fall prey to the limitation of only input-side multimodal understanding, without the ability to produce content in multiple modalities. As we humans always perceive the world and communicate with people through various modalities, developing any-to-any MM-LLMs capable of accepting and delivering content in any modality becomes essential to human-level AI. To fill the gap, we present an end-to-end general-purpose any-to-any MM-LLM system, NExT-GPT. We connect an LLM with multimodal adaptors and different diffusion decoders, enabling NExT-GPT to perceive inputs and generate outputs in arbitrary combinations of text, images, videos, and audio. By leveraging the existing well-trained highly-performing encoders and decoders, NExT-GPT is tuned with only a small amount of parameter (1%) of certain projection layers, which not only benefits low-cost training and also facilitates convenient expansion to more potential modalities. Moreover, we introduce a modality-switching instruction tuning (MosIT) and manually curate a high-quality dataset for MosIT, based on which NExT-GPT is empowered with complex cross-modal semantic understanding and content generation. Overall, our research showcases the promising possibility of building an AI agent capable of modeling universal modalities, paving the way for more human-like AI research in the community.

Community

This comment has been hidden

project page: next-gpt.github.io
Any plan to add it in huggingface ?

This comment has been hidden

How to fine tune this model?

@cutmasta-kun , seeing as the code and data repos are empty, you don't right now.

Once they aren't empty, the same way you fine-tune any other pre-trained model. I'd recommend starting with LoRA.

The special tokens like the "AUD" , are they appended to the text response by the LLM or does the LLM also generate them?. That seems to be my point of confusion. In terms of Visual question answering , how does this compare to BLIP-2. The paper shows Image-Text performance where it surpassed BLIP2 but not particularly for visual question answering.

Paper author

@cutmasta-kun , seeing as the code and data repos are empty, you don't right now.

Once they aren't empty, the same way you fine-tune any other pre-trained model. I'd recommend starting with LoRA.

Hi @mattbarr , thx for the attention; the code base publication is now finished at https://github.com/NExT-GPT/NExT-GPT.

The special tokens like the "AUD" , are they appended to the text response by the LLM or does the LLM also generate them?. That seems to be my point of confusion. In terms of Visual question answering , how does this compare to BLIP-2. The paper shows Image-Text performance where it surpassed BLIP2 but not particularly for visual question answering.

Hi @UncleanCode , thx for the interest.

  • Regarding your first question, the special tokens are generated by LLM when users ask NExT-GPT to show images, videos, or sounds. Actually, during the training, we insert the pre-defined special tokens into the vocabulary of LLM. For more details, please check the code.
  • For the second question, yes we have not conducted experiments at the moment to compare the VQA task. I guess somebody will do this when with our code.
Paper author

project page: next-gpt.github.io
Any plan to add it in huggingface?

I think yes, maybe later.

Looking forward for inclusion in Huggingface

Looks very interesting and excited to try it out in buggingface

Hey could I download the model files and dump it into the ckpt folder and run the demo_app.py from your github to run it in on my local?

ยท
Paper author

Please refer to the github page for the detailed steps.

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2309.05519 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2309.05519 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2309.05519 in a Space README.md to link it from this page.

Collections including this paper 14