Papers
arxiv:2309.16058

AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model

Published on Sep 27, 2023
· Featured in Daily Papers on Sep 29, 2023
Authors:
,
,
,
,

Abstract

We present Any-Modality Augmented Language Model (AnyMAL), a unified model that reasons over diverse input modality signals (i.e. text, image, video, audio, IMU motion sensor), and generates textual responses. AnyMAL inherits the powerful text-based reasoning abilities of the state-of-the-art LLMs including LLaMA-2 (70B), and converts modality-specific signals to the joint textual space through a pre-trained aligner module. To further strengthen the multimodal LLM's capabilities, we fine-tune the model with a multimodal instruction set manually collected to cover diverse topics and tasks beyond simple QAs. We conduct comprehensive empirical analysis comprising both human and automatic evaluations, and demonstrate state-of-the-art performance on various multimodal tasks.

Community

This comment has been hidden
This comment has been hidden

Have github?

I found this on github, not sure if legit and it is just a bare template at the moment.
Are you still putting this github together?
https://github.com/kyegomez/AnyMAL

I found this on github, not sure if legit and it is just a bare template at the moment.
Are you still putting this github together?
https://github.com/kyegomez/AnyMAL

the account has many repositories for known papers, but they do not work and may have been written with AI. Please don't use/update/star them.

Are model weights or training(from llama 2 on audio, image & video) instructions available anywhere?

Sign up or log in to comment

Models citing this paper 2

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2309.16058 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2309.16058 in a Space README.md to link it from this page.

Collections including this paper 18