Papers
arxiv:2311.04589

TEAL: Tokenize and Embed ALL for Multi-modal Large Language Models

Published on Nov 8, 2023
· Featured in Daily Papers on Nov 9, 2023
Authors:
,

Abstract

Despite Multi-modal Large Language Models (MM-LLMs) have made exciting strides recently, they are still struggling to efficiently model the interactions among multi-modal inputs and the generation in non-textual modalities. In this work, we propose TEAL (Tokenize and Embed ALl)}, an approach to treat the input from any modality as a token sequence and learn a joint embedding space for all modalities. Specifically, for the input from any modality, TEAL first discretizes it into a token sequence with the off-the-shelf tokenizer and embeds the token sequence into a joint embedding space with a learnable embedding matrix. MM-LLMs just need to predict the multi-modal tokens autoregressively as the textual LLMs do. Finally, the corresponding de-tokenizer is applied to generate the output in each modality based on the predicted token sequence. With the joint embedding space, TEAL enables the frozen LLMs to perform both understanding and generation tasks involving non-textual modalities, such as image and audio. Thus, the textual LLM can just work as an interface and maintain its high performance in textual understanding and generation. Experiments show that TEAL achieves substantial improvements in multi-modal understanding, and implements a simple scheme for multi-modal generations.

Community

As far as image generation is concerned, the quality of VQ-VAE or VQ-GAN generation does not reach the current level of sd generation. The practice of VQ seems to be falling out of the mainstream at the moment, so is serializing images really a good idea?

As far as image generation is concerned, the quality of VQ-VAE or VQ-GAN generation does not reach the current level of sd generation. The practice of VQ seems to be falling out of the mainstream at the moment, so is serializing images really a good idea?

This isn't accurate. SD sdxl decodes its latent with a vqgan? Everybody uses autonecoder LOL[although I believe that will be replaced with ViT over the next year, since current research demostrates with a few mods, its great at image reconstruction, meaning with a few more adjustments generation isn't fair away. Make it Bitnet-based, then it becomes way more computationally efficient than autoencoders. end-to-end multimodal transformers is the future IMO.] Although autoregressive image denoising has been subpar to non-AR decoding. That isn't the case now. Plenty of research shows how to maintain the semantics, and even how to infuse these semantics into the autoencoder itself.

FYI great reconstruction quality doesn't equate to equal generative capabilitiy. Plenty of methods to improve this though. For instance, decoupled autoencoder training and better codebook utilization. On top of that, CM3Leon can generate photorealistic images using a causal masking pretraining objective, this uses a normal autoregressive objective. Also Emu, demostrates how to greatly enhance generative fidelity and realism. We can even leverage OCR with the autoencoder to better model text in images. Respectfully, you're bugging, this is great research. Yes serializing images is a GREAT idea. I've stated for the last 7 months, long-term in terms of multimodal generation, MMLM's will be unbeatable, due to the fact it can leverage common sense reasoning at inference, especially when it comes to video[imagine on-demand predictive video modeling based on situational constraints you define], unlike diffusion models. Fidelity is solvable now.

image reconstruction

Meaning computing pregenerated embeddings back into an image?

As far as image generation is concerned, the quality of VQ-VAE or VQ-GAN generation does not reach the current level of sd generation. The practice of VQ seems to be falling out of the mainstream at the moment, so is serializing images really a good idea?

This is not inaccurate. As for the generation quality, some work in LLM has beaten the SD, such as the work "Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation"

image reconstruction

Meaning computing pregenerated embeddings back into an image?

Yes, you are right. Image reconstruction means transforming the generated tokens back to the image.

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2311.04589 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2311.04589 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2311.04589 in a Space README.md to link it from this page.

Collections including this paper 8