Community Computer Vision Course documentation

A Multimodal World

Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

A Multimodal World

Welcome to the chapter on the fundamentals of multimodality. This chapter builds the foundation for the later sections of the unit. We will explore:

  • The notion of multimodality, and different sensory inputs humans use for efficient decision making.
  • Why is it important for making innovative applications and services through which we can interact and make lives easier.
  • Multimodality in context to Deep Learning, data, tasks, and models.
  • Related applications like multimodal emotion recognition and multimodal search.

So let’s begin πŸ€—

What is Multimodality? πŸ“ΈπŸ“πŸŽ΅

A modality means a medium or a way in which something exists or is done. In our daily lives, we come across many scenarios where we have to make decisions and perform tasks. For this, we use our 5 sense organs (eyes to see, ears to hear, nose to smell, tongue to taste, and skin to touch). Based on the information from all sense organs, we assess our environment, perform tasks, and make decisions for our survival. Each of these 5 sense organs is a different modality through which information comes to us and thus the word multimodality or multimodal.

Think about this scenario for a moment, on a windy night you hear an eerie sound while you are on your bed πŸ‘»πŸ˜¨. You feel a bit scared, as you are unaware about the source of the sound. You try to gather some courage and check your environment but you are unable to figure this out 😱. Daringly, you turn on the lights and you find out that it was just your window which was half-opened through which the wind was blowing and making the sound in the first place πŸ˜’.

So what just happened here? Initially you had restricted understanding of the situation due to your limited knowledge of the environment. This limited knowledge was due to the fact because you were just relying on your ears (the eerie sound), to make sense. But as soon as you turned on the lights in the room and looked around through your eyes (added another sense organ), you had a better understanding about the whole situation. As we kept on adding modalities our understanding of the situation became better and clearer than before, for the same scenario, this suggests that adding more modalities to the same situation assist each other and improves the information content. Even while taking this course and moving ahead, would you not like cool infographics, accompanied by video content explaining minute concepts instead of just plain textual content πŸ˜‰ Here you go:

Multimodality Notion

An infographic on multimodality and why it is important to capture the overall sense of data through different modalities. The infographic is multimodal as well (image + text).

Many times communication between 2 people gets really awkward in textual mode, slightly improves when voices are involved but greatly improves when you are able to visualize body language and facial expressions as well. This has been studied in detail by the American Psychologist, Albert Mehrabian who stated this as the 7-38-55 rule of communication, the rule states: β€œIn communication, 7% of the overall meaning is conveyed through verbal mode (spoken words), 38% through voice and tone and 55% through body language and facial expressions.”

To be more general, in the context of AI, 7% of the meaning conveyed is through textual modality, 38% through audio modality and 55% through vision modality. Within the context of deep learning, we would refer each modality as a way data arrives to a deep learning model for processing and predictions. The most commonly used modalities in deep learning are: vision, audio and text. Other modalities can also be considered for specific use cases like LIDAR, EEG Data, eye tracking data etc.

Unimodal models and datasets are purely based on a single modality, and have been studied for long with many tasks and benchmarks but are limited in their capabilities. Relying on a single modality might not give us the complete picture, and combining more modalities will increase the information content and reduce the possibility of missing cues that might be in them. For the machines around us to be more intelligent, better at communicating with us and have enhanced interpretation and reasoning capabilities, it is important to build applications and services around models and datasets that are multimodal in nature. Because, multimodality can give us a clearer and more accurate representation of the world around us enabling us to develop applications that are closer to the real-world scenarios.

Common combinations of modalities and real life examples:

  • Vision + Text : Infographics, Memes, Articles, Blogs.
  • Vision + Audio: A Skype call with your friend, dyadic conversations.
  • Vision + Audio + Text: Watching YouTube videos or movies with captions, social media content in general is multimodal.
  • Audio + Text: Voice notes, music files with lyrics

Multimodal Datasets

A dataset consisting of multiple modalities is a multimodal dataset. Out of the common modality combinations let us see some examples:

Now let us see what kind of tasks can be performed using a multimodal dataset? There are many examples, but we will focus generally on tasks that contains the visual and textual A multimodal dataset will require a model which is able to process data from multiple modalities, such a model is a multimodal model.

Multimodal Tasks and Models

Each modality has different tasks related to it, for example: vision downstream tasks contain image classification, image segmentation, object detection etc. and we would be using models specifically designed for these tasks. So tasks and models go hand in hand. If a task involves two or more modalities then it can be termed as a multimodal task. If we consider a task in terms of inputs and outputs, a multimodal task can generally be thought of as a single input/output arrangement with two different modalities at input and output ends respectively.

Hugging Face supports a wide variety of multimodal tasks. Let us look into some of them.

Some multimodal tasks supported by πŸ€— and their variants:

  1. Vision + Text:
  • Visual Question Answering or VQA: Aiding visually impaired persons, efficient image retrieval, video search, Video Question Answering, Document VQA.
  • Image to Text: Image Captioning, Optical Character Recognition (OCR), Pix2Struct.
  • Text to Image: Image Generation
  • Text to Video: Text-to-video editing, Text-to-video search, Video Translation, Text-driven Video Prediction.
  1. Audio + Text:

πŸ’‘An amazing usecase of multimodal task is Multimodal Emotion Recognition (MER). The MER task involves recognition of emotion from two or more modalities like audio+text, text+vision, audio+vision or vision+text+audio As we discussed in the example, MER is more efficient than unimodal emotion recognition and gives clear insight into the emotion recognition task. Check out more on MER with this repository.

Multimodal model flow

A multimodal model, is a model that can be used to perform multimodal tasks by processing data coming from multiple modalities at the same time. These models combine the uniqueness and strengths of different modalities to make a complete representation of data enhancing the performance on multiple tasks. Multimodal models are trained to integrate and process data from sources like images, videos, text, audio etc. The process of combining these modalities begins with multiple unimodal models. The outputs of these unimodal models (encoded data) are then fused using a strategy by the fusion module. The strategy of fusion can be early fusion, late fusion or hybrid fusion. The overall task of the fusion module is to make a combined representation of the encoded data from the unimodal models. Finally, a classification network takes up the fused representation to make predictions.

A detailed section on multimodal tasks and models with a focus on Vision and Text, will be discussed in the next chapter.

An application of multimodality: Multimodal Search πŸ”ŽπŸ“²πŸ’»

Internet search was the one key advantage Google had, but with the introduction of ChatGPT by OpenAI, Microsoft started out with powering up their Bing search engine so that they can crush the competition. It was only restricted initially to LLMs, looking into large corpus of text data but the world around us, mainly social media content, web articles and all possible form of online content is largely multimodal. When we search about an image, the image pops up with a corresponding text to describe it. Won’t it be super cool to have another powerful multimodal model which involved both Vision and Text at the same time? This can revolutionize the search landscape hugely, and the core tech involved in this is multimodal learning. We know that many companies also have a large database which is multimodal and mostly unstructured in nature. Multimodal models might help companies with internal search, interactive documentation (chatbots), and many such use cases. This is another domain of Enterprise AI where we leverage AI for organizational intelligence.

Vision Language Models (VLMs) are models that can understand and process both vision and text modalities. The joint understanding of both modalities lead VLMs to perform various tasks efficiently like Visual Question Answering, Text-to-image search etc. VLMs thus can serve as one of the best candidates for multimodal search. So overall, VLMs should find some way to map text and image pairs to a joint embedding space where each text-image pair is present as an embedding. We can perform various downstream tasks using these embeddings, which can also be used for search. The idea of such a joint space is that image and text embeddings that are similar in meaning will lie close together, enabling us to do searches for images based on text (text-to-image search) or vice-versa.

πŸ’‘Meta released the first multimodal AI model to bind information from 6 different modalities: images and videos, audio, text, depth, thermal, and inertial measurement units (IMUs). Learn more about it here.

After going through the fundamentals for multimodality, let’s now take a look into different multimodal tasks and models available in πŸ€— and their applications via cool demos and Spaces.

< > Update on GitHub