Segment objects in image from data points.
Converts speech-to-text and text-to-speech.
Zero-shot classification from audio and images.
Image captioning, image-text matching and visual Q&A.