Running on Zero 11 11 Explainable-Vision-Language-Model 🥶 Generate a video visualizing how a multimodal model attends to an image while generating text