Augmenting Multimodal LLMs with Self-Reflective Tokens for Knowledge-based Visual Question Answering Paper • 2411.16863 • Published 28 days ago
Revisiting Image Captioning Training Paradigm via Direct CLIP-based Optimization Paper • 2408.14547 • Published Aug 26
Contrasting Deepfakes Diffusion via Contrastive Learning and Global-Local Similarities Paper • 2407.20337 • Published Jul 29
Wiki-LLaVA: Hierarchical Retrieval-Augmented Generation for Multimodal LLMs Paper • 2404.15406 • Published Apr 23
Safe-CLIP: Removing NSFW Concepts from Vision-and-Language Models Paper • 2311.16254 • Published Nov 27, 2023
The (R)Evolution of Multimodal Large Language Models: A Survey Paper • 2402.12451 • Published Feb 19
itserr/scratch_2-nodes_tokenizer_latbert-original_packing_fcocchi Text Generation • Updated Jun 6 • 27
With a Little Help from your own Past: Prototypical Memory Networks for Image Captioning Paper • 2308.12383 • Published Aug 23, 2023
Multimodal Garment Designer: Human-Centric Latent Diffusion Models for Fashion Image Editing Paper • 2304.02051 • Published Apr 4, 2023 • 4
The (R)Evolution of Multimodal Large Language Models: A Survey Paper • 2402.12451 • Published Feb 19
LaDI-VTON: Latent Diffusion Textual-Inversion Enhanced Virtual Try-On Paper • 2305.13501 • Published May 22, 2023 • 1
Multimodal-Conditioned Latent Diffusion Models for Fashion Image Editing Paper • 2403.14828 • Published Mar 21