Commonsense-T2I Challenge: Can Text-to-Image Generation Models Understand Commonsense? Paper • 2406.07546 • Published Jun 11 • 8
MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding Paper • 2406.09411 • Published Jun 13 • 18
Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal Language Models Paper • 2406.09403 • Published Jun 13 • 19
BLINK: Multimodal Large Language Models Can See but Not Perceive Paper • 2404.12390 • Published Apr 18 • 24
ImagenHub: Standardizing the evaluation of conditional image generation models Paper • 2310.01596 • Published Oct 2, 2023 • 18
BLINK: Multimodal Large Language Models Can See but Not Perceive Paper • 2404.12390 • Published Apr 18 • 24
Visual Program Distillation: Distilling Tools and Programmatic Reasoning into Vision-Language Models Paper • 2312.03052 • Published Dec 5, 2023
TIFA: Accurate and Interpretable Text-to-Image Faithfulness Evaluation with Question Answering Paper • 2303.11897 • Published Mar 21, 2023
Training Language Models to Generate Text with Citations via Fine-grained Rewards Paper • 2402.04315 • Published Feb 6
One Embedder, Any Task: Instruction-Finetuned Text Embeddings Paper • 2212.09741 • Published Dec 19, 2022 • 3
In-Context Learning for Few-Shot Dialogue State Tracking Paper • 2203.08568 • Published Mar 16, 2022 • 1
Fine-Grained Human Feedback Gives Better Rewards for Language Model Training Paper • 2306.01693 • Published Jun 2, 2023 • 3
PromptCap: Prompt-Guided Image Captioning for VQA with GPT-3 Paper • 2211.09699 • Published Nov 15, 2022 • 2