ViBe: A Text-to-Video Benchmark for Evaluating Hallucination in Large Multimodal Models
Abstract
Latest developments in Large Multimodal Models (LMMs) have broadened their capabilities to include video understanding. Specifically, Text-to-video (T2V) models have made significant progress in quality, comprehension, and duration, excelling at creating videos from simple textual prompts. Yet, they still frequently produce hallucinated content that clearly signals the video is AI-generated. We introduce ViBe: a large-scale Text-to-Video Benchmark of hallucinated videos from T2V models. We identify five major types of hallucination: Vanishing Subject, Numeric Variability, Temporal Dysmorphia, Omission Error, and Physical Incongruity. Using 10 open-source T2V models, we developed the first large-scale dataset of hallucinated videos, comprising 3,782 videos annotated by humans into these five categories. ViBe offers a unique resource for evaluating the reliability of T2V models and provides a foundation for improving hallucination detection and mitigation in video generation. We establish classification as a baseline and present various ensemble classifier configurations, with the TimeSFormer + CNN combination yielding the best performance, achieving 0.345 accuracy and 0.342 F1 score. This benchmark aims to drive the development of robust T2V models that produce videos more accurately aligned with input prompts.
Community
- The paper introduces ViBe, a comprehensive benchmark and dataset for analyzing and categorizing hallucinations in text-to-video (T2V) generation models, aiming to enhance their reliability and alignment with input prompts.
- Novel Dataset and Benchmark: ViBe is a large-scale dataset featuring 3,782 human-annotated videos from T2V models, categorized into five types of hallucinations, including physical incongruities and temporal inconsistencies.
- Evaluation Framework: The paper establishes baseline performance for detecting hallucinations using classifiers like CNNs and Transformers, with TimeSFormer embeddings achieving the best accuracy (0.345) and F1 score (0.342).
- Future Directions: ViBe provides a foundation for improving hallucination detection in T2V models, highlighting areas such as multi-hallucination detection and mitigating annotation subjectivity.
Hi @amanchadha congrats on this work!
Are you planning to release code, and the benchmark on the hub?
Happy to assist if required! We just added support for the Video
feature in Datasets (along with the viewer)! https://huggingface.co/docs/datasets/main/en/video_load
Kind regards,
Niels
- The paper introduces ViBe, a comprehensive benchmark and dataset for analyzing and categorizing hallucinations in text-to-video (T2V) generation models, aiming to enhance their reliability and alignment with input prompts.
- Novel Dataset and Benchmark: ViBe is a large-scale dataset featuring 3,782 human-annotated videos from T2V models, categorized into five types of hallucinations, including physical incongruities and temporal inconsistencies.
- Evaluation Framework: The paper establishes baseline performance for detecting hallucinations using classifiers like CNNs and Transformers, with TimeSFormer embeddings achieving the best accuracy (0.345) and F1 score (0.342).
- Future Directions: ViBe provides a foundation for improving hallucination detection in T2V models, highlighting areas such as multi-hallucination detection and mitigating annotation subjectivity.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Can LVLMs Describe Videos like Humans? A Five-in-One Video Annotations Benchmark for Better Human-Machine Comparison (2024)
- The Curse of Multi-Modalities: Evaluating Hallucinations of Large Multimodal Models across Language, Visual, and Audio (2024)
- EventHallusion: Diagnosing Event Hallucinations in Video LLMs (2024)
- AVHBench: A Cross-Modal Hallucination Benchmark for Audio-Visual Large Language Models (2024)
- LongHalQA: Long-Context Hallucination Evaluation for MultiModal Large Language Models (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper