VideoGameQA-Bench: Evaluating Vision-Language Models for Video Game Quality Assurance
Paper
•
2505.15952
•
Published
•
17
Thanks for sharing! Another benchmark to evaluate long contexts: https://github.com/adobe-research/NoLiMa ; paper: NoLiMa: Long-Context Evaluation Beyond Literal Matching