Reasoning benchmarks Collection Various benchmarks for reasoning capabilities of LLMs • 1 item • Updated Oct 4
Vinoground: Scrutinizing LMMs over Dense Temporal Reasoning with Short Videos Paper • 2410.02763 • Published Oct 3 • 7
PingPong: A Benchmark for Role-Playing Language Models with User Emulation and Multi-Model Evaluation Paper • 2409.06820 • Published Sep 10 • 63
ISTA-DASLab/Meta-Llama-3-8B-Instruct-AQLM-2Bit-1x16 Text Generation • Updated 20 days ago • 3.51k • 12
NLP Party Collection My (Denis Gordeev) collection of mostly NLP papers. You can message me at t.me/nlp_party • 14 items • Updated Mar 22
MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems? Paper • 2403.14624 • Published Mar 21 • 51