-
Plot2Code: A Comprehensive Benchmark for Evaluating Multi-modal Large Language Models in Code Generation from Scientific Plots
Paper • 2405.07990 • Published • 15 -
Large Language Models as Planning Domain Generators
Paper • 2405.06650 • Published • 8 -
AutoCrawler: A Progressive Understanding Web Agent for Web Crawler Generation
Paper • 2404.12753 • Published • 38 -
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
Paper • 2404.07972 • Published • 41
Collections
Discover the best community collections!
Collections including paper arxiv:2404.03543
-
BLINK: Multimodal Large Language Models Can See but Not Perceive
Paper • 2404.12390 • Published • 23 -
GAIA: a benchmark for General AI Assistants
Paper • 2311.12983 • Published • 173 -
RULER: What's the Real Context Size of Your Long-Context Language Models?
Paper • 2404.06654 • Published • 32 -
CantTalkAboutThis: Aligning Language Models to Stay on Topic in Dialogues
Paper • 2404.03820 • Published • 21
-
LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders
Paper • 2404.05961 • Published • 62 -
Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs
Paper • 2404.05719 • Published • 57 -
CodeEditorBench: Evaluating Code Editing Capability of Large Language Models
Paper • 2404.03543 • Published • 15 -
Are large language models superhuman chemists?
Paper • 2404.01475 • Published • 13
-
Unlocking the conversion of Web Screenshots into HTML Code with the WebSight Dataset
Paper • 2403.09029 • Published • 53 -
LLMLingua-2: Data Distillation for Efficient and Faithful Task-Agnostic Prompt Compression
Paper • 2403.12968 • Published • 20 -
RAFT: Adapting Language Model to Domain Specific RAG
Paper • 2403.10131 • Published • 64 -
Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking
Paper • 2403.09629 • Published • 54
-
A Survey on Language Models for Code
Paper • 2311.07989 • Published • 21 -
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
Paper • 2310.06770 • Published • 3 -
CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution
Paper • 2401.03065 • Published • 10 -
Copilot Evaluation Harness: Evaluating LLM-Guided Software Programming
Paper • 2402.14261 • Published • 10
-
MERT: Acoustic Music Understanding Model with Large-Scale Self-supervised Training
Paper • 2306.00107 • Published • 2 -
MusiLingo: Bridging Music and Text with Pre-trained Language Models for Music Captioning and Query Response
Paper • 2309.08730 • Published • 1 -
ChatMusician: Understanding and Generating Music Intrinsically with LLM
Paper • 2402.16153 • Published • 55 -
CMMMU: A Chinese Massive Multi-discipline Multimodal Understanding Benchmark
Paper • 2401.11944 • Published • 24