-
A Survey on Language Models for Code
Paper • 2311.07989 • Published • 21 -
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
Paper • 2310.06770 • Published • 4 -
CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution
Paper • 2401.03065 • Published • 10 -
Copilot Evaluation Harness: Evaluating LLM-Guided Software Programming
Paper • 2402.14261 • Published • 10
Collections
Discover the best community collections!
Collections including paper arxiv:2305.01210
-
CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution
Paper • 2401.03065 • Published • 10 -
Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation
Paper • 2305.01210 • Published • 4 -
AGIBench: A Multi-granularity, Multimodal, Human-referenced, Auto-scoring Benchmark for Large Language Models
Paper • 2309.06495 • Published • 1 -
MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI
Paper • 2311.16502 • Published • 35
-
Magicoder: Source Code Is All You Need
Paper • 2312.02120 • Published • 79 -
StarCoder 2 and The Stack v2: The Next Generation
Paper • 2402.19173 • Published • 134 -
Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation
Paper • 2305.01210 • Published • 4 -
NeuRI: Diversifying DNN Generation via Inductive Rule Inference
Paper • 2302.02261 • Published • 3
-
A Survey on Language Models for Code
Paper • 2311.07989 • Published • 21 -
Evaluating Large Language Models Trained on Code
Paper • 2107.03374 • Published • 6 -
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
Paper • 2310.06770 • Published • 4 -
CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation
Paper • 2102.04664 • Published • 1