$τ$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains Paper • 2406.12045 • Published Jun 17 • 6
Can It Edit? Evaluating the Ability of Large Language Models to Follow Code Editing Instructions Paper • 2312.12450 • Published Dec 11, 2023 • 1
MultiPL-E: A Scalable and Extensible Approach to Benchmarking Neural Code Generation Paper • 2208.08227 • Published Aug 17, 2022 • 1
Knowledge Transfer from High-Resource to Low-Resource Programming Languages for Code LLMs Paper • 2308.09895 • Published Aug 19, 2023 • 1
Can It Edit? Evaluating the Ability of Large Language Models to Follow Code Editing Instructions Paper • 2312.12450 • Published Dec 11, 2023 • 1
Type Prediction With Program Decomposition and Fill-in-the-Type Training Paper • 2305.17145 • Published May 25, 2023
Reflexion: Language Agents with Verbal Reinforcement Learning Paper • 2303.11366 • Published Mar 20, 2023 • 4
Robust Prompt Optimization for Defending Language Models Against Jailbreaking Attacks Paper • 2401.17263 • Published Jan 30 • 1
Type Prediction With Program Decomposition and Fill-in-the-Type Training Paper • 2305.17145 • Published May 25, 2023
Reflexion: Language Agents with Verbal Reinforcement Learning Paper • 2303.11366 • Published Mar 20, 2023 • 4
Language Agent Tree Search Unifies Reasoning Acting and Planning in Language Models Paper • 2310.04406 • Published Oct 6, 2023 • 8