Little Paper Reviews & AutoCodeRover
Recently, I’ve been reading a lot of papers in my research field — AI Agents and Planning and I’d like to gather some TL;DRs and reviews of them here 💫
Writing reviews helps me dive deeper into the material, as when I’m just reading, I can sometimes gloss over things or mistakenly assume I’ve understood them. True understanding and confidence come when I share what I’ve read with others, processing the content first in my mind, then through writing, as I try to organize it and link it with what I already know. I hope that by sharing articles on Hugging Face, I can replace the need for a conversation partner, and find the motivation and time to keep up with this practice, regularly sharing the insights I’ve gained.
All notes I made (in italic) are subjective and discussible, so if you have something to say, drop me a comment under this article if you have any feedback. Additionally, I want to mention that I review these papers with the assumption that it is of high quality, which is why I chose it for reading and learning. However, many of my notes may contain criticism or suggestions for improvement. I use this approach to train myself to think critically and develop the hard skills necessary for becoming a better researcher. So, if I'm reviewing your paper, please don't take it personally 🤗
AutoCodeRover: Autonomous Program Improvement
And the first, and I hope not the last work which I try to present and understand here will be AutoCodeRover.
📄 AutoCodeRover: Autonomous Program Improvement
👩💻 https://github.com/nus-apr/auto-code-rover
🌐 https://autocoderover.dev
Main Idea and Result
The paper presents yet another SWE-agent. By adding sophisticated code search capabilities, AST-based project tree representation (instead of file collection), test execution they claim that they solve 19% issues SWE-bench-lite, processing 57 GitHub issues in about 4 minutes spending $0.43 USD per issue in average.
Background
A brief overview on current SWE-bench task and solutions. Task: given issue and repository content (Python language, top-12 opensource repos) you need to resolve the issue (pass all the hidden tests). In table from paper they compare their solution with previously reliead ones, so the leaderboard looked like following
Solution
The workflow is divided into two stages: (1) context retrieval loop, when using API agent searches for sufficient context and (2) patch generation loop, when issue fixing patch is generated until one is applicable. For each stage the separate LLM-based agents in invoked.
Context Retrieval
For context retrieval, they proposed to use code search tools of different granularity.
Note: To minimize context size, authors separate signature and content retiring APIs, assuming that is LLM will call the signature API first and ask for content if only it will be considerate necessary. However, it's interesting to check if the trajectories prove this assumption, as if such signature API calls are always followed by content calls, we can leave just the second one and decrease the number of tool calls.
API is invoked in stratified strategy. In each stratum:
- Prompt LLM to select the set of necessary API invocations, based on the current context (problem statement + the code searched so far)
- Prompt LLM to analyze whether the current context is sufficient
Note: There were no appendix revealing prompts and it's hard to read them from code. Also the discovered in repo prompts do not contain any examples of context "necessity" and "sufficiency" (relying on zero-shot approach), so as for me, it's hard to trust the authors and LLM itself in the resulted context relevance.
Patch Generation
Given the collected context, agent iteratively tries to generate "applicable" issue-solving patch. As additional improvement authors apply python linters to control indentation. The agent is allowed to retry up to a pre-configured attempt limits (currently set to three), after which the best patch so far is returned as output.
Note: What the context is not full (e.g., method to fix is right, but not enough lines inside it are added to the context)? In such case, we will never get the right patch. Maybe we can provide some tool here to enlarge the retrived code content area if required? Also authors do not elaborate on additional content length (+/- 3 lines), so maybe 5 is optimal?
Evaluation
- Benchmark: Full and random subset of 25% data point from SWE-bench (to compare with Devin) and SWE-bench-lite
- Baseline and Evaluation Metric: Compare with Devin and SWE-Agent. All values is average by three tries. Reported pass@1 and pass@3 metrics. As for metrics, the authors used: (1) the percentage of resolved instances, (2) average time cost, and (3) average token cost to evaluate the effectiveness of the tools.
- Implementation and Parameters: gpt-4-0125-preview, temperature=0.2, max_tokens=1024.
RQ1: To what extent can AutoCodeRover automate software issues like human developers?
The results are presented in the table below.
- AutoCodeRover and SWE-agent complement each other in different scenarios
- The main reason that AutoCodeRover failed on the 23 unique instances resolved by SWE-agent is unimplemented search APIs. Note: but I does not get the point, what they call unimplemented search APIs
- Overall, on SWE-bench lite, AutoCodeRover has a correctness rate of 65.4% (51 correct/78 plausible). Plausible program patch passes the given test (but not always equivalent to developer's patch)
- The vast majority of AutoCodeRover’s overfitting patches (all but 2 of the overfitting patches) modify the same methods as the developer patches, but the code modifications are wrong
- One cause is that the issue creator gives a preliminary patch in the description. This patch can be different from the final developer patch, misleading the LLM
- The other interesting cause is that the issue creator mentioned a case that needs to be handled. The LLM only fixes this mentioned case, but the developer fixed other similar cases as well
RQ2: Can existing debugging / analysis techniques assist AutoCodeRover?
As an additional experiment, the authors try to equip the agent with method-level Spectrum-based Fault Localization (SBFL). This technique identifies potentially error-prone code by analyzing test execution results. As for the test suite the developer test-suite of the target task instance can be taken (provided in dataset). The proposed approach of utilizing SBFL is to add its output to the context before the context retrieval stage, hoping that this information can extend hints from issue text and incline the agent to the faulted code element. Moreover, in patch generation stage they propose to run test suite and in case of failure, regenerate patch up to 3 times.
Note: AFAIK, SWE-Bench contains not only bug reports but also new feature issues, so in this case SBFL might bias the agent. It may be more reasonable to first classify issues as bug-related or not, and then apply SBFL only to the bug-related ones.
- Increase from 19% to 22% Note: Surprisingly not so dramatic
- ACR-sbfl still uniquely resolves 7 task instances that are not resolved in any of the other runs
RQ3: What are the challenges for AutoCodeRover and fully automated program improvement in future?
- Success: The generated patch resolves the issue
- Wrong patch: The generated patch modifies all methods that are modified in the developer patch. This means the patch content is wrong but the patch location(s) are correct
- Wrong location in correct file: The generated patch that modifies the correct file but wrong location(s) in the file
- Wrong file: The generated patch modifies the wrong file
- No patch: No patch is generated from the retrieved context
Note: What if we simply provide the LLM with the developer's test suite that the patch must pass? Such oracle estimation could better help to assess the maximum potential of test hint. Moreover, if the results turn out well, we could further explore generating a test suite based on the issue text!!! Smells like a good research to try?
Summary
Overall, the paper is inspiring for those looking to equip a SWE agent with SE tools like test generation and debugging. However, the lack of an ablation study of the proposed tool is a drawback, as it could have clarified certain details and supported the hypotheses mentioned in my notes. That said, the experiment setup and evaluation is conducted with good quality and clarity.