Papers
arxiv:2309.12499

CodePlan: Repository-level Coding using LLMs and Planning

Published on Sep 21, 2023
· Featured in Daily Papers on Sep 25, 2023
Authors:
,
,

Abstract

Software engineering activities such as package migration, fixing errors reports from static analysis or testing, and adding type annotations or other specifications to a codebase, involve pervasively editing the entire repository of code. We formulate these activities as repository-level coding tasks. Recent tools like GitHub Copilot, which are powered by Large Language Models (LLMs), have succeeded in offering high-quality solutions to localized coding problems. Repository-level coding tasks are more involved and cannot be solved directly using LLMs, since code within a repository is inter-dependent and the entire repository may be too large to fit into the prompt. We frame repository-level coding as a planning problem and present a task-agnostic framework, called CodePlan to solve it. CodePlan synthesizes a multi-step chain of edits (plan), where each step results in a call to an LLM on a code location with context derived from the entire repository, previous code changes and task-specific instructions. CodePlan is based on a novel combination of an incremental dependency analysis, a change may-impact analysis and an adaptive planning algorithm. We evaluate the effectiveness of CodePlan on two repository-level tasks: package migration (C#) and temporal code edits (Python). Each task is evaluated on multiple code repositories, each of which requires inter-dependent changes to many files (between 2-97 files). Coding tasks of this level of complexity have not been automated using LLMs before. Our results show that CodePlan has better match with the ground truth compared to baselines. CodePlan is able to get 5/6 repositories to pass the validity checks (e.g., to build without errors and make correct code edits) whereas the baselines (without planning but with the same type of contextual information as CodePlan) cannot get any of the repositories to pass them.

Community

My highlights from the paper:

As software projects grow, changing code across entire repositories becomes tedious & error-prone. Tasks like migrating APIs or updating dependencies require complex edits across files. This paper proposes a way to automate these "repository-level" coding challenges with AI: CodePlan - an AI system that breaks repository tasks into incremental steps guided by planning & analysis.

Key points:

  • Uses LLMs like GPT-3 for localized code edits
  • Maintains validity across repository via incremental analysis
  • Adaptively plans multi-step changes based on code dependencies
  • Significantly outperformed baselines on API migration & temporal edits
  • Automated tasks across 168 file C# codebase
  • 2-3x more accurate edit locations than baselines
  • Produced final valid codebases, unlike reactive approaches

The core insight is combining LLM strengths with rigorous planning based on dependency analysis. This automates interdependent code changes that naive LLM users struggles with (I personally have these kinds of issues all the time with GPT4 - lack of context about the entirety of the repo/how files fit together).

I think CodePlan demonstrates AI can expand beyond small coding assists into large-scale engineering tasks. Planning + LLMs > LLMs alone. This could really improve productivity and code quality... at least for me :)

Full summary

Personal opinion but...a paper like this is worthless without an accompanying implementation. I mean, anyone can sit here and detail an LLM that comprehensively codes an entire, complex repository of information.

Showing us a viable implementation is what would actually make this impressive or noteworthy as research for the community. No disrespect to the researchers or Microsoft, but this entire piece read more like an LLM fanfic than anything worth taking seriously. If the researchers feel this criticism is unwarranted or unfair then they should prove it by using a little bit of elbow grease and getting out a working code implementation.

Introduces CodePlan: Repository-level coding assistance for testing, package management, type annotations, and other tasks that take the whole repository (entire codebase in one prompt is infeasible); multi-step chain of edits with calls to an LLM that has context of entire repository, edits, and instructions. Problem formulation: Seed specification makes changes at one place, construct repo-level coding system that generates derived changes/specifications to get repository to a valid state (validity defined through oracle: tests, linting, correct build, etc.). Adaptive planning takes a seed (change) and change-may impact analysis (through incremental dependency analysis of repository), and produces a plan graph (that’s used to track completed, next, pending, and new actions); LLM gets prompt with next task to modify snippet and the patch is merged in the repository; oracle uses completed tasks for new (change/delta) seeds. Dependency graph has abstract syntax trees (AST) and class hierarchy analysis. Plan graph is a directed acyclic graph with dependency/cause and obligation (containing code block/snippet, instruction, and status); code fragment extraction traverses AST and folds segments that (only modifiable blocks are un-folded); get spatial context through dependency graph (with plan graph, get temporal context); custom prompt template has task instructions, previous changes (temporal context), causes for change, spatial context (related code), snippet to change, and a (prompt engineered) note for the task. Uses tree-sitter for AST structure, Jedi static code analysis tool for python, GPT-4 for LLM (to make edits, pyright and build tools for oracle. Tested on external and internal (Microsoft) repositories. More baselines include oracle-guided repair (give error and snippet to LLM) and Coeditor transformer model. Uses Levenshtein distance (number of changes between source and target repositories) and Diff BELU metrics. CodePlan can make better (localized edits) for repo-level tasks; temporal and spatial contexts are important; it demonstrates strategic planning, context extraction, propagates behavioral changes, and maintains cause-effect relationships. From Microsoft.

Links: arxiv, GitHub

Introduces CodePlan: Repository-level coding assistance for testing, package management, type annotations, and other tasks that take the whole repository (entire codebase in one prompt is infeasible); multi-step chain of edits with calls to an LLM that has context of entire repository, edits, and instructions. Problem formulation: Seed specification makes changes at one place, construct repo-level coding system that generates derived changes/specifications to get repository to a valid state (validity defined through oracle: tests, linting, correct build, etc.). Adaptive planning takes a seed (change) and change-may impact analysis (through incremental dependency analysis of repository), and produces a plan graph (that’s used to track completed, next, pending, and new actions); LLM gets prompt with next task to modify snippet and the patch is merged in the repository; oracle uses completed tasks for new (change/delta) seeds. Dependency graph has abstract syntax trees (AST) and class hierarchy analysis. Plan graph is a directed acyclic graph with dependency/cause and obligation (containing code block/snippet, instruction, and status); code fragment extraction traverses AST and folds segments that (only modifiable blocks are un-folded); get spatial context through dependency graph (with plan graph, get temporal context); custom prompt template has task instructions, previous changes (temporal context), causes for change, spatial context (related code), snippet to change, and a (prompt engineered) note for the task. Uses tree-sitter for AST structure, Jedi static code analysis tool for python, GPT-4 for LLM (to make edits, pyright and build tools for oracle. Tested on external and internal (Microsoft) repositories. More baselines include oracle-guided repair (give error and snippet to LLM) and Coeditor transformer model. Uses Levenshtein distance (number of changes between source and target repositories) and Diff BELU metrics. CodePlan can make better (localized edits) for repo-level tasks; temporal and spatial contexts are important; it demonstrates strategic planning, context extraction, propagates behavioral changes, and maintains cause-effect relationships. From Microsoft.

Links: arxiv, GitHub

The link that you provided to the repo is 404'd. I also took the time to go find the repo everywhere and I could not look it up (Google dorked for it too).

So I stand by my original comments. If the researchers don't like it then tell them push something out and contribute more than just a bunch of thoughts and opinions on a piece of paper. I thought Microsoft only hired people that took their crafts seriously smh.

image.png

Hello!. I am trying to implement a proof of concept of this document in Java, to apply to java projects. I plan to publish it as open source. The project is still halfway done but for now it is too messy and incomplete, and it doesn't seem right to publish it in this state. Before I could do that, I would need to reorganize the code and have something minimally functional.

For now I can say that instead of the treesetter library I am using javaparser.
I have managed to obtain the dependency graph and more or less I have the classes representing the entire core of the problem
As the paper describes, I get the Fragment, and the SpatialContext. Now I am trying to implement an adapter to call a llm. In principle I will do the tests in a llm running locally with ooba textgen.

As soon as I can invoke the llm with the spatial context and organize the code a little, I will make the project public and put the link here in case anyone is interested

A question for the researchers - where can we find the repo now?

A question for the researchers - where can we find the repo now?

To my understanding, there is still no repo. The researchers at Microsoft are more interested in publishing conjecture and hypothetical papers premised on nothing but self-adulation for describing a tool that doesn't exist.

What they did is no different than me writing a paper describing the specs of a UFO-finder that I tout as a comprehensive solution for finding any and all UFOs, no matter where one is on planet earth. Sure, the idea sounds great in theory but without there being any concrete code, product or otherwise we can evaluate, this paper should be considered bullshit and fodder.

I hate to be that jackass but I've grown exhausted with 'intellectual masturbation' as I like to call it. Because that's what this is. This is an empty idea with NOTHING to back it. And I'm not going to spend time doing these guys' job for them by using their paper as a guideline to create the repo. My assumption is that in order for them to have written such a paper, the researchers would've (hopefully) had the code on hand to test and observe its efficacy.

I see nothing in the paper that indicates that this failure to publish a repo is due to an encumbered license of some sort. Even if there was an encumbrance, that would have to come from Microsoft anyway. The researchers apparently work for Microsoft, so the repo and idea is property of Microsoft technically (assuming the researchers leveraged resources at Microsoft to create it). The way around this would be open sourcing it and publishing the code to the appropriate repo. The fact that there are SIMILAR repos out there by Microsoft on GitHub makes me question why this would not be one of them.

Hello! I just made the repository public, but I warn you that the code is still a mess :( is here: https://github.com/jmdevall/opencodeplan
Before anyone asks: I am not a researcher from India or from Microsoft or anything like that.
Personally, I don't know why they have published a paper that refers to a link that doesn't work. This is not very professional and does not make much sense. In the paper it says that they have done tests with python and c#, and in my case it would be interesting if it were in java.
I think 2 totally different things may have happened: either they had planned to publish it and seeing that it is so good, they thought to keep it to put it as an extra in one of their products ($$$$), or the implementation they had was so bad that it was not worthy to publish

The link that you provided to the repo is 404'd. I also took the time to go find the repo everywhere and I could not look it up (Google dorked for it too).

@librehash You're correct. The repo still isn't public. I thought that it would be there a few days after the paper was up.
I expected the code to be there. If the work is legit, I really hope the community gets a proof of concept before this becomes a part of Copilot. The work has 59 upvotes on HuggingFace Papers alone. I really hope the codebase is made public (at least for verifying the claims against custom code or to extend it to other programming languages).

·

What a surprise! For my part, I had left my project a little aside.
I see that they have only published the benchmark data

Amazing that the repo is still not published

And now (2024.04.04) GitHub link is Ok.

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2309.12499 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2309.12499 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2309.12499 in a Space README.md to link it from this page.

Collections including this paper 26