arxiv:2310.17680

CodeFusion: A Pre-trained Diffusion Model for Code Generation

Published on Oct 26, 2023

· Submitted by

akhaliq on Oct 30, 2023

#1 Paper of the day

Upvote

Authors:

Mukul Singh ,

Abstract

Imagine a developer who can only change their last line of code, how often would they have to start writing a function from scratch before it is correct? Auto-regressive models for code generation from natural language have a similar limitation: they do not easily allow reconsidering earlier tokens generated. We introduce CodeFusion, a pre-trained diffusion code generation model that addresses this limitation by iteratively denoising a complete program conditioned on the encoded natural language. We evaluate CodeFusion on the task of natural language to code generation for Bash, Python, and Microsoft Excel conditional formatting (CF) rules. Experiments show that CodeFusion (75M parameters) performs on par with state-of-the-art auto-regressive systems (350M-175B parameters) in top-1 accuracy and outperforms them in top-3 and top-5 accuracy due to its better balance in diversity versus quality.

View arXiv page View PDF Add to collection

Community

MichaelBarryUK

Oct 30, 2023

•

edited Oct 30, 2023

Amazing. Also, it would be intriguing to see the code being non-linearly inferred, would make for some interesting UI effects. Imagine the green rain matrix effect morphing into real source code

Also, according to this paper, gpt3.5 has 20B parameters.

Drstone1

Oct 30, 2023

The code in python for deleting unwanted cookie

TheProjectsGuy

Oct 30, 2023

Proposes CodeFusion: code generation model from diffusion (combined with an encoder-decoder model), conditioned on natural language. Diffusion for text: embedding layer to convert discrete tokens to continuous embeddings, then denoise, then retrieve closest discrete enbedding. Architecture has encoder, diffusion, decoder, and classification head (for code tokens). Two stage training: unsupervised pretraining of denoiser and decoder, and supervised (utterance, code) pairs fine-tuning for encoder, denoiser, and decoder. Loss adapted from GENIE. Benchmarked on Python (CoNaLa), Bash, and conditional rules in MS Excel. Encoder initialized from CodeT5. Better performance than StarCoder, CodeT5+, and GPT-3.5 (Python CodeBERT, bash template, and CF rule execution); also generates more diverse outputs. Appendix has implementation and training details, baseline details, visualization of diffusion process (code with time step), and background (auto regression and diffusion). From Microsoft.