CodeFusion: A Pre-trained Diffusion Model for Code Generation
Imagine a developer who can only change their last line of code, how often would they have to start writing a function from scratch before it is correct? Auto-regressive models for code generation from natural language have a similar limitation: they do not easily allow reconsidering earlier tokens generated. We introduce CodeFusion, a pre-trained diffusion code generation model that addresses this limitation by iteratively denoising a complete program conditioned on the encoded natural language. We evaluate CodeFusion on the task of natural language to code generation for Bash, Python, and Microsoft Excel conditional formatting (CF) rules. Experiments show that CodeFusion (75M parameters) performs on par with state-of-the-art auto-regressive systems (350M-175B parameters) in top-1 accuracy and outperforms them in top-3 and top-5 accuracy due to its better balance in diversity versus quality.
Proposes CodeFusion: code generation model from diffusion (combined with an encoder-decoder model), conditioned on natural language. Diffusion for text: embedding layer to convert discrete tokens to continuous embeddings, then denoise, then retrieve closest discrete enbedding. Architecture has encoder, diffusion, decoder, and classification head (for code tokens). Two stage training: unsupervised pretraining of denoiser and decoder, and supervised (utterance, code) pairs fine-tuning for encoder, denoiser, and decoder. Loss adapted from GENIE. Benchmarked on Python (CoNaLa), Bash, and conditional rules in MS Excel. Encoder initialized from CodeT5. Better performance than StarCoder, CodeT5+, and GPT-3.5 (Python CodeBERT, bash template, and CF rule execution); also generates more diverse outputs. Appendix has implementation and training details, baseline details, visualization of diffusion process (code with time step), and background (auto regression and diffusion). From Microsoft.
I wondering if the 20B is a typo.
If its true, it should be a BIG news.
Maybe it's three open source 7B models stuck together and Microsoft is trolling them 😂
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- SteloCoder: a Decoder-Only LLM for Multi-Language to Python Code Translation (2023)
- T5APR: Empowering Automated Program Repair across Languages through Checkpoint Ensemble (2023)
- CAT-LM: Training Language Models on Aligned Code And Tests (2023)
- InstructCoder: Empowering Language Models for Code Editing (2023)
- InstructExcel: A Benchmark for Natural Language Instruction in Excel (2023)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper