Papers
arxiv:2407.05700

InverseCoder: Unleashing the Power of Instruction-Tuned Code LLMs with Inverse-Instruct

Published on Jul 8
· Submitted by wyt2000 on Jul 9
Authors:
,
,
,
,
,
,
,
,
,
,
,
,
,

Abstract

Recent advancements in open-source code large language models (LLMs) have demonstrated remarkable coding abilities by fine-tuning on the data generated from powerful closed-source LLMs such as GPT-3.5 and GPT-4 for instruction tuning. This paper explores how to further improve an instruction-tuned code LLM by generating data from itself rather than querying closed-source LLMs. Our key observation is the misalignment between the translation of formal and informal languages: translating formal language (i.e., code) to informal language (i.e., natural language) is more straightforward than the reverse. Based on this observation, we propose INVERSE-INSTRUCT, which summarizes instructions from code snippets instead of the reverse. Specifically, given an instruction tuning corpus for code and the resulting instruction-tuned code LLM, we ask the code LLM to generate additional high-quality instructions for the original corpus through code summarization and self-evaluation. Then, we fine-tune the base LLM on the combination of the original corpus and the self-generated one, which yields a stronger instruction-tuned LLM. We present a series of code LLMs named InverseCoder, which surpasses the performance of the original code LLMs on a wide range of benchmarks, including Python text-to-code generation, multilingual coding, and data-science code generation.

Community

Paper author Paper submitter
•
edited 6 days ago

InverseCoder is a series of code LLMs instruction-tuned by generating data from itself through Inverse-Instruct.
Our contributions:

  • We introduce INVERSE-INSTRUCT, a simple yet effective instruction tuning approach exploiting the mismatch of code-generation and instruction-generation.
  • We make thorough analysis on INVERSE-INSTRUCT, including the component of generated dataset, the impact of data size, etc. We find that the self-consistency between the code generation and summarization is predictive of the effectiveness of INVERSE-INSTRUCT prior to training.
  • Based on INVERSE-INSTRUCT, we present a series of code LLMs named InverseCoder, which achieves SOTA or comparative results on a wide range of benchmarks including Python code generation, multilingual code completion, and data science problems.
Paper author Paper submitter
•
edited 5 days ago

Official code repo for Inverse-Instruct (under development).

Sign up or log in to comment

Models citing this paper 3

Datasets citing this paper 3

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2407.05700 in a Space README.md to link it from this page.

Collections including this paper 4