arxiv:2305.06156

The Vault: A Comprehensive Multilingual Dataset for Advancing Code Understanding and Generation

Published on May 9, 2023

Upvote

Authors:

Dung Nguyen Manh ,

Nam Le Hai ,

Nghi D. Q. Bui

Abstract

We present The Vault, an open-source, large-scale code-text dataset designed to enhance the training of code-focused large language models (LLMs). Existing open-source datasets for training code-based LLMs often face challenges in terms of size, quality (due to noisy signals), and format (only containing code function and text explanation pairings). The Vault overcomes these limitations by providing 40 million code-text pairs across 10 popular programming languages, thorough cleaning for 10+ prevalent issues, and various levels of code-text pairings, including class, function, and line levels. Researchers and practitioners can utilize The Vault for training diverse code-focused LLMs or incorporate the provided data cleaning methods and scripts to improve their datasets. By employing The Vault as the training dataset for code-centric LLMs, we anticipate significant advancements in code understanding and generation tasks, fostering progress in both artificial intelligence research and software development practices.

View arXiv page View PDF Add to collection

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

The Vault: A Comprehensive Multilingual Dataset for Advancing Code Understanding and Generation

Abstract

Community

Models citing this paper 1

Datasets citing this paper 3

Spaces citing this paper 2

Collections including this paper 4