arxiv:2402.13064

Synthetic Data (Almost) from Scratch: Generalized Instruction Tuning for Language Models

Published on Feb 20

· Submitted by

akhaliq on Feb 21

#2 Paper of the day

Upvote

Authors:

Zhengyang Tang ,

Chaojun Wang ,

Haoyang Huang ,

Shaohan Huang ,

Xiaolong Huang ,

Dongdong Zhang ,

Yuxian Gu ,

Wei Lu ,

Benyou Wang ,

Furu Wei

Abstract

We introduce Generalized Instruction Tuning (called GLAN), a general and scalable method for instruction tuning of Large Language Models (LLMs). Unlike prior work that relies on seed examples or existing datasets to construct instruction tuning data, GLAN exclusively utilizes a pre-curated taxonomy of human knowledge and capabilities as input and generates large-scale synthetic instruction data across all disciplines. Specifically, inspired by the systematic structure in human education system, we build the taxonomy by decomposing human knowledge and capabilities to various fields, sub-fields and ultimately, distinct disciplines semi-automatically, facilitated by LLMs. Subsequently, we generate a comprehensive list of subjects for every discipline and proceed to design a syllabus tailored to each subject, again utilizing LLMs. With the fine-grained key concepts detailed in every class session of the syllabus, we are able to generate diverse instructions with a broad coverage across the entire spectrum of human knowledge and skills. Extensive experiments on large language models (e.g., Mistral) demonstrate that GLAN excels in multiple dimensions from mathematical reasoning, coding, academic exams, logical reasoning to general instruction following without using task-specific training data of these tasks. In addition, GLAN allows for easy customization and new fields or skills can be added by simply incorporating a new node into our taxonomy.

View arXiv page View PDF Add to collection

Community

librarian-bot

Feb 22

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

philschmid

Feb 24

Are you planning to release the prompts you used in each step? or relating the dataset? You approach sounds very interesting, but it is currently impossible to verify or test it.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Synthetic Data (Almost) from Scratch: Generalized Instruction Tuning for Language Models

Abstract

Community

Models citing this paper 1

Datasets citing this paper 2

Spaces citing this paper 1

Collections including this paper 27