arxiv:2305.18752

GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction

Published on May 30, 2023

· Featured in Daily Papers on May 31, 2023

Upvote

Authors:

Lin Song ,

Yixiao Ge ,

Abstract

This paper aims to efficiently enable Large Language Models (LLMs) to use multimodal tools. Advanced proprietary LLMs, such as ChatGPT and GPT-4, have shown great potential for tool usage through sophisticated prompt engineering. Nevertheless, these models typically rely on prohibitive computational costs and publicly inaccessible data. To address these challenges, we propose the GPT4Tools based on self-instruct to enable open-source LLMs, such as LLaMA and OPT, to use tools. It generates an instruction-following dataset by prompting an advanced teacher with various multi-modal contexts. By using the Low-Rank Adaptation (LoRA) optimization, our approach facilitates the open-source LLMs to solve a range of visual problems, including visual comprehension and image generation. Moreover, we provide a benchmark to evaluate the ability of LLMs to use tools, which is performed in both zero-shot and fine-tuning ways. Extensive experiments demonstrate the effectiveness of our method on various language models, which not only significantly improves the accuracy of invoking seen tools, but also enables the zero-shot capacity for unseen tools. The code and demo are available at https://github.com/StevenGrove/GPT4Tools.

View arXiv page View PDF Add to collection

Community

mikelabs

Sep 24, 2023

My highlights from the paper (Full summary.)

Uses ChatGPT as a "teacher" to generate instructional data for other LLMs
Fine-tunes LLMs like Vicuna on this data using selective weight tuning (keeps base model frozen)
Allows smaller 13B LLM to match 175B GPT-3.5 on seen tools after tuning
Data augmentation with negative/context samples was found to be the secret sauce to get this to work
Can generalize to brand new visual tools in a zero-shot way

This is big because it shows we may not need hyper-expensive training of massive models to impart visual capabilities to LLMs. They seem to be generalizable enough that they can be taught to work with images. Some examples shown include counting objects or segmenting items in pictures using other tools.

With this approach, existing models can be made multi-modal! Pretty cool.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2305.18752 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2305.18752 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2305.18752 in a Space README.md to link it from this page.