{ "cells": [ { "cell_type": "markdown", "id": "cbe0f126", "metadata": { "jupyter": { "source_hidden": false } }, "source": [ "# Introducing Genstruct\n", "Generating high-quality synthetic instruction data is an important challenge. Standard approaches rely heavily on in-context learning and prompting of large language models to generate instruction pairs. This has limitations in terms of quality, diversity, and lack of explicit reasoning.\n", "\n", "Two previous methods aimed to improve upon this naive prompting approach:\n", "- Retrieval-augmented generation (RAG) pipelines convert passages from sources like Wikipedia into instructional pairs.\n", "- [Ada-Instruct](https://arxiv.org/abs/2310.04484) instead trains a custom model to generate instructions, rather than relying on prompting. This improves quality and diversity compared to prompting alone. Further, the authors of the Ada-Instruct paper found that training could be performed with as few as 10 examples.\n", "\n", "Genstruct is a new method that combines and extends these previous approaches. Like Ada-instruct, it is a custom trained model rather than relying on prompting. However, Ada-Instruct relies heavily on ungrounded generation, which can lead to hallucinations. To mitigate this, Genstruct generates instructions based upon a user-provided context, like RAG methods.\n", "\n", "Additionally, Genstruct goes beyond prior work by focusing on the generation of complex questions and multi-step reasoning for each generated instruction pair, rather than just direct questions and responses." ] }, { "cell_type": "markdown", "id": "bf417800", "metadata": { "jupyter": { "source_hidden": false } }, "source": [ "## Generating instruction pairs\n", "Ada-Instruct is trained based on Mistral. Specifically, it is trained over the [MetaMath-Mistral-7B](meta-math/MetaMath-Mistral-7B) model, in order to improve reasoning with math-heavy topcs.\n", "\n", "Like any other Mistral model, it can be imported from Huggingface Hub as follows:" ] }, { "cell_type": "code", "execution_count": 1, "id": "7492d81a", "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false, "source_hidden": false } }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/home/user/.conda/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n", " from .autonotebook import tqdm as notebook_tqdm\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "Loading checkpoint shards: 0%| | 0/3 [00:00