{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "## Getting to Main directory\n",
    "import os\n",
    "os.chdir(\"../\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# loading secret key\n",
    "import os\n",
    "from dotenv import load_dotenv\n",
    "load_dotenv()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "e:\\projects\\AI research assistant\\venv\\lib\\site-packages\\tqdm\\auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n",
      "  from .autonotebook import tqdm as notebook_tqdm\n"
     ]
    }
   ],
   "source": [
    "from llama_index.core import VectorStoreIndex\n",
    "from llama_index.core import ServiceContext\n",
    "from llama_index.core import StorageContext, load_index_from_storage\n",
    "from llama_index.embeddings.gemini import GeminiEmbedding\n",
    "from llama_index.llms.gemini import Gemini\n",
    "import google.generativeai as genai\n",
    "from llama_index.core import VectorStoreIndex,SimpleDirectoryReader\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "gemini_api_key=os.getenv(\"GEMINI_API_KEY\")\n",
    "pinecone_api_key=os.getenv(\"PINECONE_API_KEY\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Data ingestion - Taking pdf documents and Cleaning and Transforming  Data into vector index"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "documents=SimpleDirectoryReader(\"Data\").load_data()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "34"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(documents)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[Document(id_='2c29fa85-a1fa-479c-8cdc-6c366889be7e', embedding=None, metadata={'page_label': '1', 'file_name': 'peft.pdf', 'file_path': 'e:\\\\projects\\\\AI research assistant\\\\Data\\\\peft.pdf', 'file_type': 'application/pdf', 'file_size': 562785, 'creation_date': '2024-03-30', 'last_modified_date': '2024-03-30'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={}, text='Few-Shot Parameter-Efﬁcient Fine-Tuning is Better\\nand Cheaper than In-Context Learning\\nHaokun Liu∗Derek Tam∗Mohammed Muqeeth∗\\nJay Mohta Tenghao Huang Mohit Bansal Colin Raffel\\nDepartment of Computer Science\\nUniversity of North Carolina at Chapel Hill\\n{haokunl,dtredsox,muqeeth,craffel}@cs.unc.edu\\nAbstract\\nFew-shot in-context learning (ICL) enables pre-trained language models to per-\\nform a previously-unseen task without any gradient-based training by feeding a\\nsmall number of training examples as part of the input. ICL incurs substantial\\ncomputational, memory, and storage costs because it involves processing all of the\\ntraining examples every time a prediction is made. Parameter-efﬁcient ﬁne-tuning\\n(PEFT) (e.g. adapter modules, prompt tuning, sparse update methods, etc.) offers\\nan alternative paradigm where a small set of parameters are trained to enable a\\nmodel to perform the new task. In this paper, we rigorously compare few-shot\\nICL and PEFT and demonstrate that the latter offers better accuracy as well as\\ndramatically lower computational costs. Along the way, we introduce a new PEFT\\nmethod called (IA)3that scales activations by learned vectors, attaining stronger\\nperformance while only introducing a relatively tiny amount of new parameters.\\nWe also propose a simple recipe based on the T0 model [ 1] called T-Few that\\ncan be applied to new tasks without task-speciﬁc tuning or modiﬁcations. We\\nvalidate the effectiveness of T-Few on completely unseen tasks by applying it to\\nthe RAFT benchmark [ 2], attaining super-human performance for the ﬁrst time\\nand outperforming the state-of-the-art by 6% absolute. All of the code used in our\\nexperiments is publicly available.1\\n1 Introduction\\nPre-trained language models have become a cornerstone of natural language processing, thanks\\nto the fact that they can dramatically improve data efﬁciency on tasks of interest – i.e., using a\\npre-trained language model for initialization often produces better results with less labeled data. A\\nhistorically common approach has been to use the pre-trained model’s parameters for initialization\\nbefore performing gradient-based ﬁne-tuning on a downstream task of interest. While ﬁne-tuning\\nhas produced many state-of-the-art results [ 1], it results in a model that is specialized for a single\\ntask with an entirely new set of parameter values, which can become impractical when ﬁne-tuning a\\nmodel on many downstream tasks.\\nAn alternative approach popularized by [ 3,4] isin-context learning (ICL), which induces a model\\nto perform a downstream task by inputting prompted examples. Few-shot prompting converts a\\nsmall collection of input-target pairs into (typically) human-understandable instructions and examples\\n[3,4], along with a single unlabeled example for which a prediction is desired. Notably, ICL requires\\nno gradient-based training and therefore allows a single model to immediately perform a wide variety\\nof tasks. Performing ICL therefore solely relies on the capabilities that a model learned during\\npre-training. These characteristics have led to a great deal of recent interest in ICL methods [5–10].\\n∗Equal contribution.\\n1https://github.com/r-three/t-few\\nPreprint. Under review.arXiv:2205.05638v2  [cs.LG]  26 Aug 2022', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\\n\\n{content}', metadata_template='{key}: {value}', metadata_seperator='\\n'),\n",
       " Document(id_='8c598b5b-d100-4d71-8273-7fb26ddac626', embedding=None, metadata={'page_label': '2', 'file_name': 'peft.pdf', 'file_path': 'e:\\\\projects\\\\AI research assistant\\\\Data\\\\peft.pdf', 'file_type': 'application/pdf', 'file_size': 562785, 'creation_date': '2024-03-30', 'last_modified_date': '2024-03-30'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={}, text=\"VKQ\\nsoftmax \\nDenseNonlinearity Dense\\nT0Susie loves her grandma's \\nbanana bread. Susie called \\nher grandma and asked her to \\nsend some. Grandma lived \\nvery far away. A week passed \\nand grandma surprised Susie \\nby coming to visit. What is \\na possible continuation for \\nthe story? Susie was so happy. \\nSusie was upset. \\n(IA)3Losses used in T-FewFigure 1: Diagram of (IA)3and the loss terms used in the T-Few recipe. Left: (IA)3introduces the\\nlearned vectors lk,lv, andlﬀwhich respectively rescale (via element-wise multiplication, visualized as\\n⊙) the keys and values in attention mechanisms and the inner activations in position-wise feed-forward\\nnetworks. Right: In addition to a standard cross-entropy loss LLM, we introduce an unlikelihood loss\\nLULthat lowers the probability of incorrect outputs and a length-normalized loss LLNthat applies a\\nstandard softmax cross-entropy loss to length-normalized log-probabilities of all output choices.\\nDespite the practical beneﬁts of ICL, it has several major drawbacks. First, processing all prompted\\ninput-target pairs every time the model makes a prediction incurs signiﬁcant compute costs. Second,\\nICL typically produces inferior performance compared to ﬁne-tuning [ 4]. Finally, the exact formatting\\nof the prompt (including the wording [ 11] and ordering of examples [ 12]) can have signiﬁcant and\\nunpredictable impact on the model’s performance, far beyond inter-run variation of ﬁne-tuning.\\nRecent work has also demonstrated that ICL can perform well even when provided with incorrect\\nlabels, raising questions as to how much learning is taking place at all [9].\\nAn additional paradigm for enabling a model to perform a new task with minimal updates is parameter-\\nefﬁcient ﬁne-tuning (PEFT), where a pre-trained model is ﬁne-tuned by only updating a small number\\nof added or selected parameters. Recent methods have matched the performance of ﬁne-tuning the\\nfull model while only updating or adding a small fraction (e.g. 0.01%) of the full model’s parameters\\n[13,14]. Furthermore, certain PEFT methods allow mixed-task batches where different examples in\\na batch are processed differently [14], making both PEFT and ICL viable for multitask models.\\nWhile the beneﬁts of PEFT address some shortcomings of ﬁne-tuning (when compared to ICL), there\\nhas been relatively little focus on whether PEFT methods work well when very little labeled data\\nis available. Our primary goal in this paper is to close this gap by proposing a recipe – i.e., a model, a\\nPEFT method, and a ﬁxed set of hyperparameters – that attains strong performance on novel, unseen\\ntasks while only updating a tiny fraction of the model’s parameters. Speciﬁcally, we base our approach\\non the T0 model [ 1], a variant of T5 [ 15] ﬁne-tuned on a multitask mixture of prompted datasets.\\nTo improve performance on classiﬁcation and multiple-choice tasks, we add unlikelihood [ 16,17]\\nand length normalization-based [ 4] loss terms. In addition, we develop (IA)3, a PEFT method\\nthat multiplies intermediate activations by learned vectors. (IA)3attains stronger performance than\\nfull-model ﬁne-tuning while updating up to 10,000 ×fewer parameters. Finally, we demonstrate\\nthe beneﬁts of pre-training the (IA)3parameters before ﬁne-tuning [ 18,19]. Our overall recipe,\\nwhich we dub “ T-Few ”, performs signiﬁcantly better than ICL (even against 16×larger models)\\nand outperforms humans for the ﬁrst time on the real-world few-shot learning benchmark RAFT [ 2]\\nwhile requiring dramatically less compute and allowing for mixed-task batches during inference. To\\nfacilitate the use of T-Few on new problems and future research on PEFT, we release our code.1\\nAfter providing background on ICL and PEFT in the following section, we discuss the design of\\nT-Few in section 3. In section 4, we present experiments comparing T-Few to strong ICL baselines.\\nFinally, we discuss related work in appendix B and conclude in section 5.\\n2 Background\\nIn this section, we provide am verview of ICL and PEFT with a focus on characterizing the com-\\nputation, memory, and on-disk storage costs of making a prediction. Real-world costs depend on\\nimplementation and hardware, so we report costs in terms of FLOPs for computation and bytes for\\nmemory and storage, respectively. Additional related work is discussed in appendix B.\\n2.1 Few-shot in-context learning (ICL)\\nICL [ 3,4] aims to induce a model to perform a task by feeding in concatenated and prompted\\ninput-target examples (called “shots”) along with an unlabeled query example. Taking the cycled\\n2\", start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\\n\\n{content}', metadata_template='{key}: {value}', metadata_seperator='\\n'),\n",
       " Document(id_='5a924469-c6ec-4239-b2c7-ef70f8be9c65', embedding=None, metadata={'page_label': '3', 'file_name': 'peft.pdf', 'file_path': 'e:\\\\projects\\\\AI research assistant\\\\Data\\\\peft.pdf', 'file_type': 'application/pdf', 'file_size': 562785, 'creation_date': '2024-03-30', 'last_modified_date': '2024-03-30'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={}, text='letter task from Brown et al. [4]as an example, a 4-shot input or context would be “ Please\\nunscramble the letters into a word, and write that word: asinoc = casino,\\nyfrogg = froggy, plesim = simple, iggestb = biggest, astedro = ”, for which the\\ndesired output would be “ roasted ”. ICL induces an autoregressive language model to perform\\nthis task by feeding in the context and sampling from the model. For classiﬁcation tasks, each\\nlabel is associated with a string (e.g. “ positive ” and “ negative ” for sentiment analysis) and\\na label is assigned by choosing the label string that the model assigns the highest probability to.\\nFor multiple-choice tasks (e.g. choosing between Npossible answers to a question), the model’s\\nprediction is similarly determined by determining which choice is assigned the highest probability.\\nThe primary advantage of ICL is that it enables a single model to perform many tasks immediately\\nwithout ﬁne-tuning. This also enables mixed-task batches , where different examples in a batch of data\\ncorrespond to different tasks by using different contexts in the input. ICL is also typically performed\\nwith only a limited number of labeled examples – called few-shot learning – making it data-efﬁcient.\\nDespite these advantages, ICL comes with signiﬁcant practical drawbacks: First, making a prediction\\nis dramatically more expensive because the model needs to process all of the in-context labeled\\nexamples. Speciﬁcally, ignoring the quadratic complexity of self-attention operations in Transformer\\nlanguage models (which are typically small compared to the costs of the rest of the model [ 20]),\\nprocessing the ktraining examples for k-shot ICL increases the computational cost by approximately\\nk+ 1times compared to processing the unlabeled example alone. Memory costs similarly scale\\napproximately linearly with k, though during inference the memory costs are typically dominated by\\nstoring the model’s parameters. Separately, there is a small amount of on-disk storage required for\\nstoring the in-context examples for a given task. For example, storing 32examples for a task where\\nthe prompted input and target for each example is 512tokens long would require about 66kilobytes\\nof storage on disk ( 32examples×512tokens×32bits).\\nBeyond the aforementioned costs, ICL also exhibits unintuitive behavior. Zhao et al. [12] showed\\nthat the ordering of examples in the context heavily inﬂuences the model’s predictions. Min et al.\\n[9]showed that ICL can still perform well even if the labels of the in-context examples are swapped\\n(i.e. made incorrect), which raises questions about whether ICL is really “learning” from the labeled\\nexamples.\\nVarious approaches have been proposed to mitigate these issues. One way to decrease computational\\ncosts is to cache the key and value vectors for in-context examples. This is possible because decoder-\\nonly Transformer language models have a causal masking pattern, so the model’s activations for the\\ncontext do not do not depend on the unlabeled example. In an extreme case, 32-shot ICL with 512\\ntokens per in-context example would result in over 144 gigabytes of cached key and value vectors for\\nthe GPT-3 model ( 32examples×512tokens×96layers×12288 d model×32bitseach for the key\\nand value vectors). Separately, Min et al. [21] proposed ensemble ICL , where instead of using the\\noutput probability from concatenating the ktraining examples, the output probabilities of the model\\non each training example (i.e. 1-shot ICL for each of the kexamples) are multiplied together. This\\nlowers the non-parameter memory cost by a factor of k/2but increases the computational cost by\\na factor of 2. In terms of task performance, Min et al. [21] ﬁnd that ensemble ICL outperforms the\\nstandard concatenative variant.\\n2.2 Parameter-efﬁcient ﬁne-tuning\\nWhile standard ﬁne-tuning updates all parameters of the pre-trained model, it has been demonstrated\\nthat it is possible to instead update or add a relatively small number of parameters. Early methods\\nproposed adding adapters [22–24], which are small trainable feed-forward networks inserted between\\nthe layers in the ﬁxed pre-trained model. Since then, various sophisticated PEFT methods have been\\nproposed, including methods that choose a sparse subset of parameters to train [ 25,26], produce\\nlow-rank updates [ 13], perform optimization in a lower-dimensional subspace [ 27], add low-rank\\nadapters using hypercomplex multiplication [ 28], and more. Relatedly, prompt tuning [14] and preﬁx\\ntuning [29] concatenate learned continuous embeddings to the model’s input or activations to induce\\nit to perform a task; this can be seen as a PEFT method [ 30]. State-of-the-art PEFT methods can\\nmatch the performance of ﬁne-tuning all of the model’s parameters while updating only a tiny fraction\\n(e.g. 0.01%) of the model’s parameters.\\nPEFT drastically reduces the memory and storage requirements for training and saving the model. In\\naddition, certain PEFT methods straightforwardly allow mixed-task batches – for example, prompt\\n3', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\\n\\n{content}', metadata_template='{key}: {value}', metadata_seperator='\\n'),\n",
       " Document(id_='7cb24228-2e95-4b96-ab9a-62b691aa13a9', embedding=None, metadata={'page_label': '4', 'file_name': 'peft.pdf', 'file_path': 'e:\\\\projects\\\\AI research assistant\\\\Data\\\\peft.pdf', 'file_type': 'application/pdf', 'file_size': 562785, 'creation_date': '2024-03-30', 'last_modified_date': '2024-03-30'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={}, text='tuning enables a single model to perform many tasks simply by concatenating different prompt\\nembeddings to each example in the batch [ 14]. On the other hand, PEFT methods that re-parameterize\\nthe model (e.g. [ 27,13]) are costly or onerous for mixed-task batches. Separately, different PEFT\\nmethods increase the computation and memory required to perform inference by different amounts.\\nFor example, adapters effectively add additional (small) layers to the model, resulting in small but\\nnon-negligible increases in computational costs and memory. An additional cost incurred by PEFT\\nis the cost of ﬁne-tuning itself, which must be performed once and is then amortized as the model\\nis used for inference. However, we will show that PEFT can be dramatically more computationally\\nefﬁcient when considering both ﬁne-tuning and inference while achieving better accuracy than ICL.\\n3 Designing the T-Few Recipe\\nGiven that PEFT allows a model to be adapted to a new task with relatively small storage requirements\\nand computational cost, we argue that PEFT presents a promising alternative to ICL. Our goal\\nis therefore to develop a recipe that allows a model to attain high accuracy on new tasks with\\nlimited labeled examples while allowing mixed-task batches during inference and incurring minimal\\ncomputational and storage costs. By recipe , we mean a speciﬁc model and hyperparameter setting\\nthat provides strong performance on any new task without manual tuning or per-task adjustments.\\nIn this way, we can ensure that our approach is a realistic option in few-shot settings where limited\\nlabeled data is available for evaluation [31, 32].\\n3.1 Model and Datasets\\nAs a ﬁrst step, we must choose a pre-trained model. Ideally, the model should attain high performance\\non new tasks after ﬁne-tuning on a limited number of labeled examples. In preliminary experiments\\napplying PEFT methods to different pre-trained models, we attained the best performance with T0\\n[1]. T0 is based on T5 [ 15], an encoder-decoder Transformer model [ 33] that was pre-trained via a\\nmasked language modeling objective [ 34] on a large corpus of unlabeled text data. T0 was created by\\nﬁne-tuning T5 on a multitask mixture of datasets in order to enable zero-shot generalization, i.e. the\\nability to perform tasks without any additional gradient-based training. Examples in the datasets used\\nto train T0 were prompted by applying the prompt templates from the Public Pool of Prompts (P3\\n[35]), which convert each example in each dataset to a prompted text-to-text format where each label\\ncorresponds to a different string. For brevity, we omit a detailed description of T0 and T5; interested\\nreaders can refer to Sanh et al. [1]and Raffel et al. [15]. T0 was released in three billion and eleven\\nbillion parameter variants, referred to as “T0-3B” and simply “T0” respectively. In this section (where\\nour goal is to design the T-Few recipe through extensive experimentation), we use T0-3B to reduce\\ncomputational costs. For all models and experiments, we use Hugging Face Transformers [36].\\nWhile T0 was designed for zero-shot generalization, we will demonstrate that it also attains strong\\nperformance after ﬁne-tuning with only a few labeled examples. To test T0’s generalization, Sanh et al.\\n[1]chose a set of tasks (and corresponding datasets) to hold out from the multitask training mixture\\n– speciﬁcally, sentence completion (COPA [ 37], H-SWAG [ 38], and Story Cloze [ 39] datasets),\\nnatural language inference (ANLI [ 40], CB [ 41], and RTE [ 42]), coreference resolution (WSC [ 43]\\nand Winogrande [ 44]), and word sense disambiguation (WiC [ 45]). Evaluation of generalization\\ncapabilities can then be straightforwardly done by measuring performance on these held-out datasets.\\nWe also will later test T-Few ’s abilities in the RAFT benchmark [ 2] in section 4.3, a collection of\\nunseen “real-world” few-shot tasks with no validation set and a held-out test set. ANLI, WiC, WSC is\\nlicensed under a Creative Commons License. Winogrande is licnsed under an Apache license. COPA\\nis under a BSD-2 Clause license. We could not ﬁnd the license of RTE and CB but they are part of\\nSuperGLUE which mentions the datasets are allowed for use in research context.\\nTo ease comparison, we use the same number of few-shot training examples for each dataset as Brown\\net al. [4], which varies from 20 to 70. Unfortunately, the few-shot dataset subsets used by Brown\\net al. [4]have not been publicly disclosed. To allow for a more robust comparison, we therefore\\nconstructed ﬁve few-shot datasets by sampling subsets with different seeds and report the median\\nand interquartile range. We prompt examples from each dataset using the prompt templates from P3\\nBach et al. [35], using a randomly-sampled prompt template for each example at each step. Unless\\notherwise stated, we train our model for 1K steps with a batch size of 8 and report performance at the\\nend of training.\\nFor evaluation, we use “rank classiﬁcation”, where the model’s log-probabilities for all possible label\\nstrings are ranked and the model’s prediction is considered correct if the highest-ranked choice is the\\n4', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\\n\\n{content}', metadata_template='{key}: {value}', metadata_seperator='\\n'),\n",
       " Document(id_='1b5f1995-0803-4905-b1bf-9dbff0e1e936', embedding=None, metadata={'page_label': '5', 'file_name': 'peft.pdf', 'file_path': 'e:\\\\projects\\\\AI research assistant\\\\Data\\\\peft.pdf', 'file_type': 'application/pdf', 'file_size': 562785, 'creation_date': '2024-03-30', 'last_modified_date': '2024-03-30'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={}, text='correct answer. Rank classiﬁcation evaluation is compatible with both classiﬁcation and multiple-\\nchoice tasks. Since model performance can vary signiﬁcantly depending on the prompt template used,\\nwe report the median accuracy across all prompt templates from P3 and across few-shot data subsets\\nfor each dataset. For all datasets, we report the accuracy on the test set or validation set when the test\\nlabels are not public (e.g. SuperGLUE datasets). In the main text, we report median accuracy across\\nthe nine datasets mentioned above. Detailed results on each dataset are provided in the appendices.\\n3.2 Unlikelihood Training and Length Normalization\\nBefore investigating PEFT methods, we ﬁrst explore two additional loss terms to improve the\\nperformance of few-shot ﬁne-tuning of language models. Language models are normally trained\\nwith cross-entropy loss LLM=−1\\nT∑\\ntlogp(yt|x,y<t)where the model is trained to increase the\\nprobability of the correct target sequence y= (y1,y2,...,y T)given the input sequence x.\\nFor evaluation, we use rank classiﬁcation (described in section 3.1) which depends on both the\\nprobability that the model assigns to the correct choice as well as the probabilities assigned by the\\nmodel to the incorrect choices. To account for this during training, we consider adding an unlikelihood\\nloss [16, 17]:\\nLUL=−∑N\\nn=1∑T(n)\\nt=1log(1−p(ˆy(n)\\ni|x,ˆy(n)\\n<t))\\n∑N\\nn=1T(n)(1)\\nwhich discourages the model from predicting tokens from incorrect target sequences, where ˆ y(n)=\\n(ˆy1,ˆy2,..., ˆyT(n))is then-th ofNincorrect target sequences. We hypothesize that adding LULwill\\nimprove results on rank classiﬁcation because the model will be trained to assign lower probabilities\\nto incorrect choices, thereby improving the chance that the correct choice is ranked highest.\\nThe possible target sequences for a given training example can have signiﬁcantly different lengths,\\nespecially in multiple-choice tasks. Ranking each choice based on probability can therefore “favor”\\nshorter choices because the model’s assigned probability to each token is ≤1. To rectify this,\\nwe consider using length normalization when performing rank classiﬁcation, which divides the\\nmodel’s score on each possible answer choice by the number of tokens in the choice (as used in\\nGPT-3 [ 4]). When using length normalization during evaluation, we introduce an additional loss\\nterm during training that more closely reﬂects length-normalized evaluation. First, we compute the\\nlength-normalized log probability of a given output sequence β(x,y) =1\\nT∑T\\nt=1logp(yt|x,y<t).\\nThen, we maximize the length-normalized log probability of the correct answer choice by minimizing\\nthesoftmax cross-entropy loss:\\nLLN=−logexp(β(x,y))\\nexp(β(x,y)) +∑N\\nn=1exp(β(x,ˆ y(n)))(2)\\nWhen training a model with LLM,LUL, andLLN, we simply sum them. This avoids introducing any\\nhyperparameters that would be problematic to tune in the few-shot setting (where realistically-sized\\nvalidation sets are tiny by necessity [31, 32]).\\nWe report the results of ﬁne-tuning all of T0-3B’s parameters with and without length normalization\\non all datasets in appendix C. We ﬁnd that adding LLNimproves the accuracy from 60.7% to 62.71%\\nand including both LULandLLNprovides a further improvement to 63.3%. Since these loss terms\\nimprove performance without introducing any additional hyperparameters, we include them in our\\nrecipe and use them in all following experiments.\\n3.3 Parameter-efﬁcient ﬁne-tuning with (IA)3\\nIn order to compare favorably to few-shot ICL, we need a PEFT method that has the following\\nproperties: First, it must add or update as few parameters as possible to avoid incurring storage\\nand memory costs. Second, it should achieve strong accuracy after few-shot training on new tasks.\\nFinally, it must allow for mixed-task batches, since that is a capability of ICL. In order to easily\\nenable mixed-task batches, a PEFT method should ideally not modify the model itself. Otherwise,\\neach example in a batch would effectively need to be processed by a different model or computational\\ngraph. A more convenient alternative is provided by methods that directly modify the activations of\\nthe model since this can be done independently and cheaply to each example in the batch according\\nto which task the example corresponds to. Prompt tuning and preﬁx tuning methods [ 14,29] work by\\nconcatenating learned vectors to activation or embedding sequences and are therefore examples of\\nactivation-modifying PEFT methods that allow for mixed-task batches. However, as we will discuss\\n5', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\\n\\n{content}', metadata_template='{key}: {value}', metadata_seperator='\\n'),\n",
       " Document(id_='f61202be-768b-43a5-b226-5196b6d617a9', embedding=None, metadata={'page_label': '6', 'file_name': 'peft.pdf', 'file_path': 'e:\\\\projects\\\\AI research assistant\\\\Data\\\\peft.pdf', 'file_type': 'application/pdf', 'file_size': 562785, 'creation_date': '2024-03-30', 'last_modified_date': '2024-03-30'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={}, text='later, we were unable to attain reasonable accuracy with prompt tuning and found that the more\\nperformant PEFT methods did not allow for mixed-task batches. We therefore developed a new PEFT\\nmethod that meets our desiderata.\\nAs an alternative, we explored element-wise multiplication (i.e. rescaling) of the model’s activations\\nagainst a learned vector. Speciﬁcally, we consider adaptation of the form l⊙xwherel∈Rdis a\\nlearned task-speciﬁc vector, ⊙represents element-wise multiplication, and x∈RT×dis a length-T\\nsequence of activations. We use “broadcasting notation” [ 46] so that the (i,j)thentry ofl⊙xisljxi,j.\\nIn preliminary experiments, we found it was not necessary to introduce a learned rescaling vector\\nfor each set of activations in the Transformer model. Instead, we found it was sufﬁcient to introduce\\nrescaling vectors on the keys and values in self-attention and encoder-decoder attention mechanisms\\nand on the intermediate activation of the position-wise feed-forward networks. Speciﬁcally, using\\nthe notation from Vaswani et al. [33], we introduce three learned vectors lk∈Rdk,lv∈Rdv, and\\nlﬀ∈Rdff, which are introduced into the attention mechanisms as:\\nsoftmax(Q(lk⊙KT)√dk)\\n(lv⊙V)\\nand in the position-wise feed-forward networks as (lﬀ⊙γ(W1x))W2, whereγis the feed-forward\\nnetwork nonlinearity. We introduce a separate set of lk,lv, andlﬀvectors in each Transformer layer\\nblock. This adds a total of L(dk+dv+dﬀ)new parameters for a L-layer-block Transformer encoder\\nandL(2dk+ 2dv+dﬀ)(with factors of 2 accounting for the presence of both self-attention and\\nencoder-decoder attention) for a L-layer-block decoder. lk,lv, andlﬀare all initialized with ones so\\nthat the overall function computed by the model does not change when they are added. We call our\\nmethod (IA)3, which stands for “Infused Adapter by Inhibiting and Amplifying Inner Activations”.\\n(IA)3makes mixed-task batches possible because each sequence of activations in the batch can be\\nseparately and cheaply multiplied by its associated learned task vector. We also note that, in the\\nevent that a model will only be used on a single task, the modiﬁcations introduced by (IA)3can\\nalso be applied to weight matrices permanently so that no elementwise multiplication is required and\\nthe model’s architecture remains unchanged. This possible because element-wise multiplications\\nperformed in (IA)3always co-occur with a matrix multiplication, and l⊙Wx= (l⊙W)x. In this\\ncase, our method incurs no additional computational cost compared to the original model.\\nTo validate (IA)3, we compare it to a large variety of existing adaptation methods in our setting of\\nﬁne-tuning T0-3B on few-shot datasets from held-out tasks. Speciﬁcally, we compare with 9 strong\\nPEFT methods: BitFit [ 47] which updates only the bias parameters; Adapters [ 23] which introduce\\ntask-speciﬁc layers after the self-attention and position-wise feed-forward networks; Compacter and\\nCompacter++ [ 28] which improve upon adapters by using low-rank matrices and hypercomplex mul-\\ntiplication; prompt tuning [ 14] which learns task-speciﬁc prompt embeddings that are concatenated to\\nthe model’s input; FISH Mask [ 26] which chooses a subset of parameters to update based on their ap-\\nproximate Fisher information; Intrinsic SAID [ 27] which performs optimization in a low-dimensional\\nsubspace; preﬁx-tuning [ 29] which learns task-speciﬁc vectors that are concatenated to the model’s\\nactivations; and LoRA [ 13] which assigns low-rank updates to parameter matrices. Additionally, we\\ninclude the baselines of full-model ﬁne-tuning and updating only the layer normalization parameters.\\nFor certain methods that allow changing the parameter efﬁciency, we report results for different\\nbudgets: 0.2% and 0.02% sparsity for FISH Mask, 10 and 100 learned prompt vectors for prompt\\ntuning, and 20,000- or 500,000-dimensional subspaces for Intrinsic SAID.\\nThe results are shown in ﬁg. 2, with detailed per-dataset results in appendix D. We ﬁnd that (IA)3\\nis the only method that attains higher accuracy than the full-model-ﬁne-tuning baseline. While\\nother PEFT methods (e.g. Intrinsic SAID and prompt tuning) update or introduce fewer parameters,\\n(IA)3performs considerably better. Our results and setting differ with some past work on the\\nPEFT methods we compare against. Mahabadi et al. [28] report that Compacter and Compacter++\\noutperform full-model ﬁne-tuning, including in the few-shot setting. Lester et al. [14] found that\\nprompt tuning could match full-model ﬁne-tuning, and in subsequent work Wei et al. [48] found that\\nprompt tuning performed well when applied to a multitask ﬁne-tuned model in the few-shot setting.\\nIn both cases, we experimented with various hyperparameter choices to try to match past results.\\nWe hypothesize the disagreement comes from us using a different model and different datasets. For\\nprompt tuning speciﬁcally, we noticed that the validation set performance could ﬂuctuate wildly over\\nthe course of training, hinting at possible optimization issues.\\n6', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\\n\\n{content}', metadata_template='{key}: {value}', metadata_seperator='\\n'),\n",
       " Document(id_='5fe7f644-b714-4cab-a0dd-604b0b6d0fe4', embedding=None, metadata={'page_label': '7', 'file_name': 'peft.pdf', 'file_path': 'e:\\\\projects\\\\AI research assistant\\\\Data\\\\peft.pdf', 'file_type': 'application/pdf', 'file_size': 562785, 'creation_date': '2024-03-30', 'last_modified_date': '2024-03-30'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={}, text='/uni0030/uni002E/uni0030/uni0030/uni0031/uni0025 /uni0030/uni002E/uni0030/uni0031/uni0025 /uni0030/uni002E/uni0031/uni0025\\n/uni0025/uni0020/uni006F/uni0066/uni0020/uni0070/uni0061/uni0072/uni0061/uni006D/uni0065/uni0074/uni0065/uni0072/uni0073/uni0020/uni0075/uni0070/uni0064/uni0061/uni0074/uni0065/uni0064/uni0035/uni0030/uni0035/uni0035/uni0036/uni0030/uni0036/uni0035/uni0041/uni0063/uni0063/uni0075/uni0072/uni0061/uni0063/uni0079/uni0041/uni006C/uni006C/uni0020/uni0070/uni0061/uni0072/uni0061/uni006D/uni0065/uni0074/uni0065/uni0072/uni0073\\n/uni0028/uni0049/uni0041/uni0029/uni00B3\\n/uni004C/uni006F/uni0052/uni0041\\n/uni0042/uni0069/uni0074/uni0046/uni0069/uni0074\\n/uni004C/uni0061/uni0079/uni0065/uni0072/uni0020/uni004E/uni006F/uni0072/uni006D\\n/uni0043/uni006F/uni006D/uni0070/uni0061/uni0063/uni0074/uni0065/uni0072\\n/uni0043/uni006F/uni006D/uni0070/uni0061/uni0063/uni0074/uni0065/uni0072/uni002B/uni002B/uni0050/uni0072/uni006F/uni006D/uni0070/uni0074/uni0020/uni0054/uni0075/uni006E/uni0069/uni006E/uni0067\\n/uni0050/uni0072/uni0065/uni0066/uni0069/uni0078/uni0020/uni0054/uni0075/uni006E/uni0069/uni006E/uni0067\\n/uni0041/uni0064/uni0061/uni0070/uni0074/uni0065/uni0072\\n/uni0046/uni0049/uni0053/uni0048/uni0020/uni004D/uni0061/uni0073/uni006B\\n/uni0049/uni006E/uni0074/uni0072/uni0069/uni006E/uni0073/uni0069/uni0063/uni0020/uni0053/uni0041/uni0049/uni0044Figure 2: Accuracy of PEFT methods with LUL\\nandLLNwhen applied to T0-3B. Methods that\\nwith variable parameter budgets are represented\\nwith larger and smaller markers for more or less\\nparameters.\\n/uni0031/uni0030/uni0031/uni0032/uni0031/uni0030/uni0031/uni0033/uni0031/uni0030/uni0031/uni0034/uni0031/uni0030/uni0031/uni0035\\n/uni0046/uni004C/uni004F/uni0050/uni0073/uni0020/uni0070/uni0065/uni0072/uni0020/uni0065/uni0078/uni0061/uni006D/uni0070/uni006C/uni0065/uni0035/uni0030/uni0035/uni0035/uni0036/uni0030/uni0036/uni0035/uni0037/uni0030/uni0041/uni0063/uni0063/uni0075/uni0072/uni0061/uni0063/uni0079\\n/uni0054/uni002D/uni0046/uni0065/uni0077\\n/uni0054/uni0030\\n/uni0054/uni0035/uni002B/uni004C/uni004D/uni0047/uni0050/uni0054/uni002D/uni0033/uni0020/uni0036/uni002E/uni0037/uni0042\\n/uni0047/uni0050/uni0054/uni002D/uni0033/uni0020/uni0031/uni0033/uni0042\\n/uni0047/uni0050/uni0054/uni002D/uni0033/uni0020/uni0031/uni0037/uni0035/uni0042Figure 3: Accuracy of different few-shot learning\\nmethods. T-Few uses (IA)3for PEFT methods\\nof T0, T0 uses zero-shot learning, and T5+LM\\nand the GPT-3 variants use few-shot ICL. The\\nx-axis corresponds to inference costs; details are\\nprovided in section 4.2.\\n3.4 Pre-training (IA)3\\nIn recent work, Gu et al. [18], Vu et al. [19] showed that pre-training the prompt embeddings in\\nprompt tuning can improve performance when ﬁne-tuning on downstream few-shot tasks. For pre-\\ntraining, Gu et al. [18] use a suite of self-supervised tasks applied to unlabeled text data, and Vu\\net al. [19] consider using embeddings from a separate task or multitask mixture. We follow Vu et al.\\n[19] and simply pre-train the new parameters introduced by (IA)3on the same multitask mixture\\nused to train T0. We pre-train for 100,000 steps with a batch size of 16 before ﬁne-tuning the (IA)3\\nparameters on each individual downstream dataset. A full comparison of accuracy with and without\\npre-training (IA)3is detailed in appendix E. We ﬁnd that pre-training improves ﬁne-tuned accuracy\\nfrom 64.6 to 65.8 and therefore add it to our recipe.\\n3.5 Combining the ingredients\\nIn summary, the T-Few recipe is deﬁned as follows: We use the T0 model as a backbone. We add\\n(IA)3for downstream task adaptation and use parameters initialized from pre-training (IA)3on the\\nsame multitask mixture for T0. As an objective, we use the sum of a standard language modeling\\nlossLLM, an unlikelihood loss LULfor incorrect choices, and a length-normalized loss LLN. We\\ntrain for 1,000 steps with a batch size of 8 sequences using the Adafactor optimizer [ 49] with a\\nlearning rate of 3e−3and a linear decay schedule with a 60-step warmup. We apply prompt templates\\nto downstream datasets during training and inference to convert each example into an instructive\\ntext-to-text format. Importantly, we apply this recipe to every downstream dataset in exactly the same\\nwaywithout per-dataset hyperparameter tuning or modiﬁcations. This makes the recipe a realistic\\noption for few-shot learning settings where validation sets are tiny by deﬁnition [31, 32].\\n4 Outperforming ICL with T-Few\\nHaving designed and established the T-Few recipe on T0-3B, we now apply it to T0 (with 11 billion\\nparameters) and compare performance to strong few-shot ICL baselines. From this point onwards,\\nwe use exactly the same recipe and hyperparameters across all tasks.\\n4.1 Performance on T0 tasks\\nFirst, we evaluate T-Few on the datasets that were held out from T0’s training mixture. We compare\\nagainst zero-shot learning with T0 [ 1] (since we found few-shot ICL to performed worse than zero-\\n7', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\\n\\n{content}', metadata_template='{key}: {value}', metadata_seperator='\\n'),\n",
       " Document(id_='39a8c397-1c68-4652-9d12-02113d441f23', embedding=None, metadata={'page_label': '8', 'file_name': 'peft.pdf', 'file_path': 'e:\\\\projects\\\\AI research assistant\\\\Data\\\\peft.pdf', 'file_type': 'application/pdf', 'file_size': 562785, 'creation_date': '2024-03-30', 'last_modified_date': '2024-03-30'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={}, text='shot for T0, see appendix F); few-shot ICL with T5+LM [ 14] (the next-step-prediction language\\nmodel upon which T0 is based); and few-shot ICL with the 6.7, 13, and 175 billion parameter variants\\nof GPT-3. See appendix F for more details on these baselines. The accuracy on the held-out T0\\ndatasets (described in section 3.1) is shown in table 1 and ﬁg. 3, with per-dataset results reported\\nin appendix F. We ﬁnd that T-Few outperforms all other methods by a substantial margin. Notably,\\nT-Few achieves a 6% higher accuracy than few-shot ICL with GPT-3 175B despite being about 16×\\nsmaller and outperforms the smaller GPT-3 variants by an even larger margin. T-Few also attains\\nsigniﬁcantly higher accuracy than both zero-shot learning with T0 and few-shot ICL with T5+LM.\\nMethodInference\\nFLOPsTraining\\nFLOPsDisk\\nspace Acc.\\nT-Few 1.1e12 2.7e16 4.2 MB 72.4%\\nT0 [1] 1.1e12 0 0 B 66.9%\\nT5+LM [14] 4.5e13 0 16 kB 49.6%\\nGPT-3 6.7B [4] 5.4e13 0 16 kB 57.2%\\nGPT-3 13B [4] 1.0e14 0 16 kB 60.3%\\nGPT-3 175B [ 4]1.4e15 0 16 kB 66.6%\\nTable 1: Accuracy on held-out T0 tasks and computational costs\\nfor different few-shot learning methods and models. T-Few\\nattains the highest accuracy with 1,000 ×lower computational\\ncost than ICL with GPT-3 175B. Fine-tuning with T-Few costs\\nabout as much as ICL on 20 examples with GPT-3 175B.Method Acc.\\nT-Few 75.8%\\nHuman baseline [2] 73.5%\\nPET [50] 69.6%\\nSetFit [51] 66.9%\\nGPT-3 [4] 62.7%\\nTable 2: Top-5 best methods on\\nRAFT as of writing. T-Few is\\nthe ﬁrst method to outperform the\\nhuman baseline and achieves over\\n6% higher accuracy than the next-\\nbest method.\\n4.2 Comparing computational costs\\nHaving established that T-Few signiﬁcantly outperforms ICL-based models, we now compare the\\nrelative costs of each few-shot learning approach. For simplicity, we use the FLOPs-per-token\\nestimates for Transformer-based language models introduced by Kaplan et al. [20]. Speciﬁcally, we\\nestimate that a decoder-only Transformer (e.g. the GPT series) with Nparameters uses 2NFLOPs\\nper token for inference and 6NFLOPs per token for training. Encoder-decoder models like T0 and\\nT5 (where the encoder and decoder have the same number of layers and layer sizes) only process\\neach token with either the encoder or decoder (each having roughly half the parameters of the full\\nmodel), so the FLOPs per token estimates are halved to Nand3NFLOPs per token for inference and\\ntraining. We note that FLOPs are not a direct measurement of real-world computational cost because\\nlatency, power usage, and other costs can vary signiﬁcantly depending on hardware and other factors\\n[52]. However, we focus on FLOPs because it is a hardware-independent metric that closely with\\nreal-world costs the hardware setup used for running the different methods we consider would likely\\nvary signiﬁcantly across methods. We summarize the costs in table 1 and discuss them below. For all\\nestimates, we use the median number of shots (41) across the datasets we consider. Rank evaluation\\nand our unlikelihood loss both require processing every possible output choice to attain a prediction\\nfor an unlabeled example. The median combined tokenized sequence length for the input and all\\npossible targets is 103 for the datasets we consider. For in-context examples processed for few-shot\\nICL, only the correct target is required, producing a median sequence length of 98. Assuming that\\nkey and value vectors are cached, processing a single example with ICL therefore involves processing\\n41×98 + 103 tokens. A summary of our cost estimates is provided in table 1.\\nInference cost. Beyond improved accuracy, the primary advantage of avoiding few-shot ICL is\\ndramatically lower inference costs. Processing a single input and all target choices with T-Few\\nrequires 11e9×103 = 1.1e12FLOPs, whereas few-shot ICL with GPT-3 175B requires 2×175e9×\\n(41×98 + 103) = 1 .4e15FLOPs – more than 3 orders of magnitude more. Inference costs with ICL\\nusing the smaller GPT-3 variants are also dramatically higher than the inference cost of T-Few . As\\ndiscussed in section 2.1, caching the key and value vectors when the same set of in-context examples\\nis to be reused can reduce the computational cost of ICL. However, this would only result in an\\napproximately 41×reduction, which is not nearly enough to make any of the GPT-3 ICL costs as low\\nasT-Few .\\nTraining cost. Since T-Few is the only method that involves updating parameters, it is the only\\nmethod that incurs a training cost. Training an eleven billion parameter encoder-decoder model for\\n1,000 steps with a batch size of 8 length-103 sequences requires approximately 3×11e9×1,000×\\n8', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\\n\\n{content}', metadata_template='{key}: {value}', metadata_seperator='\\n'),\n",
       " Document(id_='c479403d-38ed-4ae2-93be-faf4f40e16cc', embedding=None, metadata={'page_label': '9', 'file_name': 'peft.pdf', 'file_path': 'e:\\\\projects\\\\AI research assistant\\\\Data\\\\peft.pdf', 'file_type': 'application/pdf', 'file_size': 562785, 'creation_date': '2024-03-30', 'last_modified_date': '2024-03-30'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={}, text='8×103 = 2.7e16FLOPs. While not insigniﬁcant, this is only about 20 times larger than the FLOPs\\nrequired to process a single example with few-shot ICL using GPT-3 175B. In other words, training\\nT-Few costs as much as using GPT-3 175B to process 20 examples with few-shot ICL. We also\\nfound that ﬁne-tuning T0 with T-Few on a single dataset only takes about a half an hour on a single\\nNVIDIA A100 GPU. As of writing, this would cost about $2 USD using Microsoft Azure.2\\nStorage cost. T-Few also incurs the largest storage cost. When stored as single-precision ﬂoats, the\\nparameters added by (IA)3take up 4.2 MB of space on disk. In contrast, ICL methods only require\\nstoring the tokenized in-context examples (typically stored as 32-bit integers), resulting in a smaller\\n41×98×32bits= 16 kBdisk space requirement. However, we note that 4.2 MB is dwarfed by the\\non-disk size of the model checkpoints themselves – storing the (IA)3adaptation vectors for 10,000\\ntasks would take about as much space as the T0 checkpoint (41.5 GB).\\nMemory usage. During inference, the primary memory cost is incurred by the model’s parameters.\\nThe only model smaller than T0 (used by T-Few ) is GPT-3 6.7B; otherwise, T-Few will incur a lower\\nmemory cost during inference. Additional memory costs are incurred when training T-Few due to the\\nneed to cache intermediate activations for backpropagation and for the gradient accumulator variables\\nin Adafactor. However, as mentioned above, it is possible to use the T-Few recipe on a single 80GB\\nA100 GPU.\\n4.3 Performance on Real-world Few-shot Tasks (RAFT)\\nSo far, we have evaluated performance on a collection of datasets that were not explicitly designed\\nfor benchmarking few-shot learning. To better evaluate T-Few ’s performance in the real world, we\\nevaluated our approach on the RAFT benchmark [2]. RAFT consists of 11 “economically valuable”\\ntasks that aim to mirror real-world applications. Importantly, each RAFT datasets has only 50 training\\nexamples with no validation set and a (larger) test set with no public labels, so it is impossible to\\n“cheat” by tuning on an unrealistically-large validation set or by peeking at the test set [ 32,31]. We\\napply T-Few to RAFT by using the standard prompts released alongside the dataset. The accuracy of\\nthe current top-5 methods is shown in table 2, with further details provided in appendix H. T-Few\\nattains a state-of-the-art accuracy of 75.8% and outperforms the human baseline (73.5% accuracy)\\nfor the ﬁrst time. The next-best model (from Schick and Schütze [50]) achieves 6% lower accuracy\\nand GPT-3 175B attains only 62.7%. These results validate that T-Few can be readily applied as-is to\\nnovel real-world tasks to attain strong performance.\\n4.4 Ablation experiments\\nGiven that our T-Few design experiments were on T0-3B, we perform an ablation of some of the\\ningredients of T-Few on T0. Detailed results are shown in appendix G. While the gains from adding\\neach ingredient does not always signiﬁcant increase the accuracy on each individual dataset, each\\ningredient consistently improves the average performance across datasets: Removing pre-training\\ndecreases accuracy by 1.6%, removing unlikelihood training and length normalization decreases\\naccuracy by 4.1%, and removing both pre-training and our additional loss terms reduces accuracy by\\n2.5%.\\n5 Conclusion\\nWe introduced T-Few , a parameter-efﬁcient few-shot learning recipe that attains higher accuracy than\\nfew-shot ICL at a lower computational cost. T-Few uses (IA)3, a new PEFT method that rescales\\ninner activations with learned vectors. Using (IA)3produces better performance than ﬁne-tuning\\nthe full model while only introducing a tiny amount of additional parameters. T-Few also uses two\\nadditional loss terms that encourage the model to output lower probabilities for incorrect choices\\nand account for the length of different answer choices. When applying T-Few as-is (with no task-\\nspeciﬁc hyperparameter tuning or other changes) to the RAFT benchmark, we attained super-human\\nperformance for the ﬁrst time and outperformed prior submissions by a large margin. Through\\ndetailed characterization of computational costs, we found that T-Few uses over 1,000×fewer FLOPs\\nduring inference than few-shot ICL with GPT-3 and only requires 30 minutes to train on a single\\nNVIDIA A100 GPU. Since all of our experiments were on classiﬁcation tasks, we are interested in\\napplying T-Few to generative tasks like as summarization and question answering in future work.\\nWe hope our results provide a new perspective on how best to perform few-shot learning with large\\nlanguage models.\\n2https://docs.microsoft.com/en-us/azure/virtual-machines/ndm-a100-v4-series\\n9', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\\n\\n{content}', metadata_template='{key}: {value}', metadata_seperator='\\n'),\n",
       " Document(id_='5118029b-33e5-430a-9cec-528bfe437765', embedding=None, metadata={'page_label': '10', 'file_name': 'peft.pdf', 'file_path': 'e:\\\\projects\\\\AI research assistant\\\\Data\\\\peft.pdf', 'file_type': 'application/pdf', 'file_size': 562785, 'creation_date': '2024-03-30', 'last_modified_date': '2024-03-30'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={}, text='References\\n[1]Victor Sanh, Albert Webson, Colin Raffel, Stephen H. Bach, Lintang Sutawika, Zaid Alyafeai,\\nAntoine Chafﬁn, Arnaud Stiegler, Teven Le Scao, Arun Raja, et al. Multitask prompted training\\nenables zero-shot task generalization. arXiv preprint arXiv:2110.08207 , 2021.\\n[2]Neel Alex, Eli Liﬂand, Lewis Tunstall, Abhishek Thakur, Pegah Maham, C Jess Riedel, Emmie\\nHine, Carolyn Ashurst, Paul Sedille, Alexis Carlier, et al. RAFT: A real-world few-shot text\\nclassiﬁcation benchmark. arXiv preprint arXiv:2109.14076 , 2021.\\n[3]Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever.\\nLanguage models are unsupervised multitask learners. OpenAI blog , 2019.\\n[4]Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal,\\nArvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are\\nfew-shot learners. arXiv preprint arXiv:2005.14165 , 2020.\\n[5]Yanda Chen, Ruiqi Zhong, Sheng Zha, George Karypis, and He He. Meta-learning via language\\nmodel in-context tuning. arXiv preprint arXiv:2110.07814 , 2021.\\n[6]Sewon Min, Mike Lewis, Luke Zettlemoyer, and Hannaneh Hajishirzi. Metaicl: Learning to\\nlearn in context. arXiv preprint arXiv:2110.15943 , 2021.\\n[7]Andrew Kyle Lampinen, Ishita Dasgupta, Stephanie C. Y . Chan, Kory Matthewson,\\nMichael Henry Tessler, Antonia Creswell, James L. McClelland, Jane X. Wang, and Felix\\nHill. Can language models learn from explanations in context? ArXiv , abs/2204.02329, 2022.\\n[8]Angeliki Lazaridou, Elena Gribovskaya, Wojciech Stokowiec, and Nikolai Grigorev. Internet-\\naugmented language models through few-shot prompting for open-domain question answering.\\narXiv preprint arXiv:2203.05115 , 2022.\\n[9]Sewon Min, Xinxi Lyu, Ari Holtzman, Mikel Artetxe, Mike Lewis, Hannaneh Hajishirzi, and\\nLuke Zettlemoyer. Rethinking the role of demonstrations: What makes in-context learning\\nwork? arXiv preprint arXiv:2202.12837 , 2022.\\n[10] Yizhong Wang, Swaroop Mishra, Pegah Alipoormolabashi, Yeganeh Kordi, Amirreza Mirzaei,\\nAnjana Arunkumar, Arjun Ashok, Arut Selvan Dhanasekaran, Atharva Naik, David Stap, et al.\\nBenchmarking generalization via in-context instructions on 1,600+ language tasks. arXiv\\npreprint arXiv:2204.07705 , 2022.\\n[11] Albert Webson and Ellie Pavlick. Do prompt-based models really understand the meaning of\\ntheir prompts? arXiv preprint arXiv:2109.01247 , 2021.\\n[12] Zihao Zhao, Eric Wallace, Shi Feng, Dan Klein, and Sameer Singh. Calibrate before use:\\nImproving few-shot performance of language models. arXiv preprint arXiv:2102.09690 , 2021.\\n[13] Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, and\\nWeizhu Chen. LoRA: Low-rank adaptation of large language models. ArXiv , abs/2106.09685,\\n2021.\\n[14] Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efﬁcient\\nprompt tuning. arXiv preprint arXiv:2104.08691 , 2021.\\n[15] Colin Raffel, Noam M. Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena,\\nYanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a uniﬁed\\ntext-to-text transformer. ArXiv , abs/1910.10683, 2020.\\n[16] Derek Tam, Rakesh R Menon, Mohit Bansal, Shashank Srivastava, and Colin Raffel. Improving\\nand simplifying pattern exploiting training. arXiv preprint arXiv:2103.11955 , 2021.\\n[17] Sean Welleck, Ilia Kulikov, Stephen Roller, Emily Dinan, Kyunghyun Cho, and Jason Weston.\\nNeural text generation with unlikelihood training. arXiv preprint arXiv:1908.04319 , 2019.\\n[18] Yuxian Gu, Xu Han, Zhiyuan Liu, and Minlie Huang. PPT: Pre-trained prompt tuning for\\nfew-shot learning. arXiv preprint arXiv:2109.04332 , 2021.\\n10', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\\n\\n{content}', metadata_template='{key}: {value}', metadata_seperator='\\n'),\n",
       " Document(id_='25e6a2bd-8598-411e-83ec-d32a58715755', embedding=None, metadata={'page_label': '11', 'file_name': 'peft.pdf', 'file_path': 'e:\\\\projects\\\\AI research assistant\\\\Data\\\\peft.pdf', 'file_type': 'application/pdf', 'file_size': 562785, 'creation_date': '2024-03-30', 'last_modified_date': '2024-03-30'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={}, text='[19] Tu Vu, Brian Lester, Noah Constant, Rami Al-Rfou, and Daniel Cer. SPoT: Better frozen model\\nadaptation through soft prompt transfer. arXiv preprint arXiv:2110.07904 , 2021.\\n[20] Jared Kaplan, Sam McCandlish, T. J. Henighan, Tom B. Brown, Benjamin Chess, Rewon Child,\\nScott Gray, Alec Radford, Jeff Wu, and Dario Amodei. Scaling laws for neural language models.\\narXiv preprint arXiv:2001.08361 , 2020.\\n[21] Sewon Min, Mike Lewis, Hannaneh Hajishirzi, and Luke Zettlemoyer. Noisy channel language\\nmodel prompting for few-shot text classiﬁcation. arXiv preprint arXiv:2108.04106 , 2021.\\n[22] Sylvestre-Alvise Rebufﬁ, Hakan Bilen, and Andrea Vedaldi. Learning multiple visual domains\\nwith residual adapters. Advances in neural information processing systems , 30, 2017.\\n[23] Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin de Laroussilhe,\\nAndrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efﬁcient transfer learning\\nfor NLP. arXiv preprint arXiv:1902.00751 , 2019.\\n[24] Ankur Bapna, Naveen Arivazhagan, and Orhan Firat. Simple, scalable adaptation for neural\\nmachine translation. arXiv preprint arXiv:1909.08478 , 2019.\\n[25] Demi Guo, Alexander M. Rush, and Yoon Kim. Parameter-efﬁcient transfer learning with diff\\npruning. arXiv preprint arXiv:2012.07463 , 2020.\\n[26] Yi-Lin Sung, Varun Nair, and Colin Raffel. Training neural networks with ﬁxed sparse masks.\\narXiv preprint arXiv:2111.09839 , 2021.\\n[27] Armen Aghajanyan, Luke Zettlemoyer, and Sonal Gupta. Intrinsic dimensionality explains the\\neffectiveness of language model ﬁne-tuning. arXiv preprint arXiv:2012.13255 , 2020.\\n[28] Rabeeh Karimi Mahabadi, James Henderson, and Sebastian Ruder. Compacter: Efﬁcient\\nlow-rank hypercomplex adapter layers. arXiv preprint arXiv:2106.04647 , 2021.\\n[29] Xiang Lisa Li and Percy Liang. Preﬁx-Tuning: Optimizing continuous prompts for generation.\\narXiv preprint arXiv:2101.00190 , 2021.\\n[30] Junxian He, Chunting Zhou, Xuezhe Ma, Taylor Berg-Kirkpatrick, and Graham Neubig. To-\\nwards a uniﬁed view of parameter-efﬁcient transfer learning. arXiv preprint arXiv:2110.04366 ,\\n2021.\\n[31] Ethan Perez, Douwe Kiela, and Kyunghyun Cho. True few-shot learning with language models.\\narXiv preprint arXiv:2105.11447 , 2021.\\n[32] Avital Oliver, Augustus Odena, Colin Raffel, Ekin Dogus Cubuk, and Ian Goodfellow. Realistic\\nevaluation of deep semi-supervised learning algorithms. Advances in Neural Information\\nProcessing Systems , 2018.\\n[33] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez,\\nŁukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in Neural Information\\nProcessing Systems , 2017.\\n[34] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of\\ndeep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 ,\\n2018.\\n[35] Stephen H. Bach, Victor Sanh, Zheng-Xin Yong, Albert Webson, Colin Raffel, Nihal V . Nayak,\\nAbheesht Sharma, Taewoon Kim, M Saiful Bari, Thibault Févry, et al. PromptSource: An\\nintegrated development environment and repository for natural language prompts. arXiv preprint\\narXiv:2202.01279 , 2022.\\n[36] Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony\\nMoi, Pierric Cistac, Tim Rault, Remi Louf, Morgan Funtowicz, et al. Transformers: State-\\nof-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical\\nMethods in Natural Language Processing: System Demonstrations , 2020.\\n11', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\\n\\n{content}', metadata_template='{key}: {value}', metadata_seperator='\\n'),\n",
       " Document(id_='e1dff462-7a67-4871-9e82-118250142766', embedding=None, metadata={'page_label': '12', 'file_name': 'peft.pdf', 'file_path': 'e:\\\\projects\\\\AI research assistant\\\\Data\\\\peft.pdf', 'file_type': 'application/pdf', 'file_size': 562785, 'creation_date': '2024-03-30', 'last_modified_date': '2024-03-30'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={}, text='[37] Melissa Roemmele, Cosmin Adrian Bejan, and Andrew S. Gordon. Choice of plausible\\nalternatives: An evaluation of commonsense causal reasoning. 2011 AAAI Spring Symposium\\nSeries , 2011.\\n[38] Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. HellaSwag: Can a\\nmachine really ﬁnish your sentence? arXiv preprint arXiv:1905.07830 , 2019.\\n[39] Rishi Sharma, James Allen, Omid Bakhshandeh, and Nasrin Mostafazadeh. Tackling the\\nstory ending biases in the story cloze test. In Proceedings of the 56th Annual Meeting of the\\nAssociation for Computational Linguistics (Volume 2: Short Papers) , pages 752–757, 2018.\\n[40] Yixin Nie, Adina Williams, Emily Dinan, Mohit Bansal, Jason Weston, and Douwe Kiela.\\nAdversarial NLI: A new benchmark for natural language understanding. arXiv preprint\\narXiv:1910.14599 , 2019.\\n[41] Marie-Catherine de Marneffe, Mandy Simons, and Judith Tonhauser. The CommitmentBank:\\nInvestigating projection in naturally occurring discourse. Proceedings of Sinn und Bedeutung\\n23, 2019.\\n[42] Ido Dagan, Oren Glickman, and Bernardo Magnini. The pascal recognising textual entailment\\nchallenge. In Machine Learning Challenges Workshop , pages 177–190. Springer, 2005.\\n[43] Hector Levesque, Ernest Davis, and Leora Morgenstern. The winograd schema challenge. Thir-\\nteenth International Conference on the Principles of Knowledge Representation and Reasoning ,\\n2012.\\n[44] Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An\\nadversarial winograd schema challenge at scale. In Proceedings of the AAAI Conference on\\nArtiﬁcial Intelligence , 2020.\\n[45] Mohammad Taher Pilehvar and Jose Camacho-Collados. WiC: the word-in-context dataset for\\nevaluating context-sensitive meaning representations. arXiv preprint arXiv:1808.09121 , 2018.\\n[46] Stefan Van Der Walt, S. Chris Colbert, and Gael Varoquaux. The numpy array: a structure for\\nefﬁcient numerical computation. Computing in science & engineering , 13(2), 2011.\\n[47] Elad Ben Zaken, Shauli Ravfogel, and Yoav Goldberg. BitFit: Simple parameter-efﬁcient\\nﬁne-tuning for transformer-based masked language-models. arXiv preprint arXiv:2106.10199 ,\\n2021.\\n[48] Jason Wei, Maarten Bosma, Vincent Y . Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan\\nDu, Andrew M Dai, and Quoc V . Le. Finetuned language models are zero-shot learners. arXiv\\npreprint arXiv:2109.01652 , 2021.\\n[49] Noam Shazeer and Mitchell Stern. Adafactor: Adaptive learning rates with sublinear memory\\ncost. In International Conference on Machine Learning . PMLR, 2018.\\n[50] Timo Schick and Hinrich Schütze. True few-shot learning with prompts–a real-world perspective.\\narXiv preprint arXiv:2111.13440 , 2021.\\n[51] Moshe Wasserblat. Sentence transformer ﬁne-tuning (SetFit): Outperforming GPT-3 on few-\\nshot text-classiﬁcation while being 1600 times smaller, 2021.\\n[52] Mostafa Dehghani, Anurag Arnab, Lucas Beyer, Ashish Vaswani, and Yi Tay. The efﬁciency\\nmisnomer. arXiv preprint arXiv:2110.12894 , 2021.\\n[53] Guanghui Qin and Jason Eisner. Learning how to ask: Querying LMs with mixtures of soft\\nprompts. arXiv preprint arXiv:2104.06599 , 2021.\\n[54] Xiao Liu, Kaixuan Ji, Yicheng Fu, Zhengxiao Du, Zhilin Yang, and Jie Tang. P-Tuning v2:\\nPrompt tuning can be comparable to ﬁne-tuning universally across scales and tasks. arXiv\\npreprint arXiv:2110.07602 , 2021.\\n12', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\\n\\n{content}', metadata_template='{key}: {value}', metadata_seperator='\\n'),\n",
       " Document(id_='acf48f52-9ed3-44fc-9e38-97ad90f6e288', embedding=None, metadata={'page_label': '13', 'file_name': 'peft.pdf', 'file_path': 'e:\\\\projects\\\\AI research assistant\\\\Data\\\\peft.pdf', 'file_type': 'application/pdf', 'file_size': 562785, 'creation_date': '2024-03-30', 'last_modified_date': '2024-03-30'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={}, text='[55] Shengnan An, Yifei Li, Zeqi Lin, Qian Liu, Bei Chen, Qiang Fu, Weizhu Chen, Nanning Zheng,\\nand Jian-Guang Lou. Input-Tuning: Adapting unfamiliar inputs to frozen pretrained models.\\narXiv preprint arXiv:2203.03131 , 2022.\\n[56] Yulong Chen, Yang Liu, Li Dong, Shuohang Wang, Chenguang Zhu, Michael Zeng, and\\nYue Zhang. AdaPrompt: Adaptive model training for prompt-based NLP. arXiv preprint\\narXiv:2202.04824 , 2022.\\n[57] Shizhe Diao, Xuechun Li, Yong Lin, Zhichao Huang, and Tong Zhang. Black-box prompt\\nlearning for pre-trained language models. arXiv preprint arXiv:2201.08531 , 2022.\\n[58] Daniel Khashabi, Shane Lyu, Sewon Min, Lianhui Qin, Kyle Richardson, Sameer Singh,\\nSean Welleck, Hannaneh Hajishirzi, Tushar Khot, Ashish Sabharwal, et al. Prompt wayward-\\nness: The curious case of discretized interpretation of continuous prompts. arXiv preprint\\narXiv:2112.08348 , 2021.\\n[59] Zifeng Wang, Zizhao Zhang, Chen-Yu Lee, Han Zhang, Ruoxi Sun, Xiaoqi Ren, Guolong Su,\\nVincent Perot, Jennifer Dy, and Tomas Pﬁster. Learning to prompt for continual learning. arXiv\\npreprint arXiv:2112.08654 , 2021.\\n[60] Zonghan Yang and Yang Liu. On robust preﬁx-tuning for text classiﬁcation. arXiv preprint\\narXiv:2203.10378 , 2022.\\n[61] Yuting Yang, Pei Huang, Juan Cao, Jintao Li, Yun Lin, Jin Song Dong, Feifei Ma, and Jian Zhang.\\nA prompting-based approach for adversarial example generation and robustness enhancement.\\narXiv preprint arXiv:2203.10714 , 2022.\\n[62] Xiaochen Liu, Yu Bai, Jiawei Li, Yinan Hu, and Yang Gao. PSP: Pre-trained soft prompts for\\nfew-shot abstractive summarization. arXiv preprint arXiv:2204.04413 , 2022.\\n[63] Xavier Garcia and Orhan Firat. Using natural language prompts for machine translation. arXiv\\npreprint arXiv:2202.11822 , 2022.\\n[64] Hunter Lang, Monica Agrawal, Yoon Kim, and David Sontag. Co-training improves prompt-\\nbased learning for large language models. arXiv preprint arXiv:2202.00828 , 2022.\\n[65] Boshi Wang, Xiang Deng, and Huan Sun. Shepherd pre-trained language models to develop a\\ntrain of thought: An iterative prompting approach. arXiv preprint arXiv:2203.08383 , 2022.\\n[66] Xu Zou, Da Yin, Qingyang Zhong, Hongxia Yang, Zhilin Yang, and Jie Tang. Controllable gener-\\nation from pre-trained language models via inverse prompting. arXiv preprint arXiv:2103.10685 ,\\n2021.\\n[67] Yusheng Su, Xiaozhi Wang, Yujia Qin, Chi-Min Chan, Yankai Lin, Zhiyuan Liu, Peng Li,\\nJuanzi Li, Lei Hou, Maosong Sun, et al. On transferability of prompt tuning for natural language\\nunderstanding. arXiv preprint arXiv:2111.06719 , 2021.\\n[68] Yun He, Huaixiu Steven Zheng, Yi Tay, Jai Gupta, Yu Du, Vamsi Aribandi, Zhe Zhao, YaGuang\\nLi, Zhao Chen, Donald Metzler, et al. HyperPrompt: Prompt-based task-conditioning of\\ntransformers. arXiv preprint arXiv:2203.00759 , 2022.\\n[69] Menglin Jia, Luming Tang, Bor-Chun Chen, Claire Cardie, Serge Belongie, Bharath Hariharan,\\nand Ser-Nam Lim. Visual prompt tuning. arXiv preprint arXiv:2203.12119 , 2022.\\n[70] Timo Schick and Hinrich Schütze. Exploiting cloze questions for few shot text classiﬁcation\\nand natural language inference. arXiv preprint arXiv:2001.07676 , 2020.\\n[71] Teven Le Scao and Alexander M. Rush. How many data points is a prompt worth? arXiv\\npreprint arXiv:2103.08493 , 2021.\\n[72] Sen Yang, Yunchen Zhang, Leyang Cui, and Yue Zhang. Do prompts solve NLP tasks using\\nnatural language? arXiv preprint arXiv:2203.00902 , 2022.\\n13', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\\n\\n{content}', metadata_template='{key}: {value}', metadata_seperator='\\n'),\n",
       " Document(id_='77b8219c-ec72-4c5b-8075-61c6824d88c5', embedding=None, metadata={'page_label': '14', 'file_name': 'peft.pdf', 'file_path': 'e:\\\\projects\\\\AI research assistant\\\\Data\\\\peft.pdf', 'file_type': 'application/pdf', 'file_size': 562785, 'creation_date': '2024-03-30', 'last_modified_date': '2024-03-30'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={}, text='[73] Taylor Shin, Yasaman Razeghi, Robert L Logan IV , Eric Wallace, and Sameer Singh. Auto-\\nPrompt: Eliciting knowledge from language models with automatically generated prompts.\\narXiv preprint arXiv:2010.15980 , 2020.\\n[74] Tianyu Gao, Adam Fisch, and Danqi Chen. Making pre-trained language models better few-shot\\nlearners. arXiv preprint arXiv:2012.15723 , 2020.\\n[75] Ningyu Zhang, Luoqiu Li, Xiang Chen, Shumin Deng, Zhen Bi, Chuanqi Tan, Fei Huang,\\nand Huajun Chen. Differentiable prompt makes pre-trained language models better few-shot\\nlearners. arXiv preprint arXiv:2108.13161 , 2021.\\n[76] Rabeeh Karimi Mahabadi, Luke Zettlemoyer, James Henderson, Marzieh Saeidi, Lambert\\nMathias, Veselin Stoyanov, and Majid Yazdani. PERFECT: Prompt-free and efﬁcient few-shot\\nlearning with language models. arXiv preprint arXiv:2204.01172 , 2022.\\n[77] Naﬁse Sadat Moosavi, Quentin Delfosse, Kristian Kersting, and Iryna Gurevych. Adaptable\\nadapters. arXiv preprint arXiv:2205.01549 , 2022.\\n[78] Eleni Triantaﬁllou, Hugo Larochelle, Richard Zemel, and Vincent Dumoulin. Learning a\\nuniversal template for few-shot dataset generalization. arXiv preprint arXiv:/2105.07029 , 2021.\\n[79] James Requeima, Jonathan Gordon, John Bronskill, Sebastian Nowozin, and Richard E. Turner.\\nFast and ﬂexible multi-task classiﬁcation using conditional neural adaptive processes. arXiv\\npreprint arXiv:1906.07697 , 2019.\\n[80] Wei-Hong Li, Xialei Liu, and Hakan Bilen. Universal representation learning from multiple\\ndomains for few-shot classiﬁcation. Proceedings of the IEEE/CVF International Conference on\\nComputer Vision. , 2021.\\n[81] Leo Gao, Jonathan Tow, Stella Biderman, Sid Black, Anthony DiPoﬁ, Charles Foster, Laurence\\nGolding, Jeffrey Hsu, Kyle McDonell, Niklas Muennighoff, Jason Phang, Laria Reynolds, Eric\\nTang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. A framework for few-shot language\\nmodel evaluation, September 2021. URL https://doi.org/10.5281/zenodo.5371628 .\\n14', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\\n\\n{content}', metadata_template='{key}: {value}', metadata_seperator='\\n'),\n",
       " Document(id_='4e8a6216-f0ff-467d-a6c6-0a2c49d879ae', embedding=None, metadata={'page_label': '15', 'file_name': 'peft.pdf', 'file_path': 'e:\\\\projects\\\\AI research assistant\\\\Data\\\\peft.pdf', 'file_type': 'application/pdf', 'file_size': 562785, 'creation_date': '2024-03-30', 'last_modified_date': '2024-03-30'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={}, text='A Compute resources used\\nAll T0-3B models were trained on 48GB A6000s. Training T0-3B with different PEFT methods took\\nabout an hour to train, except for Intrinsic SAID and FishMask which each took about two hours to\\ntrain. Pre-training (IA)3took 1 day on 4 A6000s. All T0 models were trained 80GB A100s from\\nDataCrunch3and took about half an hour to train each. Pre-training (IA)3took about 1 day on 4\\nA100s.\\nB Related Work\\nCurrently, prompt tuning is one of the most parameter-efﬁcient methods for large language models\\n[29,14,53]. Liu et al. [54] introduce several tricks to improve prompt tuning, An et al. [55] tune\\nprompts along with input embeddings for boost in performance, and Chen et al. [56] improve prompt\\nembeddings through continued pre-training. Given optimization difﬁculties when training prompt\\nembeddings, Diao et al. [57] recently used black-box optimization to train prompt embeddings\\nwithout requiring gradients. Several works have analyzed prompt tuning from the perspective of\\ninterpretability Khashabi et al. [58] and its similarity to other PEFT methods He et al. [30]. Prompt\\ntuning has been applied to various applications for NLP including continual learning [ 59], model\\nrobustness [ 60,61], summarization [ 62], machine translation [ 63], co-training [ 64], probing language\\nmodels [ 65,65], inverse prompting [ 66] and transfer learning [ 67]. He et al. [68] recently proposed\\nthe use of a hypernetwork to predict prompts for new tasks (rather than training the prompt parameters\\nwith gradient descent). Prompt tuning and other PEFT methods have also been explored outside of\\nthe context of language models (e.g. vision [22, 69] and vision-and-language models [26]).\\nSeparately, various studies have considered few-shot full-model ﬁne-tuning with discrete prompts\\n[70]. Recent work has analyzed training with discrete prompts, demonstrating a boost in performance\\nwith prompting when training on various numbers of examples [ 71], ﬁnding that models perform\\nsimilarly when trained on good and bad prompts [ 11], and exploring which prompts work well\\nfor few-shot and full-shot setting [ 72]. There have also been efforts to develop methods that ﬁnd\\nperformant discrete prompts [ 73,74] and training prompts using methods similar to prompt tuning\\n[75].\\nThere has also been a great deal of work on improving ICL. Chen et al. [5], Min et al. [6]use ICL for\\nmeta-learning to perform few-shot learning on new tasks. Lampinen et al. [7]show ICL can improve\\nwhen explanations are provided and [ 8] use ICL with text retrieved from the web for open-domain\\nquestion-answering. Meanwhile, Min et al. [9]analyze how ICL works and show that ICL can still\\nperform well when incorrect labels are provided for the in-context examples.\\nWith the advent of large language models with billions of parameters, there has been a great deal\\nof recent interest in PEFT methods. A small amount of recent work has also begun to explore the\\ncompatibility of PEFT methods in the few-shot setting. Mahabadi et al. [28] found that PEFT can\\noutperform standard ﬁne-tuning in the low-resource setting. In concurrent work, Mahabadi et al.\\n[76] compare PEFT to the use of discrete prompts (e.g. PET [ 70]) during few-shot ﬁne-tuning and\\nﬁnd that PEFT compares favorably. Also concurrently, Moosavi et al. [77] propose a framework\\nfor introducing adapters whose architecture and design vary from task to task and demonstrate\\nimproved results in few-shot settings. Gu et al. [18] and Vu et al. [19] both explored how pre-training\\nprompt tuning parameters can improve when limited labeled data is available. For few-shot learning,\\nTriantaﬁllou et al. [78] explore learning universal and dataset dependent parameters that can be\\nblended for generalization. Requeima et al. [79] use conditional neural adaptive processes and Li\\net al. [80] leverage distillation from multiple feature extractors for learning new classes or domains in\\nfew-shot learning.\\nC Full Unlikelihood Training and Length Normalization Results\\nTable 3 shows the full results with unlikelihood training and length normalization.\\nD Full PEFT Results\\nWe compare against the following PEFT methods, using a linear decay with warmup scheduler with\\na warm-up ratio of 0.06and the Adafactor optimizer [ 49]. We show the full per-dataset result of all\\n3https://cloud.datacrunch.io/\\n15', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\\n\\n{content}', metadata_template='{key}: {value}', metadata_seperator='\\n'),\n",
       " Document(id_='c6d7dd88-8425-4d2c-9d2c-86bcffd3359d', embedding=None, metadata={'page_label': '16', 'file_name': 'peft.pdf', 'file_path': 'e:\\\\projects\\\\AI research assistant\\\\Data\\\\peft.pdf', 'file_type': 'application/pdf', 'file_size': 562785, 'creation_date': '2024-03-30', 'last_modified_date': '2024-03-30'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={}, text='COPA H-Swag StoryCloze Winogrande WSC WiC\\nFT 78.02.039.20.2 91.51.0 54.50.9 66.41.053.81.7\\n+ UL 81.03.046.14.8 93.62.5 56.52.2 61.58.756.44.1\\n+ LN 86.04.047.122.4 94.00.6 56.93.8 65.43.953.92.0\\n+ UL + LN 81.011.046.48.8 93.82.7 56.51.5 65.47.757.73.9\\nRTE CB ANLI-R1 ANLI-R2 ANLI-R3\\nFT 75.85.482.15.4 47.81.5 40.60.8 37.81.8\\n+ UL 77.61.489.31.8 47.91.9 40.91.9 38.85.0\\n+ LN 75.84.389.37.1 48.20.6 40.90.9 38.31.6\\n+ UL + LN 79.83.687.55.4 46.62.5 41.30.9 40.25.3\\nTable 3: Per-dataset results for comparing the effect of including the additional loss terms introduced\\nin section 3.2. Subscripts are IQR.\\nPEFT methods we considered and ablate the losses. Table 4 includes all losses, Table 5 includes LLN,\\nTable 6 includes LUL, and Table 7 does not include either loss.\\nFull Model Fine-tuning We train for 300 steps with a learning rate of 3e−4.\\nBitFit [47] We train for 300 steps with a learning rate of 3e−4.\\nLayerNorm We train for 300 steps with a learning rate of 3e−4.\\nAdapter [23] We use a reduction factor of 32, ReLU nonlinearity, and residual connections. We\\ntrain for 500 steps with a learning rate of 3e−3.\\nCompacter [28] We train for 500 steps with a learning rate of 3e−3and hyper complex division\\nfactor of 4 (n= 4) .\\nCompacter++ [28] We train for 500 steps with a learning rate of 3e−3and hyper complex division\\nfactor of 4 (n= 4) .\\nPrompt tuning [14] We train for 1000 steps with a learning rate of 3e−1and use 10 and 100 prompt\\nembeddings.\\nPreﬁx tuning [29] We train for 1000 steps with a learning rate of 3e−3and adopt the two-layer MLP\\nparameterization in the paper with hidden size 512. We use \"Question:\" and \"Answer:\" as\\ninitialization text for the preﬁxes attached to the input and target sequence, respectively.\\nFishMask [26] The Fisher is ﬁrst computed on the training examples and we keep 0.2%or0.02%\\nof the parameters. Then, these parameters are trained for 1500 steps with a learning rate of\\n3e−4.\\nIntrinsic SAID [27] We train for 3000 steps with a learning rate of 3e−2. Due to large model size,\\nwe use Intrinsic SAID to produce rank-1 updates for 2D weights via an outer product of two\\nvectors.\\nLoRA [13] We use a rank of 4with initialization scale of 0.01and update all the attention and\\nfeedforward module. We train for 1000 steps with a learning rate of 3e−3.\\nE Full Pre-training Results\\nTable 8 shows the per-dataset results for of pre-training (IA)3.\\nF Full Main Results\\nWe compare against the following baselines:\\nT0. To measure the improvement in performance conferred through parameter-efﬁcient few-shot\\nlearning, we compare to zero-shot evaluation using T0 itself. In preliminary experiments, we found\\nthat T0 was not able to perform few-shot ICL – performance actually decreased as we increased the\\n16', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\\n\\n{content}', metadata_template='{key}: {value}', metadata_seperator='\\n'),\n",
       " Document(id_='c6356195-4f0b-4720-8513-a62a2c083af9', embedding=None, metadata={'page_label': '17', 'file_name': 'peft.pdf', 'file_path': 'e:\\\\projects\\\\AI research assistant\\\\Data\\\\peft.pdf', 'file_type': 'application/pdf', 'file_size': 562785, 'creation_date': '2024-03-30', 'last_modified_date': '2024-03-30'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={}, text='number of in-context examples. This is likely because of the zero-shot format used during multitask\\nprompted ﬁne-tuning and corroborates a recent ﬁnding by [10].\\nT5+LM. Since T0 is unable to perform ICL on its own, we also compare to T5+LM, the next-step-\\nprediction language model upon which T0 is based. Speciﬁcally, we use the LM-adapted variant of\\nT5.1.1.xxl released by Lester et al. [14], which has the same architecture and number of parameters as\\nT0. Due to memory constraints and because of its improved performance, we use ensemble ICL for\\nT5+LM [ 6]. Speciﬁcally, we perform one-shot ICL using each example in the training set individually\\nand average the predictions for a given query example. For fair comparison with GPT-3 models, we\\nuse the EleutherAI evaluation harness [ 81], which was designed to replicate the evaluation setup done\\nby Brown et al. [4].\\nGPT-3. For a strong ICL baseline, we consider models in the GPT-3 family [ 4]. Speciﬁcally, we\\ncompare to the 6.7, 13, and 175 billion parameter variants of GPT-3. Because these models have not\\nbeen publicly released, we report numbers directly from Brown et al. [4]. While GPT-3 is available\\nthrough the commercial OpenAI API, re-running evaluation through the API would be more than an\\norder of magnitude more expensive than running all of the experiments performed for this paper.\\nG Full Ablation Results\\nTable table 10 shows the T-Few ablation results.\\nH RAFT Experiment Details\\nRAFT consists of 11 tasks: Ade Corpus V2, Banking 77, NeurIps Impact Statement Risks, One Stop\\nEnglish, Overruling, Systematic Review Inclusion, Tai Safety Research, Terms of Service, Tweet\\nEval Hate, and Twitter Complaints. We use the T-Few recipe on all datasets without putting the labels\\ninto the input string except Banking 77. Since Banking 77 has 77 classes which causes memory\\nissues for unlikelihood training, we turn off unlikelihood training for Banking 77. We also feed in all\\nthe labels as part of the input string for Banking 77 since there were some labels never seen during\\ntraining and clean the labels by replacing \".\" with \",\".\\nPer-dataset results of T-Few and the other top-5 methods on RAFT are shown in table 11.\\n17', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\\n\\n{content}', metadata_template='{key}: {value}', metadata_seperator='\\n'),\n",
       " Document(id_='35eca939-39b7-422c-920e-e35af9a3cb78', embedding=None, metadata={'page_label': '18', 'file_name': 'peft.pdf', 'file_path': 'e:\\\\projects\\\\AI research assistant\\\\Data\\\\peft.pdf', 'file_type': 'application/pdf', 'file_size': 562785, 'creation_date': '2024-03-30', 'last_modified_date': '2024-03-30'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={}, text='# of Param COPA H-Swag StoryCloze Winogrande\\nFull Model Fine-tuning 3B 81.011.046.48.8 93.82.7 56.51.5\\nBitFit (with LayerNorm) 1.3M 75.02.029.53.6 88.60.7 49.61.3\\nLayerNorm 250K 76.02.029.63.4 88.70.9 49.41.4\\nAdapter 12.9M 84.03.041.93.8 91.73.7 54.73.6\\nCompacter 807K 84.05.046.42.5 93.52.2 55.52.9\\nCompacter++ 540K 86.03.046.33.0 93.51.2 55.11.1\\nPrompt tuning (10) 41K 67.05.029.90.6 84.20.8 51.91.6\\nPrompt tuning (100) 409K 60.019.026.80.6 74.03.4 51.10.8\\nPreﬁx tuning 576K 71.08.042.14.0 90.23.1 52.01.3\\nFishMask (0.2%) 6M 82.05.044.14.2 94.21.8 54.52.1\\nFishMask (0.02%) 600K 84.06.038.23.6 93.60.7 53.92.8\\nIntrinsic SAID 500K 77.04.036.74.5 89.32.3 52.72.1\\nIntrinsic SAID 20K 76.04.038.36.4 89.72.7 50.91.0\\nLoRA 9.1M 88.05.047.13.2 93.62.1 56.83.3\\n(IA)3540K 87.03.049.44.6 94.72.7 59.80.6\\n# of Param WSC WiC RTE CB\\nFull Model Fine-tuning 3B 65.47.757.73.979.83.687.55.4\\nBitFit (with LayerNorm) 1.3M 61.511.551.72.272.21.157.11.8\\nLayerNorm 250K 63.512.552.21.671.80.457.11.8\\nAdapter 12.9M 65.41.055.52.776.23.687.53.6\\nCompacter (n= 4) 807K 64.46.755.23.875.86.182.13.6\\nCompacter++ (n= 4) 540K 65.43.954.12.276.90.482.13.6\\nPrompt tuning (10) 41K 54.810.651.62.052.75.466.11.8\\nPrompt tuning (100) 409K 60.64.850.01.148.02.953.617.9\\nPreﬁx tuning 576K 56.73.354.23.368.63.384.01.8\\nFishMask (0.2%) 6M 63.54.852.53.376.94.783.93.6\\nFishMask (0.02%) 600K 61.51.053.51.375.55.476.83.6\\nSAID 500K 61.58.755.02.769.07.680.40.0\\nSAID 20K 55.86.755.30.566.15.483.91.8\\nLoRA 9.1M 60.65.855.25.078.37.685.71.8\\n(IA)3540K 68.36.756.04.678.02.587.51.8\\n# of Param ANLI-R1 ANLI-R2 ANLI-R3\\nFull Model Fine-tuning 3B 46.62.5 41.30.9 40.25.3\\nBitFit (with LayerNorm) 1.3M 36.50.8 35.32.2 36.60.8\\nLayerNorm 250K 36.50.7 35.12.6 36.31.0\\nAdapter 12.9M 45.12.6 40.41.2 35.31.3\\nCompacter 807K 40.83.3 37.40.2 35.83.3\\nCompacter++ 540K 41.70.4 38.31.8 36.91.5\\nPrompt tuning (10) 41K 34.21.9 33.51.1 33.51.3\\nPrompt tuning (100) 409K 33.41.2 33.80.5 33.30.8\\nPreﬁx tuning 576K 43.34.1 37.51.2 36.51.5\\nFishMask (0.2%) 6M 43.70.3 39.71.4 37.21.1\\nFishMask (0.02%) 600K 39.90.9 38.12.0 36.21.8\\nSAID 500K 40.43.3 35.44.1 35.51.6\\nSAID 20K 41.31.3 38.51.8 35.82.0\\nLoRA 9.1M 45.12.5 41.01.4 39.54.8\\n(IA)3540K 48.62.0 40.81.5 40.82.3\\nTable 4: Per-dataset accuracies for the PEFT methods we consider when adding LULandLLN.\\nSubscripts are IQR.\\n18', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\\n\\n{content}', metadata_template='{key}: {value}', metadata_seperator='\\n'),\n",
       " Document(id_='8177981e-7a83-4ddc-b3c2-9ec4c5bded57', embedding=None, metadata={'page_label': '19', 'file_name': 'peft.pdf', 'file_path': 'e:\\\\projects\\\\AI research assistant\\\\Data\\\\peft.pdf', 'file_type': 'application/pdf', 'file_size': 562785, 'creation_date': '2024-03-30', 'last_modified_date': '2024-03-30'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={}, text='# of Param COPA H-Swag StoryCloze Winogrande\\nFull Model Fine-tuning 3B 86.004.0047.1222.44 93.960.59 56.913.79\\nBitFit (with LayerNorm) 1.3M 80.006.00 31.330.16 92.890.27 51.380.71\\nLayerNorm 250K 82.002.00 31.250.64 92.840.48 51.140.39\\nAdapter 12.9M 84.005.00 44.053.22 92.892.35 52.640.55\\nCompacter (n= 4) 807K 85.003.00 47.205.34 94.331.23 53.911.34\\nCompacter++ (n= 4) 540K 85.002.00 47.861.65 94.550.69 54.382.92\\nPrompt tuning (10) 41K 72.005.00 30.431.07 90.381.23 50.510.95\\nPrompt tuning (100) 409K 65.001.00 27.934.69 87.013.05 51.930.39\\nPreﬁx tuning 576K 79.006.00 34.409.71 90.333.15 51.101.72\\nFishMask (0.2%) 6M 85.004.00 26.650.14 93.800.90 54.380.16\\nFishMask (0.02%) 600K 82.002.00 26.650.14 93.641.12 53.911.97\\nIntrinsic SAID 500K\\nIntrinsic SAID 20K\\nLoRA 9.1M 86.001.00 48.682.62 94.441.66 56.121.03\\n(IA)3540K 90.002.00 50.033.02 95.401.12 58.250.55\\n# of Param WSC WiC RTE CB\\nFull Model Fine-tuning 3B 65.383.8553.922.04 75.814.33 89.297.14\\nBitFit (with LayerNorm) 1.3M 63.462.8854.233.13 75.451.81 67.860.00\\nLayerNorm 250K 60.582.8855.331.88 76.171.44 67.861.79\\nAdapter 12.9M 63.463.8555.493.61 77.263.97 80.363.57\\nCompacter (n= 4) 807K 64.423.8553.295.49 75.452.89 82.145.36\\nCompacter++ (n= 4) 540K 65.383.8554.863.45 77.265.78 76.797.14\\nPrompt tuning (10) 41K 53.854.8152.041.72 55.232.53 66.073.57\\nPrompt tuning (100) 409K 50.966.7351.881.57 48.383.69 62.5012.50\\nPreﬁx tuning 576K 60.583.8568.950.7280.3612.50 75.008.93\\nFishMask (0.2%) 6M 66.352.8854.231.10 75.813.61 83.937.14\\nFishMask (0.02%) 600K 60.581.9252.821.10 75.093.61 76.793.57\\nSAID 500K\\nSAID 20K\\nLoRA 9.1M 61.541.9255.024.70 74.734.69 85.711.79\\n(IA)3540K 66.353.8553.760.63 76.902.89 83.930.00\\n# of Param ANLI-R1 ANLI-R2 ANLI-R3 Avg.\\nFull Model Fine-tuning 3B 48.200.60 40.900.90 38.251.58 63.25\\nBitFit (with LayerNorm) 1.3M 36.101.40 35.601.40 35.422.00 56.7\\nLayerNorm 250K 37.300.50 37.100.70 36.251.08 57.07\\nAdapter 12.9M 42.403.20 38.800.60 36.503.83 60.71\\nCompacter (n= 4) 807K 42.903.90 38.000.80 37.332.33 61.27\\nCompacter++ (n= 4) 540K 41.900.50 38.502.40 36.000.58 61.13\\nPrompt tuning (10) 41K 34.201.10 34.201.30 34.420.83 52.12\\nPrompt tuning (100) 409K 34.101.10 34.200.20 34.081.25 49.82\\nPreﬁx tuning 576K 37.503.60 34.174.50 34.409.71 58.71\\nFishMask (0.2%) 6M 43.400.60 40.000.90 36.752.83 60.03\\nFishMask (0.02%) 600K 40.100.90 38.002.00 35.500.75 57.73\\nSAID 500K\\nSAID 20K\\nLoRA 9.1M 46.201.70 41.400.90 38.422.67 62.57\\n(IA)3540K 49.202.80 40.302.30 40.423.17 64.05\\nTable 5: Per-dataset accuracies for the PEFT methods we consider when adding LLN. Subscripts are\\nIQR.\\n19', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\\n\\n{content}', metadata_template='{key}: {value}', metadata_seperator='\\n'),\n",
       " Document(id_='d462828e-2a4d-421b-aea5-a1c0b3272253', embedding=None, metadata={'page_label': '20', 'file_name': 'peft.pdf', 'file_path': 'e:\\\\projects\\\\AI research assistant\\\\Data\\\\peft.pdf', 'file_type': 'application/pdf', 'file_size': 562785, 'creation_date': '2024-03-30', 'last_modified_date': '2024-03-30'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={}, text='# of Param COPA H-Swag StoryCloze Winogrande\\nFull Model Fine-tuning 3B 81.003.00 46.124.82 93.642.51 56.512.21\\nBitFit (with LayerNorm) 1.3M 81.004.00 35.512.34 92.780.86 50.910.08\\nLayerNorm 250K 82.001.00 34.602.31 92.680.75 51.781.26\\nAdapter 12.9M 83.001.00 42.535.35 90.493.15 53.673.63\\nCompacter (n= 4) 807K 88.003.00 42.954.06 92.891.87 54.621.50\\nCompacter++ (n= 4) 540K 85.002.00 48.262.95 93.851.60 54.852.84\\nPrompt tuning (10) 41K 74.005.00 29.242.48 88.881.12 51.380.47\\nPrompt tuning (100) 409K 68.007.00 28.512.43 86.914.33 50.590.16\\nPreﬁx tuning 576K 69.002.0029.0410.83 86.442.35 50.631.41\\nFishMask (0.2%) 6M 85.005.00 27.780.51 94.011.55 53.672.60\\nFishMask (0.02%) 600K 84.004.00 27.780.51 93.161.23 53.592.21\\nIntrinsic SAID 500K\\nIntrinsic SAID 20K\\nLoRA 9.1M 87.003.00 46.971.98 93.112.03 57.933.63\\n(IA)3540K 86.004.00 48.784.12 94.012.83 58.721.34\\n# of Param WSC WiC RTE CB\\nFull Model Fine-tuning 3B 61.548.65 56.434.0877.621.44 89.291.79\\nBitFit (with LayerNorm) 1.3M 64.423.85 53.612.5176.173.61 60.711.79\\nLayerNorm 250K 60.588.65 53.922.3575.091.81 57.143.57\\nAdapter 12.9M 65.386.73 54.393.1379.065.42 85.713.57\\nCompacter (n= 4) 807K 65.384.81 54.553.6175.455.05 82.140.00\\nCompacter++ (n= 4) 540K 64.423.85 55.643.6177.624.69 80.367.14\\nPrompt tuning (10) 41K 54.816.73 52.823.2952.711.08 69.645.36\\nPrompt tuning (100) 409K 50.003.85 50.160.9452.714.3358.9312.50\\nPreﬁx tuning 576K 55.771.92 71.126.1482.145.36 83.938.93\\nFishMask (0.2%) 6M 62.503.85 53.611.4176.172.17 83.938.93\\nFishMask (0.02%) 600K 59.621.92 53.610.4774.375.05 75.001.79\\nSAID 500K\\nSAID 20K\\nLoRA 9.1M 59.6212.5055.494.8679.061.81 87.501.79\\n(IA)3540K 65.384.81 56.744.3977.262.53 87.501.79\\n# of Param ANLI-R1 ANLI-R2 ANLI-R3 Avg.\\nFull Model Fine-tuning 3B 47.901.90 40.901.90 38.835.00 62.71\\nBitFit (with LayerNorm) 1.3M 36.401.10 34.000.70 35.252.42 56.43\\nLayerNorm 250K 37.001.90 36.002.10 35.582.17 56.03\\nAdapter 12.9M 43.901.10 38.601.10 36.172.17 61.17\\nCompacter (n= 4) 807K 41.801.30 37.603.00 37.171.92 61.14\\nCompacter++ (n= 4) 540K 41.700.60 38.202.50 35.580.33 61.41\\nPrompt tuning (10) 41K 35.002.10 33.800.60 33.672.75 52.36\\nPrompt tuning (100) 409K 35.700.90 33.801.50 33.002.17 49.85\\nPreﬁx tuning 576K 34.601.60 36.834.67 38.523.00 58\\nFishMask (0.2%) 6M 44.101.00 38.701.50 38.250.83 59.79\\nFishMask (0.02%) 600K 40.502.60 37.001.20 35.580.75 57.66\\nSAID 500K\\nSAID 20K\\nLoRA 9.1M 45.902.20 41.101.70 38.831.08 62.96\\n(IA)3540K 49.802.10 40.300.30 40.173.33 64.06\\nTable 6: Per-dataset accuracies for the PEFT methods we consider when adding LUL. Subscripts are\\nIQR.\\n20', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\\n\\n{content}', metadata_template='{key}: {value}', metadata_seperator='\\n'),\n",
       " Document(id_='0deca225-5947-4507-a6fe-960a1d0c7c61', embedding=None, metadata={'page_label': '21', 'file_name': 'peft.pdf', 'file_path': 'e:\\\\projects\\\\AI research assistant\\\\Data\\\\peft.pdf', 'file_type': 'application/pdf', 'file_size': 562785, 'creation_date': '2024-03-30', 'last_modified_date': '2024-03-30'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={}, text='# of Param COPA H-Swag StoryCloze Winogrande\\nFull Model Fine-tuning 3B 78.002.0039.160.24 91.450.96 54.460.87\\nBitFit (with LayerNorm) 1.3M 77.007.0033.760.38 90.490.27 51.540.16\\nLayerNorm 250K 77.007.0033.580.65 90.430.21 51.380.32\\nAdapter 12.9M 76.005.0036.412.27 90.591.71 52.010.47\\nCompacter (n= 4) 807K 81.005.0037.530.67 91.500.21 52.570.87\\nCompacter++ (n= 4) 540K 78.002.0037.001.02 91.980.91 53.120.87\\nPrompt tuning (10) 41K 73.004.0030.091.67 88.881.12 52.250.32\\nPrompt tuning (100) 409K 66.004.0026.314.46 87.440.21 51.140.55\\nPreﬁx tuning 576K 70.003.0027.986.62 86.752.24 51.071.10\\nFishMask (0.2%) 6M 77.003.0035.450.87 90.541.07 52.960.87\\nFishMask (0.02%) 600K 74.002.0031.151.30 89.521.28 52.570.47\\nIntrinsic SAID 500K\\nIntrinsic SAID 20K\\nLoRA 9.1M 80.005.0039.141.26 92.041.07 53.750.47\\n(IA)3540K 82.001.0040.590.56 92.570.48 56.912.53\\n# of Param WSC WiC RTE CB\\nFull Model Fine-tuning 3B 66.350.96 53.761.72 75.815.42 82.145.36\\nBitFit (with LayerNorm) 1.3M 61.543.85 53.131.72 76.531.08 64.298.93\\nLayerNorm 250K 61.543.85 53.291.72 76.172.17 62.508.93\\nAdapter 12.9M 65.387.69 54.701.72 77.262.89 83.931.79\\nCompacter (n= 4) 807K 61.542.88 55.333.61 76.172.17 83.930.00\\nCompacter++ (n= 4) 540K 61.541.92 54.704.23 73.651.81 78.575.36\\nPrompt tuning (10) 41K 53.857.69 52.511.88 57.404.33 69.6410.71\\nPrompt tuning (100) 409K 56.736.73 52.350.63 54.153.97 53.5719.64\\nPreﬁx tuning 576K 52.887.69 52.510.3172.5611.9175.0017.86\\nFishMask (0.2%) 6M 62.504.81 54.232.04 77.265.42 82.141.79\\nFishMask (0.02%) 600K 58.652.88 54.391.10 76.175.05 75.003.57\\nSAID 500K\\nSAID 20K\\nLoRA 9.1M 64.4212.5054.863.45 77.264.33 87.503.57\\n(IA)3540K 64.423.85 54.231.57 77.981.81 82.145.36\\n# of Param ANLI-R1 ANLI-R2 ANLI-R3 Avg.\\nFull Model Fine-tuning 3B 47.801.50 40.600.80 37.751.83 60.66\\nBitFit (with LayerNorm) 1.3M 37.301.80 36.102.60 35.173.67 56.08\\nLayerNorm 250K 37.501.50 36.002.80 35.083.42 55.86\\nAdapter 12.9M 40.703.70 39.201.10 35.831.92 59.27\\nCompacter (n= 4) 807K 41.802.70 38.000.80 36.002.75 59.58\\nCompacter++ (n= 4) 540K 41.101.50 38.902.50 36.921.42 58.68\\nPrompt tuning (10) 41K 33.600.70 33.801.10 34.831.00 52.71\\nPrompt tuning (100) 409K 35.601.70 34.500.70 34.751.42 50.23\\nPreﬁx tuning 576K 37.602.30 34.103.50 35.080.67 54.14\\nFishMask (0.2%) 6M 43.500.30 40.300.40 36.422.25 59.3\\nFishMask (0.02%) 600K 40.402.20 37.501.00 36.421.08 56.89\\nSAID 500K\\nSAID 20K\\nLoRA 9.1M 44.202.60 40.401.20 37.580.58 61.01\\n(IA)3540K 48.500.90 40.201.80 39.421.67 61.72\\nTable 7: Per-dataset accuracies for the PEFT methods we consider without LULorLLN. Subscripts\\nare IQR.\\n21', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\\n\\n{content}', metadata_template='{key}: {value}', metadata_seperator='\\n'),\n",
       " Document(id_='8b0a6d7a-af47-40a5-a9b8-a7ba33045ccc', embedding=None, metadata={'page_label': '22', 'file_name': 'peft.pdf', 'file_path': 'e:\\\\projects\\\\AI research assistant\\\\Data\\\\peft.pdf', 'file_type': 'application/pdf', 'file_size': 562785, 'creation_date': '2024-03-30', 'last_modified_date': '2024-03-30'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={}, text='COPA H-Swag StoryCloze Winogrande WSC WiC\\n(IA)387.03.049.44.6 94.72.7 59.80.6 68.36.756.04.6\\n+ PT 89.05.051.24.6 95.12.5 62.61.1 70.28.757.22.5\\nRTE CB ANLI-R1 ANLI-R2 ANLI-R3 Acc.\\n(IA)378.02.587.51.8 48.62.0 40.81.5 40.832.3 64.6\\n+ PT 80.91.487.51.8 49.31.1 41.10.5 39.84.8 65.8\\nTable 8: Per-dataset results when pre-training ( PT)(IA)3vs. not pre-training (IA)3. Subscripts are\\nIQR.\\nCOPA H-Swag StoryCloze Winogrande WSC WiC\\nT-Few 93.02.067.16.0 97.90.3 74.31.5 75.05.562.27.8\\nT0 90.8 33.7 94.7 60 .5 64.4 57.2\\nT5+LM 68.0 60.95 62.8 56 .9 63.5 50.0\\nGPT-3 (175B) 92.0 79.3 87.7 77 .7 75.0 55.3\\nGPT-3 (13B) 86.0 71.3 83.0 70 .0 75.0 51.1\\nGPT-3 (6.7B) 83.0 67.3 81.2 67 .4 67.3 53.1\\nRTE CB ANLI-R1 ANLI-R2 ANLI-R3\\nT-Few 85.62.987.53.6 59.33.6 49.82.6 44.88.0\\nT0 81.2 78.6 44.7 39.4 42.4\\nT5 + LM 53.4 32.1 33.3 32.7 34.1\\nGPT-3 (175B) 72.9 82.1 36.8 34.0 40.2\\nGPT-3 (13B) 60.6 66.1 33.3 32.6 34.5\\nGPT-3 (6.7B) 49.5 60.7 33.1 33.1 33.9\\nTable 9: Comparing T-Few with few-shot ICL methods. All GPT-3 numbers are from Brown et al.\\n[4] and all T0 numbers are from Sanh et al. [1]. Subscripts are IQR.\\nCOPA H-Swag StoryCloze Winogrande WSC WiC\\nT-Few 93.02.067.16.0 97.90.3 74.31.5 75.05.562.157.8\\n- PT 92.02.064.56.6 97.80.8 72.71.0 73.16.360.86.4\\n-LUL-LLN 91.02.052.12.7 97.40.5 71.91.1 71.21.062.22.4\\n- PT -LUL-LLN 94.02.352.74.9 98.00.3 74.01.1 72.64.862.65.0\\nRTE CB ANLI-R1 ANLI-R2 ANLI-R3 Acc.\\nT-Few 85.62.987.53.6 59.33.6 49.82.6 44.88.0 72.4\\n- PT 84.52.883.95.4 57.93.2 48.63.0 43.15.7 70.8\\n-LUL-LLN 82.00.782.13.6 54.80.4 46.10.6 40.85.2 68.3\\n- PT -LUL-LLN 84.52.980.43.6 57.13.1 47.12.4 43.85.9 69.7\\nTable 10: T-Few ablation results when omitting (IA)3pre-training (PT) and/or the LULandLLN\\nlosses. Subscripts are IQR.\\n22', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\\n\\n{content}', metadata_template='{key}: {value}', metadata_seperator='\\n'),\n",
       " Document(id_='4bd304e9-5a59-4cd7-bc99-d6d39abeec6e', embedding=None, metadata={'page_label': '23', 'file_name': 'peft.pdf', 'file_path': 'e:\\\\projects\\\\AI research assistant\\\\Data\\\\peft.pdf', 'file_type': 'application/pdf', 'file_size': 562785, 'creation_date': '2024-03-30', 'last_modified_date': '2024-03-30'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={}, text='Method\\nAde Corpus V2\\nBanking 77\\nNeurips Impact Statement Risks\\nOne Stop English\\nOverruling\\nSemiconductor Org Types\\nSystematic Review Inclusion\\nTai Safety Research\\nTerms Of Service\\nTweet Eval Hate\\nTwitter Complaints\\nT-Few 80.4 69.5 83.3 67.6 95.0 91.5 50.8 73.6 75.0 58.6 87.9\\nHuman baseline [2] 83.0 60.7 85.7 64.6 91.7 90.8 46.8 60.9 62.7 72.2 89.7\\nPET [50] 82.2 59.3 85.7 64.6 90.8 81.6 49.3 63.8 57.6 48.3 82.4\\nSetFit [51] 72.6 53.8 87.2 52.1 90.7 68.2 49.3 62.8 62.0 53.2 83.7\\nGPT-3 [4] 68.6 29.9 67.9 43.1 93.7 76.9 51.6 65.6 57.4 52.6 82.1\\nTable 11: Detailed per-dataset results for T-Few and the other top-5 methods on RAFT.\\n23', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\\n\\n{content}', metadata_template='{key}: {value}', metadata_seperator='\\n'),\n",
       " Document(id_='4761fff7-3af4-4a23-b044-ed93b50ad9fc', embedding=None, metadata={'page_label': '1', 'file_name': 'reft paper.pdf', 'file_path': 'e:\\\\projects\\\\AI research assistant\\\\Data\\\\reft paper.pdf', 'file_type': 'application/pdf', 'file_size': 1496447, 'creation_date': '2024-03-30', 'last_modified_date': '2024-03-30'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={}, text=\"REFT: Reasoning with R Einforced Fine-Tuning\\nTrung Quoc Luong∗, Xinbo Zhang∗, Zhanming Jie*, Peng Sun†, Xiaoran Jin, Hang Li\\nByteDance Research\\n{trung.luong, zhangxinbo.freya, allan}@bytedance.com\\n{wanhesong, xiaoran.jin, lihang.lh}@bytedance.com\\nAbstract\\nOne way to enhance the reasoning capability of\\nLarge Language Models (LLMs) is to conduct\\nSupervised Fine-Tuning (SFT) using Chain-of-\\nThought (CoT) annotations. This approach\\ndoes not show sufficiently strong generaliza-\\ntion ability, however, because the training only\\nrelies on the given CoT data. In math problem-\\nsolving, for example, there is usually only one\\nannotated reasoning path for each question in\\nthe training data. Intuitively, it would be better\\nfor the algorithm to learn from multiple an-\\nnotated reasoning paths given a question. To\\naddress this issue, we propose a simple yet ef-\\nfective approach called Reinforced Fine-Tuning\\n(ReFT) to enhance the generalizability of learn-\\ning LLMs for reasoning, with math problem-\\nsolving as an example. ReFT first warmups\\nthe model with SFT, and then employs on-line\\nreinforcement learning, specifically the PPO\\nalgorithm in this paper, to further fine-tune the\\nmodel, where an abundance of reasoning paths\\nare automatically sampled given the question\\nand the rewards are naturally derived from the\\nground-truth answers. Extensive experiments\\non GSM8K, MathQA, and SV AMP datasets\\nshow that ReFT significantly outperforms SFT,\\nand the performance can be potentially further\\nboosted by combining inference-time strategies\\nsuch as majority voting and re-ranking. Note\\nthat ReFT obtains the improvement by learn-\\ning from the same training questions as SFT,\\nwithout relying on extra or augmented training\\nquestions. This indicates a superior generaliza-\\ntion ability for ReFT. The code of this work is\\npublicly available1.\\n1 Introduction\\nThe state-of-the-art approaches to solving math\\nproblems (Uesato et al., 2022; Luo et al., 2023;\\nWang et al., 2023a) employ Supervised Fine-\\n*indicates equal contribution, †indicates corresponding\\nauthor\\n1https://github.com/lqtrung1998/mwp_ReFT\\nSupervised Fine-Tuning\\nModelQuestion (x):\\xa0 Weng earns $12 an hour for bab ysitting. Y esterda y, she just did\\n50 minutes of bab ysitting. How much did she earn?\\nChain-of-Thought (e) : We need to calculate her hourly r ate and then multiply\\nit by the amount of time she work ed. First, we need to con vert 50 minutes to\\nhours. There are 60 minutes in an hour , so 50 minutes is equal to 50/60 = 5/6\\nhours. Next, we can calculate W eng's earnings b y multiplying her hourly r ate\\nby the amount of time she work ed:  $12/hour x 5/6 hour = $10. Therefore,\\nWeng earned $10 for 50 minutes of bab ysitting. The answer is 10.\\nAnswer (y): 10\\nReinforced Fine-Tuning\\nquestion(x, e', y')Golden\\nRewardReinforcement LearningFinal\\nPolicy\\nWarm-up{(x, e, y)}{(x, e, y)}\\n {(x, e, y)}\\nSFT Epochs\\nSFT EpochsSFT Epochs\\ny == y' ?\\nOn-Policy\\nSampling\\nFigure 1: An example of question ( x), CoT ( e), and\\nanswer ( y) in GSM8K (Cobbe et al., 2021a). The SFT\\nprocess iterates several epochs on the training data. The\\nproposed ReFT warm-up from SFT and performs RL\\ntraining on the same data.\\nTuning (SFT) to train the models using Chain-of-\\nThought (CoT) annotations (Wei et al., 2022). As\\nshown in Figure 1, a CoT annotation outlines the\\nintermediate reasoning steps toward solving a math\\nproblem.\\nUsually there is one CoT annotation for each\\nquestion in the training data, i.e., one correct rea-\\nsoning path, which is utilized in SFT. We observe\\nthat this may result in relatively weak generaliza-\\ntion abilities of the SFT models. It is often the case\\nthat multiple valid CoT annotations exist for the\\nsame question (Cobbe et al., 2021a; Zhang et al.,\\n2023), underscoring the need for a more powerful\\nfine-tuning approach. To address this problem, we\\npropose a simple yet effective approach called Re-\\ninforced Fine-Tuning (ReFT), depicted in the lower\\npart of Figure 1.\\nReFT commences with a warm-up stage involv-\\ning Supervised Fine-Tuning (SFT) in one or two\\nepochs (Figure 1, shaded box). This initial stagearXiv:2401.08967v1  [cs.CL]  17 Jan 2024\", start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\\n\\n{content}', metadata_template='{key}: {value}', metadata_seperator='\\n'),\n",
       " Document(id_='ca16161f-c73b-455c-8d57-1fd2edf32863', embedding=None, metadata={'page_label': '2', 'file_name': 'reft paper.pdf', 'file_path': 'e:\\\\projects\\\\AI research assistant\\\\Data\\\\reft paper.pdf', 'file_type': 'application/pdf', 'file_size': 1496447, 'creation_date': '2024-03-30', 'last_modified_date': '2024-03-30'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={}, text='Question\\nunexplored CoT\\xa0\\ngold CoTQuestion\\nSampled CoT\\nSFT ReFTgold CoTFigure 2: Comparison between SFT and ReFT on the\\npresence of CoT alternatives.\\nequips the model with the ability to generate cor-\\nrect responses to mathematical problems to some\\nextent, as demonstrated in prior work (Cobbe et al.,\\n2021a). Next, ReFT proceeds to further refine the\\nmodel through the utilization of an online Rein-\\nforcement Learning (RL) algorithm (Sutton and\\nBarto, 2018), specifically Proximal Policy Opti-\\nmization (PPO) (Schulman et al., 2017) in this pa-\\nper. In this way, ReFT is able to sample multiple\\ncorrect reasoning paths or CoT annotations and\\nlearn from them (Figure 2, right).\\nSince the training data include ground-truth an-\\nswers, the golden rewards can be naturally derived\\nfrom them when training PPO. Consequently, there\\nis no requirement for a separately trained reward\\nmodel. In contrast, RLHF (Ouyang et al., 2022)\\nhas to utilize a reward model that is learned from\\nhuman-labeled data.\\nDuring the warm-up stage, ReFT acquires a cer-\\ntain level of accuracy by supervised learning. In\\nthe RL stage, ReFT further enhances its ability by\\nreinforcement learning through sampling various\\nCoT reasoning paths. In this way, ReFT gets much\\nricher supervision signals than SFT. This approach\\nenables ReFT to greatly improve generalization in\\nmath problem-solving (Gao et al., 2018; Brown\\net al., 2020). Note that ReFT outperforms SFT by\\nusing the same training questions as SFT, without\\nrelying on extra or augmented training questions.\\nIn fact, ReFT does not conflict with such a data\\nengineering, and can be seamlessly combined with\\nit.\\nOur contributions can be summarized as fol-\\nlows:\\n•We introduce a novel fine-tuning approach, re-\\ninforced fine-tuning (ReFT), which utilizes re-\\ninforcement learning to solve math problems.\\nReFT exhibits enhanced generalization capa-\\nbilities compared to conventional supervised\\nfine-tuning (SFT) when trained on the same\\ndataset.•We conduct extensive experiments using two\\nfoundational models, CodeLLAMA (Touvron\\net al., 2023; Roziere et al., 2023) and Galac-\\ntica (Taylor et al., 2022), on three standard\\nmathematical datasets: GSM8K (Cobbe et al.,\\n2021a), MathQA (Amini et al., 2019), and\\nSV AMP (Patel et al., 2021). Our experiments\\ncover both natural language and program-\\nbased CoTs, demonstrating the significantly\\nimproved performance and generalization abil-\\nity of ReFT.\\n•Additionally, we demonstrate that ReFT ben-\\nefits from both majority voting (Wang et al.,\\n2023b) and reward model reranking (Uesato\\net al., 2022) at inference-time, further improv-\\ning its performance.\\n2 Related Work\\nMath Problem Solving Recent research efforts\\nfocus on CoT prompt design and data engineering.\\nMost of them attempted to make CoT comprehen-\\nsive and fine-grained to present the step-by-step\\nreasoning solutions (Nye et al., 2021; Fu et al.,\\n2023; Zhou et al., 2023b; Khot et al., 2023; Imani\\net al., 2023; Miao et al., 2023). Gao et al. (2023)\\nfurther proposed to use the Python program as CoT\\nprompt, demonstrating more accurate reasoning\\nsteps and significant improvements over the natu-\\nral language CoT (Wei et al., 2022). Zhou et al.\\n(2023a) introduced a prompting method that gener-\\nates code to verify the intermediate reasoning step\\nwith GPT-4 (OpenAI, 2023), thus achieving state-\\nof-the-art performance on GSM8K (Cobbe et al.,\\n2021a) and MATH (Hendrycks et al., 2021). An-\\nother line of work focuses on improving the quality\\nof CoT (Wang et al., 2023a; Liu et al., 2023; Yu\\net al., 2023) and increasing the amount of CoT\\ndata (Luo et al., 2023; Yue et al., 2023) from Ope-\\nnAI’s ChatGPT ( gpt-3.5-turbo ) or GPT-42.\\nReinforcement Learning Our work is mostly\\nrelated to the recent work that applies PPO (Schul-\\nman et al., 2017) to natural language process\\nfor aligning human preferences (Ouyang et al.,\\n2022). Since then, several training algorithms\\nhave been proposed to efficiently improve the\\nalignment, including direct preference optimiza-\\ntion (DPO) (Rafailov et al., 2023), identity pref-\\nerence optimization (IPO) (Azar et al., 2023),\\nand Kahneman-Tversky optimization (KTO) (Etha-\\n2https://chat.openai.com/', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\\n\\n{content}', metadata_template='{key}: {value}', metadata_seperator='\\n'),\n",
       " Document(id_='d2216eab-e471-4db0-b04b-321ba96c3f0f', embedding=None, metadata={'page_label': '3', 'file_name': 'reft paper.pdf', 'file_path': 'e:\\\\projects\\\\AI research assistant\\\\Data\\\\reft paper.pdf', 'file_type': 'application/pdf', 'file_size': 1496447, 'creation_date': '2024-03-30', 'last_modified_date': '2024-03-30'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={}, text='yarajh et al., 2023). Other than the purpose of\\nalignment, we aim to adopt reinforcement learning\\nas a fine-tuning paradigm to improve performance\\nover conventional supervised fine-tuning.\\nSpecifically for solving math problems, Uesato\\net al. (2022) and Lightman et al. (2023) trained an\\noutcome-based or process-based reward model to\\nperform reranking (Cobbe et al., 2021a) to achieve\\nmuch better performance over SFT and majority\\nvoting (Wang et al., 2023b). While our approach\\naims to improve the performance of the policy it-\\nself, these reward model reranking approaches can\\nbe easily integrated into the resulting policy model.\\n3 Method\\nIn this work, we focus on natural language CoT\\n(N-CoT ) (Wei et al., 2022) (Figure 1) and program-\\nbased CoT (Gao et al., 2023) ( P-CoT ) using\\nPython. Gao et al. (2023) proposed the program-\\nbased CoT for math problem solving. We can\\nsimply execute the program to obtain the answer.\\nTo ensure clarity and avoid ambiguity, we use the\\nterms N-CoT and P-CoT to represent natural lan-\\nguage and program-based CoTs in the rest of this\\npaper, respectively.\\n3.1 Reinforced Fine-Tuning\\nThe proposed Reinforced Fine-Tuning (ReFT) pro-\\ncess consists of two stages: the warm-up stage and\\nthe reinforcement learning stage. The overall algo-\\nrithm is shown in Algorithm 1.\\nWarm-up In this stage, the policy is fine-tuned\\nfor a few epochs on a dataset comprising of the\\n“(question ,CoT)” tuples: (x,e). It enables the\\nmodel to have basic problem-solving skills to gen-\\nerate a proper response for a question3. Formally,\\nthe CoT generation process can be decomposed\\ninto a sequence of next token prediction actions.\\nThe last action token, <eos> , signals the genera-\\ntion process to terminate. The CoT eis written\\nas:\\ne= [a1, a2, ..., a L−1, aL=<eos> ]\\nwhere Lrepresents the maximum length. At\\ntimestep t, the action atis sampled from a policy\\nπθ(·|st)where atcan be any token in the vocabu-\\nlary and the state stcomprises of all tokens in the\\nquestion and all tokens generated so far. After each\\n3The underlying concept is similar to the verifier train-\\ning (Cobbe et al., 2021a) to generate multiple solutions.action, the resulting state st+1is the concatenation\\nof the current state stand the action at:\\nst+1=(\\nx, t = 0\\n[st, at],1≤t≤L.\\nAs the produced action corresponds to the <eos>\\ntoken, the resulting state sL+1is the terminal state\\nand the generation process is finished. With this no-\\ntation, the loss function for a sample can be written\\nas in Equation 1:\\nLSFT(θ) =−Ee∼D\"LX\\ni=1log (πθ(at|st))#\\n(1)\\nReinforcement Learning In this stage, the pol-\\nicy improves its performance via a form of online\\nself-learning using a dataset comprising of ( ques-\\ntion,answer ) tuples: (x,y). Specifically, the pol-\\nicy model learns by repeatedly sampling responses\\n(Figure 2), evaluating the response’s answer cor-\\nrectness, and updating its parameters in an online\\nfashion (line 7-14 in Algorithm 1). We employ\\nPPO (Schulman et al., 2017) with a clipped ob-\\njective algorithm for training. Following Ziegler\\net al. (2019), the value model Vϕis constructed\\nby appending a linear value head on top of the\\nlast hidden states of the policy model πθ, which is\\nthe model after the warm-up stage. The reward of\\n0 is given for all action resulting in non-terminal\\nstate. At the terminal state, we use a reward func-\\ntion that directly compares the answer extracted\\nfrom the state’s CoT and the ground-truth answer\\ny. Here, the reward function returns 1 if the an-\\nswer is deemed correct, otherwise 0 is returned.\\nOn dataset whose answers are all numeric, partial\\nreward (Zhong et al., 2017; Le et al., 2022) of 0.1\\ncan be applied when the answer can be extracted\\nand of numeric type. For 1≤t≤L, we write\\nr(st, at, st+1)=\\uf8f1\\n\\uf8f4\\uf8f2\\n\\uf8f4\\uf8f31,EXTRACT (st+1) =y\\n0.1,EXTRACT (st+1)̸=null,̸=y\\n0,EXTRACT (st+1) =null\\nSuch a partial reward can help reduce the effect\\nof learning from sparse reward (Riedmiller et al.,\\n2018; Trott et al., 2019). In addition, following\\nZheng et al. (2023), our total reward is the sum\\nof reward function score and the Kullback-Leibler\\n(KL) divergence (Kullback and Leibler, 1951) be-\\ntween the learned RL policy and initial policy', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\\n\\n{content}', metadata_template='{key}: {value}', metadata_seperator='\\n'),\n",
       " Document(id_='371664bc-1c64-4702-a5ae-f5888b630eff', embedding=None, metadata={'page_label': '4', 'file_name': 'reft paper.pdf', 'file_path': 'e:\\\\projects\\\\AI research assistant\\\\Data\\\\reft paper.pdf', 'file_type': 'application/pdf', 'file_size': 1496447, 'creation_date': '2024-03-30', 'last_modified_date': '2024-03-30'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={}, text='Algorithm 1: Reinforced Fine-Tuning\\nInput: Dtrain ={(x,e,y)}: Tuples of ( question ,CoT,answer ),W: number of warm-up steps, T:\\nnumber of RL steps, U: number of updates per RL step, π(0)\\nθ: Initial policy.\\nOutput: πθ: Final policy\\n1πθ=π(0)\\nθ\\n2//Warm-up stage\\n3fori←1toWdo\\n4x,e,_∼ D train // Sample mini-batch from Dtrain\\n5θ=OPTIMIZATION _STEP(LSFT(θ)) // Update model parameters for this batch (Eq. 1)\\n6//Reinforcement learning stage\\n7fori←1toTdo\\n8x,_,y∼ D train // Sample mini-batch without CoT\\n9 ˆe∼πθ // On-policy CoT sampling\\n10 ˆy←EXTRACT (ˆe) // Extract the answer from CoT\\n11 πθold←πθ, Vϕold←Vϕ\\n12 Compute δt,ˆAt,ˆRtusing πθold, Vϕold,x,ˆe,ˆyandy //§3.1 Reinforcement Learning\\n13 forj←1toUdo\\n14 θ,ϕ=OPTIMIZATION _STEP(LRL(θ,ϕ)) // Use the loss in Equation 2\\n15return πθ\\nscaled by a coefficient factor β.\\nrtotal(st,at, st+1) =r(st, at, st+1)\\n−βKL\\x10\\nπθ(·|st),π(0)\\nθ(·|st)\\x11\\nFor advantage calculation, the generalized advan-\\ntage estimate from Schulman et al. (2018) is em-\\nployed.\\nˆAt=L−tX\\nl=0(γλ)lδt+l,\\nwhere the Temporal Difference (TD) is defined as\\nδt′=−Vϕ(st′)+rtotal(st′, at′, st′+1)+γVϕ(st′+1)\\nwith the terminal state value Vϕ(sL+1) := 0 ,λ∈\\n(0,1]is the discount factor for rewards, and γ∈\\n[0,1]is the discount factor for TD. For the estimate\\nof return, we leverages the λ-return ˆRt, which can\\nbe written as the sum of the generalized advantage\\nestimate and the value estimate:\\nˆRt=ˆAt+Vϕ(st)\\nLastly, the policy and value objectives can be writ-\\nten as in two equations below\\nLpolicy (θ) =−Ee∼πθold\"\\nmin \\nπθ(at|st)\\nπθold(at|st)ˆAt,\\nclip\\x12πθ(at|st)\\nπθold(at|st),1−ϵ,1 +ϵ\\x13\\nˆAt!#Lvalue(ϕ) =1\\n2Ee∼πθold\"\\nmax \\r\\r\\rVϕ(st)−ˆRt\\r\\r\\r2\\n,\\n\\r\\r\\rclip\\x10\\nVϕ(st)−ˆRt,ˆAt−ϵ,ˆAt+ϵ\\x11\\r\\r\\r2!#\\nwhere πθold,Vϕoldare used for sampling CoT and\\ncomputing ˆAt,ˆRt. The unified loss function is the\\nweighted sum of the above objectives.\\nLRL(θ,ϕ) =Lpolicy +αLvalue (2)\\nwhere αis the coefficient for the value function\\nloss.\\n4 Experiments\\n4.1 Datasets\\nWe conduct experiments on three math prob-\\nlem datasets: GSM8K (Cobbe et al., 2021a),\\nSV AMP (Patel et al., 2021) and MathQA (Amini\\net al., 2019). For both GSM8K and SV AMP, the\\nformat of answers is a numeric value. In MathQA,\\nthe format is instead a list of multiple choices (i.e.,\\nABCD ). Table 1 presents the statistics of all datasets.\\nWe perform few-shot prompting (Wei et al., 2022;\\nGao et al., 2023) using GPT-3.5-turbo to obtain\\nboth the N-CoT and P-CoT annotations4. The N-\\nCoT and P-CoT annotations are obtained following\\n4Examples of N-CoT and P-CoT representations can be\\nfound in Appendix A.', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\\n\\n{content}', metadata_template='{key}: {value}', metadata_seperator='\\n'),\n",
       " Document(id_='ff6bd418-2e54-4481-a2cb-225f57fa8a46', embedding=None, metadata={'page_label': '5', 'file_name': 'reft paper.pdf', 'file_path': 'e:\\\\projects\\\\AI research assistant\\\\Data\\\\reft paper.pdf', 'file_type': 'application/pdf', 'file_size': 1496447, 'creation_date': '2024-03-30', 'last_modified_date': '2024-03-30'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={}, text='GSM8k SV AMP MathQA MCQ MathQA numeric\\nN-CoT 7,465 3,076 14,862 8,955\\nP-CoT 7,356 3,043 15,250 7,672\\nTest 1,319 1,000 01,605 1,605\\nTable 1: Dataset statistics of two types of CoT in the\\ntraining set and the test set.\\nJie et al. (2023). We also conducted an additional\\nexperiment on a numeric version of MathQA (Jie\\nand Lu, 2023) where the format is also a numeric\\nvalue. Such experiments are used to demonstrate\\nour assumptions of potential reward hacking phe-\\nnomenon (Skalse et al., 2022) on MathQA (§4.4).\\n4.2 Baseline\\nWe compare ReFT with SFT and self-training (Xie\\net al., 2020; Amini et al., 2022) baselines. SFT\\nsimply fine-tunes the language model on the train-\\ning data. Experiments with self-training meth-\\nods ensure a relatively fair comparison because\\nall these methods share the mechanism that the\\ntraining makes use of the samples generated from\\nthe model.\\nWe implemented Offline Self-Training ( Offline-\\nST) (He et al., 2020), and Online (Hoi et al., 2021)\\nSelf-Training ( Online-ST ). The Offline-ST method\\nis similar to expert iteration (Anthony et al., 2017;\\nUesato et al., 2022). We first use the SFT check-\\npoint from the early checkpoint to sample the CoTs\\nand verify them against the ground truth. We only\\nretain those expert samples that have a correct an-\\nswer. We perform supervised fine-tuning on the\\ncombination of original training data and the expert\\nsamples.\\nThe Online-ST method is made to be closely\\ncomparable to ReFT. Following ReFT, Online-ST\\nhas the same warm-up process. After that, we per-\\nform continual training with the samples generated\\non the fly. At each training step, the model first\\nsamples CoTs for a batch and only retains those\\nwith correct answers. The resulting batch consists\\nof both sampled and ground-truth CoTs. We then\\nupdate the model parameters on this batch with\\nthe supervised fine-tuning objective LSFT. Com-\\npared with ReFT, Online-ST neither makes use of\\nnegative responses (with an incorrect answer) nor\\nhas a dedicated mechanism to prevent the model\\nfrom significantly diverging from the initial model,\\nwhich can manifest as task-specific overfitting and\\ntraining instability.4.3 Experimental Setup\\nWe conduct experiments with two foundation mod-\\nels: Galactica-6.7B5(Taylor et al., 2022) and\\nCodellama-7B6(Roziere et al., 2023). Both models\\nare reported to have strong performance in solving\\nmath problems and are commonly adopted in recent\\nliterature on reasoning tasks (Yue et al., 2023; Luo\\net al., 2023). In addition to the comparison with\\nbaselines, we also apply common techniques, ma-\\njority voting (Wang et al., 2023b) and reward model\\nreranking (Lightman et al., 2023) on GSM8K.\\nHyper-parameter In all experiments, the train-\\ning is done with 8 A100-80GB GPUs using Deep-\\nSpeed (Rajbhandari et al., 2020; Rasley et al., 2020)\\nZero stage 2 and HuggingFace Accelerate (Gugger\\net al., 2022). During the warm-up stage of ReFT,\\nwe use AdamW (Loshchilov and Hutter, 2017) op-\\ntimizer with 0.1 warm-up ration. The batch size is\\nset to 48 and learning rate is 1e-5. The maximum\\nlength is set to 1024 . The number of epochs in the\\nwarm-up stage is either 1 or 2 in all settings except\\non MathQA MCQ and MathQA numeric where we use\\nupto 5 and 10 respectively. The model is trained\\nfor300epochs with a learning rate of 3e-7. Fol-\\nlowing Ziegler et al. (2019), the λ,γ,α,ϵandUin\\nPPO are set to 1,0.95,5,0.2, and 2, respectively.\\nThe KL coefficient βis set to 0.01for P-CoT and\\nis set to 0.05for N-CoT experiments. Further hy-\\nperprameter settings about ReFT can be found in\\nAppendix B.\\nFor SFT baseline, we train the model for 40\\nepochs and choose the checkpoint with best perfor-\\nmance. This number of epochs has been chosen\\nto be sufficiently large to ensure SFT converges.\\nFor Offline-ST baseline, we sample the CoTs by\\nusing the checkpoint from the ReFT warm-up stage.\\nUsing the generation temperature of 1.0 and max\\nlength of 1024, we sample 100 CoTs for each\\nquestion and only keep those with a correct an-\\nswer. Following Singh et al. (2023), we then sub-\\nsample the CoTs to 10 random unique CoTs per\\nquestion to balance difficulties of questions. As\\nmentioned in §4.2, the Online-ST baseline tries\\nto mimic the same setting as in ReFT. We have\\nthe same warm-up process and the hyperparameter\\nsetting is roughly the same as ReFT.\\n5https://huggingface.co/facebook/galactica-6.\\n7b\\n6https://huggingface.co/codellama/\\nCodeLlama-7b-hf', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\\n\\n{content}', metadata_template='{key}: {value}', metadata_seperator='\\n'),\n",
       " Document(id_='a41c6f64-6009-4758-9de4-663860fa12c6', embedding=None, metadata={'page_label': '6', 'file_name': 'reft paper.pdf', 'file_path': 'e:\\\\projects\\\\AI research assistant\\\\Data\\\\reft paper.pdf', 'file_type': 'application/pdf', 'file_size': 1496447, 'creation_date': '2024-03-30', 'last_modified_date': '2024-03-30'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={}, text='Method SizeGSM8K SV AMP MathQA MCQ Average\\nN-CoT P-CoT N-CoT P-CoT N-CoT P-CoT N-CoT P-CoT\\nGalactica + SFT 6.7B 41.0 57 .1 53 .8 69 .3 58 .7 64 .8 51 .2 63 .7\\nGalactica + Offline Self-Training 6.7B 45.0 61 .0 56 .5 70 .8 60.7 67.5 54 .1 66 .5\\nGalactica + Online Self-Training 6.7B 45.7 61 .9 58 .5 73 .7 59 .7 62 .4 54 .6 66 .0\\nGalactica + ReFT 6.7B 46.8 68 .4 62 .3 73 .9 58.3 70.4 55 .8 70 .9\\nCodeLLAMA + SFT 0.7B 44.0 64 .4 59 .6 76 .2 56 .5 64 .2 53 .4 68 .3\\nCodeLLAMA + Offline Self-Training 0.7B 38.8 65 .0 54 .2 72 .5 57 .6 62 .8 50 .2 66 .8\\nCodeLLAMA + Online Self-Training 0.7B 40.0 64 .3 59 .7 75 .4 55 .5 68 .2 53 .1 69 .3\\nCodeLLAMA + ReFT 0.7B 53.5 72 .8 60 .0 78 .4 57 .9 71 .5 57 .1 74 .2\\nTable 2: Value accuracy comparison among the baselines and proposed ReFT method fine-tuned with two foundation\\nmodels on all datasets.\\nReward Model Reranking Following (Cobbe\\net al., 2021a; Uesato et al., 2022), we train a reward\\nmodel (RM) to determine the correctness of the\\nCoT. To construct the RM training data, we use\\nthe model from the warm-up stage and perform\\nsampling to obtain 100 CoTs for each question in\\nthe training set. The CoTs are deduplicated and\\nthe binary labels can be obtained by comparing the\\nextracted answer against the ground truth.\\nAs a common practice, the reward model is a\\nlanguage model that is initialized from the best\\nSFT checkpoint (Cobbe et al., 2021a; Ouyang et al.,\\n2022). Similar to the outcome-based reward model\\n(ORM) (Uesato et al., 2022), the reward model is\\ntrained to predict a binary label that indicates the\\n“correct ” or “ incorrect ” solution. Once the input\\npasses through the reward model, classification is\\nconducted with a linear classifier on the hidden\\nstate of the last token. Finally, the solution with\\nthe highest “correct” score among the candidates\\nis selected as the final answer. We train the RM\\nmodel for 3 epochs using a batch size of 48 and\\nmaximum length of 700.\\nEvaluation We report value accuracy for both\\nN-CoT and P-CoT on all datasets. Specifically for\\nmajority voting and reranking (Table 4), we sam-\\nple 100 CoTs for evaluation. In voting, the valid\\nanswer with majority counts is chosen as the final\\nanswer for computing accuracy. In reranking, we\\nchoose the CoT with the highest score and extract\\nthe answer.\\n4.4 Results\\nReFT Outperforms SFT Table 2 compares the\\nperformance among the baselines and proposed\\nReFT on GSM8K, SV AMP, and MathQA datasets.\\nWe can observe that ReFT consistently achieves\\nmuch better performance over the SFT and the self-Question : The diagonals of a rhombus are 18 cm and 22 cm. Find\\nits area?\\nA) 277, B) 266, C) 198, D) 288, E) 212 C) 198\\nGenerated CoT : To find the area of a rhombus, we need to multiply\\nthe lengths of its diagonals and divide the result by 2.\\nArea of rhombus = (Product of diagonals) / 2\\nArea of rhombus = (18 cm x 22 cm) / 2\\nArea of rhombus = 344 cm2/ 2\\nArea of rhombus = 172 cm2\\nTherefore, the answer is: C\\nFigure 3: Example prediction of MathQA MCQ reveals\\nreward hacking.\\ntraining family approaches except on MathQA MCQ\\nN-CoT. Specifically, we have more than 9-point\\nand8-point improvement over SFT with CodeL-\\nLAMA on GSM8K N-CoT and P-CoT, respectively.\\nOn average, we achieve 3.7-point and 5.9-point\\nimprovements with CodeLLAMA on all datasets\\nin N-CoT and P-CoT, respectively. More impor-\\ntantly, no additional annotations or reward models\\nare used in ReFT. Such strong results demonstrate\\nrobust generalization of ReFT (see Analysis §5.1)\\nand huge potential for further exploring the training\\ndata with reinforcement learning (Lu et al., 2023).\\nOffline self-training includes the sampling data\\nfrom the initial policy for fine-tuning. We can see\\nthis simple baseline can sometimes improve the\\nperformance compared with SFT (He et al., 2020;\\nGulcehre et al., 2023) but the improvements are\\nfar behind the one made by ReFT. Such compar-\\nisons indicate that “ exploring ” is essential in ReFT\\nto have good performance. Though online self-\\ntraining achieves some improvements with Galac-\\ntica, it is still far behind ReFT on average. This\\nresult indicates that incorrect instances are also\\nvery essential to guide the model for better explo-\\nration. Comparisons with self-training also suggest\\nthe proposed approach with on-policy sampling\\nand reinforcement learning is better than standard\\ndata augmentation approaches.', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\\n\\n{content}', metadata_template='{key}: {value}', metadata_seperator='\\n'),\n",
       " Document(id_='4ea10718-99dc-4a88-96bf-402b78451634', embedding=None, metadata={'page_label': '7', 'file_name': 'reft paper.pdf', 'file_path': 'e:\\\\projects\\\\AI research assistant\\\\Data\\\\reft paper.pdf', 'file_type': 'application/pdf', 'file_size': 1496447, 'creation_date': '2024-03-30', 'last_modified_date': '2024-03-30'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={}, text='Method N-CoT\\nGalacticaSFT 41.1\\nReFT 44.9\\nCodellamaSFT 36.3\\nReFT 41.0\\nTable 3: Accuracy of SFT and ReFT with two founda-\\ntion models on MathQA numeric benchmark\\nReward Hacking for MathQA Our investiga-\\ntion of the negative results on MathQA MCQ in-\\ndicates that ReFT suffers from the reward hack-\\ning (Skalse et al., 2022) on the multi-choice ques-\\ntion during training. Figure 3 shows how the\\nsampled solutions produce “ inaccurate rewards ”,\\nwhich makes the RL training suffer. As we can see,\\nthe sampled CoT obtains an incorrect answer “ 344”\\nwhich is not the product of “ 18” and “ 22”. How-\\never, the final reasoning step still predicts the option\\n“C” as the final answer as the model would always\\npredict one of the options from {A, B, C, D, E }\\nregardless of the correctness of intermediate CoT7.\\nThus, such a misleading CoT will receive a pos-\\nitive reward “ 1” and misguide the model to treat\\nthis as a correct CoT. The underlying reward hack-\\ning phenomenon severely tampers the model train-\\ning (Everitt et al., 2021). This is also the reason\\nthat we chose the checkpoint with longer warm-up\\nsteps for MathQA to reduce the reward hacking\\neffect.\\nTo further demonstrate the negative effect of\\nMCQ questions, we experiment on the MathQA\\nvariant by Jie and Lu (2023), MathQA numeric (Ta-\\nble 1), which removed the options in the question,\\nand directly predict the numeric answer. Table\\n3 presents the comparison against SFT. We can\\nobserve that ReFT consistently outperforms SFT\\nusing both Galactica and CodeLLAMA.\\nMajority Voting and Reranking Benefit ReFT\\nFollowing Wang et al. (2023b); Uesato et al. (2022);\\nLightman et al. (2023), we also perform majority\\nvoting and reward model reranking to show that\\nReFT can benefits from these common techniques.\\nSpecifically, we perform sampling from both SFT\\nand ReFT policies. We sample 100CoT solutions\\nfor each question and apply the reward model de-\\nscribed in §4.3. Table 4 shows that ReFT consis-\\ntently achieves the best performance on GSM8K\\n7We found that program-based CoTs are less likely to\\nsuffer as it is more rigorous than natural language.Method SizeGSM8K\\nN-CoT P-CoT\\nGalactica + SFT + V oting 6.7B 50.8 61 .1\\nGalactica + ReFT + V oting 6.7B 58.7 70 .7\\nGalactica + SFT + Reranking 6.7B 59.5 72 .4\\nGalactica + ReFT + Reranking 6.7B 62.8 76 .6\\nCodeLLAMA + SFT + V oting 0.7B 53.8 67 .9\\nCodeLLAMA + ReFT + V oting 0.7B 65.1 75 .0\\nCodeLLAMA + SFT + Reranking 0.7B 61.9 77 .6\\nCodeLLAMA + ReFT + Reranking 0.7B 65.7 79 .3\\nExtra Training Data Used †\\nWizardMath (Luo et al., 2023) 07B 54.9 -\\nWizardMath (Luo et al., 2023) 13B 63.9 -\\nMathCoder (Wang et al., 2023a) 07B 67.8 -\\nMAmmoTH-Coder (Yue et al., 2023) 07B 22.2 58 .8\\nMAmmoTH-Coder (Yue et al., 2023) 70B 72.4 76 .7\\nGPT-3.5-turbo (Jie et al., 2023) N.A. 75.3 78 .0\\nGPT-4 (OpenAI, 2023; Zhou et al., 2023a) N.A. 93.0 97 .0\\nTable 4: Solving accuracy of majority voting and reward\\nmodel reranking for SFT and ReFT on GSM8K. We also\\ninclude existing approaches for comparison.\\nMethod GSM8K SV AMP MathQA MCQ\\nGalactica-125M + SFT 23.7 35 .6 58 .4\\nGalactica-125M + ReFT 29.8 39 .4 60 .5\\nTable 5: Experiments on P-CoT data with Galactica-\\n125M.\\nby reward model reranking. ReFT + V oting signifi-\\ncantly outperforms SFT + V oting by 9.2points on\\naverage across all settings. ReFT with reranking\\noutperforms SFT with reranking by 3.3points on\\naverage.\\nCompared with existing open-source ap-\\nproaches (Luo et al., 2023; Wang et al., 2023a;\\nYue et al., 2023) (Table 4 bottom8), our best\\nP-CoT variant achieves the best performance\\nwith accuracy 79.3on GSM8K. In addition, these\\napproaches mainly include extra data generated\\nfrom ChatGPT and perform distillation during\\nfine-tuning. In contrast, we improve the policy\\nitself by exploiting the potential of existing\\ntraining data and pushing the limit of the policy\\nperformance. Our best result reported in Table\\n4, i.e., the CodeLLAMA + ReFT + Reranking\\nwith P-CoT setting, even slightly surpasses\\nGPT-3.5-turbo. However, we obtain the result with\\na model that is only in the size of 7B.\\nExperiments with Small Model Intuitively, ex-\\nploration could lead to imperfect demonstration\\nwith a small language model. We conduct an exper-\\n8Numbers are taken from original papers. The N-CoT and\\nP-CoT results for MAmmoTH-Coder are reported in their\\nappendix.', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\\n\\n{content}', metadata_template='{key}: {value}', metadata_seperator='\\n'),\n",
       " Document(id_='64a79316-38b3-4e86-9efb-f045e782baa0', embedding=None, metadata={'page_label': '8', 'file_name': 'reft paper.pdf', 'file_path': 'e:\\\\projects\\\\AI research assistant\\\\Data\\\\reft paper.pdf', 'file_type': 'application/pdf', 'file_size': 1496447, 'creation_date': '2024-03-30', 'last_modified_date': '2024-03-30'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={}, text='Model Setting Accuracy\\nCodeLLAMA + ReFT 72.7\\n– remove partial reward 70.9\\n– KL coefficient β= 0 collapse\\n– non-shared value model 72.6\\nTable 6: Ablation study on GSM8K P-CoT.\\niment on P-CoT data using Galactica-125M9. Table\\n5 shows the performance comparison between SFT\\nand ReFT. Surprisingly, ReFT still outperforms\\nSFT on three datasets even with a small model.\\nSuch improvements demonstrate the robustness of\\nReFT during the exploration of reasonable pro-\\ngrams.\\nAblation Study We perform the ablation study\\nusing CodeLLAMA on GSM8K P-CoT (Table 6).\\nWithout the partial reward, ReFT obtains a lower\\naccuracy 70.9but it is still much better than SFT.\\nAs mentioned in §3.1, such a partial reward can\\nhelp reduce the effect of sparse reward (Trott et al.,\\n2019) during training. In addition, the policy distri-\\nbution will easily collapse to produce unexpected\\nresults (i.e., 0accuracy) if we set the KL coefficient\\nβto0. It is certainly critical to impose constraints\\non the space that the policy explores (Ouyang\\net al., 2022). The initial warm-up step essentially\\nmakes such constraints and allows the policy to\\nfurther explore within the range that is governed\\nbyβ. Finally, we also experiment with a value\\nmodel that has no parameter shared with the policy\\nmodel (Andrychowicz et al., 2021; Cobbe et al.,\\n2021b). The individual value model initializes the\\nparameter the same as the policy model. We found\\nthat such a setting allows the model to converge\\nfaster and eventually reach equivalent performance\\nbut sacrifices two times of original computation\\noverhead as we have to perform the forward pass\\ntwice for each batch.\\n5 Analysis\\n5.1 Generalization\\nFigure 4 shows the mean reward, evaluation ac-\\ncuracy, and the KL divergence during training of\\nReFT10on GSM8K P-CoT. SFT converges and\\nbecomes overfiting when approaching 40thepoch.\\n9The smallest model size available in Galactica: https:\\n//huggingface.co/facebook/galactica-125m .\\n10For illustration purpose, we only shows the mean reward\\nand KL for 60epochs.However, we can see the mean reward is around\\n80% to 90% for the ReFT policy at 40thepoch,\\nand the value accuracy is also increasing. In addi-\\ntion, we can see that the KL divergence (Figure 4\\n(c)) is very large in the beginning and then main-\\ntain a reasonable value between 0and10. The\\nstable KL divergence indicates our policy performs\\nexploration within a space that contains appropri-\\nate programs. The underlying reinforcement learn-\\ning mechanism greatly improves the generalization\\nability of ReFT (Brown et al., 2020).\\n5.2 When ReFT surpasses SFT?\\nTo further investigate the relationship between\\nReFT and SFT, we perform ReFT training with\\ndifferent number of warm-up steps from SFT. Fig-\\nure 5 shows the value accuracy of different ReFT\\nvariants against SFT11. Specifically, if the wamrup\\nstep is 3, that means the policy initialize from the\\n3rd-epoch SFT checkpoint. We can see that all\\nReFT policies have worse performance in the be-\\nginning where the epoch is less than 8. Because the\\nlinear layer in the shared value model is randomly\\ninitialized and it could take a few epochs to adjust\\nthe distribution. Starting from the 30thepoch, SFT\\nconverges and all ReFT variants are still improv-\\ning. We can also see that all variants outperform\\nSFT by a significant margin and there is no obvious\\nadvantage of any specific ReFT variant.\\n6 Conclusion\\nWe have introduced reinforced fine-tuning (ReFT)\\nas a new method for fine-tuning models to solve\\nmath problems. In contrast to SFT, ReFT optimizes\\na non-differentiable objective by exploring multi-\\nple CoT annotations in the search for the correct\\nanswer, rather than relying on a single CoT annota-\\ntion.\\nThrough extensive experimentation on three\\ndatasets using two foundation models, we have\\ndemonstrated that ReFT outperforms SFT in terms\\nof performance and generalization ability. More-\\nover, we have showcased the compatibility of mod-\\nels trained with ReFT with techniques such as ma-\\njority voting (Wang et al., 2023b) and reward model\\nreranking (Cobbe et al., 2021a; Uesato et al., 2022).\\nFurthermore, ReFT has exhibited superior per-\\nformance compared to several publicly available\\nopen-source models of comparable sizes in math\\n11We only show 60 epochs for illustration purposes. The\\nperformance for the later epoch will be shown in Appendix.', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\\n\\n{content}', metadata_template='{key}: {value}', metadata_seperator='\\n'),\n",
       " Document(id_='a0013c9c-9be1-492e-a91e-fb0b8a82a337', embedding=None, metadata={'page_label': '9', 'file_name': 'reft paper.pdf', 'file_path': 'e:\\\\projects\\\\AI research assistant\\\\Data\\\\reft paper.pdf', 'file_type': 'application/pdf', 'file_size': 1496447, 'creation_date': '2024-03-30', 'last_modified_date': '2024-03-30'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={}, text='0 10 20 30 40 50 600.40.60.81\\nEpochActual Data\\n10-point Smooth Moving Average\\n(a) Mean Training reward0 20 40 60 80 1006070\\nEpoch\\n(b) Evaluation accuracy0 10 20 30 40 50 60010203040\\nEpoch\\n(c) Mean Sequence KL\\nFigure 4: Training reward of ReFT, evaluation accuracy, KL against training epoch on GSM8K P-CoT.\\n0 10 20 30 40 50 605055606570\\nTraining EpochReFT warm-up_ep.=1\\nReFT warm-up_ep.=2\\nReFT warm-up_ep.=3\\nReFT warm-up_ep.=4\\nSFT\\nFigure 5: Accuracy comparison between SFT and ReFT\\nwith different number of warm-up epoch.\\nproblem-solving. This demonstrates the effective-\\nness and practical value of the ReFT approach.\\n7 Future Work\\nWe have made the first attempt of applying re-\\ninforcement learning, specifically the PPO algo-\\nrithm (Schulman et al., 2017), to fine-tune of LLMs\\nfor math problem-solving. Our future work in-\\ncludes utilization of offline reinforcement learn-\\ning techniques (Levine et al., 2020; Gulcehre\\net al., 2023), development of a warm-up free\\nmethod to enhance training efficiency and perfor-\\nmance, thereby reducing the gap with the rerank-\\ning method. Additionally, Lightman et al. (2023)\\nsuggests that a well-trained process-based reward\\nmodel (PRM) can significantly enhance perfor-\\nmance. Hence, it would be worthwhile to explore\\nthe implementation of process-based rewards in\\nreinforcement learning training. Lastly, as ReFT is\\na versatile approach, we intend to apply it to more\\ngeneral reasoning tasks where the inference can be\\nformalized with CoT.\\nLimitations\\nTraining Efficiency As depicted in Figure 4 (b),\\nit is evident that ReFT necessitates a greater num-ber of epochs to reach convergence compared to\\nSFT. This is primarily due to the fact that ReFT op-\\ntimizes a non-differentiable objective and requires\\nexploration of the generation space to attain correct\\nanswers. While a larger learning rate may expe-\\ndite convergence, it also makes the policy more\\nsusceptible to instability and potential collapse. Al-\\nternatively, using a larger batch size is a viable op-\\ntion; however, it comes at the expense of increased\\ncomputational costs.\\nReward Hacking Our reward function relies\\nsolely on the final answer to determine the reward.\\nHowever, as demonstrated in the experiments con-\\nducted on the MathQA MCQ N-CoT dataset, the pol-\\nicy can be easily manipulated if the possible space\\nof final answers is limited, such as A,B,C,D . To\\nmitigate the issue of reward hacking, it may be nec-\\nessary to employ a more detailed or process-based\\nreward function that takes into account a broader\\nrange of factors.\\nReferences\\nAida Amini, Saadia Gabriel, Shanchuan Lin, Rik\\nKoncel-Kedziorski, Yejin Choi, and Hannaneh Ha-\\njishirzi. 2019. Mathqa: Towards interpretable math\\nword problem solving with operation-based for-\\nmalisms. In Proceedings of NAACL .\\nMassih-Reza Amini, Vasilii Feofanov, Loic\\nPauletto, Emilie Devijver, and Yury Maximov.\\n2022. Self-training: A survey. arXiv preprint\\narXiv:2202.12040 .\\nMarcin Andrychowicz, Anton Raichuk, Piotr Sta ´nczyk,\\nManu Orsini, Sertan Girgin, Raphael Marinier,\\nLéonard Hussenot, Matthieu Geist, Olivier Pietquin,\\nMarcin Michalski, et al. 2021. What matters in on-\\npolicy reinforcement learning? a large-scale empiri-\\ncal study. In Proceedings of ICLR .\\nThomas Anthony, Zheng Tian, and David Barber. 2017.\\nThinking fast and slow with deep learning and tree\\nsearch. In Proceedings of NeurIPS .', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\\n\\n{content}', metadata_template='{key}: {value}', metadata_seperator='\\n'),\n",
       " Document(id_='e594b398-4694-460a-91f6-c0d43ff5bcf5', embedding=None, metadata={'page_label': '10', 'file_name': 'reft paper.pdf', 'file_path': 'e:\\\\projects\\\\AI research assistant\\\\Data\\\\reft paper.pdf', 'file_type': 'application/pdf', 'file_size': 1496447, 'creation_date': '2024-03-30', 'last_modified_date': '2024-03-30'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={}, text='Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten\\nBosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou,\\net al. 2022. Chain-of-thought prompting elicits rea-\\nsoning in large language models. In Proceedings of\\nNeurIPS .\\nQizhe Xie, Minh-Thang Luong, Eduard Hovy, and\\nQuoc V Le. 2020. Self-training with noisy student\\nimproves imagenet classification. In Proceedings of\\nCVPR , pages 10687–10698.\\nLonghui Yu, Weisen Jiang, Han Shi, Jincheng Yu,\\nZhengying Liu, Yu Zhang, James T Kwok, Zhen-\\nguo Li, Adrian Weller, and Weiyang Liu. 2023.\\nMetamath: Bootstrap your own mathematical ques-\\ntions for large language models. arXiv preprint\\narXiv:2309.12284 .\\nXiang Yue, Xingwei Qu, Ge Zhang, Yao Fu, Wen-\\nhao Huang, Huan Sun, Yu Su, and Wenhu Chen.\\n2023. Mammoth: Building math generalist models\\nthrough hybrid instruction tuning. arXiv preprint\\narXiv:2309.05653 .\\nMengxue Zhang, Zichao Wang, Zhichao Yang, Weiqi\\nFeng, and Andrew Lan. 2023. Interpretable math\\nword problem solution generation via step-by-step\\nplanning. In Proceedings of ACL .\\nRui Zheng, Shihan Dou, Songyang Gao, Yuan Hua, Wei\\nShen, Binghai Wang, Yan Liu, Senjie Jin, Qin Liu,\\nYuhao Zhou, Limao Xiong, Lu Chen, Zhiheng Xi,\\nNuo Xu, Wenbin Lai, Minghao Zhu, Cheng Chang,\\nZhangyue Yin, Rongxiang Weng, Wensen Cheng,\\nHaoran Huang, Tianxiang Sun, Hang Yan, Tao Gui,\\nQi Zhang, Xipeng Qiu, and Xuanjing Huang. 2023.\\nSecrets of rlhf in large language models part i: Ppo.\\nVictor Zhong, Caiming Xiong, and Richard Socher.\\n2017. Seq2sql: Generating structured queries from\\nnatural language using reinforcement learning. arXiv\\npreprint arXiv:1709.00103 .\\nAojun Zhou, Ke Wang, Zimu Lu, Weikang Shi, Sichun\\nLuo, Zipeng Qin, Shaoqing Lu, Anya Jia, Linqi Song,\\nMingjie Zhan, et al. 2023a. Solving challenging\\nmath word problems using gpt-4 code interpreter\\nwith code-based self-verification. arXiv preprint\\narXiv:2308.07921 .\\nDenny Zhou, Nathanael Schärli, Le Hou, Jason Wei,\\nNathan Scales, Xuezhi Wang, Dale Schuurmans,\\nClaire Cui, Olivier Bousquet, Quoc Le, et al. 2023b.\\nLeast-to-most prompting enables complex reasoning\\nin large language models. In Proceedings of ICLR .\\nDaniel M Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B\\nBrown, Alec Radford, Dario Amodei, Paul Chris-\\ntiano, and Geoffrey Irving. 2019. Fine-tuning lan-\\nguage models from human preferences. arXiv\\npreprint arXiv:1909.08593 .Question : Weng earns $12 an hour for babysitting. Yesterday, she\\njust did 50 minutes of babysitting. How much did she earn?\\nN-CoT : We need to calculate her hourly rate and then multiply it by\\nthe amount of time she worked.\\nFirst, we need to convert 50 minutes to hours.\\nThere are 60 minutes in an hour, so 50 minutes is equal to 50/60 =\\n5/6 hours.\\nNext, we can calculate Weng’s earnings by multiplying her hourly\\nrate by the amount of time she worked:\\n$12/hour x 5/6 hour = $10.\\nTherefore, Weng earned $10 for 50 minutes of babysitting.\\nThe answer is 10.\\nP-CoT :\\ndef solution():\\n\"\"\"Weng earns $12 an hour for babysitting. Y esterday, she just did\\n50 minutes of babysitting. How much did she earn?\"\"\"\\nhourly_rate = 12\\nminutes_worked = 50\\nhours_worked = minutes_worked / 60\\nearnings = hourly_rate * hours_worked\\nresult = earnings\\nreturn result\\nFigure 6: N-CoT and P-CoT examples on GSM8K\\nA Examples of N-CoT and P-CoT\\nRepresentations\\nWe present examples of natural language CoT and\\nprogram-based CoT from GSM8K dataset in Fig-\\nure 6. We follow Jie et al. (2023) to perform few-\\nshot prompting and obtain the CoT representations.\\nThe natural language CoT is generally the same\\nas the one presented in Wei et al. (2022). The for-\\nmat program-based CoT is similar to the one in\\nPAL (Gao et al., 2023), where we use a function\\nto solve the problems. The function starts with a\\nPython docstring that repeats the question and then\\na list of statements as reasoning steps.\\nB Detailed Hyperparameter Setting\\nSupervised Fine-Tuning We train the model for\\n40 epochs with the batch size of 48 and the max-\\nimum length of 1024 .. For small models, we in-\\ncrease the learning rate to 2e-5, and the number of\\nepoch for training MathQA MCQ to 100 epochs.\\nReFT Warm-up For Galactica, we perform\\nwarm-up for 2 epochs on GSM8K, SV AMP for\\nboth N-CoT and P-CoT. In terms of MathQA MCQ,\\nwe perform warm-up for 5 epochs on MathQA MCQ\\nN-CoT and 2 epochs on MathQA MCQ P-CoT. For\\nCodeLLAMA, we perform warm-up for 1 epoch\\non SV AMP, 2 epochs on GSM8K, 5 epochs on\\nMathQA MCQ N-CoT and 2 epochs on MathQA MCQ\\nP-CoT. Specifically for MathQA numeric , we perform', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\\n\\n{content}', metadata_template='{key}: {value}', metadata_seperator='\\n'),\n",
       " Document(id_='fa5aa83e-aa8c-4d3d-a23e-f6f72e1d8ba4', embedding=None, metadata={'page_label': '11', 'file_name': 'reft paper.pdf', 'file_path': 'e:\\\\projects\\\\AI research assistant\\\\Data\\\\reft paper.pdf', 'file_type': 'application/pdf', 'file_size': 1496447, 'creation_date': '2024-03-30', 'last_modified_date': '2024-03-30'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={}, text='warm-up for 10 epochs because this dataset is much\\nharder and the number of reasoning chains is longer\\nthan other datasets. For small models, we the warm-\\nup period is 10 epochs for GSM8K and SV AMP\\nand is 40 epochs for MathQA MCQ\\nReFT RL The maximum length for question is\\nset to 300, and the maximum length during sam-\\npling is set to 700. The batch size is 32, which is\\nsmaller than SFT due to extra memory consump-\\ntion of the value model. The number of updates per\\nRL step (i.e., ppo epoch) is set to 2 (Ziegler et al.,\\n2019). We do not employ any weight decay and\\ndropout following Ziegler et al. (2019). For small\\nmodels, we train for 700 epochs with the learning\\nrate of 3e-6and the global batch size of 256.', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\\n\\n{content}', metadata_template='{key}: {value}', metadata_seperator='\\n')]"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "documents"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'Few-Shot Parameter-Efﬁcient Fine-Tuning is Better\\nand Cheaper than In-Context Learning\\nHaokun Liu∗Derek Tam∗Mohammed Muqeeth∗\\nJay Mohta Tenghao Huang Mohit Bansal Colin Raffel\\nDepartment of Computer Science\\nUniversity of North Carolina at Chapel Hill\\n{haokunl,dtredsox,muqeeth,craffel}@cs.unc.edu\\nAbstract\\nFew-shot in-context learning (ICL) enables pre-trained language models to per-\\nform a previously-unseen task without any gradient-based training by feeding a\\nsmall number of training examples as part of the input. ICL incurs substantial\\ncomputational, memory, and storage costs because it involves processing all of the\\ntraining examples every time a prediction is made. Parameter-efﬁcient ﬁne-tuning\\n(PEFT) (e.g. adapter modules, prompt tuning, sparse update methods, etc.) offers\\nan alternative paradigm where a small set of parameters are trained to enable a\\nmodel to perform the new task. In this paper, we rigorously compare few-shot\\nICL and PEFT and demonstrate that the latter offers better accuracy as well as\\ndramatically lower computational costs. Along the way, we introduce a new PEFT\\nmethod called (IA)3that scales activations by learned vectors, attaining stronger\\nperformance while only introducing a relatively tiny amount of new parameters.\\nWe also propose a simple recipe based on the T0 model [ 1] called T-Few that\\ncan be applied to new tasks without task-speciﬁc tuning or modiﬁcations. We\\nvalidate the effectiveness of T-Few on completely unseen tasks by applying it to\\nthe RAFT benchmark [ 2], attaining super-human performance for the ﬁrst time\\nand outperforming the state-of-the-art by 6% absolute. All of the code used in our\\nexperiments is publicly available.1\\n1 Introduction\\nPre-trained language models have become a cornerstone of natural language processing, thanks\\nto the fact that they can dramatically improve data efﬁciency on tasks of interest – i.e., using a\\npre-trained language model for initialization often produces better results with less labeled data. A\\nhistorically common approach has been to use the pre-trained model’s parameters for initialization\\nbefore performing gradient-based ﬁne-tuning on a downstream task of interest. While ﬁne-tuning\\nhas produced many state-of-the-art results [ 1], it results in a model that is specialized for a single\\ntask with an entirely new set of parameter values, which can become impractical when ﬁne-tuning a\\nmodel on many downstream tasks.\\nAn alternative approach popularized by [ 3,4] isin-context learning (ICL), which induces a model\\nto perform a downstream task by inputting prompted examples. Few-shot prompting converts a\\nsmall collection of input-target pairs into (typically) human-understandable instructions and examples\\n[3,4], along with a single unlabeled example for which a prediction is desired. Notably, ICL requires\\nno gradient-based training and therefore allows a single model to immediately perform a wide variety\\nof tasks. Performing ICL therefore solely relies on the capabilities that a model learned during\\npre-training. These characteristics have led to a great deal of recent interest in ICL methods [5–10].\\n∗Equal contribution.\\n1https://github.com/r-three/t-few\\nPreprint. Under review.arXiv:2205.05638v2  [cs.LG]  26 Aug 2022'"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "documents[0].text"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Clean up our Documents' content\n",
    "import re\n",
    "\n",
    "def clean_up_text(content: str) -> str:\n",
    "    \"\"\"\n",
    "    Remove unwanted characters and patterns in text input.\n",
    "\n",
    "    :param content: Text input.\n",
    "    \n",
    "    :return: Cleaned version of original text input.\n",
    "    \"\"\"\n",
    "\n",
    "    # Fix hyphenated words broken by newline\n",
    "    content = re.sub(r'(\\w+)-\\n(\\w+)', r'\\1\\2', content)\n",
    "\n",
    "    # Remove specific unwanted patterns and characters\n",
    "    unwanted_patterns = [\n",
    "        \"\\\\n\", \"  —\", \"——————————\", \"—————————\", \"—————\",\n",
    "        r'\\\\u[\\dA-Fa-f]{4}', r'\\uf075', r'\\uf0b7'\n",
    "    ]\n",
    "    for pattern in unwanted_patterns:\n",
    "        content = re.sub(pattern, \"\", content)\n",
    "\n",
    "    # Fix improperly spaced hyphenated words and normalize whitespace\n",
    "    content = re.sub(r'(\\w)\\s*-\\s*(\\w)', r'\\1-\\2', content)\n",
    "    content = re.sub(r'\\s+', ' ', content)\n",
    "\n",
    "    return content\n",
    "\n",
    "# Call function\n",
    "cleaned_docs = []\n",
    "for d in documents: \n",
    "    cleaned_text = clean_up_text(d.text)\n",
    "    d.text = cleaned_text\n",
    "    cleaned_docs.append(d)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'Few-Shot Parameter-Efﬁcient Fine-Tuning is Betterand Cheaper than In-Context LearningHaokun Liu∗Derek Tam∗Mohammed Muqeeth∗Jay Mohta Tenghao Huang Mohit Bansal Colin RaffelDepartment of Computer ScienceUniversity of North Carolina at Chapel Hill{haokunl,dtredsox,muqeeth,craffel}@cs.unc.eduAbstractFew-shot in-context learning (ICL) enables pre-trained language models to perform a previously-unseen task without any gradient-based training by feeding asmall number of training examples as part of the input. ICL incurs substantialcomputational, memory, and storage costs because it involves processing all of thetraining examples every time a prediction is made. Parameter-efﬁcient ﬁne-tuning(PEFT) (e.g. adapter modules, prompt tuning, sparse update methods, etc.) offersan alternative paradigm where a small set of parameters are trained to enable amodel to perform the new task. In this paper, we rigorously compare few-shotICL and PEFT and demonstrate that the latter offers better accuracy as well asdramatically lower computational costs. Along the way, we introduce a new PEFTmethod called (IA)3that scales activations by learned vectors, attaining strongerperformance while only introducing a relatively tiny amount of new parameters.We also propose a simple recipe based on the T0 model [ 1] called T-Few thatcan be applied to new tasks without task-speciﬁc tuning or modiﬁcations. Wevalidate the effectiveness of T-Few on completely unseen tasks by applying it tothe RAFT benchmark [ 2], attaining super-human performance for the ﬁrst timeand outperforming the state-of-the-art by 6% absolute. All of the code used in ourexperiments is publicly available.11 IntroductionPre-trained language models have become a cornerstone of natural language processing, thanksto the fact that they can dramatically improve data efﬁciency on tasks of interest – i.e., using apre-trained language model for initialization often produces better results with less labeled data. Ahistorically common approach has been to use the pre-trained model’s parameters for initializationbefore performing gradient-based ﬁne-tuning on a downstream task of interest. While ﬁne-tuninghas produced many state-of-the-art results [ 1], it results in a model that is specialized for a singletask with an entirely new set of parameter values, which can become impractical when ﬁne-tuning amodel on many downstream tasks.An alternative approach popularized by [ 3,4] isin-context learning (ICL), which induces a modelto perform a downstream task by inputting prompted examples. Few-shot prompting converts asmall collection of input-target pairs into (typically) human-understandable instructions and examples[3,4], along with a single unlabeled example for which a prediction is desired. Notably, ICL requiresno gradient-based training and therefore allows a single model to immediately perform a wide varietyof tasks. Performing ICL therefore solely relies on the capabilities that a model learned duringpre-training. These characteristics have led to a great deal of recent interest in ICL methods [5–10].∗Equal contribution.1https://github.com/r-three/t-fewPreprint. Under review.arXiv:2205.05638v2 [cs.LG] 26 Aug 2022'"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Inspect output\n",
    "cleaned_docs[0].get_content()\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'page_label': '1',\n",
       " 'file_name': 'peft.pdf',\n",
       " 'file_path': 'e:\\\\projects\\\\AI research assistant\\\\Data\\\\peft.pdf',\n",
       " 'file_type': 'application/pdf',\n",
       " 'file_size': 562785,\n",
       " 'creation_date': '2024-03-30',\n",
       " 'last_modified_date': '2024-03-30'}"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "cleaned_docs[0].metadata\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "34"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(documents)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Configuring Gemini model and GeminiEmbedding"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "genai.configure(api_key=gemini_api_key)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "gemini_embed_model = GeminiEmbedding(model_name=\"models/embedding-001\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Setting Tempreture to 0.3 for getting low risk results\n",
    "\n",
    "model = Gemini(models=\"gemini-pro\",api_key=gemini_api_key,temperature=0.3)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "\n",
    "from llama_index.core.node_parser import SemanticSplitterNodeParser\n",
    "from llama_index.core.ingestion import IngestionPipeline\n",
    "\n",
    "# This will be the model we use both for Node parsing and for vectorization\n",
    "embed_model =gemini_embed_model\n",
    "\n",
    "# Define the initial pipeline\n",
    "pipeline = IngestionPipeline(\n",
    "    transformations=[\n",
    "        SemanticSplitterNodeParser(\n",
    "            buffer_size=1,\n",
    "            breakpoint_percentile_threshold=95, \n",
    "            embed_model=embed_model,\n",
    "            ),\n",
    "        embed_model,\n",
    "        ],\n",
    "    )\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Setting up the `Settings` module to have the informantion about our llm and embedding models and also chunk size distribution of document files"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### As LLMPredictor is depriciated, we are using Settings.llm to define our base LLM Model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index.core import Settings\n",
    "from llama_index.core.node_parser import SentenceSplitter\n",
    "\n",
    "\n",
    "Settings.llm = model\n",
    "Settings.embed_model = gemini_embed_model\n",
    "Settings.node_parser = SentenceSplitter(chunk_size=512, chunk_overlap=20)\n",
    "Settings.num_output = 512\n",
    "Settings.context_window = 3900"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Using Pinecone as our vector database taking the index to save them "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index.vector_stores.pinecone import PineconeVectorStore"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [],
   "source": [
    "from pinecone import Pinecone\n",
    "\n",
    "pc = Pinecone(api_key=pinecone_api_key)\n",
    "pinecone_index = pc.Index(\"ai-research-assistant\") # `ai-research-assistant` is the index name"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Nowing indexing and upserting indexes to pinecone"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [],
   "source": [
    "vector_store = PineconeVectorStore(pinecone_index=pinecone_index)\n",
    "storage_context = StorageContext.from_defaults(vector_store=vector_store)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%capture \n",
    "# Our pipeline with the addition of our PineconeVectorStore\n",
    "pipeline = IngestionPipeline(\n",
    "    transformations=[\n",
    "        SemanticSplitterNodeParser(\n",
    "            buffer_size=1,\n",
    "            breakpoint_percentile_threshold=95, \n",
    "            embed_model=embed_model,\n",
    "            ),\n",
    "        embed_model,\n",
    "        ],\n",
    "        vector_store=vector_store  # Our new addition\n",
    "    )\n",
    "\n",
    "# Now we run our pipeline!\n",
    "pipeline.run(documents=cleaned_docs)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'dimension': 768,\n",
       " 'index_fullness': 0.00176,\n",
       " 'namespaces': {'': {'vector_count': 176}},\n",
       " 'total_vector_count': 176}"
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pinecone_index.describe_index_stats()\n",
    "\n",
    "# >>> {'dimension': 1536,\n",
    "# >>> 'index_fullness': 0.0,\n",
    "# >>> 'namespaces': {'': {'vector_count': 46}},\n",
    "# >>> 'total_vector_count': 46}\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Simply querying from the index"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['If your candidate doesn’t know the answer to the above questions and you’re hiring for a ML intern position, then they’re obviously not a great fit.', 'correct answer. Rank classiﬁcation evaluation is compatible with both classiﬁcation and multiplechoice tasks. Since model performance can vary signiﬁcantly depending on the prompt template used,we report the median accuracy across all prompt templates from P3 and across few-shot data subsetsfor each dataset. For all datasets, we report the accuracy on the test set or validation set when the testlabels are not public (e.g. SuperGLUE datasets). ', 'correct answer. Rank classiﬁcation evaluation is compatible with both classiﬁcation and multiplechoice tasks. Since model performance can vary signiﬁcantly depending on the prompt template used,we report the median accuracy across all prompt templates from P3 and across few-shot data subsetsfor each dataset. For all datasets, we report the accuracy on the test set or validation set when the testlabels are not public (e.g. SuperGLUE datasets). ', 'Interview Questions to Ask a ML intern| Xobin [Downloaded]8 Prepared and Curated by Xobin Team', \"Interview Questions to Ask a ML intern| Xobin [Downloaded]1Interview Questions to Ask a ML intern| Xobin [Downloaded]We at Xobin reached out to over 70+ Hiring teams to curate the best interview questions. W e didn't stop there. We went ahead to understand what type of answers dif ferentiated the top candidate from the rest. \"]\n"
     ]
    }
   ],
   "source": [
    "# from llama_index import VectorStoreIndex\n",
    "from llama_index.core.retrievers import VectorIndexRetriever\n",
    "\n",
    "# Instantiate VectorStoreIndex object from your vector_store object\n",
    "vector_index = VectorStoreIndex.from_vector_store(vector_store=vector_store)\n",
    "\n",
    "# Grab 5 search results\n",
    "retriever = VectorIndexRetriever(index=vector_index, similarity_top_k=5)\n",
    "\n",
    "# Query vector DB\n",
    "answer = retriever.retrieve('generate a summary based on the information you have')\n",
    "\n",
    "# Inspect results\n",
    "print([i.get_content() for i in answer])\n",
    "\n",
    "# >>> ['some relevant search result 1', 'some relevant search result 1'...]\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Adding proper prompt templates for the query engine"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'Empty Response'"
      ]
     },
     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from llama_index.core.query_engine import RetrieverQueryEngine\n",
    "from llama_index.core import PromptTemplate\n",
    "from llama_index.core.postprocessor import SimilarityPostprocessor\n",
    "\n",
    "\n",
    "# Pass in your retriever from above, which is configured to return the top 5 results\n",
    "query_engine = RetrieverQueryEngine(retriever=retriever)\n",
    "\n",
    "postprocessor=SimilarityPostprocessor(similarity_cutoff=0.70)\n",
    "\n",
    "query_engine=RetrieverQueryEngine(retriever=retriever,\n",
    "                                  node_postprocessors=[postprocessor])\n",
    "\n",
    "# Now you query:\n",
    "llm_query = query_engine.query('generate a summary based on the information you have')\n",
    "# llm_query = query_engine.query('tell me about ML questions')\n",
    "\n",
    "llm_query.response"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [],
   "source": [
    "def get_full_prompt_template(cur_instr: str, prompt_tmpl):\n",
    "    tmpl_str = prompt_tmpl.get_template()\n",
    "    new_tmpl_str = cur_instr + \"\\n\" + tmpl_str\n",
    "    new_tmpl = PromptTemplate(new_tmpl_str)\n",
    "    return new_tmpl"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [],
   "source": [
    "QA_PROMPT_KEY = \"response_synthesizer:text_qa_template\"\n",
    "\n",
    "# get the base qa prompt (without any instruction prefix)\n",
    "base_qa_prompt = query_engine.get_prompts()[QA_PROMPT_KEY]\n",
    "\n",
    "\n",
    "initial_instr = \"\"\"\\\n",
    "You are a QA assistant specifically designed to help in RESEARCH WORK as a RESEARCH ASSISTANT.\n",
    "Context information is below. Given the context information and not prior knowledge, \\\n",
    "answer the query. \\\n",
    "\"\"\"\n",
    "\n",
    "# this is the \"initial\" prompt template\n",
    "# implicitly used in the first stage of the loop during prompt optimization\n",
    "# here we explicitly capture it so we can use it for evaluation\n",
    "old_qa_prompt = get_full_prompt_template(initial_instr, base_qa_prompt)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "PromptTemplate(metadata={'prompt_type': <PromptType.CUSTOM: 'custom'>}, template_vars=['context_str', 'query_str'], kwargs={}, output_parser=None, template_var_mappings=None, function_mappings=None, template='You are a QA assistant specifically designed to help in RESEARCH WORK as a RESEARCH ASSISTANT.\\nContext information is below. Given the context information and not prior knowledge, answer the query. \\nContext information is below.\\n---------------------\\n{context_str}\\n---------------------\\nGiven the context information and not prior knowledge, answer the query.\\nQuery: {query_str}\\nAnswer: ')"
      ]
     },
     "execution_count": 30,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "old_qa_prompt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "I apologize, but the provided context does not contain sufficient information to generate a meaningful summary.\n"
     ]
    }
   ],
   "source": [
    "# Use the custom prompt when querying\n",
    "query_engine = vector_index.as_query_engine(text_qa_template=old_qa_prompt)\n",
    "response = query_engine.query(\"generate a summary based on the information you have\")\n",
    "# response = query_engine.query('tell me about Few-shot in-context learning')\n",
    "print(response)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Final Response: I apologize, but the provided context does not contain\n",
      "sufficient information to generate a meaningful summary.\n",
      "______________________________________________________________________\n",
      "Source Node 1/2\n",
      "Node ID: 444a52bb-805b-4698-9e41-6817dcfe1fa1\n",
      "Similarity: 0.481024384\n",
      "Text: If your candidate doesn’t know the answer to the above questions\n",
      "and you’re hiring for a ML intern position, then they’re obviously not\n",
      "a great fit.\n",
      "______________________________________________________________________\n",
      "Source Node 2/2\n",
      "Node ID: 6c5664e4-5e90-491d-8b80-857852760395\n",
      "Similarity: 0.492449284\n",
      "Text: correct answer. Rank classiﬁcation evaluation is compatible with\n",
      "both classiﬁcation and multiplechoice tasks. Since model performance\n",
      "can vary signiﬁcantly depending on the prompt template used,we report\n",
      "the median accuracy across all prompt templates from P3 and across\n",
      "few-shot data subsetsfor each dataset. For all datasets, we report the\n",
      "accur...\n",
      "I apologize, but the provided context does not contain sufficient information to generate a meaningful summary.\n"
     ]
    }
   ],
   "source": [
    "from llama_index.core.response.pprint_utils import pprint_response\n",
    "pprint_response(response,show_source=True)\n",
    "print(response)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Correct Prompts and the Right question is important to get the desired respose"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### \"generate a summary based on the information you have about peft\" is a better query in this case than 'tell me about the T-Few Recipe' as in the later case the souce node is fetching irrelevant data like the table content which is apperently more similar through vector search"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The provided context does not mention anything about recipes, so I cannot answer this question from the provided context.\n"
     ]
    }
   ],
   "source": [
    "# Use the custom prompt when querying\n",
    "query_engine = vector_index.as_query_engine(text_qa_template=old_qa_prompt)\n",
    "# response = query_engine.query(\"generate a summary based on the information you have about peft\")\n",
    "response = query_engine.query('tell me about t few recipe')\n",
    "print(response)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Final Response: The provided context does not mention anything about\n",
      "recipes, so I cannot answer this question from the provided context.\n",
      "______________________________________________________________________\n",
      "Source Node 1/2\n",
      "Node ID: 401a275e-9caa-44dd-bbab-3b46c526adc7\n",
      "Similarity: 0.553759813\n",
      "Text: Interview Questions to Ask a ML intern| Xobin [Downloaded]8\n",
      "Prepared and Curated by Xobin Team\n",
      "______________________________________________________________________\n",
      "Source Node 2/2\n",
      "Node ID: dda36b9b-cba5-4017-9baf-06326c029b8e\n",
      "Similarity: 0.575024486\n",
      "Text: Interview Questions to Ask a ML intern| Xobin\n",
      "[Downloaded]1Interview Questions to Ask a ML intern| Xobin\n",
      "[Downloaded]We at Xobin reached out to over 70+ Hiring teams to curate\n",
      "the best interview questions. W e didn't stop there. We went ahead to\n",
      "understand what type of answers dif ferentiated the top candidate from\n",
      "the rest.\n",
      "The provided context does not mention anything about recipes, so I cannot answer this question from the provided context.\n"
     ]
    }
   ],
   "source": [
    "from llama_index.core.response.pprint_utils import pprint_response\n",
    "pprint_response(response,show_source=True)\n",
    "print(response)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "base",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.18"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}