{ "cells": [ { "cell_type": "code", "execution_count": 68, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "13" ] }, "execution_count": 68, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from langchain_community.document_loaders import PyPDFLoader\n", "\n", "url = 'https://arxiv.org/pdf/1907.11692v1'\n", "loader = PyPDFLoader(url)\n", "pages = loader.load()\n", "\n", "len(pages)" ] }, { "cell_type": "code", "execution_count": 78, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "page_content='arXiv:1907.11692v1 [cs.CL] 26 Jul 2019RoBERTa: A Robustly Optimized BERT Pretraining Approach\\nYinhan Liu∗§Myle Ott∗§Naman Goyal∗§Jingfei Du∗§Mandar Joshi†\\nDanqi Chen§Omer Levy§Mike Lewis§Luke Zettlemoyer†§Veselin Stoyanov§\\n†Paul G. Allen School of Computer Science & Engineering,' metadata={'source': 'https://arxiv.org/pdf/1907.11692v1', 'page': 0}\n", "page_content='University of Washington, Seattle, WA\\n{mandar90,lsz }@cs.washington.edu\\n§Facebook AI\\n{yinhanliu,myleott,naman,jingfeidu,\\ndanqi,omerlevy,mikelewis,lsz,ves }@fb.com\\nAbstract\\nLanguage model pretraining has led to sig-\\nnificant performance gains but careful com-' metadata={'source': 'https://arxiv.org/pdf/1907.11692v1', 'page': 0}\n", "{'source': 'https://arxiv.org/pdf/1907.11692v1', 'page': 8}: Single models on test (as of July 25, 2019)\n", "BERT LARGE 72.0 76.6 70.1\n", "XLNet LARGE 81.7 85.4 80.2\n", "RoBERTa 83.2 86.5 81.3\n", "Table 7: Results on the RACE test set. BERT LARGE and\n", "XLNet LARGE results are from Yang et al. (2019 ).\n", "nating each candidate answer with the correspond-\n", "{'source': 'https://arxiv.org/pdf/1907.11692v1', 'page': 6}: BERT LARGE\n", "with B OOKS + W IKI 13GB 256 1M 90.9/81.8 86.6 93.7\n", "XLNet LARGE\n", "with B OOKS + W IKI 13GB 256 1M 94.0/87.8 88.4 94.4\n", "+ additional data 126GB 2K 500K 94.5/88.8 89.8 95.6\n", "Table 4: Development set results for RoBERTa as we pretrain o ver more data (16GB →160GB of text) and pretrain\n", "{'source': 'https://arxiv.org/pdf/1907.11692v1', 'page': 6}: Results We present our results in Table 4. When\n", "controlling for training data, we observe that\n", "RoBERTa provides a large improvement over the\n", "originally reported BERT LARGE results, reaffirming\n", "the importance of the design choices we explored\n", "in Section 4.\n", "{'source': 'https://arxiv.org/pdf/1907.11692v1', 'page': 8}: XLNet + SG-Net Verifier 87.0†89.9†\n", "Table 6: Results on SQuAD. †indicates results that de-\n", "pend on additional external training data. RoBERTa\n", "uses only the provided SQuAD data in both dev and\n", "test settings. BERT LARGE and XLNet LARGE results are\n", "from Devlin et al. (2019 ) and Yang et al. (2019 ), re-\n", "{'source': 'https://arxiv.org/pdf/1907.11692v1', 'page': 6}: for longer (100K →300K→500K steps). Each row accumulates improvements from the row s above. RoBERTa\n", "matches the architecture and training objective of BERT LARGE . Results for BERT LARGE and XLNet LARGE are from\n" ] } ], "source": [ "from langchain_community.vectorstores import FAISS\n", "from langchain_openai import OpenAIEmbeddings\n", "import os\n", "from dotenv import load_dotenv\n", "from langchain_text_splitters import RecursiveCharacterTextSplitter\n", "\n", "split_pages = loader.load_and_split()\n", "text_splitter = RecursiveCharacterTextSplitter(\n", " chunk_size=300,\n", " chunk_overlap=50,\n", " length_function=len,\n", " is_separator_regex=False,\n", ")\n", "texts = text_splitter.create_documents([page.page_content for page in split_pages], metadatas=[page.metadata for page in split_pages])\n", "print(texts[0])\n", "print(texts[1])\n", "\n", "load_dotenv()\n", "\n", "embeddings = OpenAIEmbeddings(model=\"text-embedding-ada-002\", openai_api_key=os.environ[\"OPENAI_KEY\"])\n", "faiss_index = FAISS.from_documents(texts, embeddings)\n", "retriever = faiss_index.as_retriever(search_type=\"similarity\", search_kwargs={\"k\": 5})\n", "retrieved_docs = retriever.invoke(\"RoBERTa surpasses BERT LARGE and XLNet LARGE in performance\")\n", "\n", "for doc in retrieved_docs:\n", " print(str(doc.metadata) + \":\", doc.page_content[:300])" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "AIMessage(content='RoBERTa is an improved pretraining approach based on BERT. It involves training the model longer, with bigger batches, over more data, removing the next sentence prediction objective, training on longer sequences, and dynamically changing the masking pattern applied to the training data. RoBERTa achieves state-of-the-art results on various benchmarks like GLUE, RACE, and SQuAD by optimizing key hyperparameters and training data size. It surpasses the performance of models published after BERT.', response_metadata={'token_usage': {'completion_tokens': 98, 'prompt_tokens': 4632, 'total_tokens': 4730}, 'model_name': 'gpt-3.5-turbo-0125', 'system_fingerprint': None, 'finish_reason': 'stop', 'logprobs': None}, id='run-0674c381-70d3-4a63-8f45-120f491ec357-0', usage_metadata={'input_tokens': 4632, 'output_tokens': 98, 'total_tokens': 4730})" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from langchain_openai import ChatOpenAI\n", "from langchain_core.runnables import RunnablePassthrough\n", "from langchain_core.prompts import PromptTemplate\n", "from langchain.chains import MapReduceDocumentsChain, ReduceDocumentsChain\n", "\n", "template = \"\"\"\n", "You are an assistant for question-answering tasks. \n", "Use the following pieces of retrieved context to answer the question. \n", "If you don't know the answer, just say that you don't know. \n", "Use three sentences maximum and keep the answer concise.\n", "\n", "Question: {question}\n", "Context: {context}\n", "Answer:\n", "\"\"\"\n", "\n", "def format_docs(docs):\n", " return \"\\n\\n\".join(doc.page_content for doc in docs)\n", "\n", "prompt = PromptTemplate(template=template, input_variables=[\"context\", \"question\"])\n", "llm = ChatOpenAI(model=\"gpt-3.5-turbo-0125\", openai_api_key=os.environ[\"OPENAI_KEY\"])\n", "\n", "rag_chain = (\n", " {\"context\": retriever | format_docs, \"question\": RunnablePassthrough()}\n", " | prompt\n", " | llm\n", ")\n", "\n", "res = rag_chain.invoke(\"What is RoBERTa?\")\n", "res" ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "13\n" ] } ], "source": [ "from langchain.chains import MapReduceDocumentsChain, ReduceDocumentsChain\n", "from langchain.chains.combine_documents.stuff import StuffDocumentsChain\n", "from langchain_text_splitters import CharacterTextSplitter\n", "from langchain.chains.llm import LLMChain\n", "\n", "llm = ChatOpenAI(model=\"gpt-3.5-turbo-0125\", openai_api_key=os.environ[\"OPENAI_KEY\"])\n", "\n", "# Map\n", "map_template = \"\"\"You are an expert in technical papers and journals.\n", "You're tasked with summarizing the main points in the following text.\n", "The following is the text you need to summarize:\n", "{docs}\n", "Based on this text, provide a summary of the main points.\n", "Helpful Answer:\n", "\"\"\"\n", "map_prompt = PromptTemplate.from_template(map_template)\n", "map_chain = LLMChain(llm=llm, prompt=map_prompt)\n", "\n", "# Reduce\n", "reduce_template = \"\"\"The following is set of summaries of a technical paper:\n", "{docs}\n", "\n", "Take these and distill it into a final, consolidated summary of the main points. \n", "\n", "RULES:\n", "- The summary should be as if you are presenting the main points in a seminar.\n", "- Organize the points in powerpoint slide format.\n", "- Use markdown to format the text.\n", "\n", "Helpful Answer:\n", "\"\"\"\n", "reduce_prompt = PromptTemplate.from_template(reduce_template)\n", "\n", "# Run chain\n", "reduce_chain = LLMChain(llm=llm, prompt=reduce_prompt)\n", "\n", "# Takes a list of documents, combines them into a single string, and passes this to an LLMChain\n", "combine_documents_chain = StuffDocumentsChain(\n", " llm_chain=reduce_chain, document_variable_name=\"docs\",\n", " verbose=True\n", ")\n", "\n", "# Combines and iteratively reduces the mapped documents\n", "reduce_documents_chain = ReduceDocumentsChain(\n", " # This is final chain that is called.\n", " combine_documents_chain=combine_documents_chain,\n", " # If documents exceed context for `StuffDocumentsChain`\n", " collapse_documents_chain=combine_documents_chain,\n", " # The maximum number of tokens to group documents into.\n", " token_max=4000,\n", " verbose=True\n", ")\n", "\n", "# Combining documents by mapping a chain over them, then combining results\n", "map_reduce_chain = MapReduceDocumentsChain(\n", " # Map chain\n", " llm_chain=map_chain,\n", " # Reduce chain\n", " reduce_documents_chain=reduce_documents_chain,\n", " # The variable name in the llm_chain to put the documents in\n", " document_variable_name=\"docs\",\n", " # Return the results of the map steps in the output\n", " return_intermediate_steps=False,\n", " verbose=True\n", ")\n", "\n", "text_splitter = CharacterTextSplitter.from_tiktoken_encoder(\n", " chunk_size=1000, chunk_overlap=100\n", ")\n", "split_docs = text_splitter.split_documents(pages)\n", "print(len(split_docs))\n", "\n", "# result = map_reduce_chain.invoke(split_docs)\n", "\n", "# print(result[\"output_text\"])\n" ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "29\n" ] } ], "source": [ "from langchain_experimental.text_splitter import SemanticChunker\n", "\n", "text_splitter = SemanticChunker(embeddings, breakpoint_threshold_type=\"gradient\")\n", "docs = text_splitter.create_documents([' '.join([page.page_content for page in pages])])\n", "print(len(docs))\n", "\n" ] }, { "cell_type": "code", "execution_count": 60, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\u001b[32;1m\u001b[1;3m[chain/start]\u001b[0m \u001b[1m[chain:RunnableSequence] Entering Chain run with input:\n", "\u001b[0m{\n", " \"doc\": \"arXiv:1907.11692v1 [cs.CL] 26 Jul 2019RoBERTa: A Robustly Optimized BERT Pretraining Approach\\nYinhan Liu∗§Myle Ott∗§Naman Goyal∗§Jingfei Du∗§Mandar Joshi†\\nDanqi Chen§Omer Levy§Mike Lewis§Luke Zettlemoyer†§Veselin Stoyanov§\\n†Paul G.\"\n", "}\u001b[32;1m\u001b[1;3m[chain/start]\u001b[0m \u001b[1m[chain:RunnableSequence] Entering Chain run with input:\n", "\u001b[0m{\n", " \"doc\": \"Allen School of Computer Science & Engineering,\\nUniversity of Washington, Seattle, WA\\n{mandar90,lsz }@cs.washington.edu\\n§Facebook AI\\n{yinhanliu,myleott,naman,jingfeidu,\\ndanqi,omerlevy,mikelewis,lsz,ves }@fb.com\\nAbstract\\nLanguage model pretraining has led to sig-\\nnificant performance gains but careful com-\\nparison between different approaches is chal-\\nlenging. Training is computationally expen-\\nsive, often done on private datasets of different\\nsizes, and, as we will show, hyperparameter\\nchoices have significant impact on the final re-\\nsults. We present a replication study of BERT\\npretraining ( Devlin et al. ,2019 ) that carefully\\nmeasures the impact of many key hyperparam-\\neters and training data size. We find that BERT\\nwas significantly undertrained, and can match\\nor exceed the performance of every model\\npublished after it. Our best model achieves\\nstate-of-the-art results on GLUE, RACE and\\nSQuAD. These results highlight the impor-\\ntance of previously overlooked design choices,\\nand raise questions about the source of re-\\ncently reported improvements. We release our\\nmodels and code.1\\n1 Introduction\\nSelf-training methods such as ELMo ( Peters et al. ,\\n2018 ), GPT ( Radford et al. ,2018 ), BERT\\n(Devlin et al. ,2019 ), XLM ( Lample and Conneau ,\\n2019 ), and XLNet ( Yang et al. ,2019 ) have\\nbrought significant performance gains, but it can\\nbe challenging to determine which aspects of\\nthe methods contribute the most. Training is\\ncomputationally expensive, limiting the amount\\nof tuning that can be done, and is often done with\\nprivate training data of varying sizes, limiting\\nour ability to measure the effects of the modeling\\nadvances. ∗Equal contribution. 1Our models and code are available at:\\nhttps://github.com/pytorch/fairseqWe present a replication study of BERT pre-\\ntraining ( Devlin et al. ,2019 ), which includes a\\ncareful evaluation of the effects of hyperparmeter\\ntuning and training set size. We find that BERT\\nwas significantly undertrained and propose an im-\\nproved recipe for training BERT models, which\\nwe call RoBERTa, that can match or exceed the\\nperformance of all of the post-BERT methods. Our modifications are simple, they include: (1)\\ntraining the model longer, with bigger batches,\\nover more data; (2) removing the next sentence\\nprediction objective; (3) training on longer se-\\nquences; and (4) dynamically changing the mask-\\ning pattern applied to the training data. We also\\ncollect a large new dataset (CC-N EWS) of compa-\\nrable size to other privately used datasets, to better\\ncontrol for training set size effects. When controlling for training data, our im-\\nproved training procedure improves upon the pub-\\nlished BERT results on both GLUE and SQuAD. When trained for longer over additional data, our\\nmodel achieves a score of 88.5 on the public\\nGLUE leaderboard, matching the 88.4 reported\\nbyYang et al. (2019 ). Our model establishes a\\nnew state-of-the-art on 4/9 of the GLUE tasks:\\nMNLI, QNLI, RTE and STS-B. We also match\\nstate-of-the-art results on SQuAD and RACE. Overall, we re-establish that BERT’s masked lan-\\nguage model training objective is competitive\\nwith other recently proposed training objectives\\nsuch as perturbed autoregressive language model-\\ning (Yang et al. ,2019 ).2\\nIn summary, the contributions of this paper\\nare: (1) We present a set of important BERT de-\\nsign choices and training strategies and introduce\\n2It is possible that these other methods could also improve\\nwith more tuning. We leave this exploration to future work. alternatives that lead to better downstream task\\nperformance; (2) We use a novel dataset, CC-\\nNEWS, and confirm that using more data for pre-\\ntraining further improves performance on down-\\nstream tasks; (3) Our training improvements show\\nthat masked language model pretraining, under\\nthe right design choices, is competitive with all\\nother recently published methods. We release our\\nmodel, pretraining and fine-tuning code imple-\\nmented in PyTorch ( Paszke et al. ,2017 ). 2 Background\\nIn this section, we give a brief overview of the\\nBERT ( Devlin et al. ,2019 ) pretraining approach\\nand some of the training choices that we will ex-\\namine experimentally in the following section. 2.1 Setup\\nBERT takes as input a concatenation of two\\nsegments (sequences of tokens), x1,...,x N\\nandy1,...,yM. Segments usually consist of\\nmore than one natural sentence. The two seg-\\nments are presented as a single input sequence\\nto BERT with special tokens delimiting them:\\n[CLS],x1,...,x N,[SEP],y1,...,yM,[EOS]. MandNare constrained such that M+N < T ,\\nwhereTis a parameter that controls the maximum\\nsequence length during training. The model is first pretrained on a large unla-\\nbeled text corpus and subsequently finetuned us-\\ning end-task labeled data. 2.2 Architecture\\nBERT uses the now ubiquitous transformer archi-\\ntecture ( Vaswani et al. ,2017 ), which we will not\\nreview in detail. We use a transformer architecture\\nwithLlayers. Each block uses Aself-attention\\nheads and hidden dimension H. 2.3 Training Objectives\\nDuring pretraining, BERT uses two objectives:\\nmasked language modeling and next sentence pre-\\ndiction. Masked Language Model (MLM) A random\\nsample of the tokens in the input sequence is\\nselected and replaced with the special token\\n[MASK]. The MLM objective is a cross-entropy\\nloss on predicting the masked tokens. BERT uni-\\nformly selects 15% of the input tokens for possi-\\nble replacement. Of the selected tokens, 80% are\\nreplaced with [MASK], 10% are left unchanged,and 10% are replaced by a randomly selected vo-\\ncabulary token. In the original implementation, random mask-\\ning and replacement is performed once in the be-\\nginning and saved for the duration of training, al-\\nthough in practice, data is duplicated so the mask\\nis not always the same for every training sentence\\n(see Section 4.1). Next Sentence Prediction (NSP) NSP is a bi-\\nnary classification loss for predicting whether two\\nsegments follow each other in the original text. Positive examples are created by taking consecu-\\ntive sentences from the text corpus. Negative ex-\\namples are created by pairing segments from dif-\\nferent documents. Positive and negative examples\\nare sampled with equal probability. The NSP objective was designed to improve\\nperformance on downstream tasks, such as Natural\\nLanguage Inference ( Bowman et al. ,2015 ), which\\nrequire reasoning about the relationships between\\npairs of sentences. 2.4 Optimization\\nBERT is optimized with Adam ( Kingma and Ba ,\\n2015 ) using the following parameters: β1= 0.9,\\nβ2= 0.999,ǫ=1e-6 and L2weight de-\\ncay of0.01. The learning rate is warmed up\\nover the first 10,000 steps to a peak value of\\n1e-4, and then linearly decayed. BERT trains\\nwith a dropout of 0.1 on all layers and at-\\ntention weights, and a GELU activation func-\\ntion ( Hendrycks and Gimpel ,2016 ). Models are\\npretrained for S=1,000,000 updates, with mini-\\nbatches containing B=256 sequences of maxi-\\nmum length T=512 tokens. 2.5 Data\\nBERT is trained on a combination of B OOK COR-\\nPUS (Zhu et al. ,2015 ) plus English W IKIPEDIA ,\\nwhich totals 16GB of uncompressed text.3\\n3 Experimental Setup\\nIn this section, we describe the experimental setup\\nfor our replication study of BERT. 3.1 Implementation\\nWe reimplement BERT in FAIRSEQ (Ott et al. ,\\n2019 ). We primarily follow the original BERT\\n3Yang et al. (2019 ) use the same dataset but report having\\nonly 13GB of text after data cleaning. This is most likely due\\nto subtle differences in cleaning of the Wikipedia data. optimization hyperparameters, given in Section 2,\\nexcept for the peak learning rate and number of\\nwarmup steps, which are tuned separately for each\\nsetting. We additionally found training to be very\\nsensitive to the Adam epsilon term, and in some\\ncases we obtained better performance or improved\\nstability after tuning it. Similarly, we found setting\\nβ2= 0.98to improve stability when training with\\nlarge batch sizes. We pretrain with sequences of at most T= 512\\ntokens. Unlike Devlin et al. (2019 ), we do not ran-\\ndomly inject short sequences, and we do not train\\nwith a reduced sequence length for the first 90% of\\nupdates. We train only with full-length sequences. We train with mixed precision floating point\\narithmetic on DGX-1 machines, each with 8 ×\\n32GB Nvidia V100 GPUs interconnected by In-\\nfiniband ( Micikevicius et al. ,2018 ). 3.2 Data\\nBERT-style pretraining crucially relies on large\\nquantities of text. Baevski et al. (2019 ) demon-\\nstrate that increasing data size can result in im-\\nproved end-task performance. Several efforts\\nhave trained on datasets larger and more diverse\\nthan the original BERT ( Radford et al.\"\n", "}\n", "\u001b[32;1m\u001b[1;3m[chain/start]\u001b[0m \u001b[1m[chain:RunnableSequence] Entering Chain run with input:\n", "\u001b[0m{\n", " \"doc\": \",2019 ;\\nYang et al. ,2019 ;Zellers et al. ,2019 ). Unfortu-\\nnately, not all of the additional datasets can be\\npublicly released. For our study, we focus on gath-\\nering as much data as possible for experimenta-\\ntion, allowing us to match the overall quality and\\nquantity of data as appropriate for each compari-\\nson. We consider five English-language corpora of\\nvarying sizes and domains, totaling over 160GB\\nof uncompressed text. We use the following text\\ncorpora:\\n•BOOK CORPUS (Zhu et al. ,2015 ) plus English\\nWIKIPEDIA . This is the original data used to\\ntrain BERT. (16GB). •CC-N EWS, which we collected from the En-\\nglish portion of the CommonCrawl News\\ndataset ( Nagel ,2016 ). The data contains 63\\nmillion English news articles crawled between\\nSeptember 2016 and February 2019. (76GB af-\\nter filtering).4\\n•OPENWEBTEXT (Gokaslan and Cohen ,2019 ),\\nan open-source recreation of the WebText cor-\\n4We usenews-please (Hamborg et al. ,2017 ) to col-\\nlect and extract CC-N EWS. CC-N EWS is similar to the R E-\\nALNEWS dataset described in Zellers et al. (2019 ).pus described in Radford et al. (2019 ). The text\\nis web content extracted from URLs shared on\\nReddit with at least three upvotes. (38GB).5\\n•STORIES , a dataset introduced in Trinh and Le\\n(2018 ) containing a subset of CommonCrawl\\ndata filtered to match the story-like style of\\nWinograd schemas. (31GB). 3.3 Evaluation\\nFollowing previous work, we evaluate our pre-\\ntrained models on downstream tasks using the fol-\\nlowing three benchmarks. GLUE The General Language Understand-\\ning Evaluation (GLUE) benchmark ( Wang et al. ,\\n2019b ) is a collection of 9 datasets for evaluating\\nnatural language understanding systems.6Tasks\\nare framed as either single-sentence classification\\nor sentence-pair classification tasks. The GLUE\\norganizers provide training and development data\\nsplits as well as a submission server and leader-\\nboard that allows participants to evaluate and com-\\npare their systems on private held-out test data. For the replication study in Section 4, we report\\nresults on the development sets after finetuning\\nthe pretrained models on the corresponding single-\\ntask training data (i.e., without multi-task training\\nor ensembling). Our finetuning procedure follows\\nthe original BERT paper ( Devlin et al. ,2019 ). In Section 5we additionally report test set re-\\nsults obtained from the public leaderboard. These\\nresults depend on a several task-specific modifica-\\ntions, which we describe in Section 5.1. SQuAD The Stanford Question Answering\\nDataset (SQuAD) provides a paragraph of context\\nand a question. The task is to answer the question\\nby extracting the relevant span from the context. We evaluate on two versions of SQuAD: V1.1\\nand V2.0 ( Rajpurkar et al. ,2016 ,2018 ). In V1.1\\nthe context always contains an answer, whereas in\\n5The authors and their affiliated institutions are not in any\\nway affiliated with the creation of the OpenWebText dataset. 6The datasets are: CoLA ( Warstadt et al. ,2018 ),\\nStanford Sentiment Treebank (SST) ( Socher et al. ,\\n2013 ), Microsoft Research Paragraph Corpus\\n(MRPC) ( Dolan and Brockett ,2005 ), Semantic Tex-\\ntual Similarity Benchmark (STS) ( Agirre et al. ,2007 ),\\nQuora Question Pairs (QQP) ( Iyer et al. ,2016 ), Multi-\\nGenre NLI (MNLI) ( Williams et al. ,2018 ), Question NLI\\n(QNLI) ( Rajpurkar et al. ,2016 ), Recognizing Textual\\nEntailment (RTE) ( Dagan et al. ,2006 ;Bar-Haim et al. ,\\n2006 ;Giampiccolo et al. ,2007 ;Bentivogli et al. ,2009 ) and\\nWinograd NLI (WNLI) ( Levesque et al. ,2011 ). V2.0 some questions are not answered in the pro-\\nvided context, making the task more challenging. For SQuAD V1.1 we adopt the same span pre-\\ndiction method as BERT ( Devlin et al. ,2019 ). For\\nSQuAD V2.0, we add an additional binary classi-\\nfier to predict whether the question is answerable,\\nwhich we train jointly by summing the classifica-\\ntion and span loss terms. During evaluation, we\\nonly predict span indices on pairs that are classi-\\nfied as answerable. RACE The ReAding Comprehension from Ex-\\naminations (RACE) ( Lai et al. ,2017 ) task is a\\nlarge-scale reading comprehension dataset with\\nmore than 28,000 passages and nearly 100,000\\nquestions. The dataset is collected from English\\nexaminations in China, which are designed for\\nmiddle and high school students. In RACE, each\\npassage is associated with multiple questions. For\\nevery question, the task is to select one correct an-\\nswer from four options. RACE has significantly\\nlonger context than other popular reading compre-\\nhension datasets and the proportion of questions\\nthat requires reasoning is very large. 4 Training Procedure Analysis\\nThis section explores and quantifies which choices\\nare important for successfully pretraining BERT\\nmodels. We keep the model architecture fixed.7\\nSpecifically, we begin by training BERT models\\nwith the same configuration as BERT BASE (L=\\n12,H= 768 ,A= 12 , 110M params). 4.1 Static vs. Dynamic Masking\\nAs discussed in Section 2, BERT relies on ran-\\ndomly masking and predicting tokens. The orig-\\ninal BERT implementation performed masking\\nonce during data preprocessing, resulting in a sin-\\nglestatic mask. To avoid using the same mask for\\neach training instance in every epoch, training data\\nwas duplicated 10 times so that each sequence is\\nmasked in 10 different ways over the 40 epochs of\\ntraining. Thus, each training sequence was seen\\nwith the same mask four times during training. We compare this strategy with dynamic mask-\\ningwhere we generate the masking pattern every\\ntime we feed a sequence to the model. This be-\\ncomes crucial when pretraining for more steps or\\nwith larger datasets. 7Studying architectural changes, including larger archi-\\ntectures, is an important area for future work.Masking SQuAD 2.0 MNLI-m SST-2\\nreference 76.3 84.3 92.8\\nOur reimplementation:\\nstatic 78.3 84.3 92.5\\ndynamic 78.7 84.0 92.9\\nTable 1: Comparison between static and dynamic\\nmasking for BERT BASE. We report F1 for SQuAD and\\naccuracy for MNLI-m and SST-2. Reported results are\\nmedians over 5 random initializations (seeds). Refer-\\nence results are from Yang et al. (2019 ). Results Table 1compares the published\\nBERT BASE results from Devlin et al. (2019 ) to our\\nreimplementation with either static or dynamic\\nmasking. We find that our reimplementation\\nwith static masking performs similar to the\\noriginal BERT model, and dynamic masking is\\ncomparable or slightly better than static masking. Given these results and the additional efficiency\\nbenefits of dynamic masking, we use dynamic\\nmasking in the remainder of the experiments. 4.2 Model Input Format and Next Sentence\\nPrediction\\nIn the original BERT pretraining procedure, the\\nmodel observes two concatenated document seg-\\nments, which are either sampled contiguously\\nfrom the same document (with p= 0.5) or from\\ndistinct documents. In addition to the masked lan-\\nguage modeling objective, the model is trained to\\npredict whether the observed document segments\\ncome from the same or distinct documents via an\\nauxiliary Next Sentence Prediction (NSP) loss. The NSP loss was hypothesized to be an impor-\\ntant factor in training the original BERT model. Devlin et al. (2019 ) observe that removing NSP\\nhurts performance, with significant performance\\ndegradation on QNLI, MNLI, and SQuAD 1.1. However, some recent work has questioned the\\nnecessity of the NSP loss ( Lample and Conneau ,\\n2019 ;Yang et al.\"\n", "}\n", "\u001b[32;1m\u001b[1;3m[chain/start]\u001b[0m \u001b[1m[chain:RunnableSequence] Entering Chain run with input:\n", "\u001b[0m{\n", " \"doc\": \",2019 ;Joshi et al. ,2019 ). To better understand this discrepancy, we com-\\npare several alternative training formats:\\n•SEGMENT -PAIR +NSP: This follows the original\\ninput format used in BERT ( Devlin et al. ,2019 ),\\nwith the NSP loss. Each input has a pair of seg-\\nments, which can each contain multiple natural\\nsentences, but the total combined length must\\nbe less than 512 tokens. Model SQuAD 1.1/2.0 MNLI-m SST-2 RACE\\nOur reimplementation (with NSP loss):\\nSEGMENT -PAIR 90.4/78.7 84.0 92.9 64.2\\nSENTENCE -PAIR 88.7/76.2 82.9 92.1 63.0\\nOur reimplementation (without NSP loss):\\nFULL -SENTENCES 90.4/79.1 84.7 92.5 64.8\\nDOC-SENTENCES 90.6/79.7 84.7 92.7 65.6\\nBERT BASE 88.5/76.3 84.3 92.8 64.3\\nXLNet BASE (K = 7) –/81.3 85.8 92.7 66.1\\nXLNet BASE (K = 6) –/81.0 85.6 93.4 66.7\\nTable 2: Development set results for base models pretrained over B OOK CORPUS and W IKIPEDIA . All models are\\ntrained for 1M steps with a batch size of 256 sequences. We rep ort F1 for SQuAD and accuracy for MNLI-m,\\nSST-2 and RACE. Reported results are medians over five random initializations (seeds). Results for BERT BASEand\\nXLNet BASEare from Yang et al. (2019 ). •SENTENCE -PAIR +NSP: Each input contains a\\npair of natural sentences , either sampled from\\na contiguous portion of one document or from\\nseparate documents. Since these inputs are sig-\\nnificantly shorter than 512 tokens, we increase\\nthe batch size so that the total number of tokens\\nremains similar to SEGMENT -PAIR +NSP. We re-\\ntain the NSP loss. •FULL -SENTENCES : Each input is packed with\\nfull sentences sampled contiguously from one\\nor more documents, such that the total length is\\nat most 512 tokens. Inputs may cross document\\nboundaries. When we reach the end of one doc-\\nument, we begin sampling sentences from the\\nnext document and add an extra separator token\\nbetween documents. We remove the NSP loss. •DOC-SENTENCES : Inputs are constructed sim-\\nilarly to FULL -SENTENCES , except that they\\nmay not cross document boundaries. Inputs\\nsampled near the end of a document may be\\nshorter than 512 tokens, so we dynamically in-\\ncrease the batch size in these cases to achieve\\na similar number of total tokens as FULL -\\nSENTENCES .\"\n", "}\n", "\u001b[32;1m\u001b[1;3m[chain/start]\u001b[0m \u001b[1m[chain:RunnableSequence] Entering Chain run with input:\n", "\u001b[0m{\n", " \"doc\": \"We remove the NSP loss. Results Table 2shows results for the four dif-\\nferent settings. We first compare the original\\nSEGMENT -PAIR input format from Devlin et al. (2019 ) to the SENTENCE -PAIR format; both for-\\nmats retain the NSP loss, but the latter uses sin-\\ngle sentences. We find that using individual\\nsentences hurts performance on downstream\\ntasks , which we hypothesize is because the model\\nis not able to learn long-range dependencies.We next compare training without the NSP\\nloss and training with blocks of text from a sin-\\ngle document ( DOC-SENTENCES ). We find that\\nthis setting outperforms the originally published\\nBERT BASEresults and that removing the NSP loss\\nmatches or slightly improves downstream task\\nperformance , in contrast to Devlin et al. (2019 ). It is possible that the original BERT implementa-\\ntion may only have removed the loss term while\\nstill retaining the SEGMENT -PAIR input format. Finally we find that restricting sequences to\\ncome from a single document ( DOC-SENTENCES )\\nperforms slightly better than packing sequences\\nfrom multiple documents ( FULL -SENTENCES ). However, because the DOC-SENTENCES format\\nresults in variable batch sizes, we use FULL -\\nSENTENCES in the remainder of our experiments\\nfor easier comparison with related work. 4.3 Training with large batches\\nPast work in Neural Machine Translation has\\nshown that training with very large mini-batches\\ncan both improve optimization speed and end-task\\nperformance when the learning rate is increased\\nappropriately ( Ott et al. ,2018 ). Recent work has\\nshown that BERT is also amenable to large batch\\ntraining ( You et al. ,2019 ). Devlin et al. (2019 ) originally trained\\nBERT BASE for 1M steps with a batch size of\\n256 sequences. This is equivalent in computa-\\ntional cost, via gradient accumulation, to training\\nfor 125K steps with a batch size of 2K sequences,\\nor for 31K steps with a batch size of 8K. In Table 3we compare perplexity and end- bsz steps lr ppl MNLI-m SST-2\\n256 1M 1e-4 3.99 84.7 92.7\\n2K 125K 7e-4 3.68 85.2 92.9\\n8K 31K 1e-3 3.77 84.6 92.8\\nTable 3: Perplexity on held-out training data ( ppl) and\\ndevelopment set accuracy for base models trained over\\nBOOK CORPUS and W IKIPEDIA with varying batch\\nsizes ( bsz).\"\n", "}\n", "\u001b[32;1m\u001b[1;3m[chain/start]\u001b[0m \u001b[1m[chain:RunnableSequence] Entering Chain run with input:\n", "\u001b[0m{\n", " \"doc\": \"We tune the learning rate ( lr) for each set-\\nting. Models make the same number of passes over the\\ndata (epochs) and have the same computational cost. task performance of BERT BASE as we increase the\\nbatch size, controlling for the number of passes\\nthrough the training data. We observe that train-\\ning with large batches improves perplexity for the\\nmasked language modeling objective, as well as\\nend-task accuracy. Large batches are also easier to\\nparallelize via distributed data parallel training,8\\nand in later experiments we train with batches of\\n8K sequences. Notably You et al. (2019 ) train BERT with even\\nlarger batche sizes, up to 32K sequences. We leave\\nfurther exploration of the limits of large batch\\ntraining to future work. 4.4 Text Encoding\\nByte-Pair Encoding (BPE) ( Sennrich et al. ,2016 )\\nis a hybrid between character- and word-level rep-\\nresentations that allows handling the large vocab-\\nularies common in natural language corpora. In-\\nstead of full words, BPE relies on subwords units,\\nwhich are extracted by performing statistical anal-\\nysis of the training corpus. BPE vocabulary sizes typically range from\\n10K-100K subword units. However, unicode char-\\nacters can account for a sizeable portion of this\\nvocabulary when modeling large and diverse cor-\\npora, such as the ones considered in this work. Radford et al. (2019 ) introduce a clever imple-\\nmentation of BPE that uses bytes instead of uni-\\ncode characters as the base subword units. Using\\nbytes makes it possible to learn a subword vocab-\\nulary of a modest size (50K units) that can still en-\\ncode any input text without introducing any “un-\\nknown” tokens. 8Large batch training can improve training efficiency even\\nwithout large scale parallel hardware through gradient ac-\\ncumulation , whereby gradients from multiple mini-batches\\nare accumulated locally before each optimization step. Thi s\\nfunctionality is supported natively in FAIRSEQ (Ott et al. ,\\n2019 ).The original BERT implementa-\\ntion ( Devlin et al. ,2019 ) uses a character-level\\nBPE vocabulary of size 30K, which is learned\\nafter preprocessing the input with heuristic tok-\\nenization rules. Following Radford et al. (2019 ),\\nwe instead consider training BERT with a larger\\nbyte-level BPE vocabulary containing 50K sub-\\nword units, without any additional preprocessing\\nor tokenization of the input. This adds approxi-\\nmately 15M and 20M additional parameters for\\nBERT BASEand BERT LARGE , respectively. Early experiments revealed only slight dif-\\nferences between these encodings, with the\\nRadford et al. (2019 ) BPE achieving slightly\\nworse end-task performance on some tasks. Nev-\\nertheless, we believe the advantages of a univer-\\nsal encoding scheme outweighs the minor degre-\\ndation in performance and use this encoding in\\nthe remainder of our experiments. A more de-\\ntailed comparison of these encodings is left to fu-\\nture work. 5 RoBERTa\\nIn the previous section we propose modifications\\nto the BERT pretraining procedure that improve\\nend-task performance. We now aggregate these\\nimprovements and evaluate their combined im-\\npact. We call this configuration RoBERTa for\\nRobustly optimized BERT approach. Specifi-\\ncally, RoBERTa is trained with dynamic mask-\\ning (Section 4.1),FULL -SENTENCES without NSP\\nloss (Section 4.2), large mini-batches (Section 4.3)\\nand a larger byte-level BPE (Section 4.4). Additionally, we investigate two other impor-\\ntant factors that have been under-emphasized in\\nprevious work: (1) the data used for pretraining,\\nand (2) the number of training passes through the\\ndata. For example, the recently proposed XLNet\\narchitecture ( Yang et al. ,2019 ) is pretrained us-\\ning nearly 10 times more data than the original\\nBERT ( Devlin et al. ,2019 ). It is also trained with\\na batch size eight times larger for half as many op-\\ntimization steps, thus seeing four times as many\\nsequences in pretraining compared to BERT. To help disentangle the importance of these fac-\\ntors from other modeling choices (e.g., the pre-\\ntraining objective), we begin by training RoBERTa\\nfollowing the BERT LARGE architecture ( L= 24 ,\\nH= 1024 ,A= 16 , 355M parameters). We\\npretrain for 100K steps over a comparable B OOK -\\nCORPUS plus W IKIPEDIA dataset as was used in Model data bsz stepsSQuADMNLI-m SST-2(v1.1/2.0)\\nRoBERTa\\nwith B OOKS + W IKI 16GB 8K 100K 93.6/87.3 89.0 95.3\\n+ additional data ( §3.2) 160GB 8K 100K 94.0/87.7 89.3 95.6\\n+ pretrain longer 160GB 8K 300K 94.4/88.7 90.0 96.1\\n+ pretrain even longer 160GB 8K 500K 94.6/89.4 90.2 96.4\\nBERT LARGE\\nwith B OOKS + W IKI 13GB 256 1M 90.9/81.8 86.6 93.7\\nXLNet LARGE\\nwith B OOKS + W IKI 13GB 256 1M 94.0/87.8 88.4 94.4\\n+ additional data 126GB 2K 500K 94.5/88.8 89.8 95.6\\nTable 4: Development set results for RoBERTa as we pretrain o ver more data (16GB →160GB of text) and pretrain\\nfor longer (100K →300K→500K steps).\"\n", "}\n", "\u001b[32;1m\u001b[1;3m[chain/start]\u001b[0m \u001b[1m[chain:RunnableSequence] Entering Chain run with input:\n", "\u001b[0m{\n", " \"doc\": \"Each row accumulates improvements from the row s above. RoBERTa\\nmatches the architecture and training objective of BERT LARGE . Results for BERT LARGE and XLNet LARGE are from\\nDevlin et al. (2019 ) and Yang et al. (2019 ), respectively. Complete results on all GLUE tasks can be fo und in the\\nAppendix.\"\n", "}\n", "\u001b[32;1m\u001b[1;3m[chain/start]\u001b[0m \u001b[1m[chain:RunnableSequence] Entering Chain run with input:\n", "\u001b[0m{\n", " \"doc\": \"Devlin et al. (2019 ). We pretrain our model using\\n1024 V100 GPUs for approximately one day. Results We present our results in Table 4. When\\ncontrolling for training data, we observe that\\nRoBERTa provides a large improvement over the\\noriginally reported BERT LARGE results, reaffirming\\nthe importance of the design choices we explored\\nin Section 4. Next, we combine this data with the three ad-\\nditional datasets described in Section 3.2. We\\ntrain RoBERTa over the combined data with the\\nsame number of training steps as before (100K). In total, we pretrain over 160GB of text. We ob-\\nserve further improvements in performance across\\nall downstream tasks, validating the importance of\\ndata size and diversity in pretraining.9\\nFinally, we pretrain RoBERTa for significantly\\nlonger, increasing the number of pretraining steps\\nfrom 100K to 300K, and then further to 500K. We\\nagain observe significant gains in downstream task\\nperformance, and the 300K and 500K step mod-\\nels outperform XLNet LARGE across most tasks. We\\nnote that even our longest-trained model does not\\nappear to overfit our data and would likely benefit\\nfrom additional training. In the rest of the paper, we evaluate our best\\nRoBERTa model on the three different bench-\\nmarks: GLUE, SQuaD and RACE. Specifically\\n9Our experiments conflate increases in data size and di-\\nversity. We leave a more careful analysis of these two dimen-\\nsions to future work.we consider RoBERTa trained for 500K steps over\\nall five of the datasets introduced in Section 3.2. 5.1 GLUE Results\\nFor GLUE we consider two finetuning settings. In the first setting ( single-task, dev ) we finetune\\nRoBERTa separately for each of the GLUE tasks,\\nusing only the training data for the correspond-\\ning task. We consider a limited hyperparameter\\nsweep for each task, with batch sizes ∈ {16,32}\\nand learning rates ∈ {1e−5,2e−5,3e−5}, with a\\nlinear warmup for the first 6% of steps followed by\\na linear decay to 0. We finetune for 10 epochs and\\nperform early stopping based on each task’s eval-\\nuation metric on the dev set. The rest of the hyper-\\nparameters remain the same as during pretraining. In this setting, we report the median development\\nset results for each task over five random initial-\\nizations, without model ensembling. In the second setting ( ensembles, test ), we com-\\npare RoBERTa to other approaches on the test set\\nvia the GLUE leaderboard. While many submis-\\nsions to the GLUE leaderboard depend on multi-\\ntask finetuning, our submission depends only on\\nsingle-task finetuning . For RTE, STS and MRPC\\nwe found it helpful to finetune starting from the\\nMNLI single-task model, rather than the baseline\\npretrained RoBERTa. We explore a slightly wider\\nhyperparameter space, described in the Appendix,\\nand ensemble between 5 and 7 models per task. MNLI QNLI QQP RTE SST MRPC CoLA STS WNLI Avg\\nSingle-task single models on dev\\nBERT LARGE 86.6/- 92.3 91.3 70.4 93.2 88.0 60.6 90.0 - -\\nXLNet LARGE 89.8/- 93.9 91.8 83.8 95.6 89.2 63.6 91.8 - -\\nRoBERTa 90.2/90.2 94.7 92.2 86.6 96.4 90.9 68.0 92.4 91.3 -\\nEnsembles on test (from leaderboard as of July 25, 2019)\\nALICE 88.2/87.9 95.7 90.7 83.5 95.2 92.6 68.6 91.1 80.8 86.3\\nMT-DNN 87.9/87.4 96.0 89.9 86.3 96.5 92.7 68.4 91.1 89.0 87.6\\nXLNet 90.2/89.8 98.6 90.3 86.3 96.8 93.0 67.8 91.6 90.4 88.4\\nRoBERTa 90.8/90.2 98.9 90.2 88.2 96.7 92.3 67.8 92.2 89.0 88.5\\nTable 5: Results on GLUE. All results are based on a 24-layer a rchitecture. BERT LARGE and XLNet LARGE results\\nare from Devlin et al. (2019 ) and Yang et al. (2019 ), respectively. RoBERTa results on the development set are a\\nmedian over five runs. RoBERTa results on the test set are ense mbles of single-task models. For RTE, STS and\\nMRPC we finetune starting from the MNLI model instead of the ba seline pretrained model. Averages are obtained\\nfrom the GLUE leaderboard. Task-specific modifications Two of the GLUE\\ntasks require task-specific finetuning approaches\\nto achieve competitive leaderboard results. QNLI : Recent submissions on the GLUE\\nleaderboard adopt a pairwise ranking formulation\\nfor the QNLI task, in which candidate answers\\nare mined from the training set and compared to\\none another, and a single (question, candidate)\\npair is classified as positive ( Liu et al.\"\n", "}\n", "\u001b[32;1m\u001b[1;3m[chain/start]\u001b[0m \u001b[1m[chain:RunnableSequence] Entering Chain run with input:\n", "\u001b[0m{\n", " \"doc\": \",2019b ,a;\\nYang et al. ,2019 ). This formulation significantly\\nsimplifies the task, but is not directly comparable\\nto BERT ( Devlin et al. ,2019 ). Following recent\\nwork, we adopt the ranking approach for our test\\nsubmission, but for direct comparison with BERT\\nwe report development set results based on a pure\\nclassification approach. WNLI : We found the provided NLI-format\\ndata to be challenging to work with. Instead\\nwe use the reformatted WNLI data from Super-\\nGLUE ( Wang et al. ,2019a ), which indicates the\\nspan of the query pronoun and referent. We fine-\\ntune RoBERTa using the margin ranking loss from\\nKocijan et al. (2019 ). For a given input sentence,\\nwe use spaCy ( Honnibal and Montani ,2017 ) to\\nextract additional candidate noun phrases from the\\nsentence and finetune our model so that it assigns\\nhigher scores to positive referent phrases than for\\nany of the generated negative candidate phrases. One unfortunate consequence of this formulation\\nis that we can only make use of the positive train-\\ning examples, which excludes over half of the pro-\\nvided training examples.10\\n10While we only use the provided WNLI training data, ourResults We present our results in Table 5. In the\\nfirst setting ( single-task, dev ), RoBERTa achieves\\nstate-of-the-art results on all 9 of the GLUE\\ntask development sets. Crucially, RoBERTa uses\\nthe same masked language modeling pretrain-\\ning objective and architecture as BERT LARGE , yet\\nconsistently outperforms both BERT LARGE and\\nXLNet LARGE . This raises questions about the rel-\\native importance of model architecture and pre-\\ntraining objective, compared to more mundane de-\\ntails like dataset size and training time that we ex-\\nplore in this work. In the second setting ( ensembles, test ), we\\nsubmit RoBERTa to the GLUE leaderboard and\\nachieve state-of-the-art results on 4 out of 9 tasks\\nand the highest average score to date. This is espe-\\ncially exciting because RoBERTa does not depend\\non multi-task finetuning, unlike most of the other\\ntop submissions. We expect future work may fur-\\nther improve these results by incorporating more\\nsophisticated multi-task finetuning procedures. 5.2 SQuAD Results\\nWe adopt a much simpler approach for SQuAD\\ncompared to past work. In particular, while\\nboth BERT ( Devlin et al. ,2019 ) and XL-\\nNet ( Yang et al. ,2019 ) augment their training data\\nwith additional QA datasets, we only finetune\\nRoBERTa using the provided SQuAD training\\ndata .Yang et al. (2019 ) also employed a custom\\nlayer-wise learning rate schedule to finetune\\nresults could potentially be improved by augmenting this wi th\\nadditional pronoun disambiguation datasets. ModelSQuAD 1.1 SQuAD 2.0\\nEM F1 EM F1\\nSingle models on dev, w/o data augmentation\\nBERT LARGE 84.1 90.9 79.0 81.8\\nXLNet LARGE 89.0 94.5 86.1 88.8\\nRoBERTa 88.9 94.6 86.5 89.4\\nSingle models on test (as of July 25, 2019)\\nXLNet LARGE 86.3†89.1†\\nRoBERTa 86.8 89.8\\nXLNet + SG-Net Verifier 87.0†89.9†\\nTable 6: Results on SQuAD. †indicates results that de-\\npend on additional external training data. RoBERTa\\nuses only the provided SQuAD data in both dev and\\ntest settings. BERT LARGE and XLNet LARGE results are\\nfrom Devlin et al. (2019 ) and Yang et al. (2019 ), re-\\nspectively. XLNet, while we use the same learning rate for\\nall layers. For SQuAD v1.1 we follow the same finetun-\\ning procedure as Devlin et al. (2019 ). For SQuAD\\nv2.0, we additionally classify whether a given\\nquestion is answerable; we train this classifier\\njointly with the span predictor by summing the\\nclassification and span loss terms. Results We present our results in Table 6. On\\nthe SQuAD v1.1 development set, RoBERTa\\nmatches the state-of-the-art set by XLNet. On the\\nSQuAD v2.0 development set, RoBERTa sets a\\nnew state-of-the-art, improving over XLNet by 0.4\\npoints (EM) and 0.6 points (F1). We also submit RoBERTa to the public SQuAD\\n2.0 leaderboard and evaluate its performance rel-\\native to other systems. Most of the top systems\\nbuild upon either BERT ( Devlin et al. ,2019 ) or\\nXLNet ( Yang et al. ,2019 ), both of which rely on\\nadditional external training data. In contrast, our\\nsubmission does not use any additional data. Our single RoBERTa model outperforms all but\\none of the single model submissions, and is the\\ntop scoring system among those that do not rely\\non data augmentation. 5.3 RACE Results\\nIn RACE, systems are provided with a passage of\\ntext, an associated question, and four candidate an-\\nswers. Systems are required to classify which of\\nthe four candidate answers is correct. We modify RoBERTa for this task by concate-Model Accuracy Middle High\\nSingle models on test (as of July 25, 2019)\\nBERT LARGE 72.0 76.6 70.1\\nXLNet LARGE 81.7 85.4 80.2\\nRoBERTa 83.2 86.5 81.3\\nTable 7: Results on the RACE test set. BERT LARGE and\\nXLNet LARGE results are from Yang et al. (2019 ). nating each candidate answer with the correspond-\\ning question and passage. We then encode each of\\nthese four sequences and pass the resulting [CLS]\\nrepresentations through a fully-connected layer,\\nwhich is used to predict the correct answer. We\\ntruncate question-answer pairs that are longer than\\n128 tokens and, if needed, the passage so that the\\ntotal length is at most 512 tokens. Results on the RACE test sets are presented in\\nTable 7. RoBERTa achieves state-of-the-art results\\non both middle-school and high-school settings. 6 Related Work\\nPretraining methods have been designed\\nwith different training objectives, includ-\\ning language modeling ( Dai and Le ,2015 ;\\nPeters et al. ,2018 ;Howard and Ruder ,2018 ),\\nmachine translation ( McCann et al. ,2017 ), and\\nmasked language modeling ( Devlin et al. ,2019 ;\\nLample and Conneau ,2019 ). Many recent\\npapers have used a basic recipe of finetuning\\nmodels for each end task ( Howard and Ruder ,\\n2018 ;Radford et al. ,2018 ), and pretraining\\nwith some variant of a masked language model\\nobjective. However, newer methods have\\nimproved performance by multi-task fine tun-\\ning ( Dong et al. ,2019 ), incorporating entity\\nembeddings ( Sun et al. ,2019 ), span predic-\\ntion ( Joshi et al. ,2019 ), and multiple variants\\nof autoregressive pretraining ( Song et al. ,2019 ;\\nChan et al. ,2019 ;Yang et al. ,2019 ). Perfor-\\nmance is also typically improved by training\\nbigger models on more data ( Devlin et al.\"\n", "}\n", "\u001b[32;1m\u001b[1;3m[chain/start]\u001b[0m \u001b[1m[chain:RunnableSequence] Entering Chain run with input:\n", "\u001b[0m{\n", " \"doc\": \",\\n2019 ;Baevski et al. ,2019 ;Yang et al. ,2019 ;\\nRadford et al. ,2019 ). Our goal was to replicate,\\nsimplify, and better tune the training of BERT,\\nas a reference point for better understanding the\\nrelative performance of all of these methods. 7 Conclusion\\nWe carefully evaluate a number of design de-\\ncisions when pretraining BERT models. We\\nfind that performance can be substantially im-\\nproved by training the model longer, with bigger\\nbatches over more data; removing the next sen-\\ntence prediction objective; training on longer se-\\nquences; and dynamically changing the masking\\npattern applied to the training data. Our improved\\npretraining procedure, which we call RoBERTa,\\nachieves state-of-the-art results on GLUE, RACE\\nand SQuAD, without multi-task finetuning for\\nGLUE or additional data for SQuAD. These re-\\nsults illustrate the importance of these previ-\\nously overlooked design decisions and suggest\\nthat BERT’s pretraining objective remains com-\\npetitive with recently proposed alternatives. We additionally use a novel dataset,\\nCC-N EWS, and release our models and\\ncode for pretraining and finetuning at:\\nhttps://github.com/pytorch/fairseq . References\\nEneko Agirre, Llu’is M‘arquez, and Richard Wicen-\\ntowski, editors. 2007. Proceedings of the Fourth\\nInternational Workshop on Semantic Evaluations\\n(SemEval-2007) . Alexei Baevski, Sergey Edunov, Yinhan Liu, Luke\\nZettlemoyer, and Michael Auli. 2019. Cloze-\\ndriven pretraining of self-attention networks. arXiv\\npreprint arXiv:1903.07785 .\"\n", "}\n", "\u001b[32;1m\u001b[1;3m[chain/start]\u001b[0m \u001b[1m[chain:RunnableSequence] Entering Chain run with input:\n", "\u001b[0m{\n", " \"doc\": \"Roy Bar-Haim, Ido Dagan, Bill Dolan, Lisa Ferro,\\nDanilo Giampiccolo, Bernardo Magnini, and Idan\\nSzpektor. 2006. The second PASCAL recognising\\ntextual entailment challenge. In Proceedings of the\\nsecond PASCAL challenges workshop on recognis-\\ning textual entailment . Luisa Bentivogli, Ido Dagan, Hoa Trang Dang, Danilo\\nGiampiccolo, and Bernardo Magnini. 2009. The\\nfifth PASCAL recognizing textual entailment chal-\\nlenge. Samuel R Bowman, Gabor Angeli, Christopher Potts,\\nand Christopher D Manning. 2015. A large anno-\\ntated corpus for learning natural language inference. InEmpirical Methods in Natural Language Process-\\ning (EMNLP) .\"\n", "}\n", "\u001b[32;1m\u001b[1;3m[chain/start]\u001b[0m \u001b[1m[chain:RunnableSequence] Entering Chain run with input:\n", "\u001b[0m{\n", " \"doc\": \"William Chan, Nikita Kitaev, Kelvin Guu, Mitchell\\nStern, and Jakob Uszkoreit. 2019. KERMIT: Gener-\\native insertion-based modeling for sequences.\"\n", "}\n", "\u001b[32;1m\u001b[1;3m[chain/start]\u001b[0m \u001b[1m[chain:RunnableSequence] Entering Chain run with input:\n", "\u001b[0m{\n", " \"doc\": \"arXiv\\npreprint arXiv:1906.01604 .Ido Dagan, Oren Glickman, and Bernardo Magnini. 2006. The PASCAL recognising textual entailment\\nchallenge. In Machine learning challenges. evalu-\\nating predictive uncertainty, visual object classifica-\\ntion, and recognising tectual entailment . Andrew M Dai and Quoc V Le. 2015. Semi-supervised\\nsequence learning. In Advances in Neural Informa-\\ntion Processing Systems (NIPS) . Jacob Devlin, Ming-Wei Chang, Kenton Lee, and\\nKristina Toutanova. 2019. BERT: Pre-training of\\ndeep bidirectional transformers for language under-\\nstanding. In North American Association for Com-\\nputational Linguistics (NAACL) . William B Dolan and Chris Brockett. 2005. Auto-\\nmatically constructing a corpus of sentential para-\\nphrases. In Proceedings of the International Work-\\nshop on Paraphrasing . Li Dong, Nan Yang, Wenhui Wang, Furu Wei,\\nXiaodong Liu, Yu Wang, Jianfeng Gao, Ming\\nZhou, and Hsiao-Wuen Hon. 2019. Unified\\nlanguage model pre-training for natural language\\nunderstanding and generation. arXiv preprint\\narXiv:1905.03197 .\"\n", "}\n", "\u001b[32;1m\u001b[1;3m[chain/start]\u001b[0m \u001b[1m[chain:RunnableSequence] Entering Chain run with input:\n", "\u001b[0m{\n", " \"doc\": \"Danilo Giampiccolo, Bernardo Magnini, Ido Dagan,\\nand Bill Dolan. 2007. The third PASCAL recog-\\nnizing textual entailment challenge. In Proceedings\\nof the ACL-PASCAL workshop on textual entailment\\nand paraphrasing . Aaron Gokaslan and Vanya Cohen. 2019. Openweb-\\ntext corpus. http://web.archive.org/\\nsave/http://Skylion007.github.io/\\nOpenWebTextCorpus .\"\n", "}\n", "\u001b[32;1m\u001b[1;3m[chain/start]\u001b[0m \u001b[1m[chain:RunnableSequence] Entering Chain run with input:\n", "\u001b[0m{\n", " \"doc\": \"Felix Hamborg, Norman Meuschke, Corinna Bre-\\nitinger, and Bela Gipp. 2017. news-please: A\\ngeneric news crawler and extractor. In Proceedings\\nof the 15th International Symposium of Information\\nScience .\"\n", "}\n", "\u001b[32;1m\u001b[1;3m[chain/start]\u001b[0m \u001b[1m[chain:RunnableSequence] Entering Chain run with input:\n", "\u001b[0m{\n", " \"doc\": \"Dan Hendrycks and Kevin Gimpel. 2016. Gaus-\\nsian error linear units (gelus). arXiv preprint\\narXiv:1606.08415 .\"\n", "}\n", "\u001b[32;1m\u001b[1;3m[chain/start]\u001b[0m \u001b[1m[chain:RunnableSequence] Entering Chain run with input:\n", "\u001b[0m{\n", " \"doc\": \"Matthew Honnibal and Ines Montani. 2017. spaCy 2:\\nNatural language understanding with Bloom embed-\\ndings, convolutional neural networks and incremen-\\ntal parsing. To appear. Jeremy Howard and Sebastian Ruder. 2018. Universal\\nlanguage model fine-tuning for text classification. arXiv preprint arXiv:1801.06146 . Shankar Iyer, Nikhil Dandekar, and Kornl Cser-\\nnai. 2016. First quora dataset release: Question\\npairs.https://data.quora.com/First-\\nQuora-Dataset-Release-Question-\\nPairs . Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S.\"\n", "}\n", "\u001b[32;1m\u001b[1;3m[chain/start]\u001b[0m \u001b[1m[chain:RunnableSequence] Entering Chain run with input:\n", "\u001b[0m{\n", " \"doc\": \"Weld, Luke Zettlemoyer, and Omer Levy. 2019. SpanBERT: Improving pre-training by repre-\\nsenting and predicting spans. arXiv preprint\\narXiv:1907.10529 . Diederik Kingma and Jimmy Ba. 2015. Adam: A\\nmethod for stochastic optimization. In International\\nConference on Learning Representations (ICLR) .\"\n", "}\n", "\u001b[32;1m\u001b[1;3m[chain/start]\u001b[0m \u001b[1m[chain:RunnableSequence] Entering Chain run with input:\n", "\u001b[0m{\n", " \"doc\": \"Vid Kocijan, Ana-Maria Cretu, Oana-Maria Camburu,\\nYordan Yordanov, and Thomas Lukasiewicz. 2019. A surprisingly robust trick for winograd schema\\nchallenge. arXiv preprint arXiv:1905.06290 . Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang,\\nand Eduard Hovy. 2017. Race: Large-scale reading\\ncomprehension dataset from examinations. arXiv\\npreprint arXiv:1704.04683 . Guillaume Lample and Alexis Conneau. 2019. Cross-\\nlingual language model pretraining. arXiv preprint\\narXiv:1901.07291 .\"\n", "}\n", "\u001b[32;1m\u001b[1;3m[chain/start]\u001b[0m \u001b[1m[chain:RunnableSequence] Entering Chain run with input:\n", "\u001b[0m{\n", " \"doc\": \"Hector J Levesque, Ernest Davis, and Leora Morgen-\\nstern. 2011. The Winograd schema challenge. In\\nAAAI Spring Symposium: Logical Formalizations of\\nCommonsense Reasoning . Xiaodong Liu, Pengcheng He, Weizhu Chen, and\\nJianfeng Gao. 2019a. Improving multi-task deep\\nneural networks via knowledge distillation for\\nnatural language understanding. arXiv preprint\\narXiv:1904.09482 .\"\n", "}\n", "\u001b[32;1m\u001b[1;3m[chain/start]\u001b[0m \u001b[1m[chain:RunnableSequence] Entering Chain run with input:\n", "\u001b[0m{\n", " \"doc\": \"Xiaodong Liu, Pengcheng He, Weizhu Chen, and Jian-\\nfeng Gao. 2019b. Multi-task deep neural networks\\nfor natural language understanding. arXiv preprint\\narXiv:1901.11504 . Bryan McCann, James Bradbury, Caiming Xiong, and\\nRichard Socher. 2017. Learned in translation: Con-\\ntextualized word vectors. In Advances in Neural In-\\nformation Processing Systems (NIPS) , pages 6297–\\n6308.\"\n", "}\n", "\u001b[32;1m\u001b[1;3m[chain/start]\u001b[0m \u001b[1m[chain:RunnableSequence] Entering Chain run with input:\n", "\u001b[0m{\n", " \"doc\": \"Paulius Micikevicius, Sharan Narang, Jonah Alben,\\nGregory Diamos, Erich Elsen, David Garcia, Boris\\nGinsburg, Michael Houston, Oleksii Kuchaiev,\\nGanesh Venkatesh, and Hao Wu. 2018. Mixed preci-\\nsion training. In International Conference on Learn-\\ning Representations . Sebastian Nagel. 2016. Cc-news. http:\\n//web.archive.org/save/http:\\n//commoncrawl.org/2016/10/news-\\ndataset-available .\"\n", "}\n", "\u001b[32;1m\u001b[1;3m[chain/start]\u001b[0m \u001b[1m[chain:RunnableSequence] Entering Chain run with input:\n", "\u001b[0m{\n", " \"doc\": \"Myle Ott, Sergey Edunov, Alexei Baevski, Angela\\nFan, Sam Gross, Nathan Ng, David Grangier, and\\nMichael Auli. 2019. FAIRSEQ : A fast, exten-\\nsible toolkit for sequence modeling. In North\\nAmerican Association for Computational Linguis-\\ntics (NAACL): System Demonstrations .Myle Ott, Sergey Edunov, David Grangier, and\\nMichael Auli. 2018. Scaling neural machine trans-\\nlation. In Proceedings of the Third Conference on\\nMachine Translation (WMT) . Adam Paszke, Sam Gross, Soumith Chintala, Gre-\\ngory Chanan, Edward Yang, Zachary DeVito, Zem-\\ning Lin, Alban Desmaison, Luca Antiga, and Adam\\nLerer. 2017. Automatic differentiation in PyTorch. InNIPS Autodiff Workshop . Matthew Peters, Mark Neumann, Mohit Iyyer, Matt\\nGardner, Christopher Clark, Kenton Lee, and Luke\\nZettlemoyer. 2018. Deep contextualized word repre-\\nsentations. In North American Association for Com-\\nputational Linguistics (NAACL) . Alec Radford, Karthik Narasimhan, Time Salimans,\\nand Ilya Sutskever. 2018. Improving language un-\\nderstanding with unsupervised learning. Technical\\nreport, OpenAI. Alec Radford, Jeffrey Wu, Rewon Child, David Luan,\\nDario Amodei, and Ilya Sutskever. 2019. Language\\nmodels are unsupervised multitask learners. Techni-\\ncal report, OpenAI.\"\n", "}\n", "\u001b[32;1m\u001b[1;3m[chain/start]\u001b[0m \u001b[1m[chain:RunnableSequence] Entering Chain run with input:\n", "\u001b[0m{\n", " \"doc\": \"Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018. Know what you don’t know: Unanswerable ques-\\ntions for squad. In Association for Computational\\nLinguistics (ACL) . Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and\\nPercy Liang. 2016. SQuAD: 100,000+ questions for\\nmachine comprehension of text. In Empirical Meth-\\nods in Natural Language Processing (EMNLP) . Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural machine translation of rare words with\\nsubword units. In Association for Computational\\nLinguistics (ACL) , pages 1715–1725. Richard Socher, Alex Perelygin, Jean Wu, Jason\\nChuang, Christopher D Manning, Andrew Ng, and\\nChristopher Potts. 2013. Recursive deep models\\nfor semantic compositionality over a sentiment tree-\\nbank. In Empirical Methods in Natural Language\\nProcessing (EMNLP) . Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and\\nTie-Yan Liu. 2019. MASS: Masked sequence\\nto sequence pre-training for language generation. InInternational Conference on Machine Learning\\n(ICML) .\"\n", "}\n", "\u001b[32;1m\u001b[1;3m[chain/start]\u001b[0m \u001b[1m[chain:RunnableSequence] Entering Chain run with input:\n", "\u001b[0m{\n", " \"doc\": \"Yu Stephanie Sun, Shuohuan Wang, Yukun Li, Shikun\\nFeng, Xuyi Chen, Han Zhang, Xinlun Tian, Danxi-\\nang Zhu, Hao Tian, and Hua Wu. 2019. ERNIE: En-\\nhanced representation through knowledge integra-\\ntion. arXiv preprint arXiv:1904.09223 . Trieu H Trinh and Quoc V Le. 2018. A simple\\nmethod for commonsense reasoning. arXiv preprint\\narXiv:1806.02847 . Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob\\nUszkoreit, Llion Jones, Aidan N Gomez, Łukasz\\nKaiser, and Illia Polosukhin. 2017. Attention is all\\nyou need. In Advances in neural information pro-\\ncessing systems . Alex Wang, Yada Pruksachatkun, Nikita Nangia,\\nAmanpreet Singh, Julian Michael, Felix Hill, Omer\\nLevy, and Samuel R.\"\n", "}\n", "\u001b[32;1m\u001b[1;3m[chain/start]\u001b[0m \u001b[1m[chain:RunnableSequence] Entering Chain run with input:\n", "\u001b[0m{\n", " \"doc\": \"Bowman. 2019a. SuperGLUE:\\nA stickier benchmark for general-purpose language\\nunderstanding systems.\"\n", "}\n", "\u001b[32;1m\u001b[1;3m[chain/start]\u001b[0m \u001b[1m[chain:RunnableSequence] Entering Chain run with input:\n", "\u001b[0m{\n", " \"doc\": \"arXiv preprint 1905.00537 . Alex Wang, Amanpreet Singh, Julian Michael, Felix\\nHill, Omer Levy, and Samuel R. Bowman. 2019b. GLUE: A multi-task benchmark and analysis plat-\\nform for natural language understanding. In Inter-\\nnational Conference on Learning Representations\\n(ICLR) . Alex Warstadt, Amanpreet Singh, and Samuel R. Bow-\\nman. 2018. Neural network acceptability judg-\\nments. arXiv preprint 1805.12471 . Adina Williams, Nikita Nangia, and Samuel Bowman. 2018. A broad-coverage challenge corpus for sen-\\ntence understanding through inference. In North\\nAmerican Association for Computational Linguis-\\ntics (NAACL) . Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Car-\\nbonell, Ruslan Salakhutdinov, and Quoc V Le. 2019. Xlnet: Generalized autoregressive pretrain-\\ning for language understanding. arXiv preprint\\narXiv:1906.08237 .\"\n", "}\n", "\u001b[32;1m\u001b[1;3m[chain/start]\u001b[0m \u001b[1m[chain:RunnableSequence] Entering Chain run with input:\n", "\u001b[0m{\n", " \"doc\": \"Yang You, Jing Li, Jonathan Hseu, Xiaodan Song,\\nJames Demmel, and Cho-Jui Hsieh. 2019. Reduc-\\ning bert pre-training time from 3 days to 76 minutes. arXiv preprint arXiv:1904.00962 . Rowan Zellers, Ari Holtzman, Hannah Rashkin,\\nYonatan Bisk, Ali Farhadi, Franziska Roesner, and\\nYejin Choi. 2019. Defending against neural fake\\nnews. arXiv preprint arXiv:1905.12616 .\"\n", "}\n", "\u001b[32;1m\u001b[1;3m[chain/start]\u001b[0m \u001b[1m[chain:RunnableSequence] Entering Chain run with input:\n", "\u001b[0m{\n", " \"doc\": \"Yukun Zhu, Ryan Kiros, Richard Zemel, Ruslan\\nSalakhutdinov, Raquel Urtasun, Antonio Torralba,\\nand Sanja Fidler. 2015. Aligning books and movies:\\nTowards story-like visual explanations by watch-\\ning movies and reading books. In arXiv preprint\\narXiv:1506.06724 . Appendix for “RoBERTa: A Robustly\\nOptimized BERT Pretraining Approach”\\nA Full results on GLUE\\nIn Table 8we present the full set of development\\nset results for RoBERTa. We present results for\\naLARGE configuration that follows BERT LARGE ,\\nas well as a BASE configuration that follows\\nBERT BASE.B Pretraining Hyperparameters\\nTable 9describes the hyperparameters for pre-\\ntraining of RoBERTa LARGE and RoBERTa BASE\\nC Finetuning Hyperparameters\\nFinetuning hyperparameters for RACE, SQuAD\\nand GLUE are given in Table 10. We select the\\nbest hyperparameter values based on the median\\nof 5 random seeds for each task. MNLI QNLI QQP RTE SST MRPC CoLA STS\\nRoBERTa BASE\\n+ all data + 500k steps 87.6 92.8 91.9 78.7 94.8 90.2 63.6 91.2\\nRoBERTa LARGE\\nwith B OOKS + W IKI 89.0 93.9 91.9 84.5 95.3 90.2 66.3 91.6\\n+ additional data ( §3.2) 89.3 94.0 92.0 82.7 95.6 91.4 66.1 92.2\\n+ pretrain longer 300k 90.0 94.5 92.2 83.3 96.1 91.1 67.4 92.3\\n+ pretrain longer 500k 90.2 94.7 92.2 86.6 96.4 90.9 68.0 92.4\\nTable 8: Development set results on GLUE tasks for various co nfigurations of RoBERTa. Hyperparam RoBERTa LARGE RoBERTa BASE\\nNumber of Layers 24 12\\nHidden size 1024 768\\nFFN inner hidden size 4096 3072\\nAttention heads 16 12\\nAttention head size 64 64\\nDropout 0.1 0.1\\nAttention Dropout 0.1 0.1\\nWarmup Steps 30k 24k\\nPeak Learning Rate 4e-4 6e-4\\nBatch Size 8k 8k\\nWeight Decay 0.01 0.01\\nMax Steps 500k 500k\\nLearning Rate Decay Linear Linear\\nAdamǫ 1e-6 1e-6\\nAdamβ1 0.9 0.9\\nAdamβ2 0.98 0.98\\nGradient Clipping 0.0 0.0\\nTable 9: Hyperparameters for pretraining RoBERTa LARGE and RoBERTa BASE. Hyperparam RACE SQuAD GLUE\\nLearning Rate 1e-5 1.5e-5 {1e-5, 2e-5, 3e-5 }\\nBatch Size 16 48 {16, 32}\\nWeight Decay 0.1 0.01 0.1\\nMax Epochs 4 2 10\\nLearning Rate Decay Linear Linear Linear\\nWarmup ratio 0.06 0.06 0.06\\nTable 10: Hyperparameters for finetuning RoBERTa LARGE on RACE, SQuAD and GLUE.\"\n", "}\n", "\n", "\u001b[32;1m\u001b[1;3m[chain/start]\u001b[0m \u001b[1m[chain:RunnableSequence > chain:RunnableParallel] Entering Chain run with input:\n", "\u001b[0m{\n", " \"doc\": \"arXiv:1907.11692v1 [cs.CL] 26 Jul 2019RoBERTa: A Robustly Optimized BERT Pretraining Approach\\nYinhan Liu∗§Myle Ott∗§Naman Goyal∗§Jingfei Du∗§Mandar Joshi†\\nDanqi Chen§Omer Levy§Mike Lewis§Luke Zettlemoyer†§Veselin Stoyanov§\\n†Paul G.\"\n", "}\n", "\u001b[32;1m\u001b[1;3m[chain/start]\u001b[0m \u001b[1m[chain:RunnableSequence > chain:RunnableParallel] Entering Chain run with input:\n", "\u001b[0m{\n", " \"doc\": \"Allen School of Computer Science & Engineering,\\nUniversity of Washington, Seattle, WA\\n{mandar90,lsz }@cs.washington.edu\\n§Facebook AI\\n{yinhanliu,myleott,naman,jingfeidu,\\ndanqi,omerlevy,mikelewis,lsz,ves }@fb.com\\nAbstract\\nLanguage model pretraining has led to sig-\\nnificant performance gains but careful com-\\nparison between different approaches is chal-\\nlenging. Training is computationally expen-\\nsive, often done on private datasets of different\\nsizes, and, as we will show, hyperparameter\\nchoices have significant impact on the final re-\\nsults. We present a replication study of BERT\\npretraining ( Devlin et al. ,2019 ) that carefully\\nmeasures the impact of many key hyperparam-\\neters and training data size. We find that BERT\\nwas significantly undertrained, and can match\\nor exceed the performance of every model\\npublished after it. Our best model achieves\\nstate-of-the-art results on GLUE, RACE and\\nSQuAD. These results highlight the impor-\\ntance of previously overlooked design choices,\\nand raise questions about the source of re-\\ncently reported improvements. We release our\\nmodels and code.1\\n1 Introduction\\nSelf-training methods such as ELMo ( Peters et al. ,\\n2018 ), GPT ( Radford et al. ,2018 ), BERT\\n(Devlin et al. ,2019 ), XLM ( Lample and Conneau ,\\n2019 ), and XLNet ( Yang et al. ,2019 ) have\\nbrought significant performance gains, but it can\\nbe challenging to determine which aspects of\\nthe methods contribute the most. Training is\\ncomputationally expensive, limiting the amount\\nof tuning that can be done, and is often done with\\nprivate training data of varying sizes, limiting\\nour ability to measure the effects of the modeling\\nadvances. ∗Equal contribution. 1Our models and code are available at:\\nhttps://github.com/pytorch/fairseqWe present a replication study of BERT pre-\\ntraining ( Devlin et al. ,2019 ), which includes a\\ncareful evaluation of the effects of hyperparmeter\\ntuning and training set size. We find that BERT\\nwas significantly undertrained and propose an im-\\nproved recipe for training BERT models, which\\nwe call RoBERTa, that can match or exceed the\\nperformance of all of the post-BERT methods. Our modifications are simple, they include: (1)\\ntraining the model longer, with bigger batches,\\nover more data; (2) removing the next sentence\\nprediction objective; (3) training on longer se-\\nquences; and (4) dynamically changing the mask-\\ning pattern applied to the training data. We also\\ncollect a large new dataset (CC-N EWS) of compa-\\nrable size to other privately used datasets, to better\\ncontrol for training set size effects. When controlling for training data, our im-\\nproved training procedure improves upon the pub-\\nlished BERT results on both GLUE and SQuAD. When trained for longer over additional data, our\\nmodel achieves a score of 88.5 on the public\\nGLUE leaderboard, matching the 88.4 reported\\nbyYang et al. (2019 ). Our model establishes a\\nnew state-of-the-art on 4/9 of the GLUE tasks:\\nMNLI, QNLI, RTE and STS-B. We also match\\nstate-of-the-art results on SQuAD and RACE. Overall, we re-establish that BERT’s masked lan-\\nguage model training objective is competitive\\nwith other recently proposed training objectives\\nsuch as perturbed autoregressive language model-\\ning (Yang et al. ,2019 ).2\\nIn summary, the contributions of this paper\\nare: (1) We present a set of important BERT de-\\nsign choices and training strategies and introduce\\n2It is possible that these other methods could also improve\\nwith more tuning. We leave this exploration to future work. alternatives that lead to better downstream task\\nperformance; (2) We use a novel dataset, CC-\\nNEWS, and confirm that using more data for pre-\\ntraining further improves performance on down-\\nstream tasks; (3) Our training improvements show\\nthat masked language model pretraining, under\\nthe right design choices, is competitive with all\\nother recently published methods. We release our\\nmodel, pretraining and fine-tuning code imple-\\nmented in PyTorch ( Paszke et al. ,2017 ). 2 Background\\nIn this section, we give a brief overview of the\\nBERT ( Devlin et al. ,2019 ) pretraining approach\\nand some of the training choices that we will ex-\\namine experimentally in the following section. 2.1 Setup\\nBERT takes as input a concatenation of two\\nsegments (sequences of tokens), x1,...,x N\\nandy1,...,yM. Segments usually consist of\\nmore than one natural sentence. The two seg-\\nments are presented as a single input sequence\\nto BERT with special tokens delimiting them:\\n[CLS],x1,...,x N,[SEP],y1,...,yM,[EOS]. MandNare constrained such that M+N < T ,\\nwhereTis a parameter that controls the maximum\\nsequence length during training. The model is first pretrained on a large unla-\\nbeled text corpus and subsequently finetuned us-\\ning end-task labeled data. 2.2 Architecture\\nBERT uses the now ubiquitous transformer archi-\\ntecture ( Vaswani et al. ,2017 ), which we will not\\nreview in detail. We use a transformer architecture\\nwithLlayers. Each block uses Aself-attention\\nheads and hidden dimension H. 2.3 Training Objectives\\nDuring pretraining, BERT uses two objectives:\\nmasked language modeling and next sentence pre-\\ndiction. Masked Language Model (MLM) A random\\nsample of the tokens in the input sequence is\\nselected and replaced with the special token\\n[MASK]. The MLM objective is a cross-entropy\\nloss on predicting the masked tokens. BERT uni-\\nformly selects 15% of the input tokens for possi-\\nble replacement. Of the selected tokens, 80% are\\nreplaced with [MASK], 10% are left unchanged,and 10% are replaced by a randomly selected vo-\\ncabulary token. In the original implementation, random mask-\\ning and replacement is performed once in the be-\\nginning and saved for the duration of training, al-\\nthough in practice, data is duplicated so the mask\\nis not always the same for every training sentence\\n(see Section 4.1). Next Sentence Prediction (NSP) NSP is a bi-\\nnary classification loss for predicting whether two\\nsegments follow each other in the original text. Positive examples are created by taking consecu-\\ntive sentences from the text corpus. Negative ex-\\namples are created by pairing segments from dif-\\nferent documents. Positive and negative examples\\nare sampled with equal probability. The NSP objective was designed to improve\\nperformance on downstream tasks, such as Natural\\nLanguage Inference ( Bowman et al. ,2015 ), which\\nrequire reasoning about the relationships between\\npairs of sentences. 2.4 Optimization\\nBERT is optimized with Adam ( Kingma and Ba ,\\n2015 ) using the following parameters: β1= 0.9,\\nβ2= 0.999,ǫ=1e-6 and L2weight de-\\ncay of0.01. The learning rate is warmed up\\nover the first 10,000 steps to a peak value of\\n1e-4, and then linearly decayed. BERT trains\\nwith a dropout of 0.1 on all layers and at-\\ntention weights, and a GELU activation func-\\ntion ( Hendrycks and Gimpel ,2016 ). Models are\\npretrained for S=1,000,000 updates, with mini-\\nbatches containing B=256 sequences of maxi-\\nmum length T=512 tokens. 2.5 Data\\nBERT is trained on a combination of B OOK COR-\\nPUS (Zhu et al. ,2015 ) plus English W IKIPEDIA ,\\nwhich totals 16GB of uncompressed text.3\\n3 Experimental Setup\\nIn this section, we describe the experimental setup\\nfor our replication study of BERT. 3.1 Implementation\\nWe reimplement BERT in FAIRSEQ (Ott et al. ,\\n2019 ). We primarily follow the original BERT\\n3Yang et al. (2019 ) use the same dataset but report having\\nonly 13GB of text after data cleaning. This is most likely due\\nto subtle differences in cleaning of the Wikipedia data. optimization hyperparameters, given in Section 2,\\nexcept for the peak learning rate and number of\\nwarmup steps, which are tuned separately for each\\nsetting. We additionally found training to be very\\nsensitive to the Adam epsilon term, and in some\\ncases we obtained better performance or improved\\nstability after tuning it. Similarly, we found setting\\nβ2= 0.98to improve stability when training with\\nlarge batch sizes. We pretrain with sequences of at most T= 512\\ntokens. Unlike Devlin et al. (2019 ), we do not ran-\\ndomly inject short sequences, and we do not train\\nwith a reduced sequence length for the first 90% of\\nupdates. We train only with full-length sequences. We train with mixed precision floating point\\narithmetic on DGX-1 machines, each with 8 ×\\n32GB Nvidia V100 GPUs interconnected by In-\\nfiniband ( Micikevicius et al. ,2018 ). 3.2 Data\\nBERT-style pretraining crucially relies on large\\nquantities of text. Baevski et al. (2019 ) demon-\\nstrate that increasing data size can result in im-\\nproved end-task performance. Several efforts\\nhave trained on datasets larger and more diverse\\nthan the original BERT ( Radford et al.\"\n", "}\n", "\u001b[32;1m\u001b[1;3m[chain/start]\u001b[0m \u001b[1m[chain:RunnableSequence > chain:RunnableParallel] Entering Chain run with input:\n", "\u001b[0m{\n", " \"doc\": \",2019 ;\\nYang et al. ,2019 ;Zellers et al. ,2019 ). Unfortu-\\nnately, not all of the additional datasets can be\\npublicly released. For our study, we focus on gath-\\nering as much data as possible for experimenta-\\ntion, allowing us to match the overall quality and\\nquantity of data as appropriate for each compari-\\nson. We consider five English-language corpora of\\nvarying sizes and domains, totaling over 160GB\\nof uncompressed text. We use the following text\\ncorpora:\\n•BOOK CORPUS (Zhu et al. ,2015 ) plus English\\nWIKIPEDIA . This is the original data used to\\ntrain BERT. (16GB). •CC-N EWS, which we collected from the En-\\nglish portion of the CommonCrawl News\\ndataset ( Nagel ,2016 ). The data contains 63\\nmillion English news articles crawled between\\nSeptember 2016 and February 2019. (76GB af-\\nter filtering).4\\n•OPENWEBTEXT (Gokaslan and Cohen ,2019 ),\\nan open-source recreation of the WebText cor-\\n4We usenews-please (Hamborg et al. ,2017 ) to col-\\nlect and extract CC-N EWS. CC-N EWS is similar to the R E-\\nALNEWS dataset described in Zellers et al. (2019 ).pus described in Radford et al. (2019 ). The text\\nis web content extracted from URLs shared on\\nReddit with at least three upvotes. (38GB).5\\n•STORIES , a dataset introduced in Trinh and Le\\n(2018 ) containing a subset of CommonCrawl\\ndata filtered to match the story-like style of\\nWinograd schemas. (31GB). 3.3 Evaluation\\nFollowing previous work, we evaluate our pre-\\ntrained models on downstream tasks using the fol-\\nlowing three benchmarks. GLUE The General Language Understand-\\ning Evaluation (GLUE) benchmark ( Wang et al. ,\\n2019b ) is a collection of 9 datasets for evaluating\\nnatural language understanding systems.6Tasks\\nare framed as either single-sentence classification\\nor sentence-pair classification tasks. The GLUE\\norganizers provide training and development data\\nsplits as well as a submission server and leader-\\nboard that allows participants to evaluate and com-\\npare their systems on private held-out test data. For the replication study in Section 4, we report\\nresults on the development sets after finetuning\\nthe pretrained models on the corresponding single-\\ntask training data (i.e., without multi-task training\\nor ensembling). Our finetuning procedure follows\\nthe original BERT paper ( Devlin et al. ,2019 ). In Section 5we additionally report test set re-\\nsults obtained from the public leaderboard. These\\nresults depend on a several task-specific modifica-\\ntions, which we describe in Section 5.1. SQuAD The Stanford Question Answering\\nDataset (SQuAD) provides a paragraph of context\\nand a question. The task is to answer the question\\nby extracting the relevant span from the context. We evaluate on two versions of SQuAD: V1.1\\nand V2.0 ( Rajpurkar et al. ,2016 ,2018 ). In V1.1\\nthe context always contains an answer, whereas in\\n5The authors and their affiliated institutions are not in any\\nway affiliated with the creation of the OpenWebText dataset. 6The datasets are: CoLA ( Warstadt et al. ,2018 ),\\nStanford Sentiment Treebank (SST) ( Socher et al. ,\\n2013 ), Microsoft Research Paragraph Corpus\\n(MRPC) ( Dolan and Brockett ,2005 ), Semantic Tex-\\ntual Similarity Benchmark (STS) ( Agirre et al. ,2007 ),\\nQuora Question Pairs (QQP) ( Iyer et al. ,2016 ), Multi-\\nGenre NLI (MNLI) ( Williams et al. ,2018 ), Question NLI\\n(QNLI) ( Rajpurkar et al. ,2016 ), Recognizing Textual\\nEntailment (RTE) ( Dagan et al. ,2006 ;Bar-Haim et al. ,\\n2006 ;Giampiccolo et al. ,2007 ;Bentivogli et al. ,2009 ) and\\nWinograd NLI (WNLI) ( Levesque et al. ,2011 ). V2.0 some questions are not answered in the pro-\\nvided context, making the task more challenging. For SQuAD V1.1 we adopt the same span pre-\\ndiction method as BERT ( Devlin et al. ,2019 ). For\\nSQuAD V2.0, we add an additional binary classi-\\nfier to predict whether the question is answerable,\\nwhich we train jointly by summing the classifica-\\ntion and span loss terms. During evaluation, we\\nonly predict span indices on pairs that are classi-\\nfied as answerable. RACE The ReAding Comprehension from Ex-\\naminations (RACE) ( Lai et al. ,2017 ) task is a\\nlarge-scale reading comprehension dataset with\\nmore than 28,000 passages and nearly 100,000\\nquestions. The dataset is collected from English\\nexaminations in China, which are designed for\\nmiddle and high school students. In RACE, each\\npassage is associated with multiple questions. For\\nevery question, the task is to select one correct an-\\nswer from four options. RACE has significantly\\nlonger context than other popular reading compre-\\nhension datasets and the proportion of questions\\nthat requires reasoning is very large. 4 Training Procedure Analysis\\nThis section explores and quantifies which choices\\nare important for successfully pretraining BERT\\nmodels. We keep the model architecture fixed.7\\nSpecifically, we begin by training BERT models\\nwith the same configuration as BERT BASE (L=\\n12,H= 768 ,A= 12 , 110M params). 4.1 Static vs. Dynamic Masking\\nAs discussed in Section 2, BERT relies on ran-\\ndomly masking and predicting tokens. The orig-\\ninal BERT implementation performed masking\\nonce during data preprocessing, resulting in a sin-\\nglestatic mask. To avoid using the same mask for\\neach training instance in every epoch, training data\\nwas duplicated 10 times so that each sequence is\\nmasked in 10 different ways over the 40 epochs of\\ntraining. Thus, each training sequence was seen\\nwith the same mask four times during training. We compare this strategy with dynamic mask-\\ningwhere we generate the masking pattern every\\ntime we feed a sequence to the model. This be-\\ncomes crucial when pretraining for more steps or\\nwith larger datasets. 7Studying architectural changes, including larger archi-\\ntectures, is an important area for future work.Masking SQuAD 2.0 MNLI-m SST-2\\nreference 76.3 84.3 92.8\\nOur reimplementation:\\nstatic 78.3 84.3 92.5\\ndynamic 78.7 84.0 92.9\\nTable 1: Comparison between static and dynamic\\nmasking for BERT BASE. We report F1 for SQuAD and\\naccuracy for MNLI-m and SST-2. Reported results are\\nmedians over 5 random initializations (seeds). Refer-\\nence results are from Yang et al. (2019 ). Results Table 1compares the published\\nBERT BASE results from Devlin et al. (2019 ) to our\\nreimplementation with either static or dynamic\\nmasking. We find that our reimplementation\\nwith static masking performs similar to the\\noriginal BERT model, and dynamic masking is\\ncomparable or slightly better than static masking. Given these results and the additional efficiency\\nbenefits of dynamic masking, we use dynamic\\nmasking in the remainder of the experiments. 4.2 Model Input Format and Next Sentence\\nPrediction\\nIn the original BERT pretraining procedure, the\\nmodel observes two concatenated document seg-\\nments, which are either sampled contiguously\\nfrom the same document (with p= 0.5) or from\\ndistinct documents. In addition to the masked lan-\\nguage modeling objective, the model is trained to\\npredict whether the observed document segments\\ncome from the same or distinct documents via an\\nauxiliary Next Sentence Prediction (NSP) loss. The NSP loss was hypothesized to be an impor-\\ntant factor in training the original BERT model. Devlin et al. (2019 ) observe that removing NSP\\nhurts performance, with significant performance\\ndegradation on QNLI, MNLI, and SQuAD 1.1. However, some recent work has questioned the\\nnecessity of the NSP loss ( Lample and Conneau ,\\n2019 ;Yang et al.\"\n", "}\n", "\u001b[32;1m\u001b[1;3m[chain/start]\u001b[0m \u001b[1m[chain:RunnableSequence > chain:RunnableParallel] Entering Chain run with input:\n", "\u001b[0m{\n", " \"doc\": \",2019 ;Joshi et al. ,2019 ). To better understand this discrepancy, we com-\\npare several alternative training formats:\\n•SEGMENT -PAIR +NSP: This follows the original\\ninput format used in BERT ( Devlin et al. ,2019 ),\\nwith the NSP loss. Each input has a pair of seg-\\nments, which can each contain multiple natural\\nsentences, but the total combined length must\\nbe less than 512 tokens. Model SQuAD 1.1/2.0 MNLI-m SST-2 RACE\\nOur reimplementation (with NSP loss):\\nSEGMENT -PAIR 90.4/78.7 84.0 92.9 64.2\\nSENTENCE -PAIR 88.7/76.2 82.9 92.1 63.0\\nOur reimplementation (without NSP loss):\\nFULL -SENTENCES 90.4/79.1 84.7 92.5 64.8\\nDOC-SENTENCES 90.6/79.7 84.7 92.7 65.6\\nBERT BASE 88.5/76.3 84.3 92.8 64.3\\nXLNet BASE (K = 7) –/81.3 85.8 92.7 66.1\\nXLNet BASE (K = 6) –/81.0 85.6 93.4 66.7\\nTable 2: Development set results for base models pretrained over B OOK CORPUS and W IKIPEDIA . All models are\\ntrained for 1M steps with a batch size of 256 sequences. We rep ort F1 for SQuAD and accuracy for MNLI-m,\\nSST-2 and RACE. Reported results are medians over five random initializations (seeds). Results for BERT BASEand\\nXLNet BASEare from Yang et al. (2019 ). •SENTENCE -PAIR +NSP: Each input contains a\\npair of natural sentences , either sampled from\\na contiguous portion of one document or from\\nseparate documents. Since these inputs are sig-\\nnificantly shorter than 512 tokens, we increase\\nthe batch size so that the total number of tokens\\nremains similar to SEGMENT -PAIR +NSP. We re-\\ntain the NSP loss. •FULL -SENTENCES : Each input is packed with\\nfull sentences sampled contiguously from one\\nor more documents, such that the total length is\\nat most 512 tokens. Inputs may cross document\\nboundaries. When we reach the end of one doc-\\nument, we begin sampling sentences from the\\nnext document and add an extra separator token\\nbetween documents. We remove the NSP loss. •DOC-SENTENCES : Inputs are constructed sim-\\nilarly to FULL -SENTENCES , except that they\\nmay not cross document boundaries. Inputs\\nsampled near the end of a document may be\\nshorter than 512 tokens, so we dynamically in-\\ncrease the batch size in these cases to achieve\\na similar number of total tokens as FULL -\\nSENTENCES .\"\n", "}\n", "\u001b[32;1m\u001b[1;3m[chain/start]\u001b[0m \u001b[1m[chain:RunnableSequence > chain:RunnableParallel] Entering Chain run with input:\n", "\u001b[0m{\n", " \"doc\": \"We remove the NSP loss. Results Table 2shows results for the four dif-\\nferent settings. We first compare the original\\nSEGMENT -PAIR input format from Devlin et al. (2019 ) to the SENTENCE -PAIR format; both for-\\nmats retain the NSP loss, but the latter uses sin-\\ngle sentences. We find that using individual\\nsentences hurts performance on downstream\\ntasks , which we hypothesize is because the model\\nis not able to learn long-range dependencies.We next compare training without the NSP\\nloss and training with blocks of text from a sin-\\ngle document ( DOC-SENTENCES ). We find that\\nthis setting outperforms the originally published\\nBERT BASEresults and that removing the NSP loss\\nmatches or slightly improves downstream task\\nperformance , in contrast to Devlin et al. (2019 ). It is possible that the original BERT implementa-\\ntion may only have removed the loss term while\\nstill retaining the SEGMENT -PAIR input format. Finally we find that restricting sequences to\\ncome from a single document ( DOC-SENTENCES )\\nperforms slightly better than packing sequences\\nfrom multiple documents ( FULL -SENTENCES ). However, because the DOC-SENTENCES format\\nresults in variable batch sizes, we use FULL -\\nSENTENCES in the remainder of our experiments\\nfor easier comparison with related work. 4.3 Training with large batches\\nPast work in Neural Machine Translation has\\nshown that training with very large mini-batches\\ncan both improve optimization speed and end-task\\nperformance when the learning rate is increased\\nappropriately ( Ott et al. ,2018 ). Recent work has\\nshown that BERT is also amenable to large batch\\ntraining ( You et al. ,2019 ). Devlin et al. (2019 ) originally trained\\nBERT BASE for 1M steps with a batch size of\\n256 sequences. This is equivalent in computa-\\ntional cost, via gradient accumulation, to training\\nfor 125K steps with a batch size of 2K sequences,\\nor for 31K steps with a batch size of 8K. In Table 3we compare perplexity and end- bsz steps lr ppl MNLI-m SST-2\\n256 1M 1e-4 3.99 84.7 92.7\\n2K 125K 7e-4 3.68 85.2 92.9\\n8K 31K 1e-3 3.77 84.6 92.8\\nTable 3: Perplexity on held-out training data ( ppl) and\\ndevelopment set accuracy for base models trained over\\nBOOK CORPUS and W IKIPEDIA with varying batch\\nsizes ( bsz).\"\n", "}\n", "\u001b[32;1m\u001b[1;3m[chain/start]\u001b[0m \u001b[1m[chain:RunnableSequence > chain:RunnableParallel] Entering Chain run with input:\n", "\u001b[0m{\n", " \"doc\": \"We tune the learning rate ( lr) for each set-\\nting. Models make the same number of passes over the\\ndata (epochs) and have the same computational cost. task performance of BERT BASE as we increase the\\nbatch size, controlling for the number of passes\\nthrough the training data. We observe that train-\\ning with large batches improves perplexity for the\\nmasked language modeling objective, as well as\\nend-task accuracy. Large batches are also easier to\\nparallelize via distributed data parallel training,8\\nand in later experiments we train with batches of\\n8K sequences. Notably You et al. (2019 ) train BERT with even\\nlarger batche sizes, up to 32K sequences. We leave\\nfurther exploration of the limits of large batch\\ntraining to future work. 4.4 Text Encoding\\nByte-Pair Encoding (BPE) ( Sennrich et al. ,2016 )\\nis a hybrid between character- and word-level rep-\\nresentations that allows handling the large vocab-\\nularies common in natural language corpora. In-\\nstead of full words, BPE relies on subwords units,\\nwhich are extracted by performing statistical anal-\\nysis of the training corpus. BPE vocabulary sizes typically range from\\n10K-100K subword units. However, unicode char-\\nacters can account for a sizeable portion of this\\nvocabulary when modeling large and diverse cor-\\npora, such as the ones considered in this work. Radford et al. (2019 ) introduce a clever imple-\\nmentation of BPE that uses bytes instead of uni-\\ncode characters as the base subword units. Using\\nbytes makes it possible to learn a subword vocab-\\nulary of a modest size (50K units) that can still en-\\ncode any input text without introducing any “un-\\nknown” tokens. 8Large batch training can improve training efficiency even\\nwithout large scale parallel hardware through gradient ac-\\ncumulation , whereby gradients from multiple mini-batches\\nare accumulated locally before each optimization step. Thi s\\nfunctionality is supported natively in FAIRSEQ (Ott et al. ,\\n2019 ).The original BERT implementa-\\ntion ( Devlin et al. ,2019 ) uses a character-level\\nBPE vocabulary of size 30K, which is learned\\nafter preprocessing the input with heuristic tok-\\nenization rules. Following Radford et al. (2019 ),\\nwe instead consider training BERT with a larger\\nbyte-level BPE vocabulary containing 50K sub-\\nword units, without any additional preprocessing\\nor tokenization of the input. This adds approxi-\\nmately 15M and 20M additional parameters for\\nBERT BASEand BERT LARGE , respectively. Early experiments revealed only slight dif-\\nferences between these encodings, with the\\nRadford et al. (2019 ) BPE achieving slightly\\nworse end-task performance on some tasks. Nev-\\nertheless, we believe the advantages of a univer-\\nsal encoding scheme outweighs the minor degre-\\ndation in performance and use this encoding in\\nthe remainder of our experiments. A more de-\\ntailed comparison of these encodings is left to fu-\\nture work. 5 RoBERTa\\nIn the previous section we propose modifications\\nto the BERT pretraining procedure that improve\\nend-task performance. We now aggregate these\\nimprovements and evaluate their combined im-\\npact. We call this configuration RoBERTa for\\nRobustly optimized BERT approach. Specifi-\\ncally, RoBERTa is trained with dynamic mask-\\ning (Section 4.1),FULL -SENTENCES without NSP\\nloss (Section 4.2), large mini-batches (Section 4.3)\\nand a larger byte-level BPE (Section 4.4). Additionally, we investigate two other impor-\\ntant factors that have been under-emphasized in\\nprevious work: (1) the data used for pretraining,\\nand (2) the number of training passes through the\\ndata. For example, the recently proposed XLNet\\narchitecture ( Yang et al. ,2019 ) is pretrained us-\\ning nearly 10 times more data than the original\\nBERT ( Devlin et al. ,2019 ). It is also trained with\\na batch size eight times larger for half as many op-\\ntimization steps, thus seeing four times as many\\nsequences in pretraining compared to BERT. To help disentangle the importance of these fac-\\ntors from other modeling choices (e.g., the pre-\\ntraining objective), we begin by training RoBERTa\\nfollowing the BERT LARGE architecture ( L= 24 ,\\nH= 1024 ,A= 16 , 355M parameters). We\\npretrain for 100K steps over a comparable B OOK -\\nCORPUS plus W IKIPEDIA dataset as was used in Model data bsz stepsSQuADMNLI-m SST-2(v1.1/2.0)\\nRoBERTa\\nwith B OOKS + W IKI 16GB 8K 100K 93.6/87.3 89.0 95.3\\n+ additional data ( §3.2) 160GB 8K 100K 94.0/87.7 89.3 95.6\\n+ pretrain longer 160GB 8K 300K 94.4/88.7 90.0 96.1\\n+ pretrain even longer 160GB 8K 500K 94.6/89.4 90.2 96.4\\nBERT LARGE\\nwith B OOKS + W IKI 13GB 256 1M 90.9/81.8 86.6 93.7\\nXLNet LARGE\\nwith B OOKS + W IKI 13GB 256 1M 94.0/87.8 88.4 94.4\\n+ additional data 126GB 2K 500K 94.5/88.8 89.8 95.6\\nTable 4: Development set results for RoBERTa as we pretrain o ver more data (16GB →160GB of text) and pretrain\\nfor longer (100K →300K→500K steps).\"\n", "}\n", "\u001b[32;1m\u001b[1;3m[chain/start]\u001b[0m \u001b[1m[chain:RunnableSequence > chain:RunnableParallel] Entering Chain run with input:\n", "\u001b[0m{\n", " \"doc\": \"Each row accumulates improvements from the row s above. RoBERTa\\nmatches the architecture and training objective of BERT LARGE . Results for BERT LARGE and XLNet LARGE are from\\nDevlin et al. (2019 ) and Yang et al. (2019 ), respectively. Complete results on all GLUE tasks can be fo und in the\\nAppendix.\"\n", "}\n", "\u001b[32;1m\u001b[1;3m[chain/start]\u001b[0m \u001b[1m[chain:RunnableSequence > chain:RunnableParallel] Entering Chain run with input:\n", "\u001b[0m{\n", " \"doc\": \"Devlin et al. (2019 ). We pretrain our model using\\n1024 V100 GPUs for approximately one day. Results We present our results in Table 4. When\\ncontrolling for training data, we observe that\\nRoBERTa provides a large improvement over the\\noriginally reported BERT LARGE results, reaffirming\\nthe importance of the design choices we explored\\nin Section 4. Next, we combine this data with the three ad-\\nditional datasets described in Section 3.2. We\\ntrain RoBERTa over the combined data with the\\nsame number of training steps as before (100K). In total, we pretrain over 160GB of text. We ob-\\nserve further improvements in performance across\\nall downstream tasks, validating the importance of\\ndata size and diversity in pretraining.9\\nFinally, we pretrain RoBERTa for significantly\\nlonger, increasing the number of pretraining steps\\nfrom 100K to 300K, and then further to 500K. We\\nagain observe significant gains in downstream task\\nperformance, and the 300K and 500K step mod-\\nels outperform XLNet LARGE across most tasks. We\\nnote that even our longest-trained model does not\\nappear to overfit our data and would likely benefit\\nfrom additional training. In the rest of the paper, we evaluate our best\\nRoBERTa model on the three different bench-\\nmarks: GLUE, SQuaD and RACE. Specifically\\n9Our experiments conflate increases in data size and di-\\nversity. We leave a more careful analysis of these two dimen-\\nsions to future work.we consider RoBERTa trained for 500K steps over\\nall five of the datasets introduced in Section 3.2. 5.1 GLUE Results\\nFor GLUE we consider two finetuning settings. In the first setting ( single-task, dev ) we finetune\\nRoBERTa separately for each of the GLUE tasks,\\nusing only the training data for the correspond-\\ning task. We consider a limited hyperparameter\\nsweep for each task, with batch sizes ∈ {16,32}\\nand learning rates ∈ {1e−5,2e−5,3e−5}, with a\\nlinear warmup for the first 6% of steps followed by\\na linear decay to 0. We finetune for 10 epochs and\\nperform early stopping based on each task’s eval-\\nuation metric on the dev set. The rest of the hyper-\\nparameters remain the same as during pretraining. In this setting, we report the median development\\nset results for each task over five random initial-\\nizations, without model ensembling. In the second setting ( ensembles, test ), we com-\\npare RoBERTa to other approaches on the test set\\nvia the GLUE leaderboard. While many submis-\\nsions to the GLUE leaderboard depend on multi-\\ntask finetuning, our submission depends only on\\nsingle-task finetuning . For RTE, STS and MRPC\\nwe found it helpful to finetune starting from the\\nMNLI single-task model, rather than the baseline\\npretrained RoBERTa. We explore a slightly wider\\nhyperparameter space, described in the Appendix,\\nand ensemble between 5 and 7 models per task. MNLI QNLI QQP RTE SST MRPC CoLA STS WNLI Avg\\nSingle-task single models on dev\\nBERT LARGE 86.6/- 92.3 91.3 70.4 93.2 88.0 60.6 90.0 - -\\nXLNet LARGE 89.8/- 93.9 91.8 83.8 95.6 89.2 63.6 91.8 - -\\nRoBERTa 90.2/90.2 94.7 92.2 86.6 96.4 90.9 68.0 92.4 91.3 -\\nEnsembles on test (from leaderboard as of July 25, 2019)\\nALICE 88.2/87.9 95.7 90.7 83.5 95.2 92.6 68.6 91.1 80.8 86.3\\nMT-DNN 87.9/87.4 96.0 89.9 86.3 96.5 92.7 68.4 91.1 89.0 87.6\\nXLNet 90.2/89.8 98.6 90.3 86.3 96.8 93.0 67.8 91.6 90.4 88.4\\nRoBERTa 90.8/90.2 98.9 90.2 88.2 96.7 92.3 67.8 92.2 89.0 88.5\\nTable 5: Results on GLUE. All results are based on a 24-layer a rchitecture. BERT LARGE and XLNet LARGE results\\nare from Devlin et al. (2019 ) and Yang et al. (2019 ), respectively. RoBERTa results on the development set are a\\nmedian over five runs. RoBERTa results on the test set are ense mbles of single-task models. For RTE, STS and\\nMRPC we finetune starting from the MNLI model instead of the ba seline pretrained model. Averages are obtained\\nfrom the GLUE leaderboard. Task-specific modifications Two of the GLUE\\ntasks require task-specific finetuning approaches\\nto achieve competitive leaderboard results. QNLI : Recent submissions on the GLUE\\nleaderboard adopt a pairwise ranking formulation\\nfor the QNLI task, in which candidate answers\\nare mined from the training set and compared to\\none another, and a single (question, candidate)\\npair is classified as positive ( Liu et al.\"\n", "}\n", "\u001b[32;1m\u001b[1;3m[chain/start]\u001b[0m \u001b[1m[chain:RunnableSequence > chain:RunnableParallel] Entering Chain run with input:\n", "\u001b[0m{\n", " \"doc\": \",2019b ,a;\\nYang et al. ,2019 ). This formulation significantly\\nsimplifies the task, but is not directly comparable\\nto BERT ( Devlin et al. ,2019 ). Following recent\\nwork, we adopt the ranking approach for our test\\nsubmission, but for direct comparison with BERT\\nwe report development set results based on a pure\\nclassification approach. WNLI : We found the provided NLI-format\\ndata to be challenging to work with. Instead\\nwe use the reformatted WNLI data from Super-\\nGLUE ( Wang et al. ,2019a ), which indicates the\\nspan of the query pronoun and referent. We fine-\\ntune RoBERTa using the margin ranking loss from\\nKocijan et al. (2019 ). For a given input sentence,\\nwe use spaCy ( Honnibal and Montani ,2017 ) to\\nextract additional candidate noun phrases from the\\nsentence and finetune our model so that it assigns\\nhigher scores to positive referent phrases than for\\nany of the generated negative candidate phrases. One unfortunate consequence of this formulation\\nis that we can only make use of the positive train-\\ning examples, which excludes over half of the pro-\\nvided training examples.10\\n10While we only use the provided WNLI training data, ourResults We present our results in Table 5. In the\\nfirst setting ( single-task, dev ), RoBERTa achieves\\nstate-of-the-art results on all 9 of the GLUE\\ntask development sets. Crucially, RoBERTa uses\\nthe same masked language modeling pretrain-\\ning objective and architecture as BERT LARGE , yet\\nconsistently outperforms both BERT LARGE and\\nXLNet LARGE . This raises questions about the rel-\\native importance of model architecture and pre-\\ntraining objective, compared to more mundane de-\\ntails like dataset size and training time that we ex-\\nplore in this work. In the second setting ( ensembles, test ), we\\nsubmit RoBERTa to the GLUE leaderboard and\\nachieve state-of-the-art results on 4 out of 9 tasks\\nand the highest average score to date. This is espe-\\ncially exciting because RoBERTa does not depend\\non multi-task finetuning, unlike most of the other\\ntop submissions. We expect future work may fur-\\nther improve these results by incorporating more\\nsophisticated multi-task finetuning procedures. 5.2 SQuAD Results\\nWe adopt a much simpler approach for SQuAD\\ncompared to past work. In particular, while\\nboth BERT ( Devlin et al. ,2019 ) and XL-\\nNet ( Yang et al. ,2019 ) augment their training data\\nwith additional QA datasets, we only finetune\\nRoBERTa using the provided SQuAD training\\ndata .Yang et al. (2019 ) also employed a custom\\nlayer-wise learning rate schedule to finetune\\nresults could potentially be improved by augmenting this wi th\\nadditional pronoun disambiguation datasets. ModelSQuAD 1.1 SQuAD 2.0\\nEM F1 EM F1\\nSingle models on dev, w/o data augmentation\\nBERT LARGE 84.1 90.9 79.0 81.8\\nXLNet LARGE 89.0 94.5 86.1 88.8\\nRoBERTa 88.9 94.6 86.5 89.4\\nSingle models on test (as of July 25, 2019)\\nXLNet LARGE 86.3†89.1†\\nRoBERTa 86.8 89.8\\nXLNet + SG-Net Verifier 87.0†89.9†\\nTable 6: Results on SQuAD. †indicates results that de-\\npend on additional external training data. RoBERTa\\nuses only the provided SQuAD data in both dev and\\ntest settings. BERT LARGE and XLNet LARGE results are\\nfrom Devlin et al. (2019 ) and Yang et al. (2019 ), re-\\nspectively. XLNet, while we use the same learning rate for\\nall layers. For SQuAD v1.1 we follow the same finetun-\\ning procedure as Devlin et al. (2019 ). For SQuAD\\nv2.0, we additionally classify whether a given\\nquestion is answerable; we train this classifier\\njointly with the span predictor by summing the\\nclassification and span loss terms. Results We present our results in Table 6. On\\nthe SQuAD v1.1 development set, RoBERTa\\nmatches the state-of-the-art set by XLNet. On the\\nSQuAD v2.0 development set, RoBERTa sets a\\nnew state-of-the-art, improving over XLNet by 0.4\\npoints (EM) and 0.6 points (F1). We also submit RoBERTa to the public SQuAD\\n2.0 leaderboard and evaluate its performance rel-\\native to other systems. Most of the top systems\\nbuild upon either BERT ( Devlin et al. ,2019 ) or\\nXLNet ( Yang et al. ,2019 ), both of which rely on\\nadditional external training data. In contrast, our\\nsubmission does not use any additional data. Our single RoBERTa model outperforms all but\\none of the single model submissions, and is the\\ntop scoring system among those that do not rely\\non data augmentation. 5.3 RACE Results\\nIn RACE, systems are provided with a passage of\\ntext, an associated question, and four candidate an-\\nswers. Systems are required to classify which of\\nthe four candidate answers is correct. We modify RoBERTa for this task by concate-Model Accuracy Middle High\\nSingle models on test (as of July 25, 2019)\\nBERT LARGE 72.0 76.6 70.1\\nXLNet LARGE 81.7 85.4 80.2\\nRoBERTa 83.2 86.5 81.3\\nTable 7: Results on the RACE test set. BERT LARGE and\\nXLNet LARGE results are from Yang et al. (2019 ). nating each candidate answer with the correspond-\\ning question and passage. We then encode each of\\nthese four sequences and pass the resulting [CLS]\\nrepresentations through a fully-connected layer,\\nwhich is used to predict the correct answer. We\\ntruncate question-answer pairs that are longer than\\n128 tokens and, if needed, the passage so that the\\ntotal length is at most 512 tokens. Results on the RACE test sets are presented in\\nTable 7. RoBERTa achieves state-of-the-art results\\non both middle-school and high-school settings. 6 Related Work\\nPretraining methods have been designed\\nwith different training objectives, includ-\\ning language modeling ( Dai and Le ,2015 ;\\nPeters et al. ,2018 ;Howard and Ruder ,2018 ),\\nmachine translation ( McCann et al. ,2017 ), and\\nmasked language modeling ( Devlin et al. ,2019 ;\\nLample and Conneau ,2019 ). Many recent\\npapers have used a basic recipe of finetuning\\nmodels for each end task ( Howard and Ruder ,\\n2018 ;Radford et al. ,2018 ), and pretraining\\nwith some variant of a masked language model\\nobjective. However, newer methods have\\nimproved performance by multi-task fine tun-\\ning ( Dong et al. ,2019 ), incorporating entity\\nembeddings ( Sun et al. ,2019 ), span predic-\\ntion ( Joshi et al. ,2019 ), and multiple variants\\nof autoregressive pretraining ( Song et al. ,2019 ;\\nChan et al. ,2019 ;Yang et al. ,2019 ). Perfor-\\nmance is also typically improved by training\\nbigger models on more data ( Devlin et al.\"\n", "}\n", "\u001b[32;1m\u001b[1;3m[chain/start]\u001b[0m \u001b[1m[chain:RunnableSequence > chain:RunnableParallel] Entering Chain run with input:\n", "\u001b[0m{\n", " \"doc\": \",\\n2019 ;Baevski et al. ,2019 ;Yang et al. ,2019 ;\\nRadford et al. ,2019 ). Our goal was to replicate,\\nsimplify, and better tune the training of BERT,\\nas a reference point for better understanding the\\nrelative performance of all of these methods. 7 Conclusion\\nWe carefully evaluate a number of design de-\\ncisions when pretraining BERT models. We\\nfind that performance can be substantially im-\\nproved by training the model longer, with bigger\\nbatches over more data; removing the next sen-\\ntence prediction objective; training on longer se-\\nquences; and dynamically changing the masking\\npattern applied to the training data. Our improved\\npretraining procedure, which we call RoBERTa,\\nachieves state-of-the-art results on GLUE, RACE\\nand SQuAD, without multi-task finetuning for\\nGLUE or additional data for SQuAD. These re-\\nsults illustrate the importance of these previ-\\nously overlooked design decisions and suggest\\nthat BERT’s pretraining objective remains com-\\npetitive with recently proposed alternatives. We additionally use a novel dataset,\\nCC-N EWS, and release our models and\\ncode for pretraining and finetuning at:\\nhttps://github.com/pytorch/fairseq . References\\nEneko Agirre, Llu’is M‘arquez, and Richard Wicen-\\ntowski, editors. 2007. Proceedings of the Fourth\\nInternational Workshop on Semantic Evaluations\\n(SemEval-2007) . Alexei Baevski, Sergey Edunov, Yinhan Liu, Luke\\nZettlemoyer, and Michael Auli. 2019. Cloze-\\ndriven pretraining of self-attention networks. arXiv\\npreprint arXiv:1903.07785 .\"\n", "}\n", "\u001b[32;1m\u001b[1;3m[chain/start]\u001b[0m \u001b[1m[chain:RunnableSequence > chain:RunnableParallel] Entering Chain run with input:\n", "\u001b[0m{\n", " \"doc\": \"Roy Bar-Haim, Ido Dagan, Bill Dolan, Lisa Ferro,\\nDanilo Giampiccolo, Bernardo Magnini, and Idan\\nSzpektor. 2006. The second PASCAL recognising\\ntextual entailment challenge. In Proceedings of the\\nsecond PASCAL challenges workshop on recognis-\\ning textual entailment . Luisa Bentivogli, Ido Dagan, Hoa Trang Dang, Danilo\\nGiampiccolo, and Bernardo Magnini. 2009. The\\nfifth PASCAL recognizing textual entailment chal-\\nlenge. Samuel R Bowman, Gabor Angeli, Christopher Potts,\\nand Christopher D Manning. 2015. A large anno-\\ntated corpus for learning natural language inference. InEmpirical Methods in Natural Language Process-\\ning (EMNLP) .\"\n", "}\n", "\u001b[32;1m\u001b[1;3m[chain/start]\u001b[0m \u001b[1m[chain:RunnableSequence > chain:RunnableParallel] Entering Chain run with input:\n", "\u001b[0m{\n", " \"doc\": \"William Chan, Nikita Kitaev, Kelvin Guu, Mitchell\\nStern, and Jakob Uszkoreit. 2019. KERMIT: Gener-\\native insertion-based modeling for sequences.\"\n", "}\n", "\u001b[32;1m\u001b[1;3m[chain/start]\u001b[0m \u001b[1m[chain:RunnableSequence > chain:RunnableParallel] Entering Chain run with input:\n", "\u001b[0m{\n", " \"doc\": \"arXiv\\npreprint arXiv:1906.01604 .Ido Dagan, Oren Glickman, and Bernardo Magnini. 2006. The PASCAL recognising textual entailment\\nchallenge. In Machine learning challenges. evalu-\\nating predictive uncertainty, visual object classifica-\\ntion, and recognising tectual entailment . Andrew M Dai and Quoc V Le. 2015. Semi-supervised\\nsequence learning. In Advances in Neural Informa-\\ntion Processing Systems (NIPS) . Jacob Devlin, Ming-Wei Chang, Kenton Lee, and\\nKristina Toutanova. 2019. BERT: Pre-training of\\ndeep bidirectional transformers for language under-\\nstanding. In North American Association for Com-\\nputational Linguistics (NAACL) . William B Dolan and Chris Brockett. 2005. Auto-\\nmatically constructing a corpus of sentential para-\\nphrases. In Proceedings of the International Work-\\nshop on Paraphrasing . Li Dong, Nan Yang, Wenhui Wang, Furu Wei,\\nXiaodong Liu, Yu Wang, Jianfeng Gao, Ming\\nZhou, and Hsiao-Wuen Hon. 2019. Unified\\nlanguage model pre-training for natural language\\nunderstanding and generation. arXiv preprint\\narXiv:1905.03197 .\"\n", "}\n", "\u001b[32;1m\u001b[1;3m[chain/start]\u001b[0m \u001b[1m[chain:RunnableSequence > chain:RunnableParallel] Entering Chain run with input:\n", "\u001b[0m{\n", " \"doc\": \"Danilo Giampiccolo, Bernardo Magnini, Ido Dagan,\\nand Bill Dolan. 2007. The third PASCAL recog-\\nnizing textual entailment challenge. In Proceedings\\nof the ACL-PASCAL workshop on textual entailment\\nand paraphrasing . Aaron Gokaslan and Vanya Cohen. 2019. Openweb-\\ntext corpus. http://web.archive.org/\\nsave/http://Skylion007.github.io/\\nOpenWebTextCorpus .\"\n", "}\n", "\u001b[32;1m\u001b[1;3m[chain/start]\u001b[0m \u001b[1m[chain:RunnableSequence > chain:RunnableParallel] Entering Chain run with input:\n", "\u001b[0m{\n", " \"doc\": \"Felix Hamborg, Norman Meuschke, Corinna Bre-\\nitinger, and Bela Gipp. 2017. news-please: A\\ngeneric news crawler and extractor. In Proceedings\\nof the 15th International Symposium of Information\\nScience .\"\n", "}\n", "\u001b[32;1m\u001b[1;3m[chain/start]\u001b[0m \u001b[1m[chain:RunnableSequence > chain:RunnableParallel] Entering Chain run with input:\n", "\u001b[0m{\n", " \"doc\": \"Dan Hendrycks and Kevin Gimpel. 2016. Gaus-\\nsian error linear units (gelus). arXiv preprint\\narXiv:1606.08415 .\"\n", "}\n", "\u001b[32;1m\u001b[1;3m[chain/start]\u001b[0m \u001b[1m[chain:RunnableSequence > chain:RunnableParallel] Entering Chain run with input:\n", "\u001b[0m{\n", " \"doc\": \"Matthew Honnibal and Ines Montani. 2017. spaCy 2:\\nNatural language understanding with Bloom embed-\\ndings, convolutional neural networks and incremen-\\ntal parsing. To appear. Jeremy Howard and Sebastian Ruder. 2018. Universal\\nlanguage model fine-tuning for text classification. arXiv preprint arXiv:1801.06146 . Shankar Iyer, Nikhil Dandekar, and Kornl Cser-\\nnai. 2016. First quora dataset release: Question\\npairs.https://data.quora.com/First-\\nQuora-Dataset-Release-Question-\\nPairs . Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S.\"\n", "}\n", "\u001b[32;1m\u001b[1;3m[chain/start]\u001b[0m \u001b[1m[chain:RunnableSequence > chain:RunnableParallel] Entering Chain run with input:\n", "\u001b[0m{\n", " \"doc\": \"Weld, Luke Zettlemoyer, and Omer Levy. 2019. SpanBERT: Improving pre-training by repre-\\nsenting and predicting spans. arXiv preprint\\narXiv:1907.10529 . Diederik Kingma and Jimmy Ba. 2015. Adam: A\\nmethod for stochastic optimization. In International\\nConference on Learning Representations (ICLR) .\"\n", "}\n", "\u001b[32;1m\u001b[1;3m[chain/start]\u001b[0m \u001b[1m[chain:RunnableSequence > chain:RunnableParallel] Entering Chain run with input:\n", "\u001b[0m{\n", " \"doc\": \"Vid Kocijan, Ana-Maria Cretu, Oana-Maria Camburu,\\nYordan Yordanov, and Thomas Lukasiewicz. 2019. A surprisingly robust trick for winograd schema\\nchallenge. arXiv preprint arXiv:1905.06290 . Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang,\\nand Eduard Hovy. 2017. Race: Large-scale reading\\ncomprehension dataset from examinations. arXiv\\npreprint arXiv:1704.04683 . Guillaume Lample and Alexis Conneau. 2019. Cross-\\nlingual language model pretraining. arXiv preprint\\narXiv:1901.07291 .\"\n", "}\n", "\u001b[32;1m\u001b[1;3m[chain/start]\u001b[0m \u001b[1m[chain:RunnableSequence > chain:RunnableParallel] Entering Chain run with input:\n", "\u001b[0m{\n", " \"doc\": \"Hector J Levesque, Ernest Davis, and Leora Morgen-\\nstern. 2011. The Winograd schema challenge. In\\nAAAI Spring Symposium: Logical Formalizations of\\nCommonsense Reasoning . Xiaodong Liu, Pengcheng He, Weizhu Chen, and\\nJianfeng Gao. 2019a. Improving multi-task deep\\nneural networks via knowledge distillation for\\nnatural language understanding. arXiv preprint\\narXiv:1904.09482 .\"\n", "}\n", "\u001b[32;1m\u001b[1;3m[chain/start]\u001b[0m \u001b[1m[chain:RunnableSequence > chain:RunnableParallel] Entering Chain run with input:\n", "\u001b[0m{\n", " \"doc\": \"Xiaodong Liu, Pengcheng He, Weizhu Chen, and Jian-\\nfeng Gao. 2019b. Multi-task deep neural networks\\nfor natural language understanding. arXiv preprint\\narXiv:1901.11504 . Bryan McCann, James Bradbury, Caiming Xiong, and\\nRichard Socher. 2017. Learned in translation: Con-\\ntextualized word vectors. In Advances in Neural In-\\nformation Processing Systems (NIPS) , pages 6297–\\n6308.\"\n", "}\n", "\u001b[32;1m\u001b[1;3m[chain/start]\u001b[0m \u001b[1m[chain:RunnableSequence > chain:RunnableParallel] Entering Chain run with input:\n", "\u001b[0m{\n", " \"doc\": \"Paulius Micikevicius, Sharan Narang, Jonah Alben,\\nGregory Diamos, Erich Elsen, David Garcia, Boris\\nGinsburg, Michael Houston, Oleksii Kuchaiev,\\nGanesh Venkatesh, and Hao Wu. 2018. Mixed preci-\\nsion training. In International Conference on Learn-\\ning Representations . Sebastian Nagel. 2016. Cc-news. http:\\n//web.archive.org/save/http:\\n//commoncrawl.org/2016/10/news-\\ndataset-available .\"\n", "}\n", "\u001b[32;1m\u001b[1;3m[chain/start]\u001b[0m \u001b[1m[chain:RunnableSequence > chain:RunnableParallel] Entering Chain run with input:\n", "\u001b[0m{\n", " \"doc\": \"Myle Ott, Sergey Edunov, Alexei Baevski, Angela\\nFan, Sam Gross, Nathan Ng, David Grangier, and\\nMichael Auli. 2019. FAIRSEQ : A fast, exten-\\nsible toolkit for sequence modeling. In North\\nAmerican Association for Computational Linguis-\\ntics (NAACL): System Demonstrations .Myle Ott, Sergey Edunov, David Grangier, and\\nMichael Auli. 2018. Scaling neural machine trans-\\nlation. In Proceedings of the Third Conference on\\nMachine Translation (WMT) . Adam Paszke, Sam Gross, Soumith Chintala, Gre-\\ngory Chanan, Edward Yang, Zachary DeVito, Zem-\\ning Lin, Alban Desmaison, Luca Antiga, and Adam\\nLerer. 2017. Automatic differentiation in PyTorch. InNIPS Autodiff Workshop . Matthew Peters, Mark Neumann, Mohit Iyyer, Matt\\nGardner, Christopher Clark, Kenton Lee, and Luke\\nZettlemoyer. 2018. Deep contextualized word repre-\\nsentations. In North American Association for Com-\\nputational Linguistics (NAACL) . Alec Radford, Karthik Narasimhan, Time Salimans,\\nand Ilya Sutskever. 2018. Improving language un-\\nderstanding with unsupervised learning. Technical\\nreport, OpenAI. Alec Radford, Jeffrey Wu, Rewon Child, David Luan,\\nDario Amodei, and Ilya Sutskever. 2019. Language\\nmodels are unsupervised multitask learners. Techni-\\ncal report, OpenAI.\"\n", "}\n", "\u001b[32;1m\u001b[1;3m[chain/start]\u001b[0m \u001b[1m[chain:RunnableSequence > chain:RunnableParallel] Entering Chain run with input:\n", "\u001b[0m{\n", " \"doc\": \"Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018. Know what you don’t know: Unanswerable ques-\\ntions for squad. In Association for Computational\\nLinguistics (ACL) . Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and\\nPercy Liang. 2016. SQuAD: 100,000+ questions for\\nmachine comprehension of text. In Empirical Meth-\\nods in Natural Language Processing (EMNLP) . Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural machine translation of rare words with\\nsubword units. In Association for Computational\\nLinguistics (ACL) , pages 1715–1725. Richard Socher, Alex Perelygin, Jean Wu, Jason\\nChuang, Christopher D Manning, Andrew Ng, and\\nChristopher Potts. 2013. Recursive deep models\\nfor semantic compositionality over a sentiment tree-\\nbank. In Empirical Methods in Natural Language\\nProcessing (EMNLP) . Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and\\nTie-Yan Liu. 2019. MASS: Masked sequence\\nto sequence pre-training for language generation. InInternational Conference on Machine Learning\\n(ICML) .\"\n", "}\n", "\u001b[32;1m\u001b[1;3m[chain/start]\u001b[0m \u001b[1m[chain:RunnableSequence > chain:RunnableParallel] Entering Chain run with input:\n", "\u001b[0m{\n", " \"doc\": \"Yu Stephanie Sun, Shuohuan Wang, Yukun Li, Shikun\\nFeng, Xuyi Chen, Han Zhang, Xinlun Tian, Danxi-\\nang Zhu, Hao Tian, and Hua Wu. 2019. ERNIE: En-\\nhanced representation through knowledge integra-\\ntion. arXiv preprint arXiv:1904.09223 . Trieu H Trinh and Quoc V Le. 2018. A simple\\nmethod for commonsense reasoning. arXiv preprint\\narXiv:1806.02847 . Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob\\nUszkoreit, Llion Jones, Aidan N Gomez, Łukasz\\nKaiser, and Illia Polosukhin. 2017. Attention is all\\nyou need. In Advances in neural information pro-\\ncessing systems . Alex Wang, Yada Pruksachatkun, Nikita Nangia,\\nAmanpreet Singh, Julian Michael, Felix Hill, Omer\\nLevy, and Samuel R.\"\n", "}\n", "\u001b[32;1m\u001b[1;3m[chain/start]\u001b[0m \u001b[1m[chain:RunnableSequence > chain:RunnableParallel] Entering Chain run with input:\n", "\u001b[0m{\n", " \"doc\": \"Bowman. 2019a. SuperGLUE:\\nA stickier benchmark for general-purpose language\\nunderstanding systems.\"\n", "}\n", "\u001b[32;1m\u001b[1;3m[chain/start]\u001b[0m \u001b[1m[chain:RunnableSequence > chain:RunnableParallel] Entering Chain run with input:\n", "\u001b[0m{\n", " \"doc\": \"arXiv preprint 1905.00537 . Alex Wang, Amanpreet Singh, Julian Michael, Felix\\nHill, Omer Levy, and Samuel R. Bowman. 2019b. GLUE: A multi-task benchmark and analysis plat-\\nform for natural language understanding. In Inter-\\nnational Conference on Learning Representations\\n(ICLR) . Alex Warstadt, Amanpreet Singh, and Samuel R. Bow-\\nman. 2018. Neural network acceptability judg-\\nments. arXiv preprint 1805.12471 . Adina Williams, Nikita Nangia, and Samuel Bowman. 2018. A broad-coverage challenge corpus for sen-\\ntence understanding through inference. In North\\nAmerican Association for Computational Linguis-\\ntics (NAACL) . Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Car-\\nbonell, Ruslan Salakhutdinov, and Quoc V Le. 2019. Xlnet: Generalized autoregressive pretrain-\\ning for language understanding. arXiv preprint\\narXiv:1906.08237 .\"\n", "}\n", "\u001b[32;1m\u001b[1;3m[chain/start]\u001b[0m \u001b[1m[chain:RunnableSequence > chain:RunnableParallel] Entering Chain run with input:\n", "\u001b[0m{\n", " \"doc\": \"Yang You, Jing Li, Jonathan Hseu, Xiaodan Song,\\nJames Demmel, and Cho-Jui Hsieh. 2019. Reduc-\\ning bert pre-training time from 3 days to 76 minutes. arXiv preprint arXiv:1904.00962 . Rowan Zellers, Ari Holtzman, Hannah Rashkin,\\nYonatan Bisk, Ali Farhadi, Franziska Roesner, and\\nYejin Choi. 2019. Defending against neural fake\\nnews. arXiv preprint arXiv:1905.12616 .\"\n", "}\n", "\u001b[32;1m\u001b[1;3m[chain/start]\u001b[0m \u001b[1m[chain:RunnableSequence > chain:RunnableParallel] Entering Chain run with input:\n", "\u001b[0m{\n", " \"doc\": \"Yukun Zhu, Ryan Kiros, Richard Zemel, Ruslan\\nSalakhutdinov, Raquel Urtasun, Antonio Torralba,\\nand Sanja Fidler. 2015. Aligning books and movies:\\nTowards story-like visual explanations by watch-\\ning movies and reading books. In arXiv preprint\\narXiv:1506.06724 . Appendix for “RoBERTa: A Robustly\\nOptimized BERT Pretraining Approach”\\nA Full results on GLUE\\nIn Table 8we present the full set of development\\nset results for RoBERTa. We present results for\\naLARGE configuration that follows BERT LARGE ,\\nas well as a BASE configuration that follows\\nBERT BASE.B Pretraining Hyperparameters\\nTable 9describes the hyperparameters for pre-\\ntraining of RoBERTa LARGE and RoBERTa BASE\\nC Finetuning Hyperparameters\\nFinetuning hyperparameters for RACE, SQuAD\\nand GLUE are given in Table 10. We select the\\nbest hyperparameter values based on the median\\nof 5 random seeds for each task. MNLI QNLI QQP RTE SST MRPC CoLA STS\\nRoBERTa BASE\\n+ all data + 500k steps 87.6 92.8 91.9 78.7 94.8 90.2 63.6 91.2\\nRoBERTa LARGE\\nwith B OOKS + W IKI 89.0 93.9 91.9 84.5 95.3 90.2 66.3 91.6\\n+ additional data ( §3.2) 89.3 94.0 92.0 82.7 95.6 91.4 66.1 92.2\\n+ pretrain longer 300k 90.0 94.5 92.2 83.3 96.1 91.1 67.4 92.3\\n+ pretrain longer 500k 90.2 94.7 92.2 86.6 96.4 90.9 68.0 92.4\\nTable 8: Development set results on GLUE tasks for various co nfigurations of RoBERTa. Hyperparam RoBERTa LARGE RoBERTa BASE\\nNumber of Layers 24 12\\nHidden size 1024 768\\nFFN inner hidden size 4096 3072\\nAttention heads 16 12\\nAttention head size 64 64\\nDropout 0.1 0.1\\nAttention Dropout 0.1 0.1\\nWarmup Steps 30k 24k\\nPeak Learning Rate 4e-4 6e-4\\nBatch Size 8k 8k\\nWeight Decay 0.01 0.01\\nMax Steps 500k 500k\\nLearning Rate Decay Linear Linear\\nAdamǫ 1e-6 1e-6\\nAdamβ1 0.9 0.9\\nAdamβ2 0.98 0.98\\nGradient Clipping 0.0 0.0\\nTable 9: Hyperparameters for pretraining RoBERTa LARGE and RoBERTa BASE. Hyperparam RACE SQuAD GLUE\\nLearning Rate 1e-5 1.5e-5 {1e-5, 2e-5, 3e-5 }\\nBatch Size 16 48 {16, 32}\\nWeight Decay 0.1 0.01 0.1\\nMax Epochs 4 2 10\\nLearning Rate Decay Linear Linear Linear\\nWarmup ratio 0.06 0.06 0.06\\nTable 10: Hyperparameters for finetuning RoBERTa LARGE on RACE, SQuAD and GLUE.\"\n", "}\n", "\u001b[32;1m\u001b[1;3m[chain/start]\u001b[0m \u001b[1m[chain:RunnableSequence > chain:RunnableParallel > chain:RunnablePassthrough] Entering Chain run with input:\n", "\u001b[0m{\n", " \"doc\": \"arXiv:1907.11692v1 [cs.CL] 26 Jul 2019RoBERTa: A Robustly Optimized BERT Pretraining Approach\\nYinhan Liu∗§Myle Ott∗§Naman Goyal∗§Jingfei Du∗§Mandar Joshi†\\nDanqi Chen§Omer Levy§Mike Lewis§Luke Zettlemoyer†§Veselin Stoyanov§\\n†Paul G.\"\n", "}\n", "\u001b[32;1m\u001b[1;3m[chain/start]\u001b[0m \u001b[1m[chain:RunnableSequence > chain:RunnableParallel > chain:RunnablePassthrough] Entering Chain run with input:\n", "\u001b[0m{\n", " \"doc\": \"Allen School of Computer Science & Engineering,\\nUniversity of Washington, Seattle, WA\\n{mandar90,lsz }@cs.washington.edu\\n§Facebook AI\\n{yinhanliu,myleott,naman,jingfeidu,\\ndanqi,omerlevy,mikelewis,lsz,ves }@fb.com\\nAbstract\\nLanguage model pretraining has led to sig-\\nnificant performance gains but careful com-\\nparison between different approaches is chal-\\nlenging. Training is computationally expen-\\nsive, often done on private datasets of different\\nsizes, and, as we will show, hyperparameter\\nchoices have significant impact on the final re-\\nsults. We present a replication study of BERT\\npretraining ( Devlin et al. ,2019 ) that carefully\\nmeasures the impact of many key hyperparam-\\neters and training data size. We find that BERT\\nwas significantly undertrained, and can match\\nor exceed the performance of every model\\npublished after it. Our best model achieves\\nstate-of-the-art results on GLUE, RACE and\\nSQuAD. These results highlight the impor-\\ntance of previously overlooked design choices,\\nand raise questions about the source of re-\\ncently reported improvements. We release our\\nmodels and code.1\\n1 Introduction\\nSelf-training methods such as ELMo ( Peters et al. ,\\n2018 ), GPT ( Radford et al. ,2018 ), BERT\\n(Devlin et al. ,2019 ), XLM ( Lample and Conneau ,\\n2019 ), and XLNet ( Yang et al. ,2019 ) have\\nbrought significant performance gains, but it can\\nbe challenging to determine which aspects of\\nthe methods contribute the most. Training is\\ncomputationally expensive, limiting the amount\\nof tuning that can be done, and is often done with\\nprivate training data of varying sizes, limiting\\nour ability to measure the effects of the modeling\\nadvances. ∗Equal contribution. 1Our models and code are available at:\\nhttps://github.com/pytorch/fairseqWe present a replication study of BERT pre-\\ntraining ( Devlin et al. ,2019 ), which includes a\\ncareful evaluation of the effects of hyperparmeter\\ntuning and training set size. We find that BERT\\nwas significantly undertrained and propose an im-\\nproved recipe for training BERT models, which\\nwe call RoBERTa, that can match or exceed the\\nperformance of all of the post-BERT methods. Our modifications are simple, they include: (1)\\ntraining the model longer, with bigger batches,\\nover more data; (2) removing the next sentence\\nprediction objective; (3) training on longer se-\\nquences; and (4) dynamically changing the mask-\\ning pattern applied to the training data. We also\\ncollect a large new dataset (CC-N EWS) of compa-\\nrable size to other privately used datasets, to better\\ncontrol for training set size effects. When controlling for training data, our im-\\nproved training procedure improves upon the pub-\\nlished BERT results on both GLUE and SQuAD. When trained for longer over additional data, our\\nmodel achieves a score of 88.5 on the public\\nGLUE leaderboard, matching the 88.4 reported\\nbyYang et al. (2019 ). Our model establishes a\\nnew state-of-the-art on 4/9 of the GLUE tasks:\\nMNLI, QNLI, RTE and STS-B. We also match\\nstate-of-the-art results on SQuAD and RACE. Overall, we re-establish that BERT’s masked lan-\\nguage model training objective is competitive\\nwith other recently proposed training objectives\\nsuch as perturbed autoregressive language model-\\ning (Yang et al. ,2019 ).2\\nIn summary, the contributions of this paper\\nare: (1) We present a set of important BERT de-\\nsign choices and training strategies and introduce\\n2It is possible that these other methods could also improve\\nwith more tuning. We leave this exploration to future work. alternatives that lead to better downstream task\\nperformance; (2) We use a novel dataset, CC-\\nNEWS, and confirm that using more data for pre-\\ntraining further improves performance on down-\\nstream tasks; (3) Our training improvements show\\nthat masked language model pretraining, under\\nthe right design choices, is competitive with all\\nother recently published methods. We release our\\nmodel, pretraining and fine-tuning code imple-\\nmented in PyTorch ( Paszke et al. ,2017 ). 2 Background\\nIn this section, we give a brief overview of the\\nBERT ( Devlin et al. ,2019 ) pretraining approach\\nand some of the training choices that we will ex-\\namine experimentally in the following section. 2.1 Setup\\nBERT takes as input a concatenation of two\\nsegments (sequences of tokens), x1,...,x N\\nandy1,...,yM. Segments usually consist of\\nmore than one natural sentence. The two seg-\\nments are presented as a single input sequence\\nto BERT with special tokens delimiting them:\\n[CLS],x1,...,x N,[SEP],y1,...,yM,[EOS]. MandNare constrained such that M+N < T ,\\nwhereTis a parameter that controls the maximum\\nsequence length during training. The model is first pretrained on a large unla-\\nbeled text corpus and subsequently finetuned us-\\ning end-task labeled data. 2.2 Architecture\\nBERT uses the now ubiquitous transformer archi-\\ntecture ( Vaswani et al. ,2017 ), which we will not\\nreview in detail. We use a transformer architecture\\nwithLlayers. Each block uses Aself-attention\\nheads and hidden dimension H. 2.3 Training Objectives\\nDuring pretraining, BERT uses two objectives:\\nmasked language modeling and next sentence pre-\\ndiction. Masked Language Model (MLM) A random\\nsample of the tokens in the input sequence is\\nselected and replaced with the special token\\n[MASK]. The MLM objective is a cross-entropy\\nloss on predicting the masked tokens. BERT uni-\\nformly selects 15% of the input tokens for possi-\\nble replacement. Of the selected tokens, 80% are\\nreplaced with [MASK], 10% are left unchanged,and 10% are replaced by a randomly selected vo-\\ncabulary token. In the original implementation, random mask-\\ning and replacement is performed once in the be-\\nginning and saved for the duration of training, al-\\nthough in practice, data is duplicated so the mask\\nis not always the same for every training sentence\\n(see Section 4.1). Next Sentence Prediction (NSP) NSP is a bi-\\nnary classification loss for predicting whether two\\nsegments follow each other in the original text. Positive examples are created by taking consecu-\\ntive sentences from the text corpus. Negative ex-\\namples are created by pairing segments from dif-\\nferent documents. Positive and negative examples\\nare sampled with equal probability. The NSP objective was designed to improve\\nperformance on downstream tasks, such as Natural\\nLanguage Inference ( Bowman et al. ,2015 ), which\\nrequire reasoning about the relationships between\\npairs of sentences. 2.4 Optimization\\nBERT is optimized with Adam ( Kingma and Ba ,\\n2015 ) using the following parameters: β1= 0.9,\\nβ2= 0.999,ǫ=1e-6 and L2weight de-\\ncay of0.01. The learning rate is warmed up\\nover the first 10,000 steps to a peak value of\\n1e-4, and then linearly decayed. BERT trains\\nwith a dropout of 0.1 on all layers and at-\\ntention weights, and a GELU activation func-\\ntion ( Hendrycks and Gimpel ,2016 ). Models are\\npretrained for S=1,000,000 updates, with mini-\\nbatches containing B=256 sequences of maxi-\\nmum length T=512 tokens. 2.5 Data\\nBERT is trained on a combination of B OOK COR-\\nPUS (Zhu et al. ,2015 ) plus English W IKIPEDIA ,\\nwhich totals 16GB of uncompressed text.3\\n3 Experimental Setup\\nIn this section, we describe the experimental setup\\nfor our replication study of BERT. 3.1 Implementation\\nWe reimplement BERT in FAIRSEQ (Ott et al. ,\\n2019 ). We primarily follow the original BERT\\n3Yang et al. (2019 ) use the same dataset but report having\\nonly 13GB of text after data cleaning. This is most likely due\\nto subtle differences in cleaning of the Wikipedia data. optimization hyperparameters, given in Section 2,\\nexcept for the peak learning rate and number of\\nwarmup steps, which are tuned separately for each\\nsetting. We additionally found training to be very\\nsensitive to the Adam epsilon term, and in some\\ncases we obtained better performance or improved\\nstability after tuning it. Similarly, we found setting\\nβ2= 0.98to improve stability when training with\\nlarge batch sizes. We pretrain with sequences of at most T= 512\\ntokens. Unlike Devlin et al. (2019 ), we do not ran-\\ndomly inject short sequences, and we do not train\\nwith a reduced sequence length for the first 90% of\\nupdates. We train only with full-length sequences. We train with mixed precision floating point\\narithmetic on DGX-1 machines, each with 8 ×\\n32GB Nvidia V100 GPUs interconnected by In-\\nfiniband ( Micikevicius et al. ,2018 ). 3.2 Data\\nBERT-style pretraining crucially relies on large\\nquantities of text. Baevski et al. (2019 ) demon-\\nstrate that increasing data size can result in im-\\nproved end-task performance. Several efforts\\nhave trained on datasets larger and more diverse\\nthan the original BERT ( Radford et al.\"\n", "}\n", "\u001b[32;1m\u001b[1;3m[chain/start]\u001b[0m \u001b[1m[chain:RunnableSequence > chain:RunnableParallel > chain:RunnablePassthrough] Entering Chain run with input:\n", "\u001b[0m{\n", " \"doc\": \",2019 ;\\nYang et al. ,2019 ;Zellers et al. ,2019 ). Unfortu-\\nnately, not all of the additional datasets can be\\npublicly released. For our study, we focus on gath-\\nering as much data as possible for experimenta-\\ntion, allowing us to match the overall quality and\\nquantity of data as appropriate for each compari-\\nson. We consider five English-language corpora of\\nvarying sizes and domains, totaling over 160GB\\nof uncompressed text. We use the following text\\ncorpora:\\n•BOOK CORPUS (Zhu et al. ,2015 ) plus English\\nWIKIPEDIA . This is the original data used to\\ntrain BERT. (16GB). •CC-N EWS, which we collected from the En-\\nglish portion of the CommonCrawl News\\ndataset ( Nagel ,2016 ). The data contains 63\\nmillion English news articles crawled between\\nSeptember 2016 and February 2019. (76GB af-\\nter filtering).4\\n•OPENWEBTEXT (Gokaslan and Cohen ,2019 ),\\nan open-source recreation of the WebText cor-\\n4We usenews-please (Hamborg et al. ,2017 ) to col-\\nlect and extract CC-N EWS. CC-N EWS is similar to the R E-\\nALNEWS dataset described in Zellers et al. (2019 ).pus described in Radford et al. (2019 ). The text\\nis web content extracted from URLs shared on\\nReddit with at least three upvotes. (38GB).5\\n•STORIES , a dataset introduced in Trinh and Le\\n(2018 ) containing a subset of CommonCrawl\\ndata filtered to match the story-like style of\\nWinograd schemas. (31GB). 3.3 Evaluation\\nFollowing previous work, we evaluate our pre-\\ntrained models on downstream tasks using the fol-\\nlowing three benchmarks. GLUE The General Language Understand-\\ning Evaluation (GLUE) benchmark ( Wang et al. ,\\n2019b ) is a collection of 9 datasets for evaluating\\nnatural language understanding systems.6Tasks\\nare framed as either single-sentence classification\\nor sentence-pair classification tasks. The GLUE\\norganizers provide training and development data\\nsplits as well as a submission server and leader-\\nboard that allows participants to evaluate and com-\\npare their systems on private held-out test data. For the replication study in Section 4, we report\\nresults on the development sets after finetuning\\nthe pretrained models on the corresponding single-\\ntask training data (i.e., without multi-task training\\nor ensembling). Our finetuning procedure follows\\nthe original BERT paper ( Devlin et al. ,2019 ). In Section 5we additionally report test set re-\\nsults obtained from the public leaderboard. These\\nresults depend on a several task-specific modifica-\\ntions, which we describe in Section 5.1. SQuAD The Stanford Question Answering\\nDataset (SQuAD) provides a paragraph of context\\nand a question. The task is to answer the question\\nby extracting the relevant span from the context. We evaluate on two versions of SQuAD: V1.1\\nand V2.0 ( Rajpurkar et al. ,2016 ,2018 ). In V1.1\\nthe context always contains an answer, whereas in\\n5The authors and their affiliated institutions are not in any\\nway affiliated with the creation of the OpenWebText dataset. 6The datasets are: CoLA ( Warstadt et al. ,2018 ),\\nStanford Sentiment Treebank (SST) ( Socher et al. ,\\n2013 ), Microsoft Research Paragraph Corpus\\n(MRPC) ( Dolan and Brockett ,2005 ), Semantic Tex-\\ntual Similarity Benchmark (STS) ( Agirre et al. ,2007 ),\\nQuora Question Pairs (QQP) ( Iyer et al. ,2016 ), Multi-\\nGenre NLI (MNLI) ( Williams et al. ,2018 ), Question NLI\\n(QNLI) ( Rajpurkar et al. ,2016 ), Recognizing Textual\\nEntailment (RTE) ( Dagan et al. ,2006 ;Bar-Haim et al. ,\\n2006 ;Giampiccolo et al. ,2007 ;Bentivogli et al. ,2009 ) and\\nWinograd NLI (WNLI) ( Levesque et al. ,2011 ). V2.0 some questions are not answered in the pro-\\nvided context, making the task more challenging. For SQuAD V1.1 we adopt the same span pre-\\ndiction method as BERT ( Devlin et al. ,2019 ). For\\nSQuAD V2.0, we add an additional binary classi-\\nfier to predict whether the question is answerable,\\nwhich we train jointly by summing the classifica-\\ntion and span loss terms. During evaluation, we\\nonly predict span indices on pairs that are classi-\\nfied as answerable. RACE The ReAding Comprehension from Ex-\\naminations (RACE) ( Lai et al. ,2017 ) task is a\\nlarge-scale reading comprehension dataset with\\nmore than 28,000 passages and nearly 100,000\\nquestions. The dataset is collected from English\\nexaminations in China, which are designed for\\nmiddle and high school students. In RACE, each\\npassage is associated with multiple questions. For\\nevery question, the task is to select one correct an-\\nswer from four options. RACE has significantly\\nlonger context than other popular reading compre-\\nhension datasets and the proportion of questions\\nthat requires reasoning is very large. 4 Training Procedure Analysis\\nThis section explores and quantifies which choices\\nare important for successfully pretraining BERT\\nmodels. We keep the model architecture fixed.7\\nSpecifically, we begin by training BERT models\\nwith the same configuration as BERT BASE (L=\\n12,H= 768 ,A= 12 , 110M params). 4.1 Static vs. Dynamic Masking\\nAs discussed in Section 2, BERT relies on ran-\\ndomly masking and predicting tokens. The orig-\\ninal BERT implementation performed masking\\nonce during data preprocessing, resulting in a sin-\\nglestatic mask. To avoid using the same mask for\\neach training instance in every epoch, training data\\nwas duplicated 10 times so that each sequence is\\nmasked in 10 different ways over the 40 epochs of\\ntraining. Thus, each training sequence was seen\\nwith the same mask four times during training. We compare this strategy with dynamic mask-\\ningwhere we generate the masking pattern every\\ntime we feed a sequence to the model. This be-\\ncomes crucial when pretraining for more steps or\\nwith larger datasets. 7Studying architectural changes, including larger archi-\\ntectures, is an important area for future work.Masking SQuAD 2.0 MNLI-m SST-2\\nreference 76.3 84.3 92.8\\nOur reimplementation:\\nstatic 78.3 84.3 92.5\\ndynamic 78.7 84.0 92.9\\nTable 1: Comparison between static and dynamic\\nmasking for BERT BASE. We report F1 for SQuAD and\\naccuracy for MNLI-m and SST-2. Reported results are\\nmedians over 5 random initializations (seeds). Refer-\\nence results are from Yang et al. (2019 ). Results Table 1compares the published\\nBERT BASE results from Devlin et al. (2019 ) to our\\nreimplementation with either static or dynamic\\nmasking. We find that our reimplementation\\nwith static masking performs similar to the\\noriginal BERT model, and dynamic masking is\\ncomparable or slightly better than static masking. Given these results and the additional efficiency\\nbenefits of dynamic masking, we use dynamic\\nmasking in the remainder of the experiments. 4.2 Model Input Format and Next Sentence\\nPrediction\\nIn the original BERT pretraining procedure, the\\nmodel observes two concatenated document seg-\\nments, which are either sampled contiguously\\nfrom the same document (with p= 0.5) or from\\ndistinct documents. In addition to the masked lan-\\nguage modeling objective, the model is trained to\\npredict whether the observed document segments\\ncome from the same or distinct documents via an\\nauxiliary Next Sentence Prediction (NSP) loss. The NSP loss was hypothesized to be an impor-\\ntant factor in training the original BERT model. Devlin et al. (2019 ) observe that removing NSP\\nhurts performance, with significant performance\\ndegradation on QNLI, MNLI, and SQuAD 1.1. However, some recent work has questioned the\\nnecessity of the NSP loss ( Lample and Conneau ,\\n2019 ;Yang et al.\"\n", "}\n", "\u001b[32;1m\u001b[1;3m[chain/start]\u001b[0m \u001b[1m[chain:RunnableSequence > chain:RunnableParallel > chain:RunnablePassthrough] Entering Chain run with input:\n", "\u001b[0m{\n", " \"doc\": \",2019 ;Joshi et al. ,2019 ). To better understand this discrepancy, we com-\\npare several alternative training formats:\\n•SEGMENT -PAIR +NSP: This follows the original\\ninput format used in BERT ( Devlin et al. ,2019 ),\\nwith the NSP loss. Each input has a pair of seg-\\nments, which can each contain multiple natural\\nsentences, but the total combined length must\\nbe less than 512 tokens. Model SQuAD 1.1/2.0 MNLI-m SST-2 RACE\\nOur reimplementation (with NSP loss):\\nSEGMENT -PAIR 90.4/78.7 84.0 92.9 64.2\\nSENTENCE -PAIR 88.7/76.2 82.9 92.1 63.0\\nOur reimplementation (without NSP loss):\\nFULL -SENTENCES 90.4/79.1 84.7 92.5 64.8\\nDOC-SENTENCES 90.6/79.7 84.7 92.7 65.6\\nBERT BASE 88.5/76.3 84.3 92.8 64.3\\nXLNet BASE (K = 7) –/81.3 85.8 92.7 66.1\\nXLNet BASE (K = 6) –/81.0 85.6 93.4 66.7\\nTable 2: Development set results for base models pretrained over B OOK CORPUS and W IKIPEDIA . All models are\\ntrained for 1M steps with a batch size of 256 sequences. We rep ort F1 for SQuAD and accuracy for MNLI-m,\\nSST-2 and RACE. Reported results are medians over five random initializations (seeds). Results for BERT BASEand\\nXLNet BASEare from Yang et al. (2019 ). •SENTENCE -PAIR +NSP: Each input contains a\\npair of natural sentences , either sampled from\\na contiguous portion of one document or from\\nseparate documents. Since these inputs are sig-\\nnificantly shorter than 512 tokens, we increase\\nthe batch size so that the total number of tokens\\nremains similar to SEGMENT -PAIR +NSP. We re-\\ntain the NSP loss. •FULL -SENTENCES : Each input is packed with\\nfull sentences sampled contiguously from one\\nor more documents, such that the total length is\\nat most 512 tokens. Inputs may cross document\\nboundaries. When we reach the end of one doc-\\nument, we begin sampling sentences from the\\nnext document and add an extra separator token\\nbetween documents. We remove the NSP loss. •DOC-SENTENCES : Inputs are constructed sim-\\nilarly to FULL -SENTENCES , except that they\\nmay not cross document boundaries. Inputs\\nsampled near the end of a document may be\\nshorter than 512 tokens, so we dynamically in-\\ncrease the batch size in these cases to achieve\\na similar number of total tokens as FULL -\\nSENTENCES .\"\n", "}\n", "\u001b[32;1m\u001b[1;3m[chain/start]\u001b[0m \u001b[1m[chain:RunnableSequence > chain:RunnableParallel > chain:RunnablePassthrough] Entering Chain run with input:\n", "\u001b[0m{\n", " \"doc\": \"We remove the NSP loss. Results Table 2shows results for the four dif-\\nferent settings. We first compare the original\\nSEGMENT -PAIR input format from Devlin et al. (2019 ) to the SENTENCE -PAIR format; both for-\\nmats retain the NSP loss, but the latter uses sin-\\ngle sentences. We find that using individual\\nsentences hurts performance on downstream\\ntasks , which we hypothesize is because the model\\nis not able to learn long-range dependencies.We next compare training without the NSP\\nloss and training with blocks of text from a sin-\\ngle document ( DOC-SENTENCES ). We find that\\nthis setting outperforms the originally published\\nBERT BASEresults and that removing the NSP loss\\nmatches or slightly improves downstream task\\nperformance , in contrast to Devlin et al. (2019 ). It is possible that the original BERT implementa-\\ntion may only have removed the loss term while\\nstill retaining the SEGMENT -PAIR input format. Finally we find that restricting sequences to\\ncome from a single document ( DOC-SENTENCES )\\nperforms slightly better than packing sequences\\nfrom multiple documents ( FULL -SENTENCES ). However, because the DOC-SENTENCES format\\nresults in variable batch sizes, we use FULL -\\nSENTENCES in the remainder of our experiments\\nfor easier comparison with related work. 4.3 Training with large batches\\nPast work in Neural Machine Translation has\\nshown that training with very large mini-batches\\ncan both improve optimization speed and end-task\\nperformance when the learning rate is increased\\nappropriately ( Ott et al. ,2018 ). Recent work has\\nshown that BERT is also amenable to large batch\\ntraining ( You et al. ,2019 ). Devlin et al. (2019 ) originally trained\\nBERT BASE for 1M steps with a batch size of\\n256 sequences. This is equivalent in computa-\\ntional cost, via gradient accumulation, to training\\nfor 125K steps with a batch size of 2K sequences,\\nor for 31K steps with a batch size of 8K. In Table 3we compare perplexity and end- bsz steps lr ppl MNLI-m SST-2\\n256 1M 1e-4 3.99 84.7 92.7\\n2K 125K 7e-4 3.68 85.2 92.9\\n8K 31K 1e-3 3.77 84.6 92.8\\nTable 3: Perplexity on held-out training data ( ppl) and\\ndevelopment set accuracy for base models trained over\\nBOOK CORPUS and W IKIPEDIA with varying batch\\nsizes ( bsz).\"\n", "}\n", "\u001b[32;1m\u001b[1;3m[chain/start]\u001b[0m \u001b[1m[chain:RunnableSequence > chain:RunnableParallel > chain:RunnablePassthrough] Entering Chain run with input:\n", "\u001b[0m{\n", " \"doc\": \"We tune the learning rate ( lr) for each set-\\nting. Models make the same number of passes over the\\ndata (epochs) and have the same computational cost. task performance of BERT BASE as we increase the\\nbatch size, controlling for the number of passes\\nthrough the training data. We observe that train-\\ning with large batches improves perplexity for the\\nmasked language modeling objective, as well as\\nend-task accuracy. Large batches are also easier to\\nparallelize via distributed data parallel training,8\\nand in later experiments we train with batches of\\n8K sequences. Notably You et al. (2019 ) train BERT with even\\nlarger batche sizes, up to 32K sequences. We leave\\nfurther exploration of the limits of large batch\\ntraining to future work. 4.4 Text Encoding\\nByte-Pair Encoding (BPE) ( Sennrich et al. ,2016 )\\nis a hybrid between character- and word-level rep-\\nresentations that allows handling the large vocab-\\nularies common in natural language corpora. In-\\nstead of full words, BPE relies on subwords units,\\nwhich are extracted by performing statistical anal-\\nysis of the training corpus. BPE vocabulary sizes typically range from\\n10K-100K subword units. However, unicode char-\\nacters can account for a sizeable portion of this\\nvocabulary when modeling large and diverse cor-\\npora, such as the ones considered in this work. Radford et al. (2019 ) introduce a clever imple-\\nmentation of BPE that uses bytes instead of uni-\\ncode characters as the base subword units. Using\\nbytes makes it possible to learn a subword vocab-\\nulary of a modest size (50K units) that can still en-\\ncode any input text without introducing any “un-\\nknown” tokens. 8Large batch training can improve training efficiency even\\nwithout large scale parallel hardware through gradient ac-\\ncumulation , whereby gradients from multiple mini-batches\\nare accumulated locally before each optimization step. Thi s\\nfunctionality is supported natively in FAIRSEQ (Ott et al. ,\\n2019 ).The original BERT implementa-\\ntion ( Devlin et al. ,2019 ) uses a character-level\\nBPE vocabulary of size 30K, which is learned\\nafter preprocessing the input with heuristic tok-\\nenization rules. Following Radford et al. (2019 ),\\nwe instead consider training BERT with a larger\\nbyte-level BPE vocabulary containing 50K sub-\\nword units, without any additional preprocessing\\nor tokenization of the input. This adds approxi-\\nmately 15M and 20M additional parameters for\\nBERT BASEand BERT LARGE , respectively. Early experiments revealed only slight dif-\\nferences between these encodings, with the\\nRadford et al. (2019 ) BPE achieving slightly\\nworse end-task performance on some tasks. Nev-\\nertheless, we believe the advantages of a univer-\\nsal encoding scheme outweighs the minor degre-\\ndation in performance and use this encoding in\\nthe remainder of our experiments. A more de-\\ntailed comparison of these encodings is left to fu-\\nture work. 5 RoBERTa\\nIn the previous section we propose modifications\\nto the BERT pretraining procedure that improve\\nend-task performance. We now aggregate these\\nimprovements and evaluate their combined im-\\npact. We call this configuration RoBERTa for\\nRobustly optimized BERT approach. Specifi-\\ncally, RoBERTa is trained with dynamic mask-\\ning (Section 4.1),FULL -SENTENCES without NSP\\nloss (Section 4.2), large mini-batches (Section 4.3)\\nand a larger byte-level BPE (Section 4.4). Additionally, we investigate two other impor-\\ntant factors that have been under-emphasized in\\nprevious work: (1) the data used for pretraining,\\nand (2) the number of training passes through the\\ndata. For example, the recently proposed XLNet\\narchitecture ( Yang et al. ,2019 ) is pretrained us-\\ning nearly 10 times more data than the original\\nBERT ( Devlin et al. ,2019 ). It is also trained with\\na batch size eight times larger for half as many op-\\ntimization steps, thus seeing four times as many\\nsequences in pretraining compared to BERT. To help disentangle the importance of these fac-\\ntors from other modeling choices (e.g., the pre-\\ntraining objective), we begin by training RoBERTa\\nfollowing the BERT LARGE architecture ( L= 24 ,\\nH= 1024 ,A= 16 , 355M parameters). We\\npretrain for 100K steps over a comparable B OOK -\\nCORPUS plus W IKIPEDIA dataset as was used in Model data bsz stepsSQuADMNLI-m SST-2(v1.1/2.0)\\nRoBERTa\\nwith B OOKS + W IKI 16GB 8K 100K 93.6/87.3 89.0 95.3\\n+ additional data ( §3.2) 160GB 8K 100K 94.0/87.7 89.3 95.6\\n+ pretrain longer 160GB 8K 300K 94.4/88.7 90.0 96.1\\n+ pretrain even longer 160GB 8K 500K 94.6/89.4 90.2 96.4\\nBERT LARGE\\nwith B OOKS + W IKI 13GB 256 1M 90.9/81.8 86.6 93.7\\nXLNet LARGE\\nwith B OOKS + W IKI 13GB 256 1M 94.0/87.8 88.4 94.4\\n+ additional data 126GB 2K 500K 94.5/88.8 89.8 95.6\\nTable 4: Development set results for RoBERTa as we pretrain o ver more data (16GB →160GB of text) and pretrain\\nfor longer (100K →300K→500K steps).\"\n", "}\n", "\u001b[32;1m\u001b[1;3m[chain/start]\u001b[0m \u001b[1m[chain:RunnableSequence > chain:RunnableParallel > chain:RunnablePassthrough] Entering Chain run with input:\n", "\u001b[0m{\n", " \"doc\": \"Each row accumulates improvements from the row s above. RoBERTa\\nmatches the architecture and training objective of BERT LARGE . Results for BERT LARGE and XLNet LARGE are from\\nDevlin et al. (2019 ) and Yang et al. (2019 ), respectively. Complete results on all GLUE tasks can be fo und in the\\nAppendix.\"\n", "}\n", "\u001b[32;1m\u001b[1;3m[chain/start]\u001b[0m \u001b[1m[chain:RunnableSequence > chain:RunnableParallel > chain:RunnablePassthrough] Entering Chain run with input:\n", "\u001b[0m{\n", " \"doc\": \"Devlin et al. (2019 ). We pretrain our model using\\n1024 V100 GPUs for approximately one day. Results We present our results in Table 4. When\\ncontrolling for training data, we observe that\\nRoBERTa provides a large improvement over the\\noriginally reported BERT LARGE results, reaffirming\\nthe importance of the design choices we explored\\nin Section 4. Next, we combine this data with the three ad-\\nditional datasets described in Section 3.2. We\\ntrain RoBERTa over the combined data with the\\nsame number of training steps as before (100K). In total, we pretrain over 160GB of text. We ob-\\nserve further improvements in performance across\\nall downstream tasks, validating the importance of\\ndata size and diversity in pretraining.9\\nFinally, we pretrain RoBERTa for significantly\\nlonger, increasing the number of pretraining steps\\nfrom 100K to 300K, and then further to 500K. We\\nagain observe significant gains in downstream task\\nperformance, and the 300K and 500K step mod-\\nels outperform XLNet LARGE across most tasks. We\\nnote that even our longest-trained model does not\\nappear to overfit our data and would likely benefit\\nfrom additional training. In the rest of the paper, we evaluate our best\\nRoBERTa model on the three different bench-\\nmarks: GLUE, SQuaD and RACE. Specifically\\n9Our experiments conflate increases in data size and di-\\nversity. We leave a more careful analysis of these two dimen-\\nsions to future work.we consider RoBERTa trained for 500K steps over\\nall five of the datasets introduced in Section 3.2. 5.1 GLUE Results\\nFor GLUE we consider two finetuning settings. In the first setting ( single-task, dev ) we finetune\\nRoBERTa separately for each of the GLUE tasks,\\nusing only the training data for the correspond-\\ning task. We consider a limited hyperparameter\\nsweep for each task, with batch sizes ∈ {16,32}\\nand learning rates ∈ {1e−5,2e−5,3e−5}, with a\\nlinear warmup for the first 6% of steps followed by\\na linear decay to 0. We finetune for 10 epochs and\\nperform early stopping based on each task’s eval-\\nuation metric on the dev set. The rest of the hyper-\\nparameters remain the same as during pretraining. In this setting, we report the median development\\nset results for each task over five random initial-\\nizations, without model ensembling. In the second setting ( ensembles, test ), we com-\\npare RoBERTa to other approaches on the test set\\nvia the GLUE leaderboard. While many submis-\\nsions to the GLUE leaderboard depend on multi-\\ntask finetuning, our submission depends only on\\nsingle-task finetuning . For RTE, STS and MRPC\\nwe found it helpful to finetune starting from the\\nMNLI single-task model, rather than the baseline\\npretrained RoBERTa. We explore a slightly wider\\nhyperparameter space, described in the Appendix,\\nand ensemble between 5 and 7 models per task. MNLI QNLI QQP RTE SST MRPC CoLA STS WNLI Avg\\nSingle-task single models on dev\\nBERT LARGE 86.6/- 92.3 91.3 70.4 93.2 88.0 60.6 90.0 - -\\nXLNet LARGE 89.8/- 93.9 91.8 83.8 95.6 89.2 63.6 91.8 - -\\nRoBERTa 90.2/90.2 94.7 92.2 86.6 96.4 90.9 68.0 92.4 91.3 -\\nEnsembles on test (from leaderboard as of July 25, 2019)\\nALICE 88.2/87.9 95.7 90.7 83.5 95.2 92.6 68.6 91.1 80.8 86.3\\nMT-DNN 87.9/87.4 96.0 89.9 86.3 96.5 92.7 68.4 91.1 89.0 87.6\\nXLNet 90.2/89.8 98.6 90.3 86.3 96.8 93.0 67.8 91.6 90.4 88.4\\nRoBERTa 90.8/90.2 98.9 90.2 88.2 96.7 92.3 67.8 92.2 89.0 88.5\\nTable 5: Results on GLUE. All results are based on a 24-layer a rchitecture. BERT LARGE and XLNet LARGE results\\nare from Devlin et al. (2019 ) and Yang et al. (2019 ), respectively. RoBERTa results on the development set are a\\nmedian over five runs. RoBERTa results on the test set are ense mbles of single-task models. For RTE, STS and\\nMRPC we finetune starting from the MNLI model instead of the ba seline pretrained model. Averages are obtained\\nfrom the GLUE leaderboard. Task-specific modifications Two of the GLUE\\ntasks require task-specific finetuning approaches\\nto achieve competitive leaderboard results. QNLI : Recent submissions on the GLUE\\nleaderboard adopt a pairwise ranking formulation\\nfor the QNLI task, in which candidate answers\\nare mined from the training set and compared to\\none another, and a single (question, candidate)\\npair is classified as positive ( Liu et al.\"\n", "}\n", "\u001b[32;1m\u001b[1;3m[chain/start]\u001b[0m \u001b[1m[chain:RunnableSequence > chain:RunnableParallel > chain:RunnablePassthrough] Entering Chain run with input:\n", "\u001b[0m{\n", " \"doc\": \",2019b ,a;\\nYang et al. ,2019 ). This formulation significantly\\nsimplifies the task, but is not directly comparable\\nto BERT ( Devlin et al. ,2019 ). Following recent\\nwork, we adopt the ranking approach for our test\\nsubmission, but for direct comparison with BERT\\nwe report development set results based on a pure\\nclassification approach. WNLI : We found the provided NLI-format\\ndata to be challenging to work with. Instead\\nwe use the reformatted WNLI data from Super-\\nGLUE ( Wang et al. ,2019a ), which indicates the\\nspan of the query pronoun and referent. We fine-\\ntune RoBERTa using the margin ranking loss from\\nKocijan et al. (2019 ). For a given input sentence,\\nwe use spaCy ( Honnibal and Montani ,2017 ) to\\nextract additional candidate noun phrases from the\\nsentence and finetune our model so that it assigns\\nhigher scores to positive referent phrases than for\\nany of the generated negative candidate phrases. One unfortunate consequence of this formulation\\nis that we can only make use of the positive train-\\ning examples, which excludes over half of the pro-\\nvided training examples.10\\n10While we only use the provided WNLI training data, ourResults We present our results in Table 5. In the\\nfirst setting ( single-task, dev ), RoBERTa achieves\\nstate-of-the-art results on all 9 of the GLUE\\ntask development sets. Crucially, RoBERTa uses\\nthe same masked language modeling pretrain-\\ning objective and architecture as BERT LARGE , yet\\nconsistently outperforms both BERT LARGE and\\nXLNet LARGE . This raises questions about the rel-\\native importance of model architecture and pre-\\ntraining objective, compared to more mundane de-\\ntails like dataset size and training time that we ex-\\nplore in this work. In the second setting ( ensembles, test ), we\\nsubmit RoBERTa to the GLUE leaderboard and\\nachieve state-of-the-art results on 4 out of 9 tasks\\nand the highest average score to date. This is espe-\\ncially exciting because RoBERTa does not depend\\non multi-task finetuning, unlike most of the other\\ntop submissions. We expect future work may fur-\\nther improve these results by incorporating more\\nsophisticated multi-task finetuning procedures. 5.2 SQuAD Results\\nWe adopt a much simpler approach for SQuAD\\ncompared to past work. In particular, while\\nboth BERT ( Devlin et al. ,2019 ) and XL-\\nNet ( Yang et al. ,2019 ) augment their training data\\nwith additional QA datasets, we only finetune\\nRoBERTa using the provided SQuAD training\\ndata .Yang et al. (2019 ) also employed a custom\\nlayer-wise learning rate schedule to finetune\\nresults could potentially be improved by augmenting this wi th\\nadditional pronoun disambiguation datasets. ModelSQuAD 1.1 SQuAD 2.0\\nEM F1 EM F1\\nSingle models on dev, w/o data augmentation\\nBERT LARGE 84.1 90.9 79.0 81.8\\nXLNet LARGE 89.0 94.5 86.1 88.8\\nRoBERTa 88.9 94.6 86.5 89.4\\nSingle models on test (as of July 25, 2019)\\nXLNet LARGE 86.3†89.1†\\nRoBERTa 86.8 89.8\\nXLNet + SG-Net Verifier 87.0†89.9†\\nTable 6: Results on SQuAD. †indicates results that de-\\npend on additional external training data. RoBERTa\\nuses only the provided SQuAD data in both dev and\\ntest settings. BERT LARGE and XLNet LARGE results are\\nfrom Devlin et al. (2019 ) and Yang et al. (2019 ), re-\\nspectively. XLNet, while we use the same learning rate for\\nall layers. For SQuAD v1.1 we follow the same finetun-\\ning procedure as Devlin et al. (2019 ). For SQuAD\\nv2.0, we additionally classify whether a given\\nquestion is answerable; we train this classifier\\njointly with the span predictor by summing the\\nclassification and span loss terms. Results We present our results in Table 6. On\\nthe SQuAD v1.1 development set, RoBERTa\\nmatches the state-of-the-art set by XLNet. On the\\nSQuAD v2.0 development set, RoBERTa sets a\\nnew state-of-the-art, improving over XLNet by 0.4\\npoints (EM) and 0.6 points (F1). We also submit RoBERTa to the public SQuAD\\n2.0 leaderboard and evaluate its performance rel-\\native to other systems. Most of the top systems\\nbuild upon either BERT ( Devlin et al. ,2019 ) or\\nXLNet ( Yang et al. ,2019 ), both of which rely on\\nadditional external training data. In contrast, our\\nsubmission does not use any additional data. Our single RoBERTa model outperforms all but\\none of the single model submissions, and is the\\ntop scoring system among those that do not rely\\non data augmentation. 5.3 RACE Results\\nIn RACE, systems are provided with a passage of\\ntext, an associated question, and four candidate an-\\nswers. Systems are required to classify which of\\nthe four candidate answers is correct. We modify RoBERTa for this task by concate-Model Accuracy Middle High\\nSingle models on test (as of July 25, 2019)\\nBERT LARGE 72.0 76.6 70.1\\nXLNet LARGE 81.7 85.4 80.2\\nRoBERTa 83.2 86.5 81.3\\nTable 7: Results on the RACE test set. BERT LARGE and\\nXLNet LARGE results are from Yang et al. (2019 ). nating each candidate answer with the correspond-\\ning question and passage. We then encode each of\\nthese four sequences and pass the resulting [CLS]\\nrepresentations through a fully-connected layer,\\nwhich is used to predict the correct answer. We\\ntruncate question-answer pairs that are longer than\\n128 tokens and, if needed, the passage so that the\\ntotal length is at most 512 tokens. Results on the RACE test sets are presented in\\nTable 7. RoBERTa achieves state-of-the-art results\\non both middle-school and high-school settings. 6 Related Work\\nPretraining methods have been designed\\nwith different training objectives, includ-\\ning language modeling ( Dai and Le ,2015 ;\\nPeters et al. ,2018 ;Howard and Ruder ,2018 ),\\nmachine translation ( McCann et al. ,2017 ), and\\nmasked language modeling ( Devlin et al. ,2019 ;\\nLample and Conneau ,2019 ). Many recent\\npapers have used a basic recipe of finetuning\\nmodels for each end task ( Howard and Ruder ,\\n2018 ;Radford et al. ,2018 ), and pretraining\\nwith some variant of a masked language model\\nobjective. However, newer methods have\\nimproved performance by multi-task fine tun-\\ning ( Dong et al. ,2019 ), incorporating entity\\nembeddings ( Sun et al. ,2019 ), span predic-\\ntion ( Joshi et al. ,2019 ), and multiple variants\\nof autoregressive pretraining ( Song et al. ,2019 ;\\nChan et al. ,2019 ;Yang et al. ,2019 ). Perfor-\\nmance is also typically improved by training\\nbigger models on more data ( Devlin et al.\"\n", "}\n", "\u001b[32;1m\u001b[1;3m[chain/start]\u001b[0m \u001b[1m[chain:RunnableSequence > chain:RunnableParallel > chain:RunnablePassthrough] Entering Chain run with input:\n", "\u001b[0m{\n", " \"doc\": \",\\n2019 ;Baevski et al. ,2019 ;Yang et al. ,2019 ;\\nRadford et al. ,2019 ). Our goal was to replicate,\\nsimplify, and better tune the training of BERT,\\nas a reference point for better understanding the\\nrelative performance of all of these methods. 7 Conclusion\\nWe carefully evaluate a number of design de-\\ncisions when pretraining BERT models. We\\nfind that performance can be substantially im-\\nproved by training the model longer, with bigger\\nbatches over more data; removing the next sen-\\ntence prediction objective; training on longer se-\\nquences; and dynamically changing the masking\\npattern applied to the training data. Our improved\\npretraining procedure, which we call RoBERTa,\\nachieves state-of-the-art results on GLUE, RACE\\nand SQuAD, without multi-task finetuning for\\nGLUE or additional data for SQuAD. These re-\\nsults illustrate the importance of these previ-\\nously overlooked design decisions and suggest\\nthat BERT’s pretraining objective remains com-\\npetitive with recently proposed alternatives. We additionally use a novel dataset,\\nCC-N EWS, and release our models and\\ncode for pretraining and finetuning at:\\nhttps://github.com/pytorch/fairseq . References\\nEneko Agirre, Llu’is M‘arquez, and Richard Wicen-\\ntowski, editors. 2007. Proceedings of the Fourth\\nInternational Workshop on Semantic Evaluations\\n(SemEval-2007) . Alexei Baevski, Sergey Edunov, Yinhan Liu, Luke\\nZettlemoyer, and Michael Auli. 2019. Cloze-\\ndriven pretraining of self-attention networks. arXiv\\npreprint arXiv:1903.07785 .\"\n", "}\n", "\u001b[32;1m\u001b[1;3m[chain/start]\u001b[0m \u001b[1m[chain:RunnableSequence > chain:RunnableParallel > chain:RunnablePassthrough] Entering Chain run with input:\n", "\u001b[0m{\n", " \"doc\": \"Roy Bar-Haim, Ido Dagan, Bill Dolan, Lisa Ferro,\\nDanilo Giampiccolo, Bernardo Magnini, and Idan\\nSzpektor. 2006. The second PASCAL recognising\\ntextual entailment challenge. In Proceedings of the\\nsecond PASCAL challenges workshop on recognis-\\ning textual entailment . Luisa Bentivogli, Ido Dagan, Hoa Trang Dang, Danilo\\nGiampiccolo, and Bernardo Magnini. 2009. The\\nfifth PASCAL recognizing textual entailment chal-\\nlenge. Samuel R Bowman, Gabor Angeli, Christopher Potts,\\nand Christopher D Manning. 2015. A large anno-\\ntated corpus for learning natural language inference. InEmpirical Methods in Natural Language Process-\\ning (EMNLP) .\"\n", "}\n", "\u001b[32;1m\u001b[1;3m[chain/start]\u001b[0m \u001b[1m[chain:RunnableSequence > chain:RunnableParallel > chain:RunnablePassthrough] Entering Chain run with input:\n", "\u001b[0m{\n", " \"doc\": \"William Chan, Nikita Kitaev, Kelvin Guu, Mitchell\\nStern, and Jakob Uszkoreit. 2019. KERMIT: Gener-\\native insertion-based modeling for sequences.\"\n", "}\n", "\u001b[32;1m\u001b[1;3m[chain/start]\u001b[0m \u001b[1m[chain:RunnableSequence > chain:RunnableParallel > chain:RunnablePassthrough] Entering Chain run with input:\n", "\u001b[0m{\n", " \"doc\": \"arXiv\\npreprint arXiv:1906.01604 .Ido Dagan, Oren Glickman, and Bernardo Magnini. 2006. The PASCAL recognising textual entailment\\nchallenge. In Machine learning challenges. evalu-\\nating predictive uncertainty, visual object classifica-\\ntion, and recognising tectual entailment . Andrew M Dai and Quoc V Le. 2015. Semi-supervised\\nsequence learning. In Advances in Neural Informa-\\ntion Processing Systems (NIPS) . Jacob Devlin, Ming-Wei Chang, Kenton Lee, and\\nKristina Toutanova. 2019. BERT: Pre-training of\\ndeep bidirectional transformers for language under-\\nstanding. In North American Association for Com-\\nputational Linguistics (NAACL) . William B Dolan and Chris Brockett. 2005. Auto-\\nmatically constructing a corpus of sentential para-\\nphrases. In Proceedings of the International Work-\\nshop on Paraphrasing . Li Dong, Nan Yang, Wenhui Wang, Furu Wei,\\nXiaodong Liu, Yu Wang, Jianfeng Gao, Ming\\nZhou, and Hsiao-Wuen Hon. 2019. Unified\\nlanguage model pre-training for natural language\\nunderstanding and generation. arXiv preprint\\narXiv:1905.03197 .\"\n", "}\n", "\u001b[32;1m\u001b[1;3m[chain/start]\u001b[0m \u001b[1m[chain:RunnableSequence > chain:RunnableParallel > chain:RunnablePassthrough] Entering Chain run with input:\n", "\u001b[0m{\n", " \"doc\": \"Danilo Giampiccolo, Bernardo Magnini, Ido Dagan,\\nand Bill Dolan. 2007. The third PASCAL recog-\\nnizing textual entailment challenge. In Proceedings\\nof the ACL-PASCAL workshop on textual entailment\\nand paraphrasing . Aaron Gokaslan and Vanya Cohen. 2019. Openweb-\\ntext corpus. http://web.archive.org/\\nsave/http://Skylion007.github.io/\\nOpenWebTextCorpus .\"\n", "}\n", "\u001b[32;1m\u001b[1;3m[chain/start]\u001b[0m \u001b[1m[chain:RunnableSequence > chain:RunnableParallel > chain:RunnablePassthrough] Entering Chain run with input:\n", "\u001b[0m{\n", " \"doc\": \"Felix Hamborg, Norman Meuschke, Corinna Bre-\\nitinger, and Bela Gipp. 2017. news-please: A\\ngeneric news crawler and extractor. In Proceedings\\nof the 15th International Symposium of Information\\nScience .\"\n", "}\n", "\u001b[32;1m\u001b[1;3m[chain/start]\u001b[0m \u001b[1m[chain:RunnableSequence > chain:RunnableParallel > chain:RunnablePassthrough] Entering Chain run with input:\n", "\u001b[0m{\n", " \"doc\": \"Dan Hendrycks and Kevin Gimpel. 2016. Gaus-\\nsian error linear units (gelus). arXiv preprint\\narXiv:1606.08415 .\"\n", "}\n", "\u001b[32;1m\u001b[1;3m[chain/start]\u001b[0m \u001b[1m[chain:RunnableSequence > chain:RunnableParallel > chain:RunnablePassthrough] Entering Chain run with input:\n", "\u001b[0m{\n", " \"doc\": \"Matthew Honnibal and Ines Montani. 2017. spaCy 2:\\nNatural language understanding with Bloom embed-\\ndings, convolutional neural networks and incremen-\\ntal parsing. To appear. Jeremy Howard and Sebastian Ruder. 2018. Universal\\nlanguage model fine-tuning for text classification. arXiv preprint arXiv:1801.06146 . Shankar Iyer, Nikhil Dandekar, and Kornl Cser-\\nnai. 2016. First quora dataset release: Question\\npairs.https://data.quora.com/First-\\nQuora-Dataset-Release-Question-\\nPairs . Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S.\"\n", "}\n", "\u001b[32;1m\u001b[1;3m[chain/start]\u001b[0m \u001b[1m[chain:RunnableSequence > chain:RunnableParallel > chain:RunnablePassthrough] Entering Chain run with input:\n", "\u001b[0m{\n", " \"doc\": \"Weld, Luke Zettlemoyer, and Omer Levy. 2019. SpanBERT: Improving pre-training by repre-\\nsenting and predicting spans. arXiv preprint\\narXiv:1907.10529 . Diederik Kingma and Jimmy Ba. 2015. Adam: A\\nmethod for stochastic optimization. In International\\nConference on Learning Representations (ICLR) .\"\n", "}\n", "\u001b[32;1m\u001b[1;3m[chain/start]\u001b[0m \u001b[1m[chain:RunnableSequence > chain:RunnableParallel > chain:RunnablePassthrough] Entering Chain run with input:\n", "\u001b[0m{\n", " \"doc\": \"Vid Kocijan, Ana-Maria Cretu, Oana-Maria Camburu,\\nYordan Yordanov, and Thomas Lukasiewicz. 2019. A surprisingly robust trick for winograd schema\\nchallenge. arXiv preprint arXiv:1905.06290 . Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang,\\nand Eduard Hovy. 2017. Race: Large-scale reading\\ncomprehension dataset from examinations. arXiv\\npreprint arXiv:1704.04683 . Guillaume Lample and Alexis Conneau. 2019. Cross-\\nlingual language model pretraining. arXiv preprint\\narXiv:1901.07291 .\"\n", "}\n", "\u001b[32;1m\u001b[1;3m[chain/start]\u001b[0m \u001b[1m[chain:RunnableSequence > chain:RunnableParallel > chain:RunnablePassthrough] Entering Chain run with input:\n", "\u001b[0m{\n", " \"doc\": \"Hector J Levesque, Ernest Davis, and Leora Morgen-\\nstern. 2011. The Winograd schema challenge. In\\nAAAI Spring Symposium: Logical Formalizations of\\nCommonsense Reasoning . Xiaodong Liu, Pengcheng He, Weizhu Chen, and\\nJianfeng Gao. 2019a. Improving multi-task deep\\nneural networks via knowledge distillation for\\nnatural language understanding. arXiv preprint\\narXiv:1904.09482 .\"\n", "}\n", "\u001b[32;1m\u001b[1;3m[chain/start]\u001b[0m \u001b[1m[chain:RunnableSequence > chain:RunnableParallel > chain:RunnablePassthrough] Entering Chain run with input:\n", "\u001b[0m{\n", " \"doc\": \"Xiaodong Liu, Pengcheng He, Weizhu Chen, and Jian-\\nfeng Gao. 2019b. Multi-task deep neural networks\\nfor natural language understanding. arXiv preprint\\narXiv:1901.11504 . Bryan McCann, James Bradbury, Caiming Xiong, and\\nRichard Socher. 2017. Learned in translation: Con-\\ntextualized word vectors. In Advances in Neural In-\\nformation Processing Systems (NIPS) , pages 6297–\\n6308.\"\n", "}\n", "\u001b[32;1m\u001b[1;3m[chain/start]\u001b[0m \u001b[1m[chain:RunnableSequence > chain:RunnableParallel > chain:RunnablePassthrough] Entering Chain run with input:\n", "\u001b[0m{\n", " \"doc\": \"Paulius Micikevicius, Sharan Narang, Jonah Alben,\\nGregory Diamos, Erich Elsen, David Garcia, Boris\\nGinsburg, Michael Houston, Oleksii Kuchaiev,\\nGanesh Venkatesh, and Hao Wu. 2018. Mixed preci-\\nsion training. In International Conference on Learn-\\ning Representations . Sebastian Nagel. 2016. Cc-news. http:\\n//web.archive.org/save/http:\\n//commoncrawl.org/2016/10/news-\\ndataset-available .\"\n", "}\n", "\u001b[32;1m\u001b[1;3m[chain/start]\u001b[0m \u001b[1m[chain:RunnableSequence > chain:RunnableParallel > chain:RunnablePassthrough] Entering Chain run with input:\n", "\u001b[0m{\n", " \"doc\": \"Myle Ott, Sergey Edunov, Alexei Baevski, Angela\\nFan, Sam Gross, Nathan Ng, David Grangier, and\\nMichael Auli. 2019. FAIRSEQ : A fast, exten-\\nsible toolkit for sequence modeling. In North\\nAmerican Association for Computational Linguis-\\ntics (NAACL): System Demonstrations .Myle Ott, Sergey Edunov, David Grangier, and\\nMichael Auli. 2018. Scaling neural machine trans-\\nlation. In Proceedings of the Third Conference on\\nMachine Translation (WMT) . Adam Paszke, Sam Gross, Soumith Chintala, Gre-\\ngory Chanan, Edward Yang, Zachary DeVito, Zem-\\ning Lin, Alban Desmaison, Luca Antiga, and Adam\\nLerer. 2017. Automatic differentiation in PyTorch. InNIPS Autodiff Workshop . Matthew Peters, Mark Neumann, Mohit Iyyer, Matt\\nGardner, Christopher Clark, Kenton Lee, and Luke\\nZettlemoyer. 2018. Deep contextualized word repre-\\nsentations. In North American Association for Com-\\nputational Linguistics (NAACL) . Alec Radford, Karthik Narasimhan, Time Salimans,\\nand Ilya Sutskever. 2018. Improving language un-\\nderstanding with unsupervised learning. Technical\\nreport, OpenAI. Alec Radford, Jeffrey Wu, Rewon Child, David Luan,\\nDario Amodei, and Ilya Sutskever. 2019. Language\\nmodels are unsupervised multitask learners. Techni-\\ncal report, OpenAI.\"\n", "}\n", "\u001b[32;1m\u001b[1;3m[chain/start]\u001b[0m \u001b[1m[chain:RunnableSequence > chain:RunnableParallel > chain:RunnablePassthrough] Entering Chain run with input:\n", "\u001b[0m{\n", " \"doc\": \"Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018. Know what you don’t know: Unanswerable ques-\\ntions for squad. In Association for Computational\\nLinguistics (ACL) . Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and\\nPercy Liang. 2016. SQuAD: 100,000+ questions for\\nmachine comprehension of text. In Empirical Meth-\\nods in Natural Language Processing (EMNLP) . Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural machine translation of rare words with\\nsubword units. In Association for Computational\\nLinguistics (ACL) , pages 1715–1725. Richard Socher, Alex Perelygin, Jean Wu, Jason\\nChuang, Christopher D Manning, Andrew Ng, and\\nChristopher Potts. 2013. Recursive deep models\\nfor semantic compositionality over a sentiment tree-\\nbank. In Empirical Methods in Natural Language\\nProcessing (EMNLP) . Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and\\nTie-Yan Liu. 2019. MASS: Masked sequence\\nto sequence pre-training for language generation. InInternational Conference on Machine Learning\\n(ICML) .\"\n", "}\n", "\u001b[32;1m\u001b[1;3m[chain/start]\u001b[0m \u001b[1m[chain:RunnableSequence > chain:RunnableParallel > chain:RunnablePassthrough] Entering Chain run with input:\n", "\u001b[0m{\n", " \"doc\": \"Yu Stephanie Sun, Shuohuan Wang, Yukun Li, Shikun\\nFeng, Xuyi Chen, Han Zhang, Xinlun Tian, Danxi-\\nang Zhu, Hao Tian, and Hua Wu. 2019. ERNIE: En-\\nhanced representation through knowledge integra-\\ntion. arXiv preprint arXiv:1904.09223 . Trieu H Trinh and Quoc V Le. 2018. A simple\\nmethod for commonsense reasoning. arXiv preprint\\narXiv:1806.02847 . Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob\\nUszkoreit, Llion Jones, Aidan N Gomez, Łukasz\\nKaiser, and Illia Polosukhin. 2017. Attention is all\\nyou need. In Advances in neural information pro-\\ncessing systems . Alex Wang, Yada Pruksachatkun, Nikita Nangia,\\nAmanpreet Singh, Julian Michael, Felix Hill, Omer\\nLevy, and Samuel R.\"\n", "}\n", "\u001b[32;1m\u001b[1;3m[chain/start]\u001b[0m \u001b[1m[chain:RunnableSequence > chain:RunnableParallel > chain:RunnablePassthrough] Entering Chain run with input:\n", "\u001b[0m{\n", " \"doc\": \"Bowman. 2019a. SuperGLUE:\\nA stickier benchmark for general-purpose language\\nunderstanding systems.\"\n", "}\n", "\u001b[32;1m\u001b[1;3m[chain/start]\u001b[0m \u001b[1m[chain:RunnableSequence > chain:RunnableParallel > chain:RunnablePassthrough] Entering Chain run with input:\n", "\u001b[0m{\n", " \"doc\": \"arXiv preprint 1905.00537 . Alex Wang, Amanpreet Singh, Julian Michael, Felix\\nHill, Omer Levy, and Samuel R. Bowman. 2019b. GLUE: A multi-task benchmark and analysis plat-\\nform for natural language understanding. In Inter-\\nnational Conference on Learning Representations\\n(ICLR) . Alex Warstadt, Amanpreet Singh, and Samuel R. Bow-\\nman. 2018. Neural network acceptability judg-\\nments. arXiv preprint 1805.12471 . Adina Williams, Nikita Nangia, and Samuel Bowman. 2018. A broad-coverage challenge corpus for sen-\\ntence understanding through inference. In North\\nAmerican Association for Computational Linguis-\\ntics (NAACL) . Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Car-\\nbonell, Ruslan Salakhutdinov, and Quoc V Le. 2019. Xlnet: Generalized autoregressive pretrain-\\ning for language understanding. arXiv preprint\\narXiv:1906.08237 .\"\n", "}\n", "\u001b[32;1m\u001b[1;3m[chain/start]\u001b[0m \u001b[1m[chain:RunnableSequence > chain:RunnableParallel > chain:RunnablePassthrough] Entering Chain run with input:\n", "\u001b[0m{\n", " \"doc\": \"Yang You, Jing Li, Jonathan Hseu, Xiaodan Song,\\nJames Demmel, and Cho-Jui Hsieh. 2019. Reduc-\\ning bert pre-training time from 3 days to 76 minutes. arXiv preprint arXiv:1904.00962 . Rowan Zellers, Ari Holtzman, Hannah Rashkin,\\nYonatan Bisk, Ali Farhadi, Franziska Roesner, and\\nYejin Choi. 2019. Defending against neural fake\\nnews. arXiv preprint arXiv:1905.12616 .\"\n", "}\n", "\u001b[32;1m\u001b[1;3m[chain/start]\u001b[0m \u001b[1m[chain:RunnableSequence > chain:RunnableParallel > chain:RunnablePassthrough] Entering Chain run with input:\n", "\u001b[0m{\n", " \"doc\": \"Yukun Zhu, Ryan Kiros, Richard Zemel, Ruslan\\nSalakhutdinov, Raquel Urtasun, Antonio Torralba,\\nand Sanja Fidler. 2015. Aligning books and movies:\\nTowards story-like visual explanations by watch-\\ning movies and reading books. In arXiv preprint\\narXiv:1506.06724 . Appendix for “RoBERTa: A Robustly\\nOptimized BERT Pretraining Approach”\\nA Full results on GLUE\\nIn Table 8we present the full set of development\\nset results for RoBERTa. We present results for\\naLARGE configuration that follows BERT LARGE ,\\nas well as a BASE configuration that follows\\nBERT BASE.B Pretraining Hyperparameters\\nTable 9describes the hyperparameters for pre-\\ntraining of RoBERTa LARGE and RoBERTa BASE\\nC Finetuning Hyperparameters\\nFinetuning hyperparameters for RACE, SQuAD\\nand GLUE are given in Table 10. We select the\\nbest hyperparameter values based on the median\\nof 5 random seeds for each task. MNLI QNLI QQP RTE SST MRPC CoLA STS\\nRoBERTa BASE\\n+ all data + 500k steps 87.6 92.8 91.9 78.7 94.8 90.2 63.6 91.2\\nRoBERTa LARGE\\nwith B OOKS + W IKI 89.0 93.9 91.9 84.5 95.3 90.2 66.3 91.6\\n+ additional data ( §3.2) 89.3 94.0 92.0 82.7 95.6 91.4 66.1 92.2\\n+ pretrain longer 300k 90.0 94.5 92.2 83.3 96.1 91.1 67.4 92.3\\n+ pretrain longer 500k 90.2 94.7 92.2 86.6 96.4 90.9 68.0 92.4\\nTable 8: Development set results on GLUE tasks for various co nfigurations of RoBERTa. Hyperparam RoBERTa LARGE RoBERTa BASE\\nNumber of Layers 24 12\\nHidden size 1024 768\\nFFN inner hidden size 4096 3072\\nAttention heads 16 12\\nAttention head size 64 64\\nDropout 0.1 0.1\\nAttention Dropout 0.1 0.1\\nWarmup Steps 30k 24k\\nPeak Learning Rate 4e-4 6e-4\\nBatch Size 8k 8k\\nWeight Decay 0.01 0.01\\nMax Steps 500k 500k\\nLearning Rate Decay Linear Linear\\nAdamǫ 1e-6 1e-6\\nAdamβ1 0.9 0.9\\nAdamβ2 0.98 0.98\\nGradient Clipping 0.0 0.0\\nTable 9: Hyperparameters for pretraining RoBERTa LARGE and RoBERTa BASE. Hyperparam RACE SQuAD GLUE\\nLearning Rate 1e-5 1.5e-5 {1e-5, 2e-5, 3e-5 }\\nBatch Size 16 48 {16, 32}\\nWeight Decay 0.1 0.01 0.1\\nMax Epochs 4 2 10\\nLearning Rate Decay Linear Linear Linear\\nWarmup ratio 0.06 0.06 0.06\\nTable 10: Hyperparameters for finetuning RoBERTa LARGE on RACE, SQuAD and GLUE.\"\n", "}\n", "\u001b[36;1m\u001b[1;3m[chain/end]\u001b[0m \u001b[1m[chain:RunnableSequence > chain:RunnableParallel > chain:RunnablePassthrough] [4ms] Exiting Chain run with output:\n", "\u001b[0m{\n", " \"doc\": \"arXiv:1907.11692v1 [cs.CL] 26 Jul 2019RoBERTa: A Robustly Optimized BERT Pretraining Approach\\nYinhan Liu∗§Myle Ott∗§Naman Goyal∗§Jingfei Du∗§Mandar Joshi†\\nDanqi Chen§Omer Levy§Mike Lewis§Luke Zettlemoyer†§Veselin Stoyanov§\\n†Paul G.\"\n", "}\n", "\u001b[36;1m\u001b[1;3m[chain/end]\u001b[0m \u001b[1m[chain:RunnableSequence > chain:RunnableParallel > chain:RunnablePassthrough] [4ms] Exiting Chain run with output:\n", "\u001b[0m{\n", " \"doc\": \"Allen School of Computer Science & Engineering,\\nUniversity of Washington, Seattle, WA\\n{mandar90,lsz }@cs.washington.edu\\n§Facebook AI\\n{yinhanliu,myleott,naman,jingfeidu,\\ndanqi,omerlevy,mikelewis,lsz,ves }@fb.com\\nAbstract\\nLanguage model pretraining has led to sig-\\nnificant performance gains but careful com-\\nparison between different approaches is chal-\\nlenging. Training is computationally expen-\\nsive, often done on private datasets of different\\nsizes, and, as we will show, hyperparameter\\nchoices have significant impact on the final re-\\nsults. We present a replication study of BERT\\npretraining ( Devlin et al. ,2019 ) that carefully\\nmeasures the impact of many key hyperparam-\\neters and training data size. We find that BERT\\nwas significantly undertrained, and can match\\nor exceed the performance of every model\\npublished after it. Our best model achieves\\nstate-of-the-art results on GLUE, RACE and\\nSQuAD. These results highlight the impor-\\ntance of previously overlooked design choices,\\nand raise questions about the source of re-\\ncently reported improvements. We release our\\nmodels and code.1\\n1 Introduction\\nSelf-training methods such as ELMo ( Peters et al. ,\\n2018 ), GPT ( Radford et al. ,2018 ), BERT\\n(Devlin et al. ,2019 ), XLM ( Lample and Conneau ,\\n2019 ), and XLNet ( Yang et al. ,2019 ) have\\nbrought significant performance gains, but it can\\nbe challenging to determine which aspects of\\nthe methods contribute the most. Training is\\ncomputationally expensive, limiting the amount\\nof tuning that can be done, and is often done with\\nprivate training data of varying sizes, limiting\\nour ability to measure the effects of the modeling\\nadvances. ∗Equal contribution. 1Our models and code are available at:\\nhttps://github.com/pytorch/fairseqWe present a replication study of BERT pre-\\ntraining ( Devlin et al. ,2019 ), which includes a\\ncareful evaluation of the effects of hyperparmeter\\ntuning and training set size. We find that BERT\\nwas significantly undertrained and propose an im-\\nproved recipe for training BERT models, which\\nwe call RoBERTa, that can match or exceed the\\nperformance of all of the post-BERT methods. Our modifications are simple, they include: (1)\\ntraining the model longer, with bigger batches,\\nover more data; (2) removing the next sentence\\nprediction objective; (3) training on longer se-\\nquences; and (4) dynamically changing the mask-\\ning pattern applied to the training data. We also\\ncollect a large new dataset (CC-N EWS) of compa-\\nrable size to other privately used datasets, to better\\ncontrol for training set size effects. When controlling for training data, our im-\\nproved training procedure improves upon the pub-\\nlished BERT results on both GLUE and SQuAD. When trained for longer over additional data, our\\nmodel achieves a score of 88.5 on the public\\nGLUE leaderboard, matching the 88.4 reported\\nbyYang et al. (2019 ). Our model establishes a\\nnew state-of-the-art on 4/9 of the GLUE tasks:\\nMNLI, QNLI, RTE and STS-B. We also match\\nstate-of-the-art results on SQuAD and RACE. Overall, we re-establish that BERT’s masked lan-\\nguage model training objective is competitive\\nwith other recently proposed training objectives\\nsuch as perturbed autoregressive language model-\\ning (Yang et al. ,2019 ).2\\nIn summary, the contributions of this paper\\nare: (1) We present a set of important BERT de-\\nsign choices and training strategies and introduce\\n2It is possible that these other methods could also improve\\nwith more tuning. We leave this exploration to future work. alternatives that lead to better downstream task\\nperformance; (2) We use a novel dataset, CC-\\nNEWS, and confirm that using more data for pre-\\ntraining further improves performance on down-\\nstream tasks; (3) Our training improvements show\\nthat masked language model pretraining, under\\nthe right design choices, is competitive with all\\nother recently published methods. We release our\\nmodel, pretraining and fine-tuning code imple-\\nmented in PyTorch ( Paszke et al. ,2017 ). 2 Background\\nIn this section, we give a brief overview of the\\nBERT ( Devlin et al. ,2019 ) pretraining approach\\nand some of the training choices that we will ex-\\namine experimentally in the following section. 2.1 Setup\\nBERT takes as input a concatenation of two\\nsegments (sequences of tokens), x1,...,x N\\nandy1,...,yM. Segments usually consist of\\nmore than one natural sentence. The two seg-\\nments are presented as a single input sequence\\nto BERT with special tokens delimiting them:\\n[CLS],x1,...,x N,[SEP],y1,...,yM,[EOS]. MandNare constrained such that M+N < T ,\\nwhereTis a parameter that controls the maximum\\nsequence length during training. The model is first pretrained on a large unla-\\nbeled text corpus and subsequently finetuned us-\\ning end-task labeled data. 2.2 Architecture\\nBERT uses the now ubiquitous transformer archi-\\ntecture ( Vaswani et al. ,2017 ), which we will not\\nreview in detail. We use a transformer architecture\\nwithLlayers. Each block uses Aself-attention\\nheads and hidden dimension H. 2.3 Training Objectives\\nDuring pretraining, BERT uses two objectives:\\nmasked language modeling and next sentence pre-\\ndiction. Masked Language Model (MLM) A random\\nsample of the tokens in the input sequence is\\nselected and replaced with the special token\\n[MASK]. The MLM objective is a cross-entropy\\nloss on predicting the masked tokens. BERT uni-\\nformly selects 15% of the input tokens for possi-\\nble replacement. Of the selected tokens, 80% are\\nreplaced with [MASK], 10% are left unchanged,and 10% are replaced by a randomly selected vo-\\ncabulary token. In the original implementation, random mask-\\ning and replacement is performed once in the be-\\nginning and saved for the duration of training, al-\\nthough in practice, data is duplicated so the mask\\nis not always the same for every training sentence\\n(see Section 4.1). Next Sentence Prediction (NSP) NSP is a bi-\\nnary classification loss for predicting whether two\\nsegments follow each other in the original text. Positive examples are created by taking consecu-\\ntive sentences from the text corpus. Negative ex-\\namples are created by pairing segments from dif-\\nferent documents. Positive and negative examples\\nare sampled with equal probability. The NSP objective was designed to improve\\nperformance on downstream tasks, such as Natural\\nLanguage Inference ( Bowman et al. ,2015 ), which\\nrequire reasoning about the relationships between\\npairs of sentences. 2.4 Optimization\\nBERT is optimized with Adam ( Kingma and Ba ,\\n2015 ) using the following parameters: β1= 0.9,\\nβ2= 0.999,ǫ=1e-6 and L2weight de-\\ncay of0.01. The learning rate is warmed up\\nover the first 10,000 steps to a peak value of\\n1e-4, and then linearly decayed. BERT trains\\nwith a dropout of 0.1 on all layers and at-\\ntention weights, and a GELU activation func-\\ntion ( Hendrycks and Gimpel ,2016 ). Models are\\npretrained for S=1,000,000 updates, with mini-\\nbatches containing B=256 sequences of maxi-\\nmum length T=512 tokens. 2.5 Data\\nBERT is trained on a combination of B OOK COR-\\nPUS (Zhu et al. ,2015 ) plus English W IKIPEDIA ,\\nwhich totals 16GB of uncompressed text.3\\n3 Experimental Setup\\nIn this section, we describe the experimental setup\\nfor our replication study of BERT. 3.1 Implementation\\nWe reimplement BERT in FAIRSEQ (Ott et al. ,\\n2019 ). We primarily follow the original BERT\\n3Yang et al. (2019 ) use the same dataset but report having\\nonly 13GB of text after data cleaning. This is most likely due\\nto subtle differences in cleaning of the Wikipedia data. optimization hyperparameters, given in Section 2,\\nexcept for the peak learning rate and number of\\nwarmup steps, which are tuned separately for each\\nsetting. We additionally found training to be very\\nsensitive to the Adam epsilon term, and in some\\ncases we obtained better performance or improved\\nstability after tuning it. Similarly, we found setting\\nβ2= 0.98to improve stability when training with\\nlarge batch sizes. We pretrain with sequences of at most T= 512\\ntokens. Unlike Devlin et al. (2019 ), we do not ran-\\ndomly inject short sequences, and we do not train\\nwith a reduced sequence length for the first 90% of\\nupdates. We train only with full-length sequences. We train with mixed precision floating point\\narithmetic on DGX-1 machines, each with 8 ×\\n32GB Nvidia V100 GPUs interconnected by In-\\nfiniband ( Micikevicius et al. ,2018 ). 3.2 Data\\nBERT-style pretraining crucially relies on large\\nquantities of text. Baevski et al. (2019 ) demon-\\nstrate that increasing data size can result in im-\\nproved end-task performance. Several efforts\\nhave trained on datasets larger and more diverse\\nthan the original BERT ( Radford et al.\"\n", "}\n", "\u001b[36;1m\u001b[1;3m[chain/end]\u001b[0m \u001b[1m[chain:RunnableSequence > chain:RunnableParallel > chain:RunnablePassthrough] [4ms] Exiting Chain run with output:\n", "\u001b[0m{\n", " \"doc\": \",2019 ;\\nYang et al. ,2019 ;Zellers et al. ,2019 ). Unfortu-\\nnately, not all of the additional datasets can be\\npublicly released. For our study, we focus on gath-\\nering as much data as possible for experimenta-\\ntion, allowing us to match the overall quality and\\nquantity of data as appropriate for each compari-\\nson. We consider five English-language corpora of\\nvarying sizes and domains, totaling over 160GB\\nof uncompressed text. We use the following text\\ncorpora:\\n•BOOK CORPUS (Zhu et al. ,2015 ) plus English\\nWIKIPEDIA . This is the original data used to\\ntrain BERT. (16GB). •CC-N EWS, which we collected from the En-\\nglish portion of the CommonCrawl News\\ndataset ( Nagel ,2016 ). The data contains 63\\nmillion English news articles crawled between\\nSeptember 2016 and February 2019. (76GB af-\\nter filtering).4\\n•OPENWEBTEXT (Gokaslan and Cohen ,2019 ),\\nan open-source recreation of the WebText cor-\\n4We usenews-please (Hamborg et al. ,2017 ) to col-\\nlect and extract CC-N EWS. CC-N EWS is similar to the R E-\\nALNEWS dataset described in Zellers et al. (2019 ).pus described in Radford et al. (2019 ). The text\\nis web content extracted from URLs shared on\\nReddit with at least three upvotes. (38GB).5\\n•STORIES , a dataset introduced in Trinh and Le\\n(2018 ) containing a subset of CommonCrawl\\ndata filtered to match the story-like style of\\nWinograd schemas. (31GB). 3.3 Evaluation\\nFollowing previous work, we evaluate our pre-\\ntrained models on downstream tasks using the fol-\\nlowing three benchmarks. GLUE The General Language Understand-\\ning Evaluation (GLUE) benchmark ( Wang et al. ,\\n2019b ) is a collection of 9 datasets for evaluating\\nnatural language understanding systems.6Tasks\\nare framed as either single-sentence classification\\nor sentence-pair classification tasks. The GLUE\\norganizers provide training and development data\\nsplits as well as a submission server and leader-\\nboard that allows participants to evaluate and com-\\npare their systems on private held-out test data. For the replication study in Section 4, we report\\nresults on the development sets after finetuning\\nthe pretrained models on the corresponding single-\\ntask training data (i.e., without multi-task training\\nor ensembling). Our finetuning procedure follows\\nthe original BERT paper ( Devlin et al. ,2019 ). In Section 5we additionally report test set re-\\nsults obtained from the public leaderboard. These\\nresults depend on a several task-specific modifica-\\ntions, which we describe in Section 5.1. SQuAD The Stanford Question Answering\\nDataset (SQuAD) provides a paragraph of context\\nand a question. The task is to answer the question\\nby extracting the relevant span from the context. We evaluate on two versions of SQuAD: V1.1\\nand V2.0 ( Rajpurkar et al. ,2016 ,2018 ). In V1.1\\nthe context always contains an answer, whereas in\\n5The authors and their affiliated institutions are not in any\\nway affiliated with the creation of the OpenWebText dataset. 6The datasets are: CoLA ( Warstadt et al. ,2018 ),\\nStanford Sentiment Treebank (SST) ( Socher et al. ,\\n2013 ), Microsoft Research Paragraph Corpus\\n(MRPC) ( Dolan and Brockett ,2005 ), Semantic Tex-\\ntual Similarity Benchmark (STS) ( Agirre et al. ,2007 ),\\nQuora Question Pairs (QQP) ( Iyer et al. ,2016 ), Multi-\\nGenre NLI (MNLI) ( Williams et al. ,2018 ), Question NLI\\n(QNLI) ( Rajpurkar et al. ,2016 ), Recognizing Textual\\nEntailment (RTE) ( Dagan et al. ,2006 ;Bar-Haim et al. ,\\n2006 ;Giampiccolo et al. ,2007 ;Bentivogli et al. ,2009 ) and\\nWinograd NLI (WNLI) ( Levesque et al. ,2011 ). V2.0 some questions are not answered in the pro-\\nvided context, making the task more challenging. For SQuAD V1.1 we adopt the same span pre-\\ndiction method as BERT ( Devlin et al. ,2019 ). For\\nSQuAD V2.0, we add an additional binary classi-\\nfier to predict whether the question is answerable,\\nwhich we train jointly by summing the classifica-\\ntion and span loss terms. During evaluation, we\\nonly predict span indices on pairs that are classi-\\nfied as answerable. RACE The ReAding Comprehension from Ex-\\naminations (RACE) ( Lai et al. ,2017 ) task is a\\nlarge-scale reading comprehension dataset with\\nmore than 28,000 passages and nearly 100,000\\nquestions. The dataset is collected from English\\nexaminations in China, which are designed for\\nmiddle and high school students. In RACE, each\\npassage is associated with multiple questions. For\\nevery question, the task is to select one correct an-\\nswer from four options. RACE has significantly\\nlonger context than other popular reading compre-\\nhension datasets and the proportion of questions\\nthat requires reasoning is very large. 4 Training Procedure Analysis\\nThis section explores and quantifies which choices\\nare important for successfully pretraining BERT\\nmodels. We keep the model architecture fixed.7\\nSpecifically, we begin by training BERT models\\nwith the same configuration as BERT BASE (L=\\n12,H= 768 ,A= 12 , 110M params). 4.1 Static vs. Dynamic Masking\\nAs discussed in Section 2, BERT relies on ran-\\ndomly masking and predicting tokens. The orig-\\ninal BERT implementation performed masking\\nonce during data preprocessing, resulting in a sin-\\nglestatic mask. To avoid using the same mask for\\neach training instance in every epoch, training data\\nwas duplicated 10 times so that each sequence is\\nmasked in 10 different ways over the 40 epochs of\\ntraining. Thus, each training sequence was seen\\nwith the same mask four times during training. We compare this strategy with dynamic mask-\\ningwhere we generate the masking pattern every\\ntime we feed a sequence to the model. This be-\\ncomes crucial when pretraining for more steps or\\nwith larger datasets. 7Studying architectural changes, including larger archi-\\ntectures, is an important area for future work.Masking SQuAD 2.0 MNLI-m SST-2\\nreference 76.3 84.3 92.8\\nOur reimplementation:\\nstatic 78.3 84.3 92.5\\ndynamic 78.7 84.0 92.9\\nTable 1: Comparison between static and dynamic\\nmasking for BERT BASE. We report F1 for SQuAD and\\naccuracy for MNLI-m and SST-2. Reported results are\\nmedians over 5 random initializations (seeds). Refer-\\nence results are from Yang et al. (2019 ). Results Table 1compares the published\\nBERT BASE results from Devlin et al. (2019 ) to our\\nreimplementation with either static or dynamic\\nmasking. We find that our reimplementation\\nwith static masking performs similar to the\\noriginal BERT model, and dynamic masking is\\ncomparable or slightly better than static masking. Given these results and the additional efficiency\\nbenefits of dynamic masking, we use dynamic\\nmasking in the remainder of the experiments. 4.2 Model Input Format and Next Sentence\\nPrediction\\nIn the original BERT pretraining procedure, the\\nmodel observes two concatenated document seg-\\nments, which are either sampled contiguously\\nfrom the same document (with p= 0.5) or from\\ndistinct documents. In addition to the masked lan-\\nguage modeling objective, the model is trained to\\npredict whether the observed document segments\\ncome from the same or distinct documents via an\\nauxiliary Next Sentence Prediction (NSP) loss. The NSP loss was hypothesized to be an impor-\\ntant factor in training the original BERT model. Devlin et al. (2019 ) observe that removing NSP\\nhurts performance, with significant performance\\ndegradation on QNLI, MNLI, and SQuAD 1.1. However, some recent work has questioned the\\nnecessity of the NSP loss ( Lample and Conneau ,\\n2019 ;Yang et al.\"\n", "}\n", "\u001b[36;1m\u001b[1;3m[chain/end]\u001b[0m \u001b[1m[chain:RunnableSequence > chain:RunnableParallel > chain:RunnablePassthrough] [4ms] Exiting Chain run with output:\n", "\u001b[0m{\n", " \"doc\": \",2019 ;Joshi et al. ,2019 ). To better understand this discrepancy, we com-\\npare several alternative training formats:\\n•SEGMENT -PAIR +NSP: This follows the original\\ninput format used in BERT ( Devlin et al. ,2019 ),\\nwith the NSP loss. Each input has a pair of seg-\\nments, which can each contain multiple natural\\nsentences, but the total combined length must\\nbe less than 512 tokens. Model SQuAD 1.1/2.0 MNLI-m SST-2 RACE\\nOur reimplementation (with NSP loss):\\nSEGMENT -PAIR 90.4/78.7 84.0 92.9 64.2\\nSENTENCE -PAIR 88.7/76.2 82.9 92.1 63.0\\nOur reimplementation (without NSP loss):\\nFULL -SENTENCES 90.4/79.1 84.7 92.5 64.8\\nDOC-SENTENCES 90.6/79.7 84.7 92.7 65.6\\nBERT BASE 88.5/76.3 84.3 92.8 64.3\\nXLNet BASE (K = 7) –/81.3 85.8 92.7 66.1\\nXLNet BASE (K = 6) –/81.0 85.6 93.4 66.7\\nTable 2: Development set results for base models pretrained over B OOK CORPUS and W IKIPEDIA . All models are\\ntrained for 1M steps with a batch size of 256 sequences. We rep ort F1 for SQuAD and accuracy for MNLI-m,\\nSST-2 and RACE. Reported results are medians over five random initializations (seeds). Results for BERT BASEand\\nXLNet BASEare from Yang et al. (2019 ). •SENTENCE -PAIR +NSP: Each input contains a\\npair of natural sentences , either sampled from\\na contiguous portion of one document or from\\nseparate documents. Since these inputs are sig-\\nnificantly shorter than 512 tokens, we increase\\nthe batch size so that the total number of tokens\\nremains similar to SEGMENT -PAIR +NSP. We re-\\ntain the NSP loss. •FULL -SENTENCES : Each input is packed with\\nfull sentences sampled contiguously from one\\nor more documents, such that the total length is\\nat most 512 tokens. Inputs may cross document\\nboundaries. When we reach the end of one doc-\\nument, we begin sampling sentences from the\\nnext document and add an extra separator token\\nbetween documents. We remove the NSP loss. •DOC-SENTENCES : Inputs are constructed sim-\\nilarly to FULL -SENTENCES , except that they\\nmay not cross document boundaries. Inputs\\nsampled near the end of a document may be\\nshorter than 512 tokens, so we dynamically in-\\ncrease the batch size in these cases to achieve\\na similar number of total tokens as FULL -\\nSENTENCES .\"\n", "}\n", "\u001b[36;1m\u001b[1;3m[chain/end]\u001b[0m \u001b[1m[chain:RunnableSequence > chain:RunnableParallel > chain:RunnablePassthrough] [4ms] Exiting Chain run with output:\n", "\u001b[0m{\n", " \"doc\": \"We remove the NSP loss. Results Table 2shows results for the four dif-\\nferent settings. We first compare the original\\nSEGMENT -PAIR input format from Devlin et al. (2019 ) to the SENTENCE -PAIR format; both for-\\nmats retain the NSP loss, but the latter uses sin-\\ngle sentences. We find that using individual\\nsentences hurts performance on downstream\\ntasks , which we hypothesize is because the model\\nis not able to learn long-range dependencies.We next compare training without the NSP\\nloss and training with blocks of text from a sin-\\ngle document ( DOC-SENTENCES ). We find that\\nthis setting outperforms the originally published\\nBERT BASEresults and that removing the NSP loss\\nmatches or slightly improves downstream task\\nperformance , in contrast to Devlin et al. (2019 ). It is possible that the original BERT implementa-\\ntion may only have removed the loss term while\\nstill retaining the SEGMENT -PAIR input format. Finally we find that restricting sequences to\\ncome from a single document ( DOC-SENTENCES )\\nperforms slightly better than packing sequences\\nfrom multiple documents ( FULL -SENTENCES ). However, because the DOC-SENTENCES format\\nresults in variable batch sizes, we use FULL -\\nSENTENCES in the remainder of our experiments\\nfor easier comparison with related work. 4.3 Training with large batches\\nPast work in Neural Machine Translation has\\nshown that training with very large mini-batches\\ncan both improve optimization speed and end-task\\nperformance when the learning rate is increased\\nappropriately ( Ott et al. ,2018 ). Recent work has\\nshown that BERT is also amenable to large batch\\ntraining ( You et al. ,2019 ). Devlin et al. (2019 ) originally trained\\nBERT BASE for 1M steps with a batch size of\\n256 sequences. This is equivalent in computa-\\ntional cost, via gradient accumulation, to training\\nfor 125K steps with a batch size of 2K sequences,\\nor for 31K steps with a batch size of 8K. In Table 3we compare perplexity and end- bsz steps lr ppl MNLI-m SST-2\\n256 1M 1e-4 3.99 84.7 92.7\\n2K 125K 7e-4 3.68 85.2 92.9\\n8K 31K 1e-3 3.77 84.6 92.8\\nTable 3: Perplexity on held-out training data ( ppl) and\\ndevelopment set accuracy for base models trained over\\nBOOK CORPUS and W IKIPEDIA with varying batch\\nsizes ( bsz).\"\n", "}\n", "\u001b[36;1m\u001b[1;3m[chain/end]\u001b[0m \u001b[1m[chain:RunnableSequence > chain:RunnableParallel > chain:RunnablePassthrough] [4ms] Exiting Chain run with output:\n", "\u001b[0m{\n", " \"doc\": \"We tune the learning rate ( lr) for each set-\\nting. Models make the same number of passes over the\\ndata (epochs) and have the same computational cost. task performance of BERT BASE as we increase the\\nbatch size, controlling for the number of passes\\nthrough the training data. We observe that train-\\ning with large batches improves perplexity for the\\nmasked language modeling objective, as well as\\nend-task accuracy. Large batches are also easier to\\nparallelize via distributed data parallel training,8\\nand in later experiments we train with batches of\\n8K sequences. Notably You et al. (2019 ) train BERT with even\\nlarger batche sizes, up to 32K sequences. We leave\\nfurther exploration of the limits of large batch\\ntraining to future work. 4.4 Text Encoding\\nByte-Pair Encoding (BPE) ( Sennrich et al. ,2016 )\\nis a hybrid between character- and word-level rep-\\nresentations that allows handling the large vocab-\\nularies common in natural language corpora. In-\\nstead of full words, BPE relies on subwords units,\\nwhich are extracted by performing statistical anal-\\nysis of the training corpus. BPE vocabulary sizes typically range from\\n10K-100K subword units. However, unicode char-\\nacters can account for a sizeable portion of this\\nvocabulary when modeling large and diverse cor-\\npora, such as the ones considered in this work. Radford et al. (2019 ) introduce a clever imple-\\nmentation of BPE that uses bytes instead of uni-\\ncode characters as the base subword units. Using\\nbytes makes it possible to learn a subword vocab-\\nulary of a modest size (50K units) that can still en-\\ncode any input text without introducing any “un-\\nknown” tokens. 8Large batch training can improve training efficiency even\\nwithout large scale parallel hardware through gradient ac-\\ncumulation , whereby gradients from multiple mini-batches\\nare accumulated locally before each optimization step. Thi s\\nfunctionality is supported natively in FAIRSEQ (Ott et al. ,\\n2019 ).The original BERT implementa-\\ntion ( Devlin et al. ,2019 ) uses a character-level\\nBPE vocabulary of size 30K, which is learned\\nafter preprocessing the input with heuristic tok-\\nenization rules. Following Radford et al. (2019 ),\\nwe instead consider training BERT with a larger\\nbyte-level BPE vocabulary containing 50K sub-\\nword units, without any additional preprocessing\\nor tokenization of the input. This adds approxi-\\nmately 15M and 20M additional parameters for\\nBERT BASEand BERT LARGE , respectively. Early experiments revealed only slight dif-\\nferences between these encodings, with the\\nRadford et al. (2019 ) BPE achieving slightly\\nworse end-task performance on some tasks. Nev-\\nertheless, we believe the advantages of a univer-\\nsal encoding scheme outweighs the minor degre-\\ndation in performance and use this encoding in\\nthe remainder of our experiments. A more de-\\ntailed comparison of these encodings is left to fu-\\nture work. 5 RoBERTa\\nIn the previous section we propose modifications\\nto the BERT pretraining procedure that improve\\nend-task performance. We now aggregate these\\nimprovements and evaluate their combined im-\\npact. We call this configuration RoBERTa for\\nRobustly optimized BERT approach. Specifi-\\ncally, RoBERTa is trained with dynamic mask-\\ning (Section 4.1),FULL -SENTENCES without NSP\\nloss (Section 4.2), large mini-batches (Section 4.3)\\nand a larger byte-level BPE (Section 4.4). Additionally, we investigate two other impor-\\ntant factors that have been under-emphasized in\\nprevious work: (1) the data used for pretraining,\\nand (2) the number of training passes through the\\ndata. For example, the recently proposed XLNet\\narchitecture ( Yang et al. ,2019 ) is pretrained us-\\ning nearly 10 times more data than the original\\nBERT ( Devlin et al. ,2019 ). It is also trained with\\na batch size eight times larger for half as many op-\\ntimization steps, thus seeing four times as many\\nsequences in pretraining compared to BERT. To help disentangle the importance of these fac-\\ntors from other modeling choices (e.g., the pre-\\ntraining objective), we begin by training RoBERTa\\nfollowing the BERT LARGE architecture ( L= 24 ,\\nH= 1024 ,A= 16 , 355M parameters). We\\npretrain for 100K steps over a comparable B OOK -\\nCORPUS plus W IKIPEDIA dataset as was used in Model data bsz stepsSQuADMNLI-m SST-2(v1.1/2.0)\\nRoBERTa\\nwith B OOKS + W IKI 16GB 8K 100K 93.6/87.3 89.0 95.3\\n+ additional data ( §3.2) 160GB 8K 100K 94.0/87.7 89.3 95.6\\n+ pretrain longer 160GB 8K 300K 94.4/88.7 90.0 96.1\\n+ pretrain even longer 160GB 8K 500K 94.6/89.4 90.2 96.4\\nBERT LARGE\\nwith B OOKS + W IKI 13GB 256 1M 90.9/81.8 86.6 93.7\\nXLNet LARGE\\nwith B OOKS + W IKI 13GB 256 1M 94.0/87.8 88.4 94.4\\n+ additional data 126GB 2K 500K 94.5/88.8 89.8 95.6\\nTable 4: Development set results for RoBERTa as we pretrain o ver more data (16GB →160GB of text) and pretrain\\nfor longer (100K →300K→500K steps).\"\n", "}\n", "\u001b[36;1m\u001b[1;3m[chain/end]\u001b[0m \u001b[1m[chain:RunnableSequence > chain:RunnableParallel > chain:RunnablePassthrough] [4ms] Exiting Chain run with output:\n", "\u001b[0m{\n", " \"doc\": \"Each row accumulates improvements from the row s above. RoBERTa\\nmatches the architecture and training objective of BERT LARGE . Results for BERT LARGE and XLNet LARGE are from\\nDevlin et al. (2019 ) and Yang et al. (2019 ), respectively. Complete results on all GLUE tasks can be fo und in the\\nAppendix.\"\n", "}\n", "\u001b[36;1m\u001b[1;3m[chain/end]\u001b[0m \u001b[1m[chain:RunnableSequence > chain:RunnableParallel > chain:RunnablePassthrough] [4ms] Exiting Chain run with output:\n", "\u001b[0m{\n", " \"doc\": \"Devlin et al. (2019 ). We pretrain our model using\\n1024 V100 GPUs for approximately one day. Results We present our results in Table 4. When\\ncontrolling for training data, we observe that\\nRoBERTa provides a large improvement over the\\noriginally reported BERT LARGE results, reaffirming\\nthe importance of the design choices we explored\\nin Section 4. Next, we combine this data with the three ad-\\nditional datasets described in Section 3.2. We\\ntrain RoBERTa over the combined data with the\\nsame number of training steps as before (100K). In total, we pretrain over 160GB of text. We ob-\\nserve further improvements in performance across\\nall downstream tasks, validating the importance of\\ndata size and diversity in pretraining.9\\nFinally, we pretrain RoBERTa for significantly\\nlonger, increasing the number of pretraining steps\\nfrom 100K to 300K, and then further to 500K. We\\nagain observe significant gains in downstream task\\nperformance, and the 300K and 500K step mod-\\nels outperform XLNet LARGE across most tasks. We\\nnote that even our longest-trained model does not\\nappear to overfit our data and would likely benefit\\nfrom additional training. In the rest of the paper, we evaluate our best\\nRoBERTa model on the three different bench-\\nmarks: GLUE, SQuaD and RACE. Specifically\\n9Our experiments conflate increases in data size and di-\\nversity. We leave a more careful analysis of these two dimen-\\nsions to future work.we consider RoBERTa trained for 500K steps over\\nall five of the datasets introduced in Section 3.2. 5.1 GLUE Results\\nFor GLUE we consider two finetuning settings. In the first setting ( single-task, dev ) we finetune\\nRoBERTa separately for each of the GLUE tasks,\\nusing only the training data for the correspond-\\ning task. We consider a limited hyperparameter\\nsweep for each task, with batch sizes ∈ {16,32}\\nand learning rates ∈ {1e−5,2e−5,3e−5}, with a\\nlinear warmup for the first 6% of steps followed by\\na linear decay to 0. We finetune for 10 epochs and\\nperform early stopping based on each task’s eval-\\nuation metric on the dev set. The rest of the hyper-\\nparameters remain the same as during pretraining. In this setting, we report the median development\\nset results for each task over five random initial-\\nizations, without model ensembling. In the second setting ( ensembles, test ), we com-\\npare RoBERTa to other approaches on the test set\\nvia the GLUE leaderboard. While many submis-\\nsions to the GLUE leaderboard depend on multi-\\ntask finetuning, our submission depends only on\\nsingle-task finetuning . For RTE, STS and MRPC\\nwe found it helpful to finetune starting from the\\nMNLI single-task model, rather than the baseline\\npretrained RoBERTa. We explore a slightly wider\\nhyperparameter space, described in the Appendix,\\nand ensemble between 5 and 7 models per task. MNLI QNLI QQP RTE SST MRPC CoLA STS WNLI Avg\\nSingle-task single models on dev\\nBERT LARGE 86.6/- 92.3 91.3 70.4 93.2 88.0 60.6 90.0 - -\\nXLNet LARGE 89.8/- 93.9 91.8 83.8 95.6 89.2 63.6 91.8 - -\\nRoBERTa 90.2/90.2 94.7 92.2 86.6 96.4 90.9 68.0 92.4 91.3 -\\nEnsembles on test (from leaderboard as of July 25, 2019)\\nALICE 88.2/87.9 95.7 90.7 83.5 95.2 92.6 68.6 91.1 80.8 86.3\\nMT-DNN 87.9/87.4 96.0 89.9 86.3 96.5 92.7 68.4 91.1 89.0 87.6\\nXLNet 90.2/89.8 98.6 90.3 86.3 96.8 93.0 67.8 91.6 90.4 88.4\\nRoBERTa 90.8/90.2 98.9 90.2 88.2 96.7 92.3 67.8 92.2 89.0 88.5\\nTable 5: Results on GLUE. All results are based on a 24-layer a rchitecture. BERT LARGE and XLNet LARGE results\\nare from Devlin et al. (2019 ) and Yang et al. (2019 ), respectively. RoBERTa results on the development set are a\\nmedian over five runs. RoBERTa results on the test set are ense mbles of single-task models. For RTE, STS and\\nMRPC we finetune starting from the MNLI model instead of the ba seline pretrained model. Averages are obtained\\nfrom the GLUE leaderboard. Task-specific modifications Two of the GLUE\\ntasks require task-specific finetuning approaches\\nto achieve competitive leaderboard results. QNLI : Recent submissions on the GLUE\\nleaderboard adopt a pairwise ranking formulation\\nfor the QNLI task, in which candidate answers\\nare mined from the training set and compared to\\none another, and a single (question, candidate)\\npair is classified as positive ( Liu et al.\"\n", "}\n", "\u001b[36;1m\u001b[1;3m[chain/end]\u001b[0m \u001b[1m[chain:RunnableSequence > chain:RunnableParallel > chain:RunnablePassthrough] [4ms] Exiting Chain run with output:\n", "\u001b[0m{\n", " \"doc\": \",2019b ,a;\\nYang et al. ,2019 ). This formulation significantly\\nsimplifies the task, but is not directly comparable\\nto BERT ( Devlin et al. ,2019 ). Following recent\\nwork, we adopt the ranking approach for our test\\nsubmission, but for direct comparison with BERT\\nwe report development set results based on a pure\\nclassification approach. WNLI : We found the provided NLI-format\\ndata to be challenging to work with. Instead\\nwe use the reformatted WNLI data from Super-\\nGLUE ( Wang et al. ,2019a ), which indicates the\\nspan of the query pronoun and referent. We fine-\\ntune RoBERTa using the margin ranking loss from\\nKocijan et al. (2019 ). For a given input sentence,\\nwe use spaCy ( Honnibal and Montani ,2017 ) to\\nextract additional candidate noun phrases from the\\nsentence and finetune our model so that it assigns\\nhigher scores to positive referent phrases than for\\nany of the generated negative candidate phrases. One unfortunate consequence of this formulation\\nis that we can only make use of the positive train-\\ning examples, which excludes over half of the pro-\\nvided training examples.10\\n10While we only use the provided WNLI training data, ourResults We present our results in Table 5. In the\\nfirst setting ( single-task, dev ), RoBERTa achieves\\nstate-of-the-art results on all 9 of the GLUE\\ntask development sets. Crucially, RoBERTa uses\\nthe same masked language modeling pretrain-\\ning objective and architecture as BERT LARGE , yet\\nconsistently outperforms both BERT LARGE and\\nXLNet LARGE . This raises questions about the rel-\\native importance of model architecture and pre-\\ntraining objective, compared to more mundane de-\\ntails like dataset size and training time that we ex-\\nplore in this work. In the second setting ( ensembles, test ), we\\nsubmit RoBERTa to the GLUE leaderboard and\\nachieve state-of-the-art results on 4 out of 9 tasks\\nand the highest average score to date. This is espe-\\ncially exciting because RoBERTa does not depend\\non multi-task finetuning, unlike most of the other\\ntop submissions. We expect future work may fur-\\nther improve these results by incorporating more\\nsophisticated multi-task finetuning procedures. 5.2 SQuAD Results\\nWe adopt a much simpler approach for SQuAD\\ncompared to past work. In particular, while\\nboth BERT ( Devlin et al. ,2019 ) and XL-\\nNet ( Yang et al. ,2019 ) augment their training data\\nwith additional QA datasets, we only finetune\\nRoBERTa using the provided SQuAD training\\ndata .Yang et al. (2019 ) also employed a custom\\nlayer-wise learning rate schedule to finetune\\nresults could potentially be improved by augmenting this wi th\\nadditional pronoun disambiguation datasets. ModelSQuAD 1.1 SQuAD 2.0\\nEM F1 EM F1\\nSingle models on dev, w/o data augmentation\\nBERT LARGE 84.1 90.9 79.0 81.8\\nXLNet LARGE 89.0 94.5 86.1 88.8\\nRoBERTa 88.9 94.6 86.5 89.4\\nSingle models on test (as of July 25, 2019)\\nXLNet LARGE 86.3†89.1†\\nRoBERTa 86.8 89.8\\nXLNet + SG-Net Verifier 87.0†89.9†\\nTable 6: Results on SQuAD. †indicates results that de-\\npend on additional external training data. RoBERTa\\nuses only the provided SQuAD data in both dev and\\ntest settings. BERT LARGE and XLNet LARGE results are\\nfrom Devlin et al. (2019 ) and Yang et al. (2019 ), re-\\nspectively. XLNet, while we use the same learning rate for\\nall layers. For SQuAD v1.1 we follow the same finetun-\\ning procedure as Devlin et al. (2019 ). For SQuAD\\nv2.0, we additionally classify whether a given\\nquestion is answerable; we train this classifier\\njointly with the span predictor by summing the\\nclassification and span loss terms. Results We present our results in Table 6. On\\nthe SQuAD v1.1 development set, RoBERTa\\nmatches the state-of-the-art set by XLNet. On the\\nSQuAD v2.0 development set, RoBERTa sets a\\nnew state-of-the-art, improving over XLNet by 0.4\\npoints (EM) and 0.6 points (F1). We also submit RoBERTa to the public SQuAD\\n2.0 leaderboard and evaluate its performance rel-\\native to other systems. Most of the top systems\\nbuild upon either BERT ( Devlin et al. ,2019 ) or\\nXLNet ( Yang et al. ,2019 ), both of which rely on\\nadditional external training data. In contrast, our\\nsubmission does not use any additional data. Our single RoBERTa model outperforms all but\\none of the single model submissions, and is the\\ntop scoring system among those that do not rely\\non data augmentation. 5.3 RACE Results\\nIn RACE, systems are provided with a passage of\\ntext, an associated question, and four candidate an-\\nswers. Systems are required to classify which of\\nthe four candidate answers is correct. We modify RoBERTa for this task by concate-Model Accuracy Middle High\\nSingle models on test (as of July 25, 2019)\\nBERT LARGE 72.0 76.6 70.1\\nXLNet LARGE 81.7 85.4 80.2\\nRoBERTa 83.2 86.5 81.3\\nTable 7: Results on the RACE test set. BERT LARGE and\\nXLNet LARGE results are from Yang et al. (2019 ). nating each candidate answer with the correspond-\\ning question and passage. We then encode each of\\nthese four sequences and pass the resulting [CLS]\\nrepresentations through a fully-connected layer,\\nwhich is used to predict the correct answer. We\\ntruncate question-answer pairs that are longer than\\n128 tokens and, if needed, the passage so that the\\ntotal length is at most 512 tokens. Results on the RACE test sets are presented in\\nTable 7. RoBERTa achieves state-of-the-art results\\non both middle-school and high-school settings. 6 Related Work\\nPretraining methods have been designed\\nwith different training objectives, includ-\\ning language modeling ( Dai and Le ,2015 ;\\nPeters et al. ,2018 ;Howard and Ruder ,2018 ),\\nmachine translation ( McCann et al. ,2017 ), and\\nmasked language modeling ( Devlin et al. ,2019 ;\\nLample and Conneau ,2019 ). Many recent\\npapers have used a basic recipe of finetuning\\nmodels for each end task ( Howard and Ruder ,\\n2018 ;Radford et al. ,2018 ), and pretraining\\nwith some variant of a masked language model\\nobjective. However, newer methods have\\nimproved performance by multi-task fine tun-\\ning ( Dong et al. ,2019 ), incorporating entity\\nembeddings ( Sun et al. ,2019 ), span predic-\\ntion ( Joshi et al. ,2019 ), and multiple variants\\nof autoregressive pretraining ( Song et al. ,2019 ;\\nChan et al. ,2019 ;Yang et al. ,2019 ). Perfor-\\nmance is also typically improved by training\\nbigger models on more data ( Devlin et al.\"\n", "}\n", "\u001b[36;1m\u001b[1;3m[chain/end]\u001b[0m \u001b[1m[chain:RunnableSequence > chain:RunnableParallel > chain:RunnablePassthrough] [4ms] Exiting Chain run with output:\n", "\u001b[0m{\n", " \"doc\": \",\\n2019 ;Baevski et al. ,2019 ;Yang et al. ,2019 ;\\nRadford et al. ,2019 ). Our goal was to replicate,\\nsimplify, and better tune the training of BERT,\\nas a reference point for better understanding the\\nrelative performance of all of these methods. 7 Conclusion\\nWe carefully evaluate a number of design de-\\ncisions when pretraining BERT models. We\\nfind that performance can be substantially im-\\nproved by training the model longer, with bigger\\nbatches over more data; removing the next sen-\\ntence prediction objective; training on longer se-\\nquences; and dynamically changing the masking\\npattern applied to the training data. Our improved\\npretraining procedure, which we call RoBERTa,\\nachieves state-of-the-art results on GLUE, RACE\\nand SQuAD, without multi-task finetuning for\\nGLUE or additional data for SQuAD. These re-\\nsults illustrate the importance of these previ-\\nously overlooked design decisions and suggest\\nthat BERT’s pretraining objective remains com-\\npetitive with recently proposed alternatives. We additionally use a novel dataset,\\nCC-N EWS, and release our models and\\ncode for pretraining and finetuning at:\\nhttps://github.com/pytorch/fairseq . References\\nEneko Agirre, Llu’is M‘arquez, and Richard Wicen-\\ntowski, editors. 2007. Proceedings of the Fourth\\nInternational Workshop on Semantic Evaluations\\n(SemEval-2007) . Alexei Baevski, Sergey Edunov, Yinhan Liu, Luke\\nZettlemoyer, and Michael Auli. 2019. Cloze-\\ndriven pretraining of self-attention networks. arXiv\\npreprint arXiv:1903.07785 .\"\n", "}\n", "\u001b[36;1m\u001b[1;3m[chain/end]\u001b[0m \u001b[1m[chain:RunnableSequence > chain:RunnableParallel > chain:RunnablePassthrough] [4ms] Exiting Chain run with output:\n", "\u001b[0m{\n", " \"doc\": \"Roy Bar-Haim, Ido Dagan, Bill Dolan, Lisa Ferro,\\nDanilo Giampiccolo, Bernardo Magnini, and Idan\\nSzpektor. 2006. The second PASCAL recognising\\ntextual entailment challenge. In Proceedings of the\\nsecond PASCAL challenges workshop on recognis-\\ning textual entailment . Luisa Bentivogli, Ido Dagan, Hoa Trang Dang, Danilo\\nGiampiccolo, and Bernardo Magnini. 2009. The\\nfifth PASCAL recognizing textual entailment chal-\\nlenge. Samuel R Bowman, Gabor Angeli, Christopher Potts,\\nand Christopher D Manning. 2015. A large anno-\\ntated corpus for learning natural language inference. InEmpirical Methods in Natural Language Process-\\ning (EMNLP) .\"\n", "}\n", "\u001b[36;1m\u001b[1;3m[chain/end]\u001b[0m \u001b[1m[chain:RunnableSequence > chain:RunnableParallel > chain:RunnablePassthrough] [4ms] Exiting Chain run with output:\n", "\u001b[0m{\n", " \"doc\": \"William Chan, Nikita Kitaev, Kelvin Guu, Mitchell\\nStern, and Jakob Uszkoreit. 2019. KERMIT: Gener-\\native insertion-based modeling for sequences.\"\n", "}\n", "\u001b[36;1m\u001b[1;3m[chain/end]\u001b[0m \u001b[1m[chain:RunnableSequence > chain:RunnableParallel > chain:RunnablePassthrough] [4ms] Exiting Chain run with output:\n", "\u001b[0m{\n", " \"doc\": \"arXiv\\npreprint arXiv:1906.01604 .Ido Dagan, Oren Glickman, and Bernardo Magnini. 2006. The PASCAL recognising textual entailment\\nchallenge. In Machine learning challenges. evalu-\\nating predictive uncertainty, visual object classifica-\\ntion, and recognising tectual entailment . Andrew M Dai and Quoc V Le. 2015. Semi-supervised\\nsequence learning. In Advances in Neural Informa-\\ntion Processing Systems (NIPS) . Jacob Devlin, Ming-Wei Chang, Kenton Lee, and\\nKristina Toutanova. 2019. BERT: Pre-training of\\ndeep bidirectional transformers for language under-\\nstanding. In North American Association for Com-\\nputational Linguistics (NAACL) . William B Dolan and Chris Brockett. 2005. Auto-\\nmatically constructing a corpus of sentential para-\\nphrases. In Proceedings of the International Work-\\nshop on Paraphrasing . Li Dong, Nan Yang, Wenhui Wang, Furu Wei,\\nXiaodong Liu, Yu Wang, Jianfeng Gao, Ming\\nZhou, and Hsiao-Wuen Hon. 2019. Unified\\nlanguage model pre-training for natural language\\nunderstanding and generation. arXiv preprint\\narXiv:1905.03197 .\"\n", "}\n", "\u001b[36;1m\u001b[1;3m[chain/end]\u001b[0m \u001b[1m[chain:RunnableSequence > chain:RunnableParallel > chain:RunnablePassthrough] [4ms] Exiting Chain run with output:\n", "\u001b[0m{\n", " \"doc\": \"Danilo Giampiccolo, Bernardo Magnini, Ido Dagan,\\nand Bill Dolan. 2007. The third PASCAL recog-\\nnizing textual entailment challenge. In Proceedings\\nof the ACL-PASCAL workshop on textual entailment\\nand paraphrasing . Aaron Gokaslan and Vanya Cohen. 2019. Openweb-\\ntext corpus. http://web.archive.org/\\nsave/http://Skylion007.github.io/\\nOpenWebTextCorpus .\"\n", "}\n", "\u001b[36;1m\u001b[1;3m[chain/end]\u001b[0m \u001b[1m[chain:RunnableSequence > chain:RunnableParallel > chain:RunnablePassthrough] [4ms] Exiting Chain run with output:\n", "\u001b[0m{\n", " \"doc\": \"Felix Hamborg, Norman Meuschke, Corinna Bre-\\nitinger, and Bela Gipp. 2017. news-please: A\\ngeneric news crawler and extractor. In Proceedings\\nof the 15th International Symposium of Information\\nScience .\"\n", "}\n", "\u001b[36;1m\u001b[1;3m[chain/end]\u001b[0m \u001b[1m[chain:RunnableSequence > chain:RunnableParallel > chain:RunnablePassthrough] [4ms] Exiting Chain run with output:\n", "\u001b[0m{\n", " \"doc\": \"Dan Hendrycks and Kevin Gimpel. 2016. Gaus-\\nsian error linear units (gelus). arXiv preprint\\narXiv:1606.08415 .\"\n", "}\n", "\u001b[36;1m\u001b[1;3m[chain/end]\u001b[0m \u001b[1m[chain:RunnableSequence > chain:RunnableParallel > chain:RunnablePassthrough] [4ms] Exiting Chain run with output:\n", "\u001b[0m{\n", " \"doc\": \"Matthew Honnibal and Ines Montani. 2017. spaCy 2:\\nNatural language understanding with Bloom embed-\\ndings, convolutional neural networks and incremen-\\ntal parsing. To appear. Jeremy Howard and Sebastian Ruder. 2018. Universal\\nlanguage model fine-tuning for text classification. arXiv preprint arXiv:1801.06146 . Shankar Iyer, Nikhil Dandekar, and Kornl Cser-\\nnai. 2016. First quora dataset release: Question\\npairs.https://data.quora.com/First-\\nQuora-Dataset-Release-Question-\\nPairs . Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S.\"\n", "}\n", "\u001b[36;1m\u001b[1;3m[chain/end]\u001b[0m \u001b[1m[chain:RunnableSequence > chain:RunnableParallel > chain:RunnablePassthrough] [4ms] Exiting Chain run with output:\n", "\u001b[0m{\n", " \"doc\": \"Weld, Luke Zettlemoyer, and Omer Levy. 2019. SpanBERT: Improving pre-training by repre-\\nsenting and predicting spans. arXiv preprint\\narXiv:1907.10529 . Diederik Kingma and Jimmy Ba. 2015. Adam: A\\nmethod for stochastic optimization. In International\\nConference on Learning Representations (ICLR) .\"\n", "}\n", "\u001b[36;1m\u001b[1;3m[chain/end]\u001b[0m \u001b[1m[chain:RunnableSequence > chain:RunnableParallel > chain:RunnablePassthrough] [4ms] Exiting Chain run with output:\n", "\u001b[0m{\n", " \"doc\": \"Vid Kocijan, Ana-Maria Cretu, Oana-Maria Camburu,\\nYordan Yordanov, and Thomas Lukasiewicz. 2019. A surprisingly robust trick for winograd schema\\nchallenge. arXiv preprint arXiv:1905.06290 . Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang,\\nand Eduard Hovy. 2017. Race: Large-scale reading\\ncomprehension dataset from examinations. arXiv\\npreprint arXiv:1704.04683 . Guillaume Lample and Alexis Conneau. 2019. Cross-\\nlingual language model pretraining. arXiv preprint\\narXiv:1901.07291 .\"\n", "}\n", "\u001b[36;1m\u001b[1;3m[chain/end]\u001b[0m \u001b[1m[chain:RunnableSequence > chain:RunnableParallel > chain:RunnablePassthrough] [4ms] Exiting Chain run with output:\n", "\u001b[0m{\n", " \"doc\": \"Hector J Levesque, Ernest Davis, and Leora Morgen-\\nstern. 2011. The Winograd schema challenge. In\\nAAAI Spring Symposium: Logical Formalizations of\\nCommonsense Reasoning . Xiaodong Liu, Pengcheng He, Weizhu Chen, and\\nJianfeng Gao. 2019a. Improving multi-task deep\\nneural networks via knowledge distillation for\\nnatural language understanding. arXiv preprint\\narXiv:1904.09482 .\"\n", "}\n", "\u001b[36;1m\u001b[1;3m[chain/end]\u001b[0m \u001b[1m[chain:RunnableSequence > chain:RunnableParallel > chain:RunnablePassthrough] [4ms] Exiting Chain run with output:\n", "\u001b[0m{\n", " \"doc\": \"Xiaodong Liu, Pengcheng He, Weizhu Chen, and Jian-\\nfeng Gao. 2019b. Multi-task deep neural networks\\nfor natural language understanding. arXiv preprint\\narXiv:1901.11504 . Bryan McCann, James Bradbury, Caiming Xiong, and\\nRichard Socher. 2017. Learned in translation: Con-\\ntextualized word vectors. In Advances in Neural In-\\nformation Processing Systems (NIPS) , pages 6297–\\n6308.\"\n", "}\n", "\u001b[36;1m\u001b[1;3m[chain/end]\u001b[0m \u001b[1m[chain:RunnableSequence > chain:RunnableParallel > chain:RunnablePassthrough] [4ms] Exiting Chain run with output:\n", "\u001b[0m{\n", " \"doc\": \"Paulius Micikevicius, Sharan Narang, Jonah Alben,\\nGregory Diamos, Erich Elsen, David Garcia, Boris\\nGinsburg, Michael Houston, Oleksii Kuchaiev,\\nGanesh Venkatesh, and Hao Wu. 2018. Mixed preci-\\nsion training. In International Conference on Learn-\\ning Representations . Sebastian Nagel. 2016. Cc-news. http:\\n//web.archive.org/save/http:\\n//commoncrawl.org/2016/10/news-\\ndataset-available .\"\n", "}\n", "\u001b[36;1m\u001b[1;3m[chain/end]\u001b[0m \u001b[1m[chain:RunnableSequence > chain:RunnableParallel > chain:RunnablePassthrough] [4ms] Exiting Chain run with output:\n", "\u001b[0m{\n", " \"doc\": \"Myle Ott, Sergey Edunov, Alexei Baevski, Angela\\nFan, Sam Gross, Nathan Ng, David Grangier, and\\nMichael Auli. 2019. FAIRSEQ : A fast, exten-\\nsible toolkit for sequence modeling. In North\\nAmerican Association for Computational Linguis-\\ntics (NAACL): System Demonstrations .Myle Ott, Sergey Edunov, David Grangier, and\\nMichael Auli. 2018. Scaling neural machine trans-\\nlation. In Proceedings of the Third Conference on\\nMachine Translation (WMT) . Adam Paszke, Sam Gross, Soumith Chintala, Gre-\\ngory Chanan, Edward Yang, Zachary DeVito, Zem-\\ning Lin, Alban Desmaison, Luca Antiga, and Adam\\nLerer. 2017. Automatic differentiation in PyTorch. InNIPS Autodiff Workshop . Matthew Peters, Mark Neumann, Mohit Iyyer, Matt\\nGardner, Christopher Clark, Kenton Lee, and Luke\\nZettlemoyer. 2018. Deep contextualized word repre-\\nsentations. In North American Association for Com-\\nputational Linguistics (NAACL) . Alec Radford, Karthik Narasimhan, Time Salimans,\\nand Ilya Sutskever. 2018. Improving language un-\\nderstanding with unsupervised learning. Technical\\nreport, OpenAI. Alec Radford, Jeffrey Wu, Rewon Child, David Luan,\\nDario Amodei, and Ilya Sutskever. 2019. Language\\nmodels are unsupervised multitask learners. Techni-\\ncal report, OpenAI.\"\n", "}\n", "\u001b[36;1m\u001b[1;3m[chain/end]\u001b[0m \u001b[1m[chain:RunnableSequence > chain:RunnableParallel > chain:RunnablePassthrough] [4ms] Exiting Chain run with output:\n", "\u001b[0m{\n", " \"doc\": \"Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018. Know what you don’t know: Unanswerable ques-\\ntions for squad. In Association for Computational\\nLinguistics (ACL) . Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and\\nPercy Liang. 2016. SQuAD: 100,000+ questions for\\nmachine comprehension of text. In Empirical Meth-\\nods in Natural Language Processing (EMNLP) . Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural machine translation of rare words with\\nsubword units. In Association for Computational\\nLinguistics (ACL) , pages 1715–1725. Richard Socher, Alex Perelygin, Jean Wu, Jason\\nChuang, Christopher D Manning, Andrew Ng, and\\nChristopher Potts. 2013. Recursive deep models\\nfor semantic compositionality over a sentiment tree-\\nbank. In Empirical Methods in Natural Language\\nProcessing (EMNLP) . Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and\\nTie-Yan Liu. 2019. MASS: Masked sequence\\nto sequence pre-training for language generation. InInternational Conference on Machine Learning\\n(ICML) .\"\n", "}\n", "\u001b[36;1m\u001b[1;3m[chain/end]\u001b[0m \u001b[1m[chain:RunnableSequence > chain:RunnableParallel > chain:RunnablePassthrough] [4ms] Exiting Chain run with output:\n", "\u001b[0m{\n", " \"doc\": \"Yu Stephanie Sun, Shuohuan Wang, Yukun Li, Shikun\\nFeng, Xuyi Chen, Han Zhang, Xinlun Tian, Danxi-\\nang Zhu, Hao Tian, and Hua Wu. 2019. ERNIE: En-\\nhanced representation through knowledge integra-\\ntion. arXiv preprint arXiv:1904.09223 . Trieu H Trinh and Quoc V Le. 2018. A simple\\nmethod for commonsense reasoning. arXiv preprint\\narXiv:1806.02847 . Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob\\nUszkoreit, Llion Jones, Aidan N Gomez, Łukasz\\nKaiser, and Illia Polosukhin. 2017. Attention is all\\nyou need. In Advances in neural information pro-\\ncessing systems . Alex Wang, Yada Pruksachatkun, Nikita Nangia,\\nAmanpreet Singh, Julian Michael, Felix Hill, Omer\\nLevy, and Samuel R.\"\n", "}\n", "\u001b[36;1m\u001b[1;3m[chain/end]\u001b[0m \u001b[1m[chain:RunnableSequence > chain:RunnableParallel > chain:RunnablePassthrough] [4ms] Exiting Chain run with output:\n", "\u001b[0m{\n", " \"doc\": \"Bowman. 2019a. SuperGLUE:\\nA stickier benchmark for general-purpose language\\nunderstanding systems.\"\n", "}\n", "\u001b[36;1m\u001b[1;3m[chain/end]\u001b[0m \u001b[1m[chain:RunnableSequence > chain:RunnableParallel > chain:RunnablePassthrough] [4ms] Exiting Chain run with output:\n", "\u001b[0m{\n", " \"doc\": \"arXiv preprint 1905.00537 . Alex Wang, Amanpreet Singh, Julian Michael, Felix\\nHill, Omer Levy, and Samuel R. Bowman. 2019b. GLUE: A multi-task benchmark and analysis plat-\\nform for natural language understanding. In Inter-\\nnational Conference on Learning Representations\\n(ICLR) . Alex Warstadt, Amanpreet Singh, and Samuel R. Bow-\\nman. 2018. Neural network acceptability judg-\\nments. arXiv preprint 1805.12471 . Adina Williams, Nikita Nangia, and Samuel Bowman. 2018. A broad-coverage challenge corpus for sen-\\ntence understanding through inference. In North\\nAmerican Association for Computational Linguis-\\ntics (NAACL) . Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Car-\\nbonell, Ruslan Salakhutdinov, and Quoc V Le. 2019. Xlnet: Generalized autoregressive pretrain-\\ning for language understanding. arXiv preprint\\narXiv:1906.08237 .\"\n", "}\n", "\u001b[36;1m\u001b[1;3m[chain/end]\u001b[0m \u001b[1m[chain:RunnableSequence > chain:RunnableParallel > chain:RunnablePassthrough] [4ms] Exiting Chain run with output:\n", "\u001b[0m{\n", " \"doc\": \"Yang You, Jing Li, Jonathan Hseu, Xiaodan Song,\\nJames Demmel, and Cho-Jui Hsieh. 2019. Reduc-\\ning bert pre-training time from 3 days to 76 minutes. arXiv preprint arXiv:1904.00962 . Rowan Zellers, Ari Holtzman, Hannah Rashkin,\\nYonatan Bisk, Ali Farhadi, Franziska Roesner, and\\nYejin Choi. 2019. Defending against neural fake\\nnews. arXiv preprint arXiv:1905.12616 .\"\n", "}\n", "\u001b[36;1m\u001b[1;3m[chain/end]\u001b[0m \u001b[1m[chain:RunnableSequence > chain:RunnableParallel > chain:RunnablePassthrough] [4ms] Exiting Chain run with output:\n", "\u001b[0m{\n", " \"doc\": \"Yukun Zhu, Ryan Kiros, Richard Zemel, Ruslan\\nSalakhutdinov, Raquel Urtasun, Antonio Torralba,\\nand Sanja Fidler. 2015. Aligning books and movies:\\nTowards story-like visual explanations by watch-\\ning movies and reading books. In arXiv preprint\\narXiv:1506.06724 . Appendix for “RoBERTa: A Robustly\\nOptimized BERT Pretraining Approach”\\nA Full results on GLUE\\nIn Table 8we present the full set of development\\nset results for RoBERTa. We present results for\\naLARGE configuration that follows BERT LARGE ,\\nas well as a BASE configuration that follows\\nBERT BASE.B Pretraining Hyperparameters\\nTable 9describes the hyperparameters for pre-\\ntraining of RoBERTa LARGE and RoBERTa BASE\\nC Finetuning Hyperparameters\\nFinetuning hyperparameters for RACE, SQuAD\\nand GLUE are given in Table 10. We select the\\nbest hyperparameter values based on the median\\nof 5 random seeds for each task. MNLI QNLI QQP RTE SST MRPC CoLA STS\\nRoBERTa BASE\\n+ all data + 500k steps 87.6 92.8 91.9 78.7 94.8 90.2 63.6 91.2\\nRoBERTa LARGE\\nwith B OOKS + W IKI 89.0 93.9 91.9 84.5 95.3 90.2 66.3 91.6\\n+ additional data ( §3.2) 89.3 94.0 92.0 82.7 95.6 91.4 66.1 92.2\\n+ pretrain longer 300k 90.0 94.5 92.2 83.3 96.1 91.1 67.4 92.3\\n+ pretrain longer 500k 90.2 94.7 92.2 86.6 96.4 90.9 68.0 92.4\\nTable 8: Development set results on GLUE tasks for various co nfigurations of RoBERTa. Hyperparam RoBERTa LARGE RoBERTa BASE\\nNumber of Layers 24 12\\nHidden size 1024 768\\nFFN inner hidden size 4096 3072\\nAttention heads 16 12\\nAttention head size 64 64\\nDropout 0.1 0.1\\nAttention Dropout 0.1 0.1\\nWarmup Steps 30k 24k\\nPeak Learning Rate 4e-4 6e-4\\nBatch Size 8k 8k\\nWeight Decay 0.01 0.01\\nMax Steps 500k 500k\\nLearning Rate Decay Linear Linear\\nAdamǫ 1e-6 1e-6\\nAdamβ1 0.9 0.9\\nAdamβ2 0.98 0.98\\nGradient Clipping 0.0 0.0\\nTable 9: Hyperparameters for pretraining RoBERTa LARGE and RoBERTa BASE. Hyperparam RACE SQuAD GLUE\\nLearning Rate 1e-5 1.5e-5 {1e-5, 2e-5, 3e-5 }\\nBatch Size 16 48 {16, 32}\\nWeight Decay 0.1 0.01 0.1\\nMax Epochs 4 2 10\\nLearning Rate Decay Linear Linear Linear\\nWarmup ratio 0.06 0.06 0.06\\nTable 10: Hyperparameters for finetuning RoBERTa LARGE on RACE, SQuAD and GLUE.\"\n", "}\n", "\u001b[36;1m\u001b[1;3m[chain/end]\u001b[0m \u001b[1m[chain:RunnableSequence > chain:RunnableParallel] [14ms] Exiting Chain run with output:\n", "\u001b[0m{\n", " \"doc\": {\n", " \"doc\": \"arXiv:1907.11692v1 [cs.CL] 26 Jul 2019RoBERTa: A Robustly Optimized BERT Pretraining Approach\\nYinhan Liu∗§Myle Ott∗§Naman Goyal∗§Jingfei Du∗§Mandar Joshi†\\nDanqi Chen§Omer Levy§Mike Lewis§Luke Zettlemoyer†§Veselin Stoyanov§\\n†Paul G.\"\n", " }\n", "}\n", "\u001b[36;1m\u001b[1;3m[chain/end]\u001b[0m \u001b[1m[chain:RunnableSequence > chain:RunnableParallel] [14ms] Exiting Chain run with output:\n", "\u001b[0m{\n", " \"doc\": {\n", " \"doc\": \"Allen School of Computer Science & Engineering,\\nUniversity of Washington, Seattle, WA\\n{mandar90,lsz }@cs.washington.edu\\n§Facebook AI\\n{yinhanliu,myleott,naman,jingfeidu,\\ndanqi,omerlevy,mikelewis,lsz,ves }@fb.com\\nAbstract\\nLanguage model pretraining has led to sig-\\nnificant performance gains but careful com-\\nparison between different approaches is chal-\\nlenging. Training is computationally expen-\\nsive, often done on private datasets of different\\nsizes, and, as we will show, hyperparameter\\nchoices have significant impact on the final re-\\nsults. We present a replication study of BERT\\npretraining ( Devlin et al. ,2019 ) that carefully\\nmeasures the impact of many key hyperparam-\\neters and training data size. We find that BERT\\nwas significantly undertrained, and can match\\nor exceed the performance of every model\\npublished after it. Our best model achieves\\nstate-of-the-art results on GLUE, RACE and\\nSQuAD. These results highlight the impor-\\ntance of previously overlooked design choices,\\nand raise questions about the source of re-\\ncently reported improvements. We release our\\nmodels and code.1\\n1 Introduction\\nSelf-training methods such as ELMo ( Peters et al. ,\\n2018 ), GPT ( Radford et al. ,2018 ), BERT\\n(Devlin et al. ,2019 ), XLM ( Lample and Conneau ,\\n2019 ), and XLNet ( Yang et al. ,2019 ) have\\nbrought significant performance gains, but it can\\nbe challenging to determine which aspects of\\nthe methods contribute the most. Training is\\ncomputationally expensive, limiting the amount\\nof tuning that can be done, and is often done with\\nprivate training data of varying sizes, limiting\\nour ability to measure the effects of the modeling\\nadvances. ∗Equal contribution. 1Our models and code are available at:\\nhttps://github.com/pytorch/fairseqWe present a replication study of BERT pre-\\ntraining ( Devlin et al. ,2019 ), which includes a\\ncareful evaluation of the effects of hyperparmeter\\ntuning and training set size. We find that BERT\\nwas significantly undertrained and propose an im-\\nproved recipe for training BERT models, which\\nwe call RoBERTa, that can match or exceed the\\nperformance of all of the post-BERT methods. Our modifications are simple, they include: (1)\\ntraining the model longer, with bigger batches,\\nover more data; (2) removing the next sentence\\nprediction objective; (3) training on longer se-\\nquences; and (4) dynamically changing the mask-\\ning pattern applied to the training data. We also\\ncollect a large new dataset (CC-N EWS) of compa-\\nrable size to other privately used datasets, to better\\ncontrol for training set size effects. When controlling for training data, our im-\\nproved training procedure improves upon the pub-\\nlished BERT results on both GLUE and SQuAD. When trained for longer over additional data, our\\nmodel achieves a score of 88.5 on the public\\nGLUE leaderboard, matching the 88.4 reported\\nbyYang et al. (2019 ). Our model establishes a\\nnew state-of-the-art on 4/9 of the GLUE tasks:\\nMNLI, QNLI, RTE and STS-B. We also match\\nstate-of-the-art results on SQuAD and RACE. Overall, we re-establish that BERT’s masked lan-\\nguage model training objective is competitive\\nwith other recently proposed training objectives\\nsuch as perturbed autoregressive language model-\\ning (Yang et al. ,2019 ).2\\nIn summary, the contributions of this paper\\nare: (1) We present a set of important BERT de-\\nsign choices and training strategies and introduce\\n2It is possible that these other methods could also improve\\nwith more tuning. We leave this exploration to future work. alternatives that lead to better downstream task\\nperformance; (2) We use a novel dataset, CC-\\nNEWS, and confirm that using more data for pre-\\ntraining further improves performance on down-\\nstream tasks; (3) Our training improvements show\\nthat masked language model pretraining, under\\nthe right design choices, is competitive with all\\nother recently published methods. We release our\\nmodel, pretraining and fine-tuning code imple-\\nmented in PyTorch ( Paszke et al. ,2017 ). 2 Background\\nIn this section, we give a brief overview of the\\nBERT ( Devlin et al. ,2019 ) pretraining approach\\nand some of the training choices that we will ex-\\namine experimentally in the following section. 2.1 Setup\\nBERT takes as input a concatenation of two\\nsegments (sequences of tokens), x1,...,x N\\nandy1,...,yM. Segments usually consist of\\nmore than one natural sentence. The two seg-\\nments are presented as a single input sequence\\nto BERT with special tokens delimiting them:\\n[CLS],x1,...,x N,[SEP],y1,...,yM,[EOS]. MandNare constrained such that M+N < T ,\\nwhereTis a parameter that controls the maximum\\nsequence length during training. The model is first pretrained on a large unla-\\nbeled text corpus and subsequently finetuned us-\\ning end-task labeled data. 2.2 Architecture\\nBERT uses the now ubiquitous transformer archi-\\ntecture ( Vaswani et al. ,2017 ), which we will not\\nreview in detail. We use a transformer architecture\\nwithLlayers. Each block uses Aself-attention\\nheads and hidden dimension H. 2.3 Training Objectives\\nDuring pretraining, BERT uses two objectives:\\nmasked language modeling and next sentence pre-\\ndiction. Masked Language Model (MLM) A random\\nsample of the tokens in the input sequence is\\nselected and replaced with the special token\\n[MASK]. The MLM objective is a cross-entropy\\nloss on predicting the masked tokens. BERT uni-\\nformly selects 15% of the input tokens for possi-\\nble replacement. Of the selected tokens, 80% are\\nreplaced with [MASK], 10% are left unchanged,and 10% are replaced by a randomly selected vo-\\ncabulary token. In the original implementation, random mask-\\ning and replacement is performed once in the be-\\nginning and saved for the duration of training, al-\\nthough in practice, data is duplicated so the mask\\nis not always the same for every training sentence\\n(see Section 4.1). Next Sentence Prediction (NSP) NSP is a bi-\\nnary classification loss for predicting whether two\\nsegments follow each other in the original text. Positive examples are created by taking consecu-\\ntive sentences from the text corpus. Negative ex-\\namples are created by pairing segments from dif-\\nferent documents. Positive and negative examples\\nare sampled with equal probability. The NSP objective was designed to improve\\nperformance on downstream tasks, such as Natural\\nLanguage Inference ( Bowman et al. ,2015 ), which\\nrequire reasoning about the relationships between\\npairs of sentences. 2.4 Optimization\\nBERT is optimized with Adam ( Kingma and Ba ,\\n2015 ) using the following parameters: β1= 0.9,\\nβ2= 0.999,ǫ=1e-6 and L2weight de-\\ncay of0.01. The learning rate is warmed up\\nover the first 10,000 steps to a peak value of\\n1e-4, and then linearly decayed. BERT trains\\nwith a dropout of 0.1 on all layers and at-\\ntention weights, and a GELU activation func-\\ntion ( Hendrycks and Gimpel ,2016 ). Models are\\npretrained for S=1,000,000 updates, with mini-\\nbatches containing B=256 sequences of maxi-\\nmum length T=512 tokens. 2.5 Data\\nBERT is trained on a combination of B OOK COR-\\nPUS (Zhu et al. ,2015 ) plus English W IKIPEDIA ,\\nwhich totals 16GB of uncompressed text.3\\n3 Experimental Setup\\nIn this section, we describe the experimental setup\\nfor our replication study of BERT. 3.1 Implementation\\nWe reimplement BERT in FAIRSEQ (Ott et al. ,\\n2019 ). We primarily follow the original BERT\\n3Yang et al. (2019 ) use the same dataset but report having\\nonly 13GB of text after data cleaning. This is most likely due\\nto subtle differences in cleaning of the Wikipedia data. optimization hyperparameters, given in Section 2,\\nexcept for the peak learning rate and number of\\nwarmup steps, which are tuned separately for each\\nsetting. We additionally found training to be very\\nsensitive to the Adam epsilon term, and in some\\ncases we obtained better performance or improved\\nstability after tuning it. Similarly, we found setting\\nβ2= 0.98to improve stability when training with\\nlarge batch sizes. We pretrain with sequences of at most T= 512\\ntokens. Unlike Devlin et al. (2019 ), we do not ran-\\ndomly inject short sequences, and we do not train\\nwith a reduced sequence length for the first 90% of\\nupdates. We train only with full-length sequences. We train with mixed precision floating point\\narithmetic on DGX-1 machines, each with 8 ×\\n32GB Nvidia V100 GPUs interconnected by In-\\nfiniband ( Micikevicius et al. ,2018 ). 3.2 Data\\nBERT-style pretraining crucially relies on large\\nquantities of text. Baevski et al. (2019 ) demon-\\nstrate that increasing data size can result in im-\\nproved end-task performance. Several efforts\\nhave trained on datasets larger and more diverse\\nthan the original BERT ( Radford et al.\"\n", " }\n", "}\n", "\u001b[36;1m\u001b[1;3m[chain/end]\u001b[0m \u001b[1m[chain:RunnableSequence > chain:RunnableParallel] [14ms] Exiting Chain run with output:\n", "\u001b[0m{\n", " \"doc\": {\n", " \"doc\": \",2019 ;\\nYang et al. ,2019 ;Zellers et al. ,2019 ). Unfortu-\\nnately, not all of the additional datasets can be\\npublicly released. For our study, we focus on gath-\\nering as much data as possible for experimenta-\\ntion, allowing us to match the overall quality and\\nquantity of data as appropriate for each compari-\\nson. We consider five English-language corpora of\\nvarying sizes and domains, totaling over 160GB\\nof uncompressed text. We use the following text\\ncorpora:\\n•BOOK CORPUS (Zhu et al. ,2015 ) plus English\\nWIKIPEDIA . This is the original data used to\\ntrain BERT. (16GB). •CC-N EWS, which we collected from the En-\\nglish portion of the CommonCrawl News\\ndataset ( Nagel ,2016 ). The data contains 63\\nmillion English news articles crawled between\\nSeptember 2016 and February 2019. (76GB af-\\nter filtering).4\\n•OPENWEBTEXT (Gokaslan and Cohen ,2019 ),\\nan open-source recreation of the WebText cor-\\n4We usenews-please (Hamborg et al. ,2017 ) to col-\\nlect and extract CC-N EWS. CC-N EWS is similar to the R E-\\nALNEWS dataset described in Zellers et al. (2019 ).pus described in Radford et al. (2019 ). The text\\nis web content extracted from URLs shared on\\nReddit with at least three upvotes. (38GB).5\\n•STORIES , a dataset introduced in Trinh and Le\\n(2018 ) containing a subset of CommonCrawl\\ndata filtered to match the story-like style of\\nWinograd schemas. (31GB). 3.3 Evaluation\\nFollowing previous work, we evaluate our pre-\\ntrained models on downstream tasks using the fol-\\nlowing three benchmarks. GLUE The General Language Understand-\\ning Evaluation (GLUE) benchmark ( Wang et al. ,\\n2019b ) is a collection of 9 datasets for evaluating\\nnatural language understanding systems.6Tasks\\nare framed as either single-sentence classification\\nor sentence-pair classification tasks. The GLUE\\norganizers provide training and development data\\nsplits as well as a submission server and leader-\\nboard that allows participants to evaluate and com-\\npare their systems on private held-out test data. For the replication study in Section 4, we report\\nresults on the development sets after finetuning\\nthe pretrained models on the corresponding single-\\ntask training data (i.e., without multi-task training\\nor ensembling). Our finetuning procedure follows\\nthe original BERT paper ( Devlin et al. ,2019 ). In Section 5we additionally report test set re-\\nsults obtained from the public leaderboard. These\\nresults depend on a several task-specific modifica-\\ntions, which we describe in Section 5.1. SQuAD The Stanford Question Answering\\nDataset (SQuAD) provides a paragraph of context\\nand a question. The task is to answer the question\\nby extracting the relevant span from the context. We evaluate on two versions of SQuAD: V1.1\\nand V2.0 ( Rajpurkar et al. ,2016 ,2018 ). In V1.1\\nthe context always contains an answer, whereas in\\n5The authors and their affiliated institutions are not in any\\nway affiliated with the creation of the OpenWebText dataset. 6The datasets are: CoLA ( Warstadt et al. ,2018 ),\\nStanford Sentiment Treebank (SST) ( Socher et al. ,\\n2013 ), Microsoft Research Paragraph Corpus\\n(MRPC) ( Dolan and Brockett ,2005 ), Semantic Tex-\\ntual Similarity Benchmark (STS) ( Agirre et al. ,2007 ),\\nQuora Question Pairs (QQP) ( Iyer et al. ,2016 ), Multi-\\nGenre NLI (MNLI) ( Williams et al. ,2018 ), Question NLI\\n(QNLI) ( Rajpurkar et al. ,2016 ), Recognizing Textual\\nEntailment (RTE) ( Dagan et al. ,2006 ;Bar-Haim et al. ,\\n2006 ;Giampiccolo et al. ,2007 ;Bentivogli et al. ,2009 ) and\\nWinograd NLI (WNLI) ( Levesque et al. ,2011 ). V2.0 some questions are not answered in the pro-\\nvided context, making the task more challenging. For SQuAD V1.1 we adopt the same span pre-\\ndiction method as BERT ( Devlin et al. ,2019 ). For\\nSQuAD V2.0, we add an additional binary classi-\\nfier to predict whether the question is answerable,\\nwhich we train jointly by summing the classifica-\\ntion and span loss terms. During evaluation, we\\nonly predict span indices on pairs that are classi-\\nfied as answerable. RACE The ReAding Comprehension from Ex-\\naminations (RACE) ( Lai et al. ,2017 ) task is a\\nlarge-scale reading comprehension dataset with\\nmore than 28,000 passages and nearly 100,000\\nquestions. The dataset is collected from English\\nexaminations in China, which are designed for\\nmiddle and high school students. In RACE, each\\npassage is associated with multiple questions. For\\nevery question, the task is to select one correct an-\\nswer from four options. RACE has significantly\\nlonger context than other popular reading compre-\\nhension datasets and the proportion of questions\\nthat requires reasoning is very large. 4 Training Procedure Analysis\\nThis section explores and quantifies which choices\\nare important for successfully pretraining BERT\\nmodels. We keep the model architecture fixed.7\\nSpecifically, we begin by training BERT models\\nwith the same configuration as BERT BASE (L=\\n12,H= 768 ,A= 12 , 110M params). 4.1 Static vs. Dynamic Masking\\nAs discussed in Section 2, BERT relies on ran-\\ndomly masking and predicting tokens. The orig-\\ninal BERT implementation performed masking\\nonce during data preprocessing, resulting in a sin-\\nglestatic mask. To avoid using the same mask for\\neach training instance in every epoch, training data\\nwas duplicated 10 times so that each sequence is\\nmasked in 10 different ways over the 40 epochs of\\ntraining. Thus, each training sequence was seen\\nwith the same mask four times during training. We compare this strategy with dynamic mask-\\ningwhere we generate the masking pattern every\\ntime we feed a sequence to the model. This be-\\ncomes crucial when pretraining for more steps or\\nwith larger datasets. 7Studying architectural changes, including larger archi-\\ntectures, is an important area for future work.Masking SQuAD 2.0 MNLI-m SST-2\\nreference 76.3 84.3 92.8\\nOur reimplementation:\\nstatic 78.3 84.3 92.5\\ndynamic 78.7 84.0 92.9\\nTable 1: Comparison between static and dynamic\\nmasking for BERT BASE. We report F1 for SQuAD and\\naccuracy for MNLI-m and SST-2. Reported results are\\nmedians over 5 random initializations (seeds). Refer-\\nence results are from Yang et al. (2019 ). Results Table 1compares the published\\nBERT BASE results from Devlin et al. (2019 ) to our\\nreimplementation with either static or dynamic\\nmasking. We find that our reimplementation\\nwith static masking performs similar to the\\noriginal BERT model, and dynamic masking is\\ncomparable or slightly better than static masking. Given these results and the additional efficiency\\nbenefits of dynamic masking, we use dynamic\\nmasking in the remainder of the experiments. 4.2 Model Input Format and Next Sentence\\nPrediction\\nIn the original BERT pretraining procedure, the\\nmodel observes two concatenated document seg-\\nments, which are either sampled contiguously\\nfrom the same document (with p= 0.5) or from\\ndistinct documents. In addition to the masked lan-\\nguage modeling objective, the model is trained to\\npredict whether the observed document segments\\ncome from the same or distinct documents via an\\nauxiliary Next Sentence Prediction (NSP) loss. The NSP loss was hypothesized to be an impor-\\ntant factor in training the original BERT model. Devlin et al. (2019 ) observe that removing NSP\\nhurts performance, with significant performance\\ndegradation on QNLI, MNLI, and SQuAD 1.1. However, some recent work has questioned the\\nnecessity of the NSP loss ( Lample and Conneau ,\\n2019 ;Yang et al.\"\n", " }\n", "}\n", "\u001b[36;1m\u001b[1;3m[chain/end]\u001b[0m \u001b[1m[chain:RunnableSequence > chain:RunnableParallel] [14ms] Exiting Chain run with output:\n", "\u001b[0m{\n", " \"doc\": {\n", " \"doc\": \",2019 ;Joshi et al. ,2019 ). To better understand this discrepancy, we com-\\npare several alternative training formats:\\n•SEGMENT -PAIR +NSP: This follows the original\\ninput format used in BERT ( Devlin et al. ,2019 ),\\nwith the NSP loss. Each input has a pair of seg-\\nments, which can each contain multiple natural\\nsentences, but the total combined length must\\nbe less than 512 tokens. Model SQuAD 1.1/2.0 MNLI-m SST-2 RACE\\nOur reimplementation (with NSP loss):\\nSEGMENT -PAIR 90.4/78.7 84.0 92.9 64.2\\nSENTENCE -PAIR 88.7/76.2 82.9 92.1 63.0\\nOur reimplementation (without NSP loss):\\nFULL -SENTENCES 90.4/79.1 84.7 92.5 64.8\\nDOC-SENTENCES 90.6/79.7 84.7 92.7 65.6\\nBERT BASE 88.5/76.3 84.3 92.8 64.3\\nXLNet BASE (K = 7) –/81.3 85.8 92.7 66.1\\nXLNet BASE (K = 6) –/81.0 85.6 93.4 66.7\\nTable 2: Development set results for base models pretrained over B OOK CORPUS and W IKIPEDIA . All models are\\ntrained for 1M steps with a batch size of 256 sequences. We rep ort F1 for SQuAD and accuracy for MNLI-m,\\nSST-2 and RACE. Reported results are medians over five random initializations (seeds). Results for BERT BASEand\\nXLNet BASEare from Yang et al. (2019 ). •SENTENCE -PAIR +NSP: Each input contains a\\npair of natural sentences , either sampled from\\na contiguous portion of one document or from\\nseparate documents. Since these inputs are sig-\\nnificantly shorter than 512 tokens, we increase\\nthe batch size so that the total number of tokens\\nremains similar to SEGMENT -PAIR +NSP. We re-\\ntain the NSP loss. •FULL -SENTENCES : Each input is packed with\\nfull sentences sampled contiguously from one\\nor more documents, such that the total length is\\nat most 512 tokens. Inputs may cross document\\nboundaries. When we reach the end of one doc-\\nument, we begin sampling sentences from the\\nnext document and add an extra separator token\\nbetween documents. We remove the NSP loss. •DOC-SENTENCES : Inputs are constructed sim-\\nilarly to FULL -SENTENCES , except that they\\nmay not cross document boundaries. Inputs\\nsampled near the end of a document may be\\nshorter than 512 tokens, so we dynamically in-\\ncrease the batch size in these cases to achieve\\na similar number of total tokens as FULL -\\nSENTENCES .\"\n", " }\n", "}\n", "\u001b[36;1m\u001b[1;3m[chain/end]\u001b[0m \u001b[1m[chain:RunnableSequence > chain:RunnableParallel] [14ms] Exiting Chain run with output:\n", "\u001b[0m{\n", " \"doc\": {\n", " \"doc\": \"We remove the NSP loss. Results Table 2shows results for the four dif-\\nferent settings. We first compare the original\\nSEGMENT -PAIR input format from Devlin et al. (2019 ) to the SENTENCE -PAIR format; both for-\\nmats retain the NSP loss, but the latter uses sin-\\ngle sentences. We find that using individual\\nsentences hurts performance on downstream\\ntasks , which we hypothesize is because the model\\nis not able to learn long-range dependencies.We next compare training without the NSP\\nloss and training with blocks of text from a sin-\\ngle document ( DOC-SENTENCES ). We find that\\nthis setting outperforms the originally published\\nBERT BASEresults and that removing the NSP loss\\nmatches or slightly improves downstream task\\nperformance , in contrast to Devlin et al. (2019 ). It is possible that the original BERT implementa-\\ntion may only have removed the loss term while\\nstill retaining the SEGMENT -PAIR input format. Finally we find that restricting sequences to\\ncome from a single document ( DOC-SENTENCES )\\nperforms slightly better than packing sequences\\nfrom multiple documents ( FULL -SENTENCES ). However, because the DOC-SENTENCES format\\nresults in variable batch sizes, we use FULL -\\nSENTENCES in the remainder of our experiments\\nfor easier comparison with related work. 4.3 Training with large batches\\nPast work in Neural Machine Translation has\\nshown that training with very large mini-batches\\ncan both improve optimization speed and end-task\\nperformance when the learning rate is increased\\nappropriately ( Ott et al. ,2018 ). Recent work has\\nshown that BERT is also amenable to large batch\\ntraining ( You et al. ,2019 ). Devlin et al. (2019 ) originally trained\\nBERT BASE for 1M steps with a batch size of\\n256 sequences. This is equivalent in computa-\\ntional cost, via gradient accumulation, to training\\nfor 125K steps with a batch size of 2K sequences,\\nor for 31K steps with a batch size of 8K. In Table 3we compare perplexity and end- bsz steps lr ppl MNLI-m SST-2\\n256 1M 1e-4 3.99 84.7 92.7\\n2K 125K 7e-4 3.68 85.2 92.9\\n8K 31K 1e-3 3.77 84.6 92.8\\nTable 3: Perplexity on held-out training data ( ppl) and\\ndevelopment set accuracy for base models trained over\\nBOOK CORPUS and W IKIPEDIA with varying batch\\nsizes ( bsz).\"\n", " }\n", "}\n", "\u001b[36;1m\u001b[1;3m[chain/end]\u001b[0m \u001b[1m[chain:RunnableSequence > chain:RunnableParallel] [14ms] Exiting Chain run with output:\n", "\u001b[0m{\n", " \"doc\": {\n", " \"doc\": \"We tune the learning rate ( lr) for each set-\\nting. Models make the same number of passes over the\\ndata (epochs) and have the same computational cost. task performance of BERT BASE as we increase the\\nbatch size, controlling for the number of passes\\nthrough the training data. We observe that train-\\ning with large batches improves perplexity for the\\nmasked language modeling objective, as well as\\nend-task accuracy. Large batches are also easier to\\nparallelize via distributed data parallel training,8\\nand in later experiments we train with batches of\\n8K sequences. Notably You et al. (2019 ) train BERT with even\\nlarger batche sizes, up to 32K sequences. We leave\\nfurther exploration of the limits of large batch\\ntraining to future work. 4.4 Text Encoding\\nByte-Pair Encoding (BPE) ( Sennrich et al. ,2016 )\\nis a hybrid between character- and word-level rep-\\nresentations that allows handling the large vocab-\\nularies common in natural language corpora. In-\\nstead of full words, BPE relies on subwords units,\\nwhich are extracted by performing statistical anal-\\nysis of the training corpus. BPE vocabulary sizes typically range from\\n10K-100K subword units. However, unicode char-\\nacters can account for a sizeable portion of this\\nvocabulary when modeling large and diverse cor-\\npora, such as the ones considered in this work. Radford et al. (2019 ) introduce a clever imple-\\nmentation of BPE that uses bytes instead of uni-\\ncode characters as the base subword units. Using\\nbytes makes it possible to learn a subword vocab-\\nulary of a modest size (50K units) that can still en-\\ncode any input text without introducing any “un-\\nknown” tokens. 8Large batch training can improve training efficiency even\\nwithout large scale parallel hardware through gradient ac-\\ncumulation , whereby gradients from multiple mini-batches\\nare accumulated locally before each optimization step. Thi s\\nfunctionality is supported natively in FAIRSEQ (Ott et al. ,\\n2019 ).The original BERT implementa-\\ntion ( Devlin et al. ,2019 ) uses a character-level\\nBPE vocabulary of size 30K, which is learned\\nafter preprocessing the input with heuristic tok-\\nenization rules. Following Radford et al. (2019 ),\\nwe instead consider training BERT with a larger\\nbyte-level BPE vocabulary containing 50K sub-\\nword units, without any additional preprocessing\\nor tokenization of the input. This adds approxi-\\nmately 15M and 20M additional parameters for\\nBERT BASEand BERT LARGE , respectively. Early experiments revealed only slight dif-\\nferences between these encodings, with the\\nRadford et al. (2019 ) BPE achieving slightly\\nworse end-task performance on some tasks. Nev-\\nertheless, we believe the advantages of a univer-\\nsal encoding scheme outweighs the minor degre-\\ndation in performance and use this encoding in\\nthe remainder of our experiments. A more de-\\ntailed comparison of these encodings is left to fu-\\nture work. 5 RoBERTa\\nIn the previous section we propose modifications\\nto the BERT pretraining procedure that improve\\nend-task performance. We now aggregate these\\nimprovements and evaluate their combined im-\\npact. We call this configuration RoBERTa for\\nRobustly optimized BERT approach. Specifi-\\ncally, RoBERTa is trained with dynamic mask-\\ning (Section 4.1),FULL -SENTENCES without NSP\\nloss (Section 4.2), large mini-batches (Section 4.3)\\nand a larger byte-level BPE (Section 4.4). Additionally, we investigate two other impor-\\ntant factors that have been under-emphasized in\\nprevious work: (1) the data used for pretraining,\\nand (2) the number of training passes through the\\ndata. For example, the recently proposed XLNet\\narchitecture ( Yang et al. ,2019 ) is pretrained us-\\ning nearly 10 times more data than the original\\nBERT ( Devlin et al. ,2019 ). It is also trained with\\na batch size eight times larger for half as many op-\\ntimization steps, thus seeing four times as many\\nsequences in pretraining compared to BERT. To help disentangle the importance of these fac-\\ntors from other modeling choices (e.g., the pre-\\ntraining objective), we begin by training RoBERTa\\nfollowing the BERT LARGE architecture ( L= 24 ,\\nH= 1024 ,A= 16 , 355M parameters). We\\npretrain for 100K steps over a comparable B OOK -\\nCORPUS plus W IKIPEDIA dataset as was used in Model data bsz stepsSQuADMNLI-m SST-2(v1.1/2.0)\\nRoBERTa\\nwith B OOKS + W IKI 16GB 8K 100K 93.6/87.3 89.0 95.3\\n+ additional data ( §3.2) 160GB 8K 100K 94.0/87.7 89.3 95.6\\n+ pretrain longer 160GB 8K 300K 94.4/88.7 90.0 96.1\\n+ pretrain even longer 160GB 8K 500K 94.6/89.4 90.2 96.4\\nBERT LARGE\\nwith B OOKS + W IKI 13GB 256 1M 90.9/81.8 86.6 93.7\\nXLNet LARGE\\nwith B OOKS + W IKI 13GB 256 1M 94.0/87.8 88.4 94.4\\n+ additional data 126GB 2K 500K 94.5/88.8 89.8 95.6\\nTable 4: Development set results for RoBERTa as we pretrain o ver more data (16GB →160GB of text) and pretrain\\nfor longer (100K →300K→500K steps).\"\n", " }\n", "}\n", "\u001b[36;1m\u001b[1;3m[chain/end]\u001b[0m \u001b[1m[chain:RunnableSequence > chain:RunnableParallel] [14ms] Exiting Chain run with output:\n", "\u001b[0m{\n", " \"doc\": {\n", " \"doc\": \"Each row accumulates improvements from the row s above. RoBERTa\\nmatches the architecture and training objective of BERT LARGE . Results for BERT LARGE and XLNet LARGE are from\\nDevlin et al. (2019 ) and Yang et al. (2019 ), respectively. Complete results on all GLUE tasks can be fo und in the\\nAppendix.\"\n", " }\n", "}\n", "\u001b[36;1m\u001b[1;3m[chain/end]\u001b[0m \u001b[1m[chain:RunnableSequence > chain:RunnableParallel] [14ms] Exiting Chain run with output:\n", "\u001b[0m{\n", " \"doc\": {\n", " \"doc\": \"Devlin et al. (2019 ). We pretrain our model using\\n1024 V100 GPUs for approximately one day. Results We present our results in Table 4. When\\ncontrolling for training data, we observe that\\nRoBERTa provides a large improvement over the\\noriginally reported BERT LARGE results, reaffirming\\nthe importance of the design choices we explored\\nin Section 4. Next, we combine this data with the three ad-\\nditional datasets described in Section 3.2. We\\ntrain RoBERTa over the combined data with the\\nsame number of training steps as before (100K). In total, we pretrain over 160GB of text. We ob-\\nserve further improvements in performance across\\nall downstream tasks, validating the importance of\\ndata size and diversity in pretraining.9\\nFinally, we pretrain RoBERTa for significantly\\nlonger, increasing the number of pretraining steps\\nfrom 100K to 300K, and then further to 500K. We\\nagain observe significant gains in downstream task\\nperformance, and the 300K and 500K step mod-\\nels outperform XLNet LARGE across most tasks. We\\nnote that even our longest-trained model does not\\nappear to overfit our data and would likely benefit\\nfrom additional training. In the rest of the paper, we evaluate our best\\nRoBERTa model on the three different bench-\\nmarks: GLUE, SQuaD and RACE. Specifically\\n9Our experiments conflate increases in data size and di-\\nversity. We leave a more careful analysis of these two dimen-\\nsions to future work.we consider RoBERTa trained for 500K steps over\\nall five of the datasets introduced in Section 3.2. 5.1 GLUE Results\\nFor GLUE we consider two finetuning settings. In the first setting ( single-task, dev ) we finetune\\nRoBERTa separately for each of the GLUE tasks,\\nusing only the training data for the correspond-\\ning task. We consider a limited hyperparameter\\nsweep for each task, with batch sizes ∈ {16,32}\\nand learning rates ∈ {1e−5,2e−5,3e−5}, with a\\nlinear warmup for the first 6% of steps followed by\\na linear decay to 0. We finetune for 10 epochs and\\nperform early stopping based on each task’s eval-\\nuation metric on the dev set. The rest of the hyper-\\nparameters remain the same as during pretraining. In this setting, we report the median development\\nset results for each task over five random initial-\\nizations, without model ensembling. In the second setting ( ensembles, test ), we com-\\npare RoBERTa to other approaches on the test set\\nvia the GLUE leaderboard. While many submis-\\nsions to the GLUE leaderboard depend on multi-\\ntask finetuning, our submission depends only on\\nsingle-task finetuning . For RTE, STS and MRPC\\nwe found it helpful to finetune starting from the\\nMNLI single-task model, rather than the baseline\\npretrained RoBERTa. We explore a slightly wider\\nhyperparameter space, described in the Appendix,\\nand ensemble between 5 and 7 models per task. MNLI QNLI QQP RTE SST MRPC CoLA STS WNLI Avg\\nSingle-task single models on dev\\nBERT LARGE 86.6/- 92.3 91.3 70.4 93.2 88.0 60.6 90.0 - -\\nXLNet LARGE 89.8/- 93.9 91.8 83.8 95.6 89.2 63.6 91.8 - -\\nRoBERTa 90.2/90.2 94.7 92.2 86.6 96.4 90.9 68.0 92.4 91.3 -\\nEnsembles on test (from leaderboard as of July 25, 2019)\\nALICE 88.2/87.9 95.7 90.7 83.5 95.2 92.6 68.6 91.1 80.8 86.3\\nMT-DNN 87.9/87.4 96.0 89.9 86.3 96.5 92.7 68.4 91.1 89.0 87.6\\nXLNet 90.2/89.8 98.6 90.3 86.3 96.8 93.0 67.8 91.6 90.4 88.4\\nRoBERTa 90.8/90.2 98.9 90.2 88.2 96.7 92.3 67.8 92.2 89.0 88.5\\nTable 5: Results on GLUE. All results are based on a 24-layer a rchitecture. BERT LARGE and XLNet LARGE results\\nare from Devlin et al. (2019 ) and Yang et al. (2019 ), respectively. RoBERTa results on the development set are a\\nmedian over five runs. RoBERTa results on the test set are ense mbles of single-task models. For RTE, STS and\\nMRPC we finetune starting from the MNLI model instead of the ba seline pretrained model. Averages are obtained\\nfrom the GLUE leaderboard. Task-specific modifications Two of the GLUE\\ntasks require task-specific finetuning approaches\\nto achieve competitive leaderboard results. QNLI : Recent submissions on the GLUE\\nleaderboard adopt a pairwise ranking formulation\\nfor the QNLI task, in which candidate answers\\nare mined from the training set and compared to\\none another, and a single (question, candidate)\\npair is classified as positive ( Liu et al.\"\n", " }\n", "}\n", "\u001b[36;1m\u001b[1;3m[chain/end]\u001b[0m \u001b[1m[chain:RunnableSequence > chain:RunnableParallel] [14ms] Exiting Chain run with output:\n", "\u001b[0m{\n", " \"doc\": {\n", " \"doc\": \",2019b ,a;\\nYang et al. ,2019 ). This formulation significantly\\nsimplifies the task, but is not directly comparable\\nto BERT ( Devlin et al. ,2019 ). Following recent\\nwork, we adopt the ranking approach for our test\\nsubmission, but for direct comparison with BERT\\nwe report development set results based on a pure\\nclassification approach. WNLI : We found the provided NLI-format\\ndata to be challenging to work with. Instead\\nwe use the reformatted WNLI data from Super-\\nGLUE ( Wang et al. ,2019a ), which indicates the\\nspan of the query pronoun and referent. We fine-\\ntune RoBERTa using the margin ranking loss from\\nKocijan et al. (2019 ). For a given input sentence,\\nwe use spaCy ( Honnibal and Montani ,2017 ) to\\nextract additional candidate noun phrases from the\\nsentence and finetune our model so that it assigns\\nhigher scores to positive referent phrases than for\\nany of the generated negative candidate phrases. One unfortunate consequence of this formulation\\nis that we can only make use of the positive train-\\ning examples, which excludes over half of the pro-\\nvided training examples.10\\n10While we only use the provided WNLI training data, ourResults We present our results in Table 5. In the\\nfirst setting ( single-task, dev ), RoBERTa achieves\\nstate-of-the-art results on all 9 of the GLUE\\ntask development sets. Crucially, RoBERTa uses\\nthe same masked language modeling pretrain-\\ning objective and architecture as BERT LARGE , yet\\nconsistently outperforms both BERT LARGE and\\nXLNet LARGE . This raises questions about the rel-\\native importance of model architecture and pre-\\ntraining objective, compared to more mundane de-\\ntails like dataset size and training time that we ex-\\nplore in this work. In the second setting ( ensembles, test ), we\\nsubmit RoBERTa to the GLUE leaderboard and\\nachieve state-of-the-art results on 4 out of 9 tasks\\nand the highest average score to date. This is espe-\\ncially exciting because RoBERTa does not depend\\non multi-task finetuning, unlike most of the other\\ntop submissions. We expect future work may fur-\\nther improve these results by incorporating more\\nsophisticated multi-task finetuning procedures. 5.2 SQuAD Results\\nWe adopt a much simpler approach for SQuAD\\ncompared to past work. In particular, while\\nboth BERT ( Devlin et al. ,2019 ) and XL-\\nNet ( Yang et al. ,2019 ) augment their training data\\nwith additional QA datasets, we only finetune\\nRoBERTa using the provided SQuAD training\\ndata .Yang et al. (2019 ) also employed a custom\\nlayer-wise learning rate schedule to finetune\\nresults could potentially be improved by augmenting this wi th\\nadditional pronoun disambiguation datasets. ModelSQuAD 1.1 SQuAD 2.0\\nEM F1 EM F1\\nSingle models on dev, w/o data augmentation\\nBERT LARGE 84.1 90.9 79.0 81.8\\nXLNet LARGE 89.0 94.5 86.1 88.8\\nRoBERTa 88.9 94.6 86.5 89.4\\nSingle models on test (as of July 25, 2019)\\nXLNet LARGE 86.3†89.1†\\nRoBERTa 86.8 89.8\\nXLNet + SG-Net Verifier 87.0†89.9†\\nTable 6: Results on SQuAD. †indicates results that de-\\npend on additional external training data. RoBERTa\\nuses only the provided SQuAD data in both dev and\\ntest settings. BERT LARGE and XLNet LARGE results are\\nfrom Devlin et al. (2019 ) and Yang et al. (2019 ), re-\\nspectively. XLNet, while we use the same learning rate for\\nall layers. For SQuAD v1.1 we follow the same finetun-\\ning procedure as Devlin et al. (2019 ). For SQuAD\\nv2.0, we additionally classify whether a given\\nquestion is answerable; we train this classifier\\njointly with the span predictor by summing the\\nclassification and span loss terms. Results We present our results in Table 6. On\\nthe SQuAD v1.1 development set, RoBERTa\\nmatches the state-of-the-art set by XLNet. On the\\nSQuAD v2.0 development set, RoBERTa sets a\\nnew state-of-the-art, improving over XLNet by 0.4\\npoints (EM) and 0.6 points (F1). We also submit RoBERTa to the public SQuAD\\n2.0 leaderboard and evaluate its performance rel-\\native to other systems. Most of the top systems\\nbuild upon either BERT ( Devlin et al. ,2019 ) or\\nXLNet ( Yang et al. ,2019 ), both of which rely on\\nadditional external training data. In contrast, our\\nsubmission does not use any additional data. Our single RoBERTa model outperforms all but\\none of the single model submissions, and is the\\ntop scoring system among those that do not rely\\non data augmentation. 5.3 RACE Results\\nIn RACE, systems are provided with a passage of\\ntext, an associated question, and four candidate an-\\nswers. Systems are required to classify which of\\nthe four candidate answers is correct. We modify RoBERTa for this task by concate-Model Accuracy Middle High\\nSingle models on test (as of July 25, 2019)\\nBERT LARGE 72.0 76.6 70.1\\nXLNet LARGE 81.7 85.4 80.2\\nRoBERTa 83.2 86.5 81.3\\nTable 7: Results on the RACE test set. BERT LARGE and\\nXLNet LARGE results are from Yang et al. (2019 ). nating each candidate answer with the correspond-\\ning question and passage. We then encode each of\\nthese four sequences and pass the resulting [CLS]\\nrepresentations through a fully-connected layer,\\nwhich is used to predict the correct answer. We\\ntruncate question-answer pairs that are longer than\\n128 tokens and, if needed, the passage so that the\\ntotal length is at most 512 tokens. Results on the RACE test sets are presented in\\nTable 7. RoBERTa achieves state-of-the-art results\\non both middle-school and high-school settings. 6 Related Work\\nPretraining methods have been designed\\nwith different training objectives, includ-\\ning language modeling ( Dai and Le ,2015 ;\\nPeters et al. ,2018 ;Howard and Ruder ,2018 ),\\nmachine translation ( McCann et al. ,2017 ), and\\nmasked language modeling ( Devlin et al. ,2019 ;\\nLample and Conneau ,2019 ). Many recent\\npapers have used a basic recipe of finetuning\\nmodels for each end task ( Howard and Ruder ,\\n2018 ;Radford et al. ,2018 ), and pretraining\\nwith some variant of a masked language model\\nobjective. However, newer methods have\\nimproved performance by multi-task fine tun-\\ning ( Dong et al. ,2019 ), incorporating entity\\nembeddings ( Sun et al. ,2019 ), span predic-\\ntion ( Joshi et al. ,2019 ), and multiple variants\\nof autoregressive pretraining ( Song et al. ,2019 ;\\nChan et al. ,2019 ;Yang et al. ,2019 ). Perfor-\\nmance is also typically improved by training\\nbigger models on more data ( Devlin et al.\"\n", " }\n", "}\n", "\u001b[36;1m\u001b[1;3m[chain/end]\u001b[0m \u001b[1m[chain:RunnableSequence > chain:RunnableParallel] [13ms] Exiting Chain run with output:\n", "\u001b[0m{\n", " \"doc\": {\n", " \"doc\": \",\\n2019 ;Baevski et al. ,2019 ;Yang et al. ,2019 ;\\nRadford et al. ,2019 ). Our goal was to replicate,\\nsimplify, and better tune the training of BERT,\\nas a reference point for better understanding the\\nrelative performance of all of these methods. 7 Conclusion\\nWe carefully evaluate a number of design de-\\ncisions when pretraining BERT models. We\\nfind that performance can be substantially im-\\nproved by training the model longer, with bigger\\nbatches over more data; removing the next sen-\\ntence prediction objective; training on longer se-\\nquences; and dynamically changing the masking\\npattern applied to the training data. Our improved\\npretraining procedure, which we call RoBERTa,\\nachieves state-of-the-art results on GLUE, RACE\\nand SQuAD, without multi-task finetuning for\\nGLUE or additional data for SQuAD. These re-\\nsults illustrate the importance of these previ-\\nously overlooked design decisions and suggest\\nthat BERT’s pretraining objective remains com-\\npetitive with recently proposed alternatives. We additionally use a novel dataset,\\nCC-N EWS, and release our models and\\ncode for pretraining and finetuning at:\\nhttps://github.com/pytorch/fairseq . References\\nEneko Agirre, Llu’is M‘arquez, and Richard Wicen-\\ntowski, editors. 2007. Proceedings of the Fourth\\nInternational Workshop on Semantic Evaluations\\n(SemEval-2007) . Alexei Baevski, Sergey Edunov, Yinhan Liu, Luke\\nZettlemoyer, and Michael Auli. 2019. Cloze-\\ndriven pretraining of self-attention networks. arXiv\\npreprint arXiv:1903.07785 .\"\n", " }\n", "}\n", "\u001b[36;1m\u001b[1;3m[chain/end]\u001b[0m \u001b[1m[chain:RunnableSequence > chain:RunnableParallel] [13ms] Exiting Chain run with output:\n", "\u001b[0m{\n", " \"doc\": {\n", " \"doc\": \"Roy Bar-Haim, Ido Dagan, Bill Dolan, Lisa Ferro,\\nDanilo Giampiccolo, Bernardo Magnini, and Idan\\nSzpektor. 2006. The second PASCAL recognising\\ntextual entailment challenge. In Proceedings of the\\nsecond PASCAL challenges workshop on recognis-\\ning textual entailment . Luisa Bentivogli, Ido Dagan, Hoa Trang Dang, Danilo\\nGiampiccolo, and Bernardo Magnini. 2009. The\\nfifth PASCAL recognizing textual entailment chal-\\nlenge. Samuel R Bowman, Gabor Angeli, Christopher Potts,\\nand Christopher D Manning. 2015. A large anno-\\ntated corpus for learning natural language inference. InEmpirical Methods in Natural Language Process-\\ning (EMNLP) .\"\n", " }\n", "}\n", "\u001b[36;1m\u001b[1;3m[chain/end]\u001b[0m \u001b[1m[chain:RunnableSequence > chain:RunnableParallel] [13ms] Exiting Chain run with output:\n", "\u001b[0m{\n", " \"doc\": {\n", " \"doc\": \"William Chan, Nikita Kitaev, Kelvin Guu, Mitchell\\nStern, and Jakob Uszkoreit. 2019. KERMIT: Gener-\\native insertion-based modeling for sequences.\"\n", " }\n", "}\n", "\u001b[36;1m\u001b[1;3m[chain/end]\u001b[0m \u001b[1m[chain:RunnableSequence > chain:RunnableParallel] [13ms] Exiting Chain run with output:\n", "\u001b[0m{\n", " \"doc\": {\n", " \"doc\": \"arXiv\\npreprint arXiv:1906.01604 .Ido Dagan, Oren Glickman, and Bernardo Magnini. 2006. The PASCAL recognising textual entailment\\nchallenge. In Machine learning challenges. evalu-\\nating predictive uncertainty, visual object classifica-\\ntion, and recognising tectual entailment . Andrew M Dai and Quoc V Le. 2015. Semi-supervised\\nsequence learning. In Advances in Neural Informa-\\ntion Processing Systems (NIPS) . Jacob Devlin, Ming-Wei Chang, Kenton Lee, and\\nKristina Toutanova. 2019. BERT: Pre-training of\\ndeep bidirectional transformers for language under-\\nstanding. In North American Association for Com-\\nputational Linguistics (NAACL) . William B Dolan and Chris Brockett. 2005. Auto-\\nmatically constructing a corpus of sentential para-\\nphrases. In Proceedings of the International Work-\\nshop on Paraphrasing . Li Dong, Nan Yang, Wenhui Wang, Furu Wei,\\nXiaodong Liu, Yu Wang, Jianfeng Gao, Ming\\nZhou, and Hsiao-Wuen Hon. 2019. Unified\\nlanguage model pre-training for natural language\\nunderstanding and generation. arXiv preprint\\narXiv:1905.03197 .\"\n", " }\n", "}\n", "\u001b[36;1m\u001b[1;3m[chain/end]\u001b[0m \u001b[1m[chain:RunnableSequence > chain:RunnableParallel] [13ms] Exiting Chain run with output:\n", "\u001b[0m{\n", " \"doc\": {\n", " \"doc\": \"Danilo Giampiccolo, Bernardo Magnini, Ido Dagan,\\nand Bill Dolan. 2007. The third PASCAL recog-\\nnizing textual entailment challenge. In Proceedings\\nof the ACL-PASCAL workshop on textual entailment\\nand paraphrasing . Aaron Gokaslan and Vanya Cohen. 2019. Openweb-\\ntext corpus. http://web.archive.org/\\nsave/http://Skylion007.github.io/\\nOpenWebTextCorpus .\"\n", " }\n", "}\n", "\u001b[36;1m\u001b[1;3m[chain/end]\u001b[0m \u001b[1m[chain:RunnableSequence > chain:RunnableParallel] [13ms] Exiting Chain run with output:\n", "\u001b[0m{\n", " \"doc\": {\n", " \"doc\": \"Felix Hamborg, Norman Meuschke, Corinna Bre-\\nitinger, and Bela Gipp. 2017. news-please: A\\ngeneric news crawler and extractor. In Proceedings\\nof the 15th International Symposium of Information\\nScience .\"\n", " }\n", "}\n", "\u001b[36;1m\u001b[1;3m[chain/end]\u001b[0m \u001b[1m[chain:RunnableSequence > chain:RunnableParallel] [13ms] Exiting Chain run with output:\n", "\u001b[0m{\n", " \"doc\": {\n", " \"doc\": \"Dan Hendrycks and Kevin Gimpel. 2016. Gaus-\\nsian error linear units (gelus). arXiv preprint\\narXiv:1606.08415 .\"\n", " }\n", "}\n", "\u001b[36;1m\u001b[1;3m[chain/end]\u001b[0m \u001b[1m[chain:RunnableSequence > chain:RunnableParallel] [13ms] Exiting Chain run with output:\n", "\u001b[0m{\n", " \"doc\": {\n", " \"doc\": \"Matthew Honnibal and Ines Montani. 2017. spaCy 2:\\nNatural language understanding with Bloom embed-\\ndings, convolutional neural networks and incremen-\\ntal parsing. To appear. Jeremy Howard and Sebastian Ruder. 2018. Universal\\nlanguage model fine-tuning for text classification. arXiv preprint arXiv:1801.06146 . Shankar Iyer, Nikhil Dandekar, and Kornl Cser-\\nnai. 2016. First quora dataset release: Question\\npairs.https://data.quora.com/First-\\nQuora-Dataset-Release-Question-\\nPairs . Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S.\"\n", " }\n", "}\n", "\u001b[36;1m\u001b[1;3m[chain/end]\u001b[0m \u001b[1m[chain:RunnableSequence > chain:RunnableParallel] [13ms] Exiting Chain run with output:\n", "\u001b[0m{\n", " \"doc\": {\n", " \"doc\": \"Weld, Luke Zettlemoyer, and Omer Levy. 2019. SpanBERT: Improving pre-training by repre-\\nsenting and predicting spans. arXiv preprint\\narXiv:1907.10529 . Diederik Kingma and Jimmy Ba. 2015. Adam: A\\nmethod for stochastic optimization. In International\\nConference on Learning Representations (ICLR) .\"\n", " }\n", "}\n", "\u001b[36;1m\u001b[1;3m[chain/end]\u001b[0m \u001b[1m[chain:RunnableSequence > chain:RunnableParallel] [13ms] Exiting Chain run with output:\n", "\u001b[0m{\n", " \"doc\": {\n", " \"doc\": \"Vid Kocijan, Ana-Maria Cretu, Oana-Maria Camburu,\\nYordan Yordanov, and Thomas Lukasiewicz. 2019. A surprisingly robust trick for winograd schema\\nchallenge. arXiv preprint arXiv:1905.06290 . Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang,\\nand Eduard Hovy. 2017. Race: Large-scale reading\\ncomprehension dataset from examinations. arXiv\\npreprint arXiv:1704.04683 . Guillaume Lample and Alexis Conneau. 2019. Cross-\\nlingual language model pretraining. arXiv preprint\\narXiv:1901.07291 .\"\n", " }\n", "}\n", "\u001b[36;1m\u001b[1;3m[chain/end]\u001b[0m \u001b[1m[chain:RunnableSequence > chain:RunnableParallel] [13ms] Exiting Chain run with output:\n", "\u001b[0m{\n", " \"doc\": {\n", " \"doc\": \"Hector J Levesque, Ernest Davis, and Leora Morgen-\\nstern. 2011. The Winograd schema challenge. In\\nAAAI Spring Symposium: Logical Formalizations of\\nCommonsense Reasoning . Xiaodong Liu, Pengcheng He, Weizhu Chen, and\\nJianfeng Gao. 2019a. Improving multi-task deep\\nneural networks via knowledge distillation for\\nnatural language understanding. arXiv preprint\\narXiv:1904.09482 .\"\n", " }\n", "}\n", "\u001b[36;1m\u001b[1;3m[chain/end]\u001b[0m \u001b[1m[chain:RunnableSequence > chain:RunnableParallel] [13ms] Exiting Chain run with output:\n", "\u001b[0m{\n", " \"doc\": {\n", " \"doc\": \"Xiaodong Liu, Pengcheng He, Weizhu Chen, and Jian-\\nfeng Gao. 2019b. Multi-task deep neural networks\\nfor natural language understanding. arXiv preprint\\narXiv:1901.11504 . Bryan McCann, James Bradbury, Caiming Xiong, and\\nRichard Socher. 2017. Learned in translation: Con-\\ntextualized word vectors. In Advances in Neural In-\\nformation Processing Systems (NIPS) , pages 6297–\\n6308.\"\n", " }\n", "}\n", "\u001b[36;1m\u001b[1;3m[chain/end]\u001b[0m \u001b[1m[chain:RunnableSequence > chain:RunnableParallel] [13ms] Exiting Chain run with output:\n", "\u001b[0m{\n", " \"doc\": {\n", " \"doc\": \"Paulius Micikevicius, Sharan Narang, Jonah Alben,\\nGregory Diamos, Erich Elsen, David Garcia, Boris\\nGinsburg, Michael Houston, Oleksii Kuchaiev,\\nGanesh Venkatesh, and Hao Wu. 2018. Mixed preci-\\nsion training. In International Conference on Learn-\\ning Representations . Sebastian Nagel. 2016. Cc-news. http:\\n//web.archive.org/save/http:\\n//commoncrawl.org/2016/10/news-\\ndataset-available .\"\n", " }\n", "}\n", "\u001b[36;1m\u001b[1;3m[chain/end]\u001b[0m \u001b[1m[chain:RunnableSequence > chain:RunnableParallel] [13ms] Exiting Chain run with output:\n", "\u001b[0m{\n", " \"doc\": {\n", " \"doc\": \"Myle Ott, Sergey Edunov, Alexei Baevski, Angela\\nFan, Sam Gross, Nathan Ng, David Grangier, and\\nMichael Auli. 2019. FAIRSEQ : A fast, exten-\\nsible toolkit for sequence modeling. In North\\nAmerican Association for Computational Linguis-\\ntics (NAACL): System Demonstrations .Myle Ott, Sergey Edunov, David Grangier, and\\nMichael Auli. 2018. Scaling neural machine trans-\\nlation. In Proceedings of the Third Conference on\\nMachine Translation (WMT) . Adam Paszke, Sam Gross, Soumith Chintala, Gre-\\ngory Chanan, Edward Yang, Zachary DeVito, Zem-\\ning Lin, Alban Desmaison, Luca Antiga, and Adam\\nLerer. 2017. Automatic differentiation in PyTorch. InNIPS Autodiff Workshop . Matthew Peters, Mark Neumann, Mohit Iyyer, Matt\\nGardner, Christopher Clark, Kenton Lee, and Luke\\nZettlemoyer. 2018. Deep contextualized word repre-\\nsentations. In North American Association for Com-\\nputational Linguistics (NAACL) . Alec Radford, Karthik Narasimhan, Time Salimans,\\nand Ilya Sutskever. 2018. Improving language un-\\nderstanding with unsupervised learning. Technical\\nreport, OpenAI. Alec Radford, Jeffrey Wu, Rewon Child, David Luan,\\nDario Amodei, and Ilya Sutskever. 2019. Language\\nmodels are unsupervised multitask learners. Techni-\\ncal report, OpenAI.\"\n", " }\n", "}\n", "\u001b[36;1m\u001b[1;3m[chain/end]\u001b[0m \u001b[1m[chain:RunnableSequence > chain:RunnableParallel] [13ms] Exiting Chain run with output:\n", "\u001b[0m{\n", " \"doc\": {\n", " \"doc\": \"Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018. Know what you don’t know: Unanswerable ques-\\ntions for squad. In Association for Computational\\nLinguistics (ACL) . Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and\\nPercy Liang. 2016. SQuAD: 100,000+ questions for\\nmachine comprehension of text. In Empirical Meth-\\nods in Natural Language Processing (EMNLP) . Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural machine translation of rare words with\\nsubword units. In Association for Computational\\nLinguistics (ACL) , pages 1715–1725. Richard Socher, Alex Perelygin, Jean Wu, Jason\\nChuang, Christopher D Manning, Andrew Ng, and\\nChristopher Potts. 2013. Recursive deep models\\nfor semantic compositionality over a sentiment tree-\\nbank. In Empirical Methods in Natural Language\\nProcessing (EMNLP) . Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and\\nTie-Yan Liu. 2019. MASS: Masked sequence\\nto sequence pre-training for language generation. InInternational Conference on Machine Learning\\n(ICML) .\"\n", " }\n", "}\n", "\u001b[36;1m\u001b[1;3m[chain/end]\u001b[0m \u001b[1m[chain:RunnableSequence > chain:RunnableParallel] [13ms] Exiting Chain run with output:\n", "\u001b[0m{\n", " \"doc\": {\n", " \"doc\": \"Yu Stephanie Sun, Shuohuan Wang, Yukun Li, Shikun\\nFeng, Xuyi Chen, Han Zhang, Xinlun Tian, Danxi-\\nang Zhu, Hao Tian, and Hua Wu. 2019. ERNIE: En-\\nhanced representation through knowledge integra-\\ntion. arXiv preprint arXiv:1904.09223 . Trieu H Trinh and Quoc V Le. 2018. A simple\\nmethod for commonsense reasoning. arXiv preprint\\narXiv:1806.02847 . Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob\\nUszkoreit, Llion Jones, Aidan N Gomez, Łukasz\\nKaiser, and Illia Polosukhin. 2017. Attention is all\\nyou need. In Advances in neural information pro-\\ncessing systems . Alex Wang, Yada Pruksachatkun, Nikita Nangia,\\nAmanpreet Singh, Julian Michael, Felix Hill, Omer\\nLevy, and Samuel R.\"\n", " }\n", "}\n", "\u001b[36;1m\u001b[1;3m[chain/end]\u001b[0m \u001b[1m[chain:RunnableSequence > chain:RunnableParallel] [13ms] Exiting Chain run with output:\n", "\u001b[0m{\n", " \"doc\": {\n", " \"doc\": \"Bowman. 2019a. SuperGLUE:\\nA stickier benchmark for general-purpose language\\nunderstanding systems.\"\n", " }\n", "}\n", "\u001b[36;1m\u001b[1;3m[chain/end]\u001b[0m \u001b[1m[chain:RunnableSequence > chain:RunnableParallel] [13ms] Exiting Chain run with output:\n", "\u001b[0m{\n", " \"doc\": {\n", " \"doc\": \"arXiv preprint 1905.00537 . Alex Wang, Amanpreet Singh, Julian Michael, Felix\\nHill, Omer Levy, and Samuel R. Bowman. 2019b. GLUE: A multi-task benchmark and analysis plat-\\nform for natural language understanding. In Inter-\\nnational Conference on Learning Representations\\n(ICLR) . Alex Warstadt, Amanpreet Singh, and Samuel R. Bow-\\nman. 2018. Neural network acceptability judg-\\nments. arXiv preprint 1805.12471 . Adina Williams, Nikita Nangia, and Samuel Bowman. 2018. A broad-coverage challenge corpus for sen-\\ntence understanding through inference. In North\\nAmerican Association for Computational Linguis-\\ntics (NAACL) . Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Car-\\nbonell, Ruslan Salakhutdinov, and Quoc V Le. 2019. Xlnet: Generalized autoregressive pretrain-\\ning for language understanding. arXiv preprint\\narXiv:1906.08237 .\"\n", " }\n", "}\n", "\u001b[36;1m\u001b[1;3m[chain/end]\u001b[0m \u001b[1m[chain:RunnableSequence > chain:RunnableParallel] [13ms] Exiting Chain run with output:\n", "\u001b[0m{\n", " \"doc\": {\n", " \"doc\": \"Yang You, Jing Li, Jonathan Hseu, Xiaodan Song,\\nJames Demmel, and Cho-Jui Hsieh. 2019. Reduc-\\ning bert pre-training time from 3 days to 76 minutes. arXiv preprint arXiv:1904.00962 . Rowan Zellers, Ari Holtzman, Hannah Rashkin,\\nYonatan Bisk, Ali Farhadi, Franziska Roesner, and\\nYejin Choi. 2019. Defending against neural fake\\nnews. arXiv preprint arXiv:1905.12616 .\"\n", " }\n", "}\n", "\u001b[36;1m\u001b[1;3m[chain/end]\u001b[0m \u001b[1m[chain:RunnableSequence > chain:RunnableParallel] [13ms] Exiting Chain run with output:\n", "\u001b[0m{\n", " \"doc\": {\n", " \"doc\": \"Yukun Zhu, Ryan Kiros, Richard Zemel, Ruslan\\nSalakhutdinov, Raquel Urtasun, Antonio Torralba,\\nand Sanja Fidler. 2015. Aligning books and movies:\\nTowards story-like visual explanations by watch-\\ning movies and reading books. In arXiv preprint\\narXiv:1506.06724 . Appendix for “RoBERTa: A Robustly\\nOptimized BERT Pretraining Approach”\\nA Full results on GLUE\\nIn Table 8we present the full set of development\\nset results for RoBERTa. We present results for\\naLARGE configuration that follows BERT LARGE ,\\nas well as a BASE configuration that follows\\nBERT BASE.B Pretraining Hyperparameters\\nTable 9describes the hyperparameters for pre-\\ntraining of RoBERTa LARGE and RoBERTa BASE\\nC Finetuning Hyperparameters\\nFinetuning hyperparameters for RACE, SQuAD\\nand GLUE are given in Table 10. We select the\\nbest hyperparameter values based on the median\\nof 5 random seeds for each task. MNLI QNLI QQP RTE SST MRPC CoLA STS\\nRoBERTa BASE\\n+ all data + 500k steps 87.6 92.8 91.9 78.7 94.8 90.2 63.6 91.2\\nRoBERTa LARGE\\nwith B OOKS + W IKI 89.0 93.9 91.9 84.5 95.3 90.2 66.3 91.6\\n+ additional data ( §3.2) 89.3 94.0 92.0 82.7 95.6 91.4 66.1 92.2\\n+ pretrain longer 300k 90.0 94.5 92.2 83.3 96.1 91.1 67.4 92.3\\n+ pretrain longer 500k 90.2 94.7 92.2 86.6 96.4 90.9 68.0 92.4\\nTable 8: Development set results on GLUE tasks for various co nfigurations of RoBERTa. Hyperparam RoBERTa LARGE RoBERTa BASE\\nNumber of Layers 24 12\\nHidden size 1024 768\\nFFN inner hidden size 4096 3072\\nAttention heads 16 12\\nAttention head size 64 64\\nDropout 0.1 0.1\\nAttention Dropout 0.1 0.1\\nWarmup Steps 30k 24k\\nPeak Learning Rate 4e-4 6e-4\\nBatch Size 8k 8k\\nWeight Decay 0.01 0.01\\nMax Steps 500k 500k\\nLearning Rate Decay Linear Linear\\nAdamǫ 1e-6 1e-6\\nAdamβ1 0.9 0.9\\nAdamβ2 0.98 0.98\\nGradient Clipping 0.0 0.0\\nTable 9: Hyperparameters for pretraining RoBERTa LARGE and RoBERTa BASE. Hyperparam RACE SQuAD GLUE\\nLearning Rate 1e-5 1.5e-5 {1e-5, 2e-5, 3e-5 }\\nBatch Size 16 48 {16, 32}\\nWeight Decay 0.1 0.01 0.1\\nMax Epochs 4 2 10\\nLearning Rate Decay Linear Linear Linear\\nWarmup ratio 0.06 0.06 0.06\\nTable 10: Hyperparameters for finetuning RoBERTa LARGE on RACE, SQuAD and GLUE.\"\n", " }\n", "}\n", "\u001b[32;1m\u001b[1;3m[chain/start]\u001b[0m \u001b[1m[chain:RunnableSequence > prompt:PromptTemplate] Entering Prompt run with input:\n", "\u001b[0m{\n", " \"doc\": {\n", " \"doc\": \"arXiv:1907.11692v1 [cs.CL] 26 Jul 2019RoBERTa: A Robustly Optimized BERT Pretraining Approach\\nYinhan Liu∗§Myle Ott∗§Naman Goyal∗§Jingfei Du∗§Mandar Joshi†\\nDanqi Chen§Omer Levy§Mike Lewis§Luke Zettlemoyer†§Veselin Stoyanov§\\n†Paul G.\"\n", " }\n", "}\n", "\u001b[32;1m\u001b[1;3m[chain/start]\u001b[0m \u001b[1m[chain:RunnableSequence > prompt:PromptTemplate] Entering Prompt run with input:\n", "\u001b[0m{\n", " \"doc\": {\n", " \"doc\": \"Allen School of Computer Science & Engineering,\\nUniversity of Washington, Seattle, WA\\n{mandar90,lsz }@cs.washington.edu\\n§Facebook AI\\n{yinhanliu,myleott,naman,jingfeidu,\\ndanqi,omerlevy,mikelewis,lsz,ves }@fb.com\\nAbstract\\nLanguage model pretraining has led to sig-\\nnificant performance gains but careful com-\\nparison between different approaches is chal-\\nlenging. Training is computationally expen-\\nsive, often done on private datasets of different\\nsizes, and, as we will show, hyperparameter\\nchoices have significant impact on the final re-\\nsults. We present a replication study of BERT\\npretraining ( Devlin et al. ,2019 ) that carefully\\nmeasures the impact of many key hyperparam-\\neters and training data size. We find that BERT\\nwas significantly undertrained, and can match\\nor exceed the performance of every model\\npublished after it. Our best model achieves\\nstate-of-the-art results on GLUE, RACE and\\nSQuAD. These results highlight the impor-\\ntance of previously overlooked design choices,\\nand raise questions about the source of re-\\ncently reported improvements. We release our\\nmodels and code.1\\n1 Introduction\\nSelf-training methods such as ELMo ( Peters et al. ,\\n2018 ), GPT ( Radford et al. ,2018 ), BERT\\n(Devlin et al. ,2019 ), XLM ( Lample and Conneau ,\\n2019 ), and XLNet ( Yang et al. ,2019 ) have\\nbrought significant performance gains, but it can\\nbe challenging to determine which aspects of\\nthe methods contribute the most. Training is\\ncomputationally expensive, limiting the amount\\nof tuning that can be done, and is often done with\\nprivate training data of varying sizes, limiting\\nour ability to measure the effects of the modeling\\nadvances. ∗Equal contribution. 1Our models and code are available at:\\nhttps://github.com/pytorch/fairseqWe present a replication study of BERT pre-\\ntraining ( Devlin et al. ,2019 ), which includes a\\ncareful evaluation of the effects of hyperparmeter\\ntuning and training set size. We find that BERT\\nwas significantly undertrained and propose an im-\\nproved recipe for training BERT models, which\\nwe call RoBERTa, that can match or exceed the\\nperformance of all of the post-BERT methods. Our modifications are simple, they include: (1)\\ntraining the model longer, with bigger batches,\\nover more data; (2) removing the next sentence\\nprediction objective; (3) training on longer se-\\nquences; and (4) dynamically changing the mask-\\ning pattern applied to the training data. We also\\ncollect a large new dataset (CC-N EWS) of compa-\\nrable size to other privately used datasets, to better\\ncontrol for training set size effects. When controlling for training data, our im-\\nproved training procedure improves upon the pub-\\nlished BERT results on both GLUE and SQuAD. When trained for longer over additional data, our\\nmodel achieves a score of 88.5 on the public\\nGLUE leaderboard, matching the 88.4 reported\\nbyYang et al. (2019 ). Our model establishes a\\nnew state-of-the-art on 4/9 of the GLUE tasks:\\nMNLI, QNLI, RTE and STS-B. We also match\\nstate-of-the-art results on SQuAD and RACE. Overall, we re-establish that BERT’s masked lan-\\nguage model training objective is competitive\\nwith other recently proposed training objectives\\nsuch as perturbed autoregressive language model-\\ning (Yang et al. ,2019 ).2\\nIn summary, the contributions of this paper\\nare: (1) We present a set of important BERT de-\\nsign choices and training strategies and introduce\\n2It is possible that these other methods could also improve\\nwith more tuning. We leave this exploration to future work. alternatives that lead to better downstream task\\nperformance; (2) We use a novel dataset, CC-\\nNEWS, and confirm that using more data for pre-\\ntraining further improves performance on down-\\nstream tasks; (3) Our training improvements show\\nthat masked language model pretraining, under\\nthe right design choices, is competitive with all\\nother recently published methods. We release our\\nmodel, pretraining and fine-tuning code imple-\\nmented in PyTorch ( Paszke et al. ,2017 ). 2 Background\\nIn this section, we give a brief overview of the\\nBERT ( Devlin et al. ,2019 ) pretraining approach\\nand some of the training choices that we will ex-\\namine experimentally in the following section. 2.1 Setup\\nBERT takes as input a concatenation of two\\nsegments (sequences of tokens), x1,...,x N\\nandy1,...,yM. Segments usually consist of\\nmore than one natural sentence. The two seg-\\nments are presented as a single input sequence\\nto BERT with special tokens delimiting them:\\n[CLS],x1,...,x N,[SEP],y1,...,yM,[EOS]. MandNare constrained such that M+N < T ,\\nwhereTis a parameter that controls the maximum\\nsequence length during training. The model is first pretrained on a large unla-\\nbeled text corpus and subsequently finetuned us-\\ning end-task labeled data. 2.2 Architecture\\nBERT uses the now ubiquitous transformer archi-\\ntecture ( Vaswani et al. ,2017 ), which we will not\\nreview in detail. We use a transformer architecture\\nwithLlayers. Each block uses Aself-attention\\nheads and hidden dimension H. 2.3 Training Objectives\\nDuring pretraining, BERT uses two objectives:\\nmasked language modeling and next sentence pre-\\ndiction. Masked Language Model (MLM) A random\\nsample of the tokens in the input sequence is\\nselected and replaced with the special token\\n[MASK]. The MLM objective is a cross-entropy\\nloss on predicting the masked tokens. BERT uni-\\nformly selects 15% of the input tokens for possi-\\nble replacement. Of the selected tokens, 80% are\\nreplaced with [MASK], 10% are left unchanged,and 10% are replaced by a randomly selected vo-\\ncabulary token. In the original implementation, random mask-\\ning and replacement is performed once in the be-\\nginning and saved for the duration of training, al-\\nthough in practice, data is duplicated so the mask\\nis not always the same for every training sentence\\n(see Section 4.1). Next Sentence Prediction (NSP) NSP is a bi-\\nnary classification loss for predicting whether two\\nsegments follow each other in the original text. Positive examples are created by taking consecu-\\ntive sentences from the text corpus. Negative ex-\\namples are created by pairing segments from dif-\\nferent documents. Positive and negative examples\\nare sampled with equal probability. The NSP objective was designed to improve\\nperformance on downstream tasks, such as Natural\\nLanguage Inference ( Bowman et al. ,2015 ), which\\nrequire reasoning about the relationships between\\npairs of sentences. 2.4 Optimization\\nBERT is optimized with Adam ( Kingma and Ba ,\\n2015 ) using the following parameters: β1= 0.9,\\nβ2= 0.999,ǫ=1e-6 and L2weight de-\\ncay of0.01. The learning rate is warmed up\\nover the first 10,000 steps to a peak value of\\n1e-4, and then linearly decayed. BERT trains\\nwith a dropout of 0.1 on all layers and at-\\ntention weights, and a GELU activation func-\\ntion ( Hendrycks and Gimpel ,2016 ). Models are\\npretrained for S=1,000,000 updates, with mini-\\nbatches containing B=256 sequences of maxi-\\nmum length T=512 tokens. 2.5 Data\\nBERT is trained on a combination of B OOK COR-\\nPUS (Zhu et al. ,2015 ) plus English W IKIPEDIA ,\\nwhich totals 16GB of uncompressed text.3\\n3 Experimental Setup\\nIn this section, we describe the experimental setup\\nfor our replication study of BERT. 3.1 Implementation\\nWe reimplement BERT in FAIRSEQ (Ott et al. ,\\n2019 ). We primarily follow the original BERT\\n3Yang et al. (2019 ) use the same dataset but report having\\nonly 13GB of text after data cleaning. This is most likely due\\nto subtle differences in cleaning of the Wikipedia data. optimization hyperparameters, given in Section 2,\\nexcept for the peak learning rate and number of\\nwarmup steps, which are tuned separately for each\\nsetting. We additionally found training to be very\\nsensitive to the Adam epsilon term, and in some\\ncases we obtained better performance or improved\\nstability after tuning it. Similarly, we found setting\\nβ2= 0.98to improve stability when training with\\nlarge batch sizes. We pretrain with sequences of at most T= 512\\ntokens. Unlike Devlin et al. (2019 ), we do not ran-\\ndomly inject short sequences, and we do not train\\nwith a reduced sequence length for the first 90% of\\nupdates. We train only with full-length sequences. We train with mixed precision floating point\\narithmetic on DGX-1 machines, each with 8 ×\\n32GB Nvidia V100 GPUs interconnected by In-\\nfiniband ( Micikevicius et al. ,2018 ). 3.2 Data\\nBERT-style pretraining crucially relies on large\\nquantities of text. Baevski et al. (2019 ) demon-\\nstrate that increasing data size can result in im-\\nproved end-task performance. Several efforts\\nhave trained on datasets larger and more diverse\\nthan the original BERT ( Radford et al.\"\n", " }\n", "}\n", "\u001b[32;1m\u001b[1;3m[chain/start]\u001b[0m \u001b[1m[chain:RunnableSequence > prompt:PromptTemplate] Entering Prompt run with input:\n", "\u001b[0m{\n", " \"doc\": {\n", " \"doc\": \",2019 ;\\nYang et al. ,2019 ;Zellers et al. ,2019 ). Unfortu-\\nnately, not all of the additional datasets can be\\npublicly released. For our study, we focus on gath-\\nering as much data as possible for experimenta-\\ntion, allowing us to match the overall quality and\\nquantity of data as appropriate for each compari-\\nson. We consider five English-language corpora of\\nvarying sizes and domains, totaling over 160GB\\nof uncompressed text. We use the following text\\ncorpora:\\n•BOOK CORPUS (Zhu et al. ,2015 ) plus English\\nWIKIPEDIA . This is the original data used to\\ntrain BERT. (16GB). •CC-N EWS, which we collected from the En-\\nglish portion of the CommonCrawl News\\ndataset ( Nagel ,2016 ). The data contains 63\\nmillion English news articles crawled between\\nSeptember 2016 and February 2019. (76GB af-\\nter filtering).4\\n•OPENWEBTEXT (Gokaslan and Cohen ,2019 ),\\nan open-source recreation of the WebText cor-\\n4We usenews-please (Hamborg et al. ,2017 ) to col-\\nlect and extract CC-N EWS. CC-N EWS is similar to the R E-\\nALNEWS dataset described in Zellers et al. (2019 ).pus described in Radford et al. (2019 ). The text\\nis web content extracted from URLs shared on\\nReddit with at least three upvotes. (38GB).5\\n•STORIES , a dataset introduced in Trinh and Le\\n(2018 ) containing a subset of CommonCrawl\\ndata filtered to match the story-like style of\\nWinograd schemas. (31GB). 3.3 Evaluation\\nFollowing previous work, we evaluate our pre-\\ntrained models on downstream tasks using the fol-\\nlowing three benchmarks. GLUE The General Language Understand-\\ning Evaluation (GLUE) benchmark ( Wang et al. ,\\n2019b ) is a collection of 9 datasets for evaluating\\nnatural language understanding systems.6Tasks\\nare framed as either single-sentence classification\\nor sentence-pair classification tasks. The GLUE\\norganizers provide training and development data\\nsplits as well as a submission server and leader-\\nboard that allows participants to evaluate and com-\\npare their systems on private held-out test data. For the replication study in Section 4, we report\\nresults on the development sets after finetuning\\nthe pretrained models on the corresponding single-\\ntask training data (i.e., without multi-task training\\nor ensembling). Our finetuning procedure follows\\nthe original BERT paper ( Devlin et al. ,2019 ). In Section 5we additionally report test set re-\\nsults obtained from the public leaderboard. These\\nresults depend on a several task-specific modifica-\\ntions, which we describe in Section 5.1. SQuAD The Stanford Question Answering\\nDataset (SQuAD) provides a paragraph of context\\nand a question. The task is to answer the question\\nby extracting the relevant span from the context. We evaluate on two versions of SQuAD: V1.1\\nand V2.0 ( Rajpurkar et al. ,2016 ,2018 ). In V1.1\\nthe context always contains an answer, whereas in\\n5The authors and their affiliated institutions are not in any\\nway affiliated with the creation of the OpenWebText dataset. 6The datasets are: CoLA ( Warstadt et al. ,2018 ),\\nStanford Sentiment Treebank (SST) ( Socher et al. ,\\n2013 ), Microsoft Research Paragraph Corpus\\n(MRPC) ( Dolan and Brockett ,2005 ), Semantic Tex-\\ntual Similarity Benchmark (STS) ( Agirre et al. ,2007 ),\\nQuora Question Pairs (QQP) ( Iyer et al. ,2016 ), Multi-\\nGenre NLI (MNLI) ( Williams et al. ,2018 ), Question NLI\\n(QNLI) ( Rajpurkar et al. ,2016 ), Recognizing Textual\\nEntailment (RTE) ( Dagan et al. ,2006 ;Bar-Haim et al. ,\\n2006 ;Giampiccolo et al. ,2007 ;Bentivogli et al. ,2009 ) and\\nWinograd NLI (WNLI) ( Levesque et al. ,2011 ). V2.0 some questions are not answered in the pro-\\nvided context, making the task more challenging. For SQuAD V1.1 we adopt the same span pre-\\ndiction method as BERT ( Devlin et al. ,2019 ). For\\nSQuAD V2.0, we add an additional binary classi-\\nfier to predict whether the question is answerable,\\nwhich we train jointly by summing the classifica-\\ntion and span loss terms. During evaluation, we\\nonly predict span indices on pairs that are classi-\\nfied as answerable. RACE The ReAding Comprehension from Ex-\\naminations (RACE) ( Lai et al. ,2017 ) task is a\\nlarge-scale reading comprehension dataset with\\nmore than 28,000 passages and nearly 100,000\\nquestions. The dataset is collected from English\\nexaminations in China, which are designed for\\nmiddle and high school students. In RACE, each\\npassage is associated with multiple questions. For\\nevery question, the task is to select one correct an-\\nswer from four options. RACE has significantly\\nlonger context than other popular reading compre-\\nhension datasets and the proportion of questions\\nthat requires reasoning is very large. 4 Training Procedure Analysis\\nThis section explores and quantifies which choices\\nare important for successfully pretraining BERT\\nmodels. We keep the model architecture fixed.7\\nSpecifically, we begin by training BERT models\\nwith the same configuration as BERT BASE (L=\\n12,H= 768 ,A= 12 , 110M params). 4.1 Static vs. Dynamic Masking\\nAs discussed in Section 2, BERT relies on ran-\\ndomly masking and predicting tokens. The orig-\\ninal BERT implementation performed masking\\nonce during data preprocessing, resulting in a sin-\\nglestatic mask. To avoid using the same mask for\\neach training instance in every epoch, training data\\nwas duplicated 10 times so that each sequence is\\nmasked in 10 different ways over the 40 epochs of\\ntraining. Thus, each training sequence was seen\\nwith the same mask four times during training. We compare this strategy with dynamic mask-\\ningwhere we generate the masking pattern every\\ntime we feed a sequence to the model. This be-\\ncomes crucial when pretraining for more steps or\\nwith larger datasets. 7Studying architectural changes, including larger archi-\\ntectures, is an important area for future work.Masking SQuAD 2.0 MNLI-m SST-2\\nreference 76.3 84.3 92.8\\nOur reimplementation:\\nstatic 78.3 84.3 92.5\\ndynamic 78.7 84.0 92.9\\nTable 1: Comparison between static and dynamic\\nmasking for BERT BASE. We report F1 for SQuAD and\\naccuracy for MNLI-m and SST-2. Reported results are\\nmedians over 5 random initializations (seeds). Refer-\\nence results are from Yang et al. (2019 ). Results Table 1compares the published\\nBERT BASE results from Devlin et al. (2019 ) to our\\nreimplementation with either static or dynamic\\nmasking. We find that our reimplementation\\nwith static masking performs similar to the\\noriginal BERT model, and dynamic masking is\\ncomparable or slightly better than static masking. Given these results and the additional efficiency\\nbenefits of dynamic masking, we use dynamic\\nmasking in the remainder of the experiments. 4.2 Model Input Format and Next Sentence\\nPrediction\\nIn the original BERT pretraining procedure, the\\nmodel observes two concatenated document seg-\\nments, which are either sampled contiguously\\nfrom the same document (with p= 0.5) or from\\ndistinct documents. In addition to the masked lan-\\nguage modeling objective, the model is trained to\\npredict whether the observed document segments\\ncome from the same or distinct documents via an\\nauxiliary Next Sentence Prediction (NSP) loss. The NSP loss was hypothesized to be an impor-\\ntant factor in training the original BERT model. Devlin et al. (2019 ) observe that removing NSP\\nhurts performance, with significant performance\\ndegradation on QNLI, MNLI, and SQuAD 1.1. However, some recent work has questioned the\\nnecessity of the NSP loss ( Lample and Conneau ,\\n2019 ;Yang et al.\"\n", " }\n", "}\n", "\u001b[32;1m\u001b[1;3m[chain/start]\u001b[0m \u001b[1m[chain:RunnableSequence > prompt:PromptTemplate] Entering Prompt run with input:\n", "\u001b[0m{\n", " \"doc\": {\n", " \"doc\": \",2019 ;Joshi et al. ,2019 ). To better understand this discrepancy, we com-\\npare several alternative training formats:\\n•SEGMENT -PAIR +NSP: This follows the original\\ninput format used in BERT ( Devlin et al. ,2019 ),\\nwith the NSP loss. Each input has a pair of seg-\\nments, which can each contain multiple natural\\nsentences, but the total combined length must\\nbe less than 512 tokens. Model SQuAD 1.1/2.0 MNLI-m SST-2 RACE\\nOur reimplementation (with NSP loss):\\nSEGMENT -PAIR 90.4/78.7 84.0 92.9 64.2\\nSENTENCE -PAIR 88.7/76.2 82.9 92.1 63.0\\nOur reimplementation (without NSP loss):\\nFULL -SENTENCES 90.4/79.1 84.7 92.5 64.8\\nDOC-SENTENCES 90.6/79.7 84.7 92.7 65.6\\nBERT BASE 88.5/76.3 84.3 92.8 64.3\\nXLNet BASE (K = 7) –/81.3 85.8 92.7 66.1\\nXLNet BASE (K = 6) –/81.0 85.6 93.4 66.7\\nTable 2: Development set results for base models pretrained over B OOK CORPUS and W IKIPEDIA . All models are\\ntrained for 1M steps with a batch size of 256 sequences. We rep ort F1 for SQuAD and accuracy for MNLI-m,\\nSST-2 and RACE. Reported results are medians over five random initializations (seeds). Results for BERT BASEand\\nXLNet BASEare from Yang et al. (2019 ). •SENTENCE -PAIR +NSP: Each input contains a\\npair of natural sentences , either sampled from\\na contiguous portion of one document or from\\nseparate documents. Since these inputs are sig-\\nnificantly shorter than 512 tokens, we increase\\nthe batch size so that the total number of tokens\\nremains similar to SEGMENT -PAIR +NSP. We re-\\ntain the NSP loss. •FULL -SENTENCES : Each input is packed with\\nfull sentences sampled contiguously from one\\nor more documents, such that the total length is\\nat most 512 tokens. Inputs may cross document\\nboundaries. When we reach the end of one doc-\\nument, we begin sampling sentences from the\\nnext document and add an extra separator token\\nbetween documents. We remove the NSP loss. •DOC-SENTENCES : Inputs are constructed sim-\\nilarly to FULL -SENTENCES , except that they\\nmay not cross document boundaries. Inputs\\nsampled near the end of a document may be\\nshorter than 512 tokens, so we dynamically in-\\ncrease the batch size in these cases to achieve\\na similar number of total tokens as FULL -\\nSENTENCES .\"\n", " }\n", "}\n", "\u001b[32;1m\u001b[1;3m[chain/start]\u001b[0m \u001b[1m[chain:RunnableSequence > prompt:PromptTemplate] Entering Prompt run with input:\n", "\u001b[0m{\n", " \"doc\": {\n", " \"doc\": \"We remove the NSP loss. Results Table 2shows results for the four dif-\\nferent settings. We first compare the original\\nSEGMENT -PAIR input format from Devlin et al. (2019 ) to the SENTENCE -PAIR format; both for-\\nmats retain the NSP loss, but the latter uses sin-\\ngle sentences. We find that using individual\\nsentences hurts performance on downstream\\ntasks , which we hypothesize is because the model\\nis not able to learn long-range dependencies.We next compare training without the NSP\\nloss and training with blocks of text from a sin-\\ngle document ( DOC-SENTENCES ). We find that\\nthis setting outperforms the originally published\\nBERT BASEresults and that removing the NSP loss\\nmatches or slightly improves downstream task\\nperformance , in contrast to Devlin et al. (2019 ). It is possible that the original BERT implementa-\\ntion may only have removed the loss term while\\nstill retaining the SEGMENT -PAIR input format. Finally we find that restricting sequences to\\ncome from a single document ( DOC-SENTENCES )\\nperforms slightly better than packing sequences\\nfrom multiple documents ( FULL -SENTENCES ). However, because the DOC-SENTENCES format\\nresults in variable batch sizes, we use FULL -\\nSENTENCES in the remainder of our experiments\\nfor easier comparison with related work. 4.3 Training with large batches\\nPast work in Neural Machine Translation has\\nshown that training with very large mini-batches\\ncan both improve optimization speed and end-task\\nperformance when the learning rate is increased\\nappropriately ( Ott et al. ,2018 ). Recent work has\\nshown that BERT is also amenable to large batch\\ntraining ( You et al. ,2019 ). Devlin et al. (2019 ) originally trained\\nBERT BASE for 1M steps with a batch size of\\n256 sequences. This is equivalent in computa-\\ntional cost, via gradient accumulation, to training\\nfor 125K steps with a batch size of 2K sequences,\\nor for 31K steps with a batch size of 8K. In Table 3we compare perplexity and end- bsz steps lr ppl MNLI-m SST-2\\n256 1M 1e-4 3.99 84.7 92.7\\n2K 125K 7e-4 3.68 85.2 92.9\\n8K 31K 1e-3 3.77 84.6 92.8\\nTable 3: Perplexity on held-out training data ( ppl) and\\ndevelopment set accuracy for base models trained over\\nBOOK CORPUS and W IKIPEDIA with varying batch\\nsizes ( bsz).\"\n", " }\n", "}\n", "\u001b[32;1m\u001b[1;3m[chain/start]\u001b[0m \u001b[1m[chain:RunnableSequence > prompt:PromptTemplate] Entering Prompt run with input:\n", "\u001b[0m{\n", " \"doc\": {\n", " \"doc\": \"We tune the learning rate ( lr) for each set-\\nting. Models make the same number of passes over the\\ndata (epochs) and have the same computational cost. task performance of BERT BASE as we increase the\\nbatch size, controlling for the number of passes\\nthrough the training data. We observe that train-\\ning with large batches improves perplexity for the\\nmasked language modeling objective, as well as\\nend-task accuracy. Large batches are also easier to\\nparallelize via distributed data parallel training,8\\nand in later experiments we train with batches of\\n8K sequences. Notably You et al. (2019 ) train BERT with even\\nlarger batche sizes, up to 32K sequences. We leave\\nfurther exploration of the limits of large batch\\ntraining to future work. 4.4 Text Encoding\\nByte-Pair Encoding (BPE) ( Sennrich et al. ,2016 )\\nis a hybrid between character- and word-level rep-\\nresentations that allows handling the large vocab-\\nularies common in natural language corpora. In-\\nstead of full words, BPE relies on subwords units,\\nwhich are extracted by performing statistical anal-\\nysis of the training corpus. BPE vocabulary sizes typically range from\\n10K-100K subword units. However, unicode char-\\nacters can account for a sizeable portion of this\\nvocabulary when modeling large and diverse cor-\\npora, such as the ones considered in this work. Radford et al. (2019 ) introduce a clever imple-\\nmentation of BPE that uses bytes instead of uni-\\ncode characters as the base subword units. Using\\nbytes makes it possible to learn a subword vocab-\\nulary of a modest size (50K units) that can still en-\\ncode any input text without introducing any “un-\\nknown” tokens. 8Large batch training can improve training efficiency even\\nwithout large scale parallel hardware through gradient ac-\\ncumulation , whereby gradients from multiple mini-batches\\nare accumulated locally before each optimization step. Thi s\\nfunctionality is supported natively in FAIRSEQ (Ott et al. ,\\n2019 ).The original BERT implementa-\\ntion ( Devlin et al. ,2019 ) uses a character-level\\nBPE vocabulary of size 30K, which is learned\\nafter preprocessing the input with heuristic tok-\\nenization rules. Following Radford et al. (2019 ),\\nwe instead consider training BERT with a larger\\nbyte-level BPE vocabulary containing 50K sub-\\nword units, without any additional preprocessing\\nor tokenization of the input. This adds approxi-\\nmately 15M and 20M additional parameters for\\nBERT BASEand BERT LARGE , respectively. Early experiments revealed only slight dif-\\nferences between these encodings, with the\\nRadford et al. (2019 ) BPE achieving slightly\\nworse end-task performance on some tasks. Nev-\\nertheless, we believe the advantages of a univer-\\nsal encoding scheme outweighs the minor degre-\\ndation in performance and use this encoding in\\nthe remainder of our experiments. A more de-\\ntailed comparison of these encodings is left to fu-\\nture work. 5 RoBERTa\\nIn the previous section we propose modifications\\nto the BERT pretraining procedure that improve\\nend-task performance. We now aggregate these\\nimprovements and evaluate their combined im-\\npact. We call this configuration RoBERTa for\\nRobustly optimized BERT approach. Specifi-\\ncally, RoBERTa is trained with dynamic mask-\\ning (Section 4.1),FULL -SENTENCES without NSP\\nloss (Section 4.2), large mini-batches (Section 4.3)\\nand a larger byte-level BPE (Section 4.4). Additionally, we investigate two other impor-\\ntant factors that have been under-emphasized in\\nprevious work: (1) the data used for pretraining,\\nand (2) the number of training passes through the\\ndata. For example, the recently proposed XLNet\\narchitecture ( Yang et al. ,2019 ) is pretrained us-\\ning nearly 10 times more data than the original\\nBERT ( Devlin et al. ,2019 ). It is also trained with\\na batch size eight times larger for half as many op-\\ntimization steps, thus seeing four times as many\\nsequences in pretraining compared to BERT. To help disentangle the importance of these fac-\\ntors from other modeling choices (e.g., the pre-\\ntraining objective), we begin by training RoBERTa\\nfollowing the BERT LARGE architecture ( L= 24 ,\\nH= 1024 ,A= 16 , 355M parameters). We\\npretrain for 100K steps over a comparable B OOK -\\nCORPUS plus W IKIPEDIA dataset as was used in Model data bsz stepsSQuADMNLI-m SST-2(v1.1/2.0)\\nRoBERTa\\nwith B OOKS + W IKI 16GB 8K 100K 93.6/87.3 89.0 95.3\\n+ additional data ( §3.2) 160GB 8K 100K 94.0/87.7 89.3 95.6\\n+ pretrain longer 160GB 8K 300K 94.4/88.7 90.0 96.1\\n+ pretrain even longer 160GB 8K 500K 94.6/89.4 90.2 96.4\\nBERT LARGE\\nwith B OOKS + W IKI 13GB 256 1M 90.9/81.8 86.6 93.7\\nXLNet LARGE\\nwith B OOKS + W IKI 13GB 256 1M 94.0/87.8 88.4 94.4\\n+ additional data 126GB 2K 500K 94.5/88.8 89.8 95.6\\nTable 4: Development set results for RoBERTa as we pretrain o ver more data (16GB →160GB of text) and pretrain\\nfor longer (100K →300K→500K steps).\"\n", " }\n", "}\n", "\u001b[32;1m\u001b[1;3m[chain/start]\u001b[0m \u001b[1m[chain:RunnableSequence > prompt:PromptTemplate] Entering Prompt run with input:\n", "\u001b[0m{\n", " \"doc\": {\n", " \"doc\": \"Each row accumulates improvements from the row s above. RoBERTa\\nmatches the architecture and training objective of BERT LARGE . Results for BERT LARGE and XLNet LARGE are from\\nDevlin et al. (2019 ) and Yang et al. (2019 ), respectively. Complete results on all GLUE tasks can be fo und in the\\nAppendix.\"\n", " }\n", "}\n", "\u001b[32;1m\u001b[1;3m[chain/start]\u001b[0m \u001b[1m[chain:RunnableSequence > prompt:PromptTemplate] Entering Prompt run with input:\n", "\u001b[0m{\n", " \"doc\": {\n", " \"doc\": \"Devlin et al. (2019 ). We pretrain our model using\\n1024 V100 GPUs for approximately one day. Results We present our results in Table 4. When\\ncontrolling for training data, we observe that\\nRoBERTa provides a large improvement over the\\noriginally reported BERT LARGE results, reaffirming\\nthe importance of the design choices we explored\\nin Section 4. Next, we combine this data with the three ad-\\nditional datasets described in Section 3.2. We\\ntrain RoBERTa over the combined data with the\\nsame number of training steps as before (100K). In total, we pretrain over 160GB of text. We ob-\\nserve further improvements in performance across\\nall downstream tasks, validating the importance of\\ndata size and diversity in pretraining.9\\nFinally, we pretrain RoBERTa for significantly\\nlonger, increasing the number of pretraining steps\\nfrom 100K to 300K, and then further to 500K. We\\nagain observe significant gains in downstream task\\nperformance, and the 300K and 500K step mod-\\nels outperform XLNet LARGE across most tasks. We\\nnote that even our longest-trained model does not\\nappear to overfit our data and would likely benefit\\nfrom additional training. In the rest of the paper, we evaluate our best\\nRoBERTa model on the three different bench-\\nmarks: GLUE, SQuaD and RACE. Specifically\\n9Our experiments conflate increases in data size and di-\\nversity. We leave a more careful analysis of these two dimen-\\nsions to future work.we consider RoBERTa trained for 500K steps over\\nall five of the datasets introduced in Section 3.2. 5.1 GLUE Results\\nFor GLUE we consider two finetuning settings. In the first setting ( single-task, dev ) we finetune\\nRoBERTa separately for each of the GLUE tasks,\\nusing only the training data for the correspond-\\ning task. We consider a limited hyperparameter\\nsweep for each task, with batch sizes ∈ {16,32}\\nand learning rates ∈ {1e−5,2e−5,3e−5}, with a\\nlinear warmup for the first 6% of steps followed by\\na linear decay to 0. We finetune for 10 epochs and\\nperform early stopping based on each task’s eval-\\nuation metric on the dev set. The rest of the hyper-\\nparameters remain the same as during pretraining. In this setting, we report the median development\\nset results for each task over five random initial-\\nizations, without model ensembling. In the second setting ( ensembles, test ), we com-\\npare RoBERTa to other approaches on the test set\\nvia the GLUE leaderboard. While many submis-\\nsions to the GLUE leaderboard depend on multi-\\ntask finetuning, our submission depends only on\\nsingle-task finetuning . For RTE, STS and MRPC\\nwe found it helpful to finetune starting from the\\nMNLI single-task model, rather than the baseline\\npretrained RoBERTa. We explore a slightly wider\\nhyperparameter space, described in the Appendix,\\nand ensemble between 5 and 7 models per task. MNLI QNLI QQP RTE SST MRPC CoLA STS WNLI Avg\\nSingle-task single models on dev\\nBERT LARGE 86.6/- 92.3 91.3 70.4 93.2 88.0 60.6 90.0 - -\\nXLNet LARGE 89.8/- 93.9 91.8 83.8 95.6 89.2 63.6 91.8 - -\\nRoBERTa 90.2/90.2 94.7 92.2 86.6 96.4 90.9 68.0 92.4 91.3 -\\nEnsembles on test (from leaderboard as of July 25, 2019)\\nALICE 88.2/87.9 95.7 90.7 83.5 95.2 92.6 68.6 91.1 80.8 86.3\\nMT-DNN 87.9/87.4 96.0 89.9 86.3 96.5 92.7 68.4 91.1 89.0 87.6\\nXLNet 90.2/89.8 98.6 90.3 86.3 96.8 93.0 67.8 91.6 90.4 88.4\\nRoBERTa 90.8/90.2 98.9 90.2 88.2 96.7 92.3 67.8 92.2 89.0 88.5\\nTable 5: Results on GLUE. All results are based on a 24-layer a rchitecture. BERT LARGE and XLNet LARGE results\\nare from Devlin et al. (2019 ) and Yang et al. (2019 ), respectively. RoBERTa results on the development set are a\\nmedian over five runs. RoBERTa results on the test set are ense mbles of single-task models. For RTE, STS and\\nMRPC we finetune starting from the MNLI model instead of the ba seline pretrained model. Averages are obtained\\nfrom the GLUE leaderboard. Task-specific modifications Two of the GLUE\\ntasks require task-specific finetuning approaches\\nto achieve competitive leaderboard results. QNLI : Recent submissions on the GLUE\\nleaderboard adopt a pairwise ranking formulation\\nfor the QNLI task, in which candidate answers\\nare mined from the training set and compared to\\none another, and a single (question, candidate)\\npair is classified as positive ( Liu et al.\"\n", " }\n", "}\n", "\u001b[32;1m\u001b[1;3m[chain/start]\u001b[0m \u001b[1m[chain:RunnableSequence > prompt:PromptTemplate] Entering Prompt run with input:\n", "\u001b[0m{\n", " \"doc\": {\n", " \"doc\": \",2019b ,a;\\nYang et al. ,2019 ). This formulation significantly\\nsimplifies the task, but is not directly comparable\\nto BERT ( Devlin et al. ,2019 ). Following recent\\nwork, we adopt the ranking approach for our test\\nsubmission, but for direct comparison with BERT\\nwe report development set results based on a pure\\nclassification approach. WNLI : We found the provided NLI-format\\ndata to be challenging to work with. Instead\\nwe use the reformatted WNLI data from Super-\\nGLUE ( Wang et al. ,2019a ), which indicates the\\nspan of the query pronoun and referent. We fine-\\ntune RoBERTa using the margin ranking loss from\\nKocijan et al. (2019 ). For a given input sentence,\\nwe use spaCy ( Honnibal and Montani ,2017 ) to\\nextract additional candidate noun phrases from the\\nsentence and finetune our model so that it assigns\\nhigher scores to positive referent phrases than for\\nany of the generated negative candidate phrases. One unfortunate consequence of this formulation\\nis that we can only make use of the positive train-\\ning examples, which excludes over half of the pro-\\nvided training examples.10\\n10While we only use the provided WNLI training data, ourResults We present our results in Table 5. In the\\nfirst setting ( single-task, dev ), RoBERTa achieves\\nstate-of-the-art results on all 9 of the GLUE\\ntask development sets. Crucially, RoBERTa uses\\nthe same masked language modeling pretrain-\\ning objective and architecture as BERT LARGE , yet\\nconsistently outperforms both BERT LARGE and\\nXLNet LARGE . This raises questions about the rel-\\native importance of model architecture and pre-\\ntraining objective, compared to more mundane de-\\ntails like dataset size and training time that we ex-\\nplore in this work. In the second setting ( ensembles, test ), we\\nsubmit RoBERTa to the GLUE leaderboard and\\nachieve state-of-the-art results on 4 out of 9 tasks\\nand the highest average score to date. This is espe-\\ncially exciting because RoBERTa does not depend\\non multi-task finetuning, unlike most of the other\\ntop submissions. We expect future work may fur-\\nther improve these results by incorporating more\\nsophisticated multi-task finetuning procedures. 5.2 SQuAD Results\\nWe adopt a much simpler approach for SQuAD\\ncompared to past work. In particular, while\\nboth BERT ( Devlin et al. ,2019 ) and XL-\\nNet ( Yang et al. ,2019 ) augment their training data\\nwith additional QA datasets, we only finetune\\nRoBERTa using the provided SQuAD training\\ndata .Yang et al. (2019 ) also employed a custom\\nlayer-wise learning rate schedule to finetune\\nresults could potentially be improved by augmenting this wi th\\nadditional pronoun disambiguation datasets. ModelSQuAD 1.1 SQuAD 2.0\\nEM F1 EM F1\\nSingle models on dev, w/o data augmentation\\nBERT LARGE 84.1 90.9 79.0 81.8\\nXLNet LARGE 89.0 94.5 86.1 88.8\\nRoBERTa 88.9 94.6 86.5 89.4\\nSingle models on test (as of July 25, 2019)\\nXLNet LARGE 86.3†89.1†\\nRoBERTa 86.8 89.8\\nXLNet + SG-Net Verifier 87.0†89.9†\\nTable 6: Results on SQuAD. †indicates results that de-\\npend on additional external training data. RoBERTa\\nuses only the provided SQuAD data in both dev and\\ntest settings. BERT LARGE and XLNet LARGE results are\\nfrom Devlin et al. (2019 ) and Yang et al. (2019 ), re-\\nspectively. XLNet, while we use the same learning rate for\\nall layers. For SQuAD v1.1 we follow the same finetun-\\ning procedure as Devlin et al. (2019 ). For SQuAD\\nv2.0, we additionally classify whether a given\\nquestion is answerable; we train this classifier\\njointly with the span predictor by summing the\\nclassification and span loss terms. Results We present our results in Table 6. On\\nthe SQuAD v1.1 development set, RoBERTa\\nmatches the state-of-the-art set by XLNet. On the\\nSQuAD v2.0 development set, RoBERTa sets a\\nnew state-of-the-art, improving over XLNet by 0.4\\npoints (EM) and 0.6 points (F1). We also submit RoBERTa to the public SQuAD\\n2.0 leaderboard and evaluate its performance rel-\\native to other systems. Most of the top systems\\nbuild upon either BERT ( Devlin et al. ,2019 ) or\\nXLNet ( Yang et al. ,2019 ), both of which rely on\\nadditional external training data. In contrast, our\\nsubmission does not use any additional data. Our single RoBERTa model outperforms all but\\none of the single model submissions, and is the\\ntop scoring system among those that do not rely\\non data augmentation. 5.3 RACE Results\\nIn RACE, systems are provided with a passage of\\ntext, an associated question, and four candidate an-\\nswers. Systems are required to classify which of\\nthe four candidate answers is correct. We modify RoBERTa for this task by concate-Model Accuracy Middle High\\nSingle models on test (as of July 25, 2019)\\nBERT LARGE 72.0 76.6 70.1\\nXLNet LARGE 81.7 85.4 80.2\\nRoBERTa 83.2 86.5 81.3\\nTable 7: Results on the RACE test set. BERT LARGE and\\nXLNet LARGE results are from Yang et al. (2019 ). nating each candidate answer with the correspond-\\ning question and passage. We then encode each of\\nthese four sequences and pass the resulting [CLS]\\nrepresentations through a fully-connected layer,\\nwhich is used to predict the correct answer. We\\ntruncate question-answer pairs that are longer than\\n128 tokens and, if needed, the passage so that the\\ntotal length is at most 512 tokens. Results on the RACE test sets are presented in\\nTable 7. RoBERTa achieves state-of-the-art results\\non both middle-school and high-school settings. 6 Related Work\\nPretraining methods have been designed\\nwith different training objectives, includ-\\ning language modeling ( Dai and Le ,2015 ;\\nPeters et al. ,2018 ;Howard and Ruder ,2018 ),\\nmachine translation ( McCann et al. ,2017 ), and\\nmasked language modeling ( Devlin et al. ,2019 ;\\nLample and Conneau ,2019 ). Many recent\\npapers have used a basic recipe of finetuning\\nmodels for each end task ( Howard and Ruder ,\\n2018 ;Radford et al. ,2018 ), and pretraining\\nwith some variant of a masked language model\\nobjective. However, newer methods have\\nimproved performance by multi-task fine tun-\\ning ( Dong et al. ,2019 ), incorporating entity\\nembeddings ( Sun et al. ,2019 ), span predic-\\ntion ( Joshi et al. ,2019 ), and multiple variants\\nof autoregressive pretraining ( Song et al. ,2019 ;\\nChan et al. ,2019 ;Yang et al. ,2019 ). Perfor-\\nmance is also typically improved by training\\nbigger models on more data ( Devlin et al.\"\n", " }\n", "}\n", "\u001b[32;1m\u001b[1;3m[chain/start]\u001b[0m \u001b[1m[chain:RunnableSequence > prompt:PromptTemplate] Entering Prompt run with input:\n", "\u001b[0m{\n", " \"doc\": {\n", " \"doc\": \",\\n2019 ;Baevski et al. ,2019 ;Yang et al. ,2019 ;\\nRadford et al. ,2019 ). Our goal was to replicate,\\nsimplify, and better tune the training of BERT,\\nas a reference point for better understanding the\\nrelative performance of all of these methods. 7 Conclusion\\nWe carefully evaluate a number of design de-\\ncisions when pretraining BERT models. We\\nfind that performance can be substantially im-\\nproved by training the model longer, with bigger\\nbatches over more data; removing the next sen-\\ntence prediction objective; training on longer se-\\nquences; and dynamically changing the masking\\npattern applied to the training data. Our improved\\npretraining procedure, which we call RoBERTa,\\nachieves state-of-the-art results on GLUE, RACE\\nand SQuAD, without multi-task finetuning for\\nGLUE or additional data for SQuAD. These re-\\nsults illustrate the importance of these previ-\\nously overlooked design decisions and suggest\\nthat BERT’s pretraining objective remains com-\\npetitive with recently proposed alternatives. We additionally use a novel dataset,\\nCC-N EWS, and release our models and\\ncode for pretraining and finetuning at:\\nhttps://github.com/pytorch/fairseq . References\\nEneko Agirre, Llu’is M‘arquez, and Richard Wicen-\\ntowski, editors. 2007. Proceedings of the Fourth\\nInternational Workshop on Semantic Evaluations\\n(SemEval-2007) . Alexei Baevski, Sergey Edunov, Yinhan Liu, Luke\\nZettlemoyer, and Michael Auli. 2019. Cloze-\\ndriven pretraining of self-attention networks. arXiv\\npreprint arXiv:1903.07785 .\"\n", " }\n", "}\n", "\u001b[32;1m\u001b[1;3m[chain/start]\u001b[0m \u001b[1m[chain:RunnableSequence > prompt:PromptTemplate] Entering Prompt run with input:\n", "\u001b[0m{\n", " \"doc\": {\n", " \"doc\": \"Roy Bar-Haim, Ido Dagan, Bill Dolan, Lisa Ferro,\\nDanilo Giampiccolo, Bernardo Magnini, and Idan\\nSzpektor. 2006. The second PASCAL recognising\\ntextual entailment challenge. In Proceedings of the\\nsecond PASCAL challenges workshop on recognis-\\ning textual entailment . Luisa Bentivogli, Ido Dagan, Hoa Trang Dang, Danilo\\nGiampiccolo, and Bernardo Magnini. 2009. The\\nfifth PASCAL recognizing textual entailment chal-\\nlenge. Samuel R Bowman, Gabor Angeli, Christopher Potts,\\nand Christopher D Manning. 2015. A large anno-\\ntated corpus for learning natural language inference. InEmpirical Methods in Natural Language Process-\\ning (EMNLP) .\"\n", " }\n", "}\n", "\u001b[32;1m\u001b[1;3m[chain/start]\u001b[0m \u001b[1m[chain:RunnableSequence > prompt:PromptTemplate] Entering Prompt run with input:\n", "\u001b[0m{\n", " \"doc\": {\n", " \"doc\": \"William Chan, Nikita Kitaev, Kelvin Guu, Mitchell\\nStern, and Jakob Uszkoreit. 2019. KERMIT: Gener-\\native insertion-based modeling for sequences.\"\n", " }\n", "}\n", "\u001b[32;1m\u001b[1;3m[chain/start]\u001b[0m \u001b[1m[chain:RunnableSequence > prompt:PromptTemplate] Entering Prompt run with input:\n", "\u001b[0m{\n", " \"doc\": {\n", " \"doc\": \"arXiv\\npreprint arXiv:1906.01604 .Ido Dagan, Oren Glickman, and Bernardo Magnini. 2006. The PASCAL recognising textual entailment\\nchallenge. In Machine learning challenges. evalu-\\nating predictive uncertainty, visual object classifica-\\ntion, and recognising tectual entailment . Andrew M Dai and Quoc V Le. 2015. Semi-supervised\\nsequence learning. In Advances in Neural Informa-\\ntion Processing Systems (NIPS) . Jacob Devlin, Ming-Wei Chang, Kenton Lee, and\\nKristina Toutanova. 2019. BERT: Pre-training of\\ndeep bidirectional transformers for language under-\\nstanding. In North American Association for Com-\\nputational Linguistics (NAACL) . William B Dolan and Chris Brockett. 2005. Auto-\\nmatically constructing a corpus of sentential para-\\nphrases. In Proceedings of the International Work-\\nshop on Paraphrasing . Li Dong, Nan Yang, Wenhui Wang, Furu Wei,\\nXiaodong Liu, Yu Wang, Jianfeng Gao, Ming\\nZhou, and Hsiao-Wuen Hon. 2019. Unified\\nlanguage model pre-training for natural language\\nunderstanding and generation. arXiv preprint\\narXiv:1905.03197 .\"\n", " }\n", "}\n", "\u001b[32;1m\u001b[1;3m[chain/start]\u001b[0m \u001b[1m[chain:RunnableSequence > prompt:PromptTemplate] Entering Prompt run with input:\n", "\u001b[0m{\n", " \"doc\": {\n", " \"doc\": \"Danilo Giampiccolo, Bernardo Magnini, Ido Dagan,\\nand Bill Dolan. 2007. The third PASCAL recog-\\nnizing textual entailment challenge. In Proceedings\\nof the ACL-PASCAL workshop on textual entailment\\nand paraphrasing . Aaron Gokaslan and Vanya Cohen. 2019. Openweb-\\ntext corpus. http://web.archive.org/\\nsave/http://Skylion007.github.io/\\nOpenWebTextCorpus .\"\n", " }\n", "}\n", "\u001b[32;1m\u001b[1;3m[chain/start]\u001b[0m \u001b[1m[chain:RunnableSequence > prompt:PromptTemplate] Entering Prompt run with input:\n", "\u001b[0m{\n", " \"doc\": {\n", " \"doc\": \"Felix Hamborg, Norman Meuschke, Corinna Bre-\\nitinger, and Bela Gipp. 2017. news-please: A\\ngeneric news crawler and extractor. In Proceedings\\nof the 15th International Symposium of Information\\nScience .\"\n", " }\n", "}\n", "\u001b[32;1m\u001b[1;3m[chain/start]\u001b[0m \u001b[1m[chain:RunnableSequence > prompt:PromptTemplate] Entering Prompt run with input:\n", "\u001b[0m{\n", " \"doc\": {\n", " \"doc\": \"Dan Hendrycks and Kevin Gimpel. 2016. Gaus-\\nsian error linear units (gelus). arXiv preprint\\narXiv:1606.08415 .\"\n", " }\n", "}\n", "\u001b[32;1m\u001b[1;3m[chain/start]\u001b[0m \u001b[1m[chain:RunnableSequence > prompt:PromptTemplate] Entering Prompt run with input:\n", "\u001b[0m{\n", " \"doc\": {\n", " \"doc\": \"Matthew Honnibal and Ines Montani. 2017. spaCy 2:\\nNatural language understanding with Bloom embed-\\ndings, convolutional neural networks and incremen-\\ntal parsing. To appear. Jeremy Howard and Sebastian Ruder. 2018. Universal\\nlanguage model fine-tuning for text classification. arXiv preprint arXiv:1801.06146 . Shankar Iyer, Nikhil Dandekar, and Kornl Cser-\\nnai. 2016. First quora dataset release: Question\\npairs.https://data.quora.com/First-\\nQuora-Dataset-Release-Question-\\nPairs . Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S.\"\n", " }\n", "}\n", "\u001b[32;1m\u001b[1;3m[chain/start]\u001b[0m \u001b[1m[chain:RunnableSequence > prompt:PromptTemplate] Entering Prompt run with input:\n", "\u001b[0m{\n", " \"doc\": {\n", " \"doc\": \"Weld, Luke Zettlemoyer, and Omer Levy. 2019. SpanBERT: Improving pre-training by repre-\\nsenting and predicting spans. arXiv preprint\\narXiv:1907.10529 . Diederik Kingma and Jimmy Ba. 2015. Adam: A\\nmethod for stochastic optimization. In International\\nConference on Learning Representations (ICLR) .\"\n", " }\n", "}\n", "\u001b[32;1m\u001b[1;3m[chain/start]\u001b[0m \u001b[1m[chain:RunnableSequence > prompt:PromptTemplate] Entering Prompt run with input:\n", "\u001b[0m{\n", " \"doc\": {\n", " \"doc\": \"Vid Kocijan, Ana-Maria Cretu, Oana-Maria Camburu,\\nYordan Yordanov, and Thomas Lukasiewicz. 2019. A surprisingly robust trick for winograd schema\\nchallenge. arXiv preprint arXiv:1905.06290 . Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang,\\nand Eduard Hovy. 2017. Race: Large-scale reading\\ncomprehension dataset from examinations. arXiv\\npreprint arXiv:1704.04683 . Guillaume Lample and Alexis Conneau. 2019. Cross-\\nlingual language model pretraining. arXiv preprint\\narXiv:1901.07291 .\"\n", " }\n", "}\n", "\u001b[32;1m\u001b[1;3m[chain/start]\u001b[0m \u001b[1m[chain:RunnableSequence > prompt:PromptTemplate] Entering Prompt run with input:\n", "\u001b[0m{\n", " \"doc\": {\n", " \"doc\": \"Hector J Levesque, Ernest Davis, and Leora Morgen-\\nstern. 2011. The Winograd schema challenge. In\\nAAAI Spring Symposium: Logical Formalizations of\\nCommonsense Reasoning . Xiaodong Liu, Pengcheng He, Weizhu Chen, and\\nJianfeng Gao. 2019a. Improving multi-task deep\\nneural networks via knowledge distillation for\\nnatural language understanding. arXiv preprint\\narXiv:1904.09482 .\"\n", " }\n", "}\n", "\u001b[32;1m\u001b[1;3m[chain/start]\u001b[0m \u001b[1m[chain:RunnableSequence > prompt:PromptTemplate] Entering Prompt run with input:\n", "\u001b[0m{\n", " \"doc\": {\n", " \"doc\": \"Xiaodong Liu, Pengcheng He, Weizhu Chen, and Jian-\\nfeng Gao. 2019b. Multi-task deep neural networks\\nfor natural language understanding. arXiv preprint\\narXiv:1901.11504 . Bryan McCann, James Bradbury, Caiming Xiong, and\\nRichard Socher. 2017. Learned in translation: Con-\\ntextualized word vectors. In Advances in Neural In-\\nformation Processing Systems (NIPS) , pages 6297–\\n6308.\"\n", " }\n", "}\n", "\u001b[32;1m\u001b[1;3m[chain/start]\u001b[0m \u001b[1m[chain:RunnableSequence > prompt:PromptTemplate] Entering Prompt run with input:\n", "\u001b[0m{\n", " \"doc\": {\n", " \"doc\": \"Paulius Micikevicius, Sharan Narang, Jonah Alben,\\nGregory Diamos, Erich Elsen, David Garcia, Boris\\nGinsburg, Michael Houston, Oleksii Kuchaiev,\\nGanesh Venkatesh, and Hao Wu. 2018. Mixed preci-\\nsion training. In International Conference on Learn-\\ning Representations . Sebastian Nagel. 2016. Cc-news. http:\\n//web.archive.org/save/http:\\n//commoncrawl.org/2016/10/news-\\ndataset-available .\"\n", " }\n", "}\n", "\u001b[32;1m\u001b[1;3m[chain/start]\u001b[0m \u001b[1m[chain:RunnableSequence > prompt:PromptTemplate] Entering Prompt run with input:\n", "\u001b[0m{\n", " \"doc\": {\n", " \"doc\": \"Myle Ott, Sergey Edunov, Alexei Baevski, Angela\\nFan, Sam Gross, Nathan Ng, David Grangier, and\\nMichael Auli. 2019. FAIRSEQ : A fast, exten-\\nsible toolkit for sequence modeling. In North\\nAmerican Association for Computational Linguis-\\ntics (NAACL): System Demonstrations .Myle Ott, Sergey Edunov, David Grangier, and\\nMichael Auli. 2018. Scaling neural machine trans-\\nlation. In Proceedings of the Third Conference on\\nMachine Translation (WMT) . Adam Paszke, Sam Gross, Soumith Chintala, Gre-\\ngory Chanan, Edward Yang, Zachary DeVito, Zem-\\ning Lin, Alban Desmaison, Luca Antiga, and Adam\\nLerer. 2017. Automatic differentiation in PyTorch. InNIPS Autodiff Workshop . Matthew Peters, Mark Neumann, Mohit Iyyer, Matt\\nGardner, Christopher Clark, Kenton Lee, and Luke\\nZettlemoyer. 2018. Deep contextualized word repre-\\nsentations. In North American Association for Com-\\nputational Linguistics (NAACL) . Alec Radford, Karthik Narasimhan, Time Salimans,\\nand Ilya Sutskever. 2018. Improving language un-\\nderstanding with unsupervised learning. Technical\\nreport, OpenAI. Alec Radford, Jeffrey Wu, Rewon Child, David Luan,\\nDario Amodei, and Ilya Sutskever. 2019. Language\\nmodels are unsupervised multitask learners. Techni-\\ncal report, OpenAI.\"\n", " }\n", "}\n", "\u001b[32;1m\u001b[1;3m[chain/start]\u001b[0m \u001b[1m[chain:RunnableSequence > prompt:PromptTemplate] Entering Prompt run with input:\n", "\u001b[0m{\n", " \"doc\": {\n", " \"doc\": \"Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018. Know what you don’t know: Unanswerable ques-\\ntions for squad. In Association for Computational\\nLinguistics (ACL) . Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and\\nPercy Liang. 2016. SQuAD: 100,000+ questions for\\nmachine comprehension of text. In Empirical Meth-\\nods in Natural Language Processing (EMNLP) . Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural machine translation of rare words with\\nsubword units. In Association for Computational\\nLinguistics (ACL) , pages 1715–1725. Richard Socher, Alex Perelygin, Jean Wu, Jason\\nChuang, Christopher D Manning, Andrew Ng, and\\nChristopher Potts. 2013. Recursive deep models\\nfor semantic compositionality over a sentiment tree-\\nbank. In Empirical Methods in Natural Language\\nProcessing (EMNLP) . Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and\\nTie-Yan Liu. 2019. MASS: Masked sequence\\nto sequence pre-training for language generation. InInternational Conference on Machine Learning\\n(ICML) .\"\n", " }\n", "}\n", "\u001b[32;1m\u001b[1;3m[chain/start]\u001b[0m \u001b[1m[chain:RunnableSequence > prompt:PromptTemplate] Entering Prompt run with input:\n", "\u001b[0m{\n", " \"doc\": {\n", " \"doc\": \"Yu Stephanie Sun, Shuohuan Wang, Yukun Li, Shikun\\nFeng, Xuyi Chen, Han Zhang, Xinlun Tian, Danxi-\\nang Zhu, Hao Tian, and Hua Wu. 2019. ERNIE: En-\\nhanced representation through knowledge integra-\\ntion. arXiv preprint arXiv:1904.09223 . Trieu H Trinh and Quoc V Le. 2018. A simple\\nmethod for commonsense reasoning. arXiv preprint\\narXiv:1806.02847 . Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob\\nUszkoreit, Llion Jones, Aidan N Gomez, Łukasz\\nKaiser, and Illia Polosukhin. 2017. Attention is all\\nyou need. In Advances in neural information pro-\\ncessing systems . Alex Wang, Yada Pruksachatkun, Nikita Nangia,\\nAmanpreet Singh, Julian Michael, Felix Hill, Omer\\nLevy, and Samuel R.\"\n", " }\n", "}\n", "\u001b[32;1m\u001b[1;3m[chain/start]\u001b[0m \u001b[1m[chain:RunnableSequence > prompt:PromptTemplate] Entering Prompt run with input:\n", "\u001b[0m{\n", " \"doc\": {\n", " \"doc\": \"Bowman. 2019a. SuperGLUE:\\nA stickier benchmark for general-purpose language\\nunderstanding systems.\"\n", " }\n", "}\n", "\u001b[32;1m\u001b[1;3m[chain/start]\u001b[0m \u001b[1m[chain:RunnableSequence > prompt:PromptTemplate] Entering Prompt run with input:\n", "\u001b[0m{\n", " \"doc\": {\n", " \"doc\": \"arXiv preprint 1905.00537 . Alex Wang, Amanpreet Singh, Julian Michael, Felix\\nHill, Omer Levy, and Samuel R. Bowman. 2019b. GLUE: A multi-task benchmark and analysis plat-\\nform for natural language understanding. In Inter-\\nnational Conference on Learning Representations\\n(ICLR) . Alex Warstadt, Amanpreet Singh, and Samuel R. Bow-\\nman. 2018. Neural network acceptability judg-\\nments. arXiv preprint 1805.12471 . Adina Williams, Nikita Nangia, and Samuel Bowman. 2018. A broad-coverage challenge corpus for sen-\\ntence understanding through inference. In North\\nAmerican Association for Computational Linguis-\\ntics (NAACL) . Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Car-\\nbonell, Ruslan Salakhutdinov, and Quoc V Le. 2019. Xlnet: Generalized autoregressive pretrain-\\ning for language understanding. arXiv preprint\\narXiv:1906.08237 .\"\n", " }\n", "}\n", "\u001b[32;1m\u001b[1;3m[chain/start]\u001b[0m \u001b[1m[chain:RunnableSequence > prompt:PromptTemplate] Entering Prompt run with input:\n", "\u001b[0m{\n", " \"doc\": {\n", " \"doc\": \"Yang You, Jing Li, Jonathan Hseu, Xiaodan Song,\\nJames Demmel, and Cho-Jui Hsieh. 2019. Reduc-\\ning bert pre-training time from 3 days to 76 minutes. arXiv preprint arXiv:1904.00962 . Rowan Zellers, Ari Holtzman, Hannah Rashkin,\\nYonatan Bisk, Ali Farhadi, Franziska Roesner, and\\nYejin Choi. 2019. Defending against neural fake\\nnews. arXiv preprint arXiv:1905.12616 .\"\n", " }\n", "}\n", "\u001b[32;1m\u001b[1;3m[chain/start]\u001b[0m \u001b[1m[chain:RunnableSequence > prompt:PromptTemplate] Entering Prompt run with input:\n", "\u001b[0m{\n", " \"doc\": {\n", " \"doc\": \"Yukun Zhu, Ryan Kiros, Richard Zemel, Ruslan\\nSalakhutdinov, Raquel Urtasun, Antonio Torralba,\\nand Sanja Fidler. 2015. Aligning books and movies:\\nTowards story-like visual explanations by watch-\\ning movies and reading books. In arXiv preprint\\narXiv:1506.06724 . Appendix for “RoBERTa: A Robustly\\nOptimized BERT Pretraining Approach”\\nA Full results on GLUE\\nIn Table 8we present the full set of development\\nset results for RoBERTa. We present results for\\naLARGE configuration that follows BERT LARGE ,\\nas well as a BASE configuration that follows\\nBERT BASE.B Pretraining Hyperparameters\\nTable 9describes the hyperparameters for pre-\\ntraining of RoBERTa LARGE and RoBERTa BASE\\nC Finetuning Hyperparameters\\nFinetuning hyperparameters for RACE, SQuAD\\nand GLUE are given in Table 10. We select the\\nbest hyperparameter values based on the median\\nof 5 random seeds for each task. MNLI QNLI QQP RTE SST MRPC CoLA STS\\nRoBERTa BASE\\n+ all data + 500k steps 87.6 92.8 91.9 78.7 94.8 90.2 63.6 91.2\\nRoBERTa LARGE\\nwith B OOKS + W IKI 89.0 93.9 91.9 84.5 95.3 90.2 66.3 91.6\\n+ additional data ( §3.2) 89.3 94.0 92.0 82.7 95.6 91.4 66.1 92.2\\n+ pretrain longer 300k 90.0 94.5 92.2 83.3 96.1 91.1 67.4 92.3\\n+ pretrain longer 500k 90.2 94.7 92.2 86.6 96.4 90.9 68.0 92.4\\nTable 8: Development set results on GLUE tasks for various co nfigurations of RoBERTa. Hyperparam RoBERTa LARGE RoBERTa BASE\\nNumber of Layers 24 12\\nHidden size 1024 768\\nFFN inner hidden size 4096 3072\\nAttention heads 16 12\\nAttention head size 64 64\\nDropout 0.1 0.1\\nAttention Dropout 0.1 0.1\\nWarmup Steps 30k 24k\\nPeak Learning Rate 4e-4 6e-4\\nBatch Size 8k 8k\\nWeight Decay 0.01 0.01\\nMax Steps 500k 500k\\nLearning Rate Decay Linear Linear\\nAdamǫ 1e-6 1e-6\\nAdamβ1 0.9 0.9\\nAdamβ2 0.98 0.98\\nGradient Clipping 0.0 0.0\\nTable 9: Hyperparameters for pretraining RoBERTa LARGE and RoBERTa BASE. Hyperparam RACE SQuAD GLUE\\nLearning Rate 1e-5 1.5e-5 {1e-5, 2e-5, 3e-5 }\\nBatch Size 16 48 {16, 32}\\nWeight Decay 0.1 0.01 0.1\\nMax Epochs 4 2 10\\nLearning Rate Decay Linear Linear Linear\\nWarmup ratio 0.06 0.06 0.06\\nTable 10: Hyperparameters for finetuning RoBERTa LARGE on RACE, SQuAD and GLUE.\"\n", " }\n", "}\n", "\u001b[36;1m\u001b[1;3m[chain/end]\u001b[0m \u001b[1m[chain:RunnableSequence > prompt:PromptTemplate] [6ms] Exiting Prompt run with output:\n", "\u001b[0m[outputs]\n", "\u001b[36;1m\u001b[1;3m[chain/end]\u001b[0m \u001b[1m[chain:RunnableSequence > prompt:PromptTemplate] [5ms] Exiting Prompt run with output:\n", "\u001b[0m[outputs]\n", "\u001b[36;1m\u001b[1;3m[chain/end]\u001b[0m \u001b[1m[chain:RunnableSequence > prompt:PromptTemplate] [5ms] Exiting Prompt run with output:\n", "\u001b[0m[outputs]\n", "\u001b[36;1m\u001b[1;3m[chain/end]\u001b[0m \u001b[1m[chain:RunnableSequence > prompt:PromptTemplate] [5ms] Exiting Prompt run with output:\n", "\u001b[0m[outputs]\n", "\u001b[36;1m\u001b[1;3m[chain/end]\u001b[0m \u001b[1m[chain:RunnableSequence > prompt:PromptTemplate] [5ms] Exiting Prompt run with output:\n", "\u001b[0m[outputs]\n", "\u001b[36;1m\u001b[1;3m[chain/end]\u001b[0m \u001b[1m[chain:RunnableSequence > prompt:PromptTemplate] [5ms] Exiting Prompt run with output:\n", "\u001b[0m[outputs]\n", "\u001b[36;1m\u001b[1;3m[chain/end]\u001b[0m \u001b[1m[chain:RunnableSequence > prompt:PromptTemplate] [5ms] Exiting Prompt run with output:\n", "\u001b[0m[outputs]\n", "\u001b[36;1m\u001b[1;3m[chain/end]\u001b[0m \u001b[1m[chain:RunnableSequence > prompt:PromptTemplate] [5ms] Exiting Prompt run with output:\n", "\u001b[0m[outputs]\n", "\u001b[36;1m\u001b[1;3m[chain/end]\u001b[0m \u001b[1m[chain:RunnableSequence > prompt:PromptTemplate] [5ms] Exiting Prompt run with output:\n", "\u001b[0m[outputs]\n", "\u001b[36;1m\u001b[1;3m[chain/end]\u001b[0m \u001b[1m[chain:RunnableSequence > prompt:PromptTemplate] [5ms] Exiting Prompt run with output:\n", "\u001b[0m[outputs]\n", "\u001b[36;1m\u001b[1;3m[chain/end]\u001b[0m \u001b[1m[chain:RunnableSequence > prompt:PromptTemplate] [5ms] Exiting Prompt run with output:\n", "\u001b[0m[outputs]\n", "\u001b[36;1m\u001b[1;3m[chain/end]\u001b[0m \u001b[1m[chain:RunnableSequence > prompt:PromptTemplate] [5ms] Exiting Prompt run with output:\n", "\u001b[0m[outputs]\n", "\u001b[36;1m\u001b[1;3m[chain/end]\u001b[0m \u001b[1m[chain:RunnableSequence > prompt:PromptTemplate] [5ms] Exiting Prompt run with output:\n", "\u001b[0m[outputs]\n", "\u001b[36;1m\u001b[1;3m[chain/end]\u001b[0m \u001b[1m[chain:RunnableSequence > prompt:PromptTemplate] [5ms] Exiting Prompt run with output:\n", "\u001b[0m[outputs]\n", "\u001b[36;1m\u001b[1;3m[chain/end]\u001b[0m \u001b[1m[chain:RunnableSequence > prompt:PromptTemplate] [5ms] Exiting Prompt run with output:\n", "\u001b[0m[outputs]\n", "\u001b[36;1m\u001b[1;3m[chain/end]\u001b[0m \u001b[1m[chain:RunnableSequence > prompt:PromptTemplate] [5ms] Exiting Prompt run with output:\n", "\u001b[0m[outputs]\n", "\u001b[36;1m\u001b[1;3m[chain/end]\u001b[0m \u001b[1m[chain:RunnableSequence > prompt:PromptTemplate] [5ms] Exiting Prompt run with output:\n", "\u001b[0m[outputs]\n", "\u001b[36;1m\u001b[1;3m[chain/end]\u001b[0m \u001b[1m[chain:RunnableSequence > prompt:PromptTemplate] [5ms] Exiting Prompt run with output:\n", "\u001b[0m[outputs]\n", "\u001b[36;1m\u001b[1;3m[chain/end]\u001b[0m \u001b[1m[chain:RunnableSequence > prompt:PromptTemplate] [5ms] Exiting Prompt run with output:\n", "\u001b[0m[outputs]\n", "\u001b[36;1m\u001b[1;3m[chain/end]\u001b[0m \u001b[1m[chain:RunnableSequence > prompt:PromptTemplate] [5ms] Exiting Prompt run with output:\n", "\u001b[0m[outputs]\n", "\u001b[36;1m\u001b[1;3m[chain/end]\u001b[0m \u001b[1m[chain:RunnableSequence > prompt:PromptTemplate] [5ms] Exiting Prompt run with output:\n", "\u001b[0m[outputs]\n", "\u001b[36;1m\u001b[1;3m[chain/end]\u001b[0m \u001b[1m[chain:RunnableSequence > prompt:PromptTemplate] [5ms] Exiting Prompt run with output:\n", "\u001b[0m[outputs]\n", "\u001b[36;1m\u001b[1;3m[chain/end]\u001b[0m \u001b[1m[chain:RunnableSequence > prompt:PromptTemplate] [5ms] Exiting Prompt run with output:\n", "\u001b[0m[outputs]\n", "\u001b[36;1m\u001b[1;3m[chain/end]\u001b[0m \u001b[1m[chain:RunnableSequence > prompt:PromptTemplate] [5ms] Exiting Prompt run with output:\n", "\u001b[0m[outputs]\n", "\u001b[36;1m\u001b[1;3m[chain/end]\u001b[0m \u001b[1m[chain:RunnableSequence > prompt:PromptTemplate] [5ms] Exiting Prompt run with output:\n", "\u001b[0m[outputs]\n", "\u001b[36;1m\u001b[1;3m[chain/end]\u001b[0m \u001b[1m[chain:RunnableSequence > prompt:PromptTemplate] [5ms] Exiting Prompt run with output:\n", "\u001b[0m[outputs]\n", "\u001b[36;1m\u001b[1;3m[chain/end]\u001b[0m \u001b[1m[chain:RunnableSequence > prompt:PromptTemplate] [5ms] Exiting Prompt run with output:\n", "\u001b[0m[outputs]\n", "\u001b[36;1m\u001b[1;3m[chain/end]\u001b[0m \u001b[1m[chain:RunnableSequence > prompt:PromptTemplate] [5ms] Exiting Prompt run with output:\n", "\u001b[0m[outputs]\n", "\u001b[36;1m\u001b[1;3m[chain/end]\u001b[0m \u001b[1m[chain:RunnableSequence > prompt:PromptTemplate] [5ms] Exiting Prompt run with output:\n", "\u001b[0m[outputs]\n", "\u001b[32;1m\u001b[1;3m[llm/start]\u001b[0m \u001b[1m[chain:RunnableSequence > llm:ChatOpenAI] Entering LLM run with input:\n", "\u001b[0m{\n", " \"prompts\": [\n", " \"Human: You are an expert in technical papers and journals.\\nYou're tasked with summarizing the main points in the following text.\\nThe following is the text you need to summarize:\\n{'doc': 'arXiv:1907.11692v1 [cs.CL] 26 Jul 2019RoBERTa: A Robustly Optimized BERT Pretraining Approach\\\\nYinhan Liu∗§Myle Ott∗§Naman Goyal∗§Jingfei Du∗§Mandar Joshi†\\\\nDanqi Chen§Omer Levy§Mike Lewis§Luke Zettlemoyer†§Veselin Stoyanov§\\\\n†Paul G.'}\\nBased on this text, provide a summary of the main points.\\n\\nRULES:\\n- Organize the points in markdown format.\\n\\nHelpful Answer:\"\n", " ]\n", "}\n", "\u001b[32;1m\u001b[1;3m[llm/start]\u001b[0m \u001b[1m[chain:RunnableSequence > llm:ChatOpenAI] Entering LLM run with input:\n", "\u001b[0m{\n", " \"prompts\": [\n", " \"Human: You are an expert in technical papers and journals.\\nYou're tasked with summarizing the main points in the following text.\\nThe following is the text you need to summarize:\\n{'doc': 'Allen School of Computer Science & Engineering,\\\\nUniversity of Washington, Seattle, WA\\\\n{mandar90,lsz }@cs.washington.edu\\\\n§Facebook AI\\\\n{yinhanliu,myleott,naman,jingfeidu,\\\\ndanqi,omerlevy,mikelewis,lsz,ves }@fb.com\\\\nAbstract\\\\nLanguage model pretraining has led to sig-\\\\nnificant performance gains but careful com-\\\\nparison between different approaches is chal-\\\\nlenging. Training is computationally expen-\\\\nsive, often done on private datasets of different\\\\nsizes, and, as we will show, hyperparameter\\\\nchoices have significant impact on the final re-\\\\nsults. We present a replication study of BERT\\\\npretraining ( Devlin et al. ,2019 ) that carefully\\\\nmeasures the impact of many key hyperparam-\\\\neters and training data size. We find that BERT\\\\nwas significantly undertrained, and can match\\\\nor exceed the performance of every model\\\\npublished after it. Our best model achieves\\\\nstate-of-the-art results on GLUE, RACE and\\\\nSQuAD. These results highlight the impor-\\\\ntance of previously overlooked design choices,\\\\nand raise questions about the source of re-\\\\ncently reported improvements. We release our\\\\nmodels and code.1\\\\n1 Introduction\\\\nSelf-training methods such as ELMo ( Peters et al. ,\\\\n2018 ), GPT ( Radford et al. ,2018 ), BERT\\\\n(Devlin et al. ,2019 ), XLM ( Lample and Conneau ,\\\\n2019 ), and XLNet ( Yang et al. ,2019 ) have\\\\nbrought significant performance gains, but it can\\\\nbe challenging to determine which aspects of\\\\nthe methods contribute the most. Training is\\\\ncomputationally expensive, limiting the amount\\\\nof tuning that can be done, and is often done with\\\\nprivate training data of varying sizes, limiting\\\\nour ability to measure the effects of the modeling\\\\nadvances. ∗Equal contribution. 1Our models and code are available at:\\\\nhttps://github.com/pytorch/fairseqWe present a replication study of BERT pre-\\\\ntraining ( Devlin et al. ,2019 ), which includes a\\\\ncareful evaluation of the effects of hyperparmeter\\\\ntuning and training set size. We find that BERT\\\\nwas significantly undertrained and propose an im-\\\\nproved recipe for training BERT models, which\\\\nwe call RoBERTa, that can match or exceed the\\\\nperformance of all of the post-BERT methods. Our modifications are simple, they include: (1)\\\\ntraining the model longer, with bigger batches,\\\\nover more data; (2) removing the next sentence\\\\nprediction objective; (3) training on longer se-\\\\nquences; and (4) dynamically changing the mask-\\\\ning pattern applied to the training data. We also\\\\ncollect a large new dataset (CC-N EWS) of compa-\\\\nrable size to other privately used datasets, to better\\\\ncontrol for training set size effects. When controlling for training data, our im-\\\\nproved training procedure improves upon the pub-\\\\nlished BERT results on both GLUE and SQuAD. When trained for longer over additional data, our\\\\nmodel achieves a score of 88.5 on the public\\\\nGLUE leaderboard, matching the 88.4 reported\\\\nbyYang et al. (2019 ). Our model establishes a\\\\nnew state-of-the-art on 4/9 of the GLUE tasks:\\\\nMNLI, QNLI, RTE and STS-B. We also match\\\\nstate-of-the-art results on SQuAD and RACE. Overall, we re-establish that BERT’s masked lan-\\\\nguage model training objective is competitive\\\\nwith other recently proposed training objectives\\\\nsuch as perturbed autoregressive language model-\\\\ning (Yang et al. ,2019 ).2\\\\nIn summary, the contributions of this paper\\\\nare: (1) We present a set of important BERT de-\\\\nsign choices and training strategies and introduce\\\\n2It is possible that these other methods could also improve\\\\nwith more tuning. We leave this exploration to future work. alternatives that lead to better downstream task\\\\nperformance; (2) We use a novel dataset, CC-\\\\nNEWS, and confirm that using more data for pre-\\\\ntraining further improves performance on down-\\\\nstream tasks; (3) Our training improvements show\\\\nthat masked language model pretraining, under\\\\nthe right design choices, is competitive with all\\\\nother recently published methods. We release our\\\\nmodel, pretraining and fine-tuning code imple-\\\\nmented in PyTorch ( Paszke et al. ,2017 ). 2 Background\\\\nIn this section, we give a brief overview of the\\\\nBERT ( Devlin et al. ,2019 ) pretraining approach\\\\nand some of the training choices that we will ex-\\\\namine experimentally in the following section. 2.1 Setup\\\\nBERT takes as input a concatenation of two\\\\nsegments (sequences of tokens), x1,...,x N\\\\nandy1,...,yM. Segments usually consist of\\\\nmore than one natural sentence. The two seg-\\\\nments are presented as a single input sequence\\\\nto BERT with special tokens delimiting them:\\\\n[CLS],x1,...,x N,[SEP],y1,...,yM,[EOS]. MandNare constrained such that M+N < T ,\\\\nwhereTis a parameter that controls the maximum\\\\nsequence length during training. The model is first pretrained on a large unla-\\\\nbeled text corpus and subsequently finetuned us-\\\\ning end-task labeled data. 2.2 Architecture\\\\nBERT uses the now ubiquitous transformer archi-\\\\ntecture ( Vaswani et al. ,2017 ), which we will not\\\\nreview in detail. We use a transformer architecture\\\\nwithLlayers. Each block uses Aself-attention\\\\nheads and hidden dimension H. 2.3 Training Objectives\\\\nDuring pretraining, BERT uses two objectives:\\\\nmasked language modeling and next sentence pre-\\\\ndiction. Masked Language Model (MLM) A random\\\\nsample of the tokens in the input sequence is\\\\nselected and replaced with the special token\\\\n[MASK]. The MLM objective is a cross-entropy\\\\nloss on predicting the masked tokens. BERT uni-\\\\nformly selects 15% of the input tokens for possi-\\\\nble replacement. Of the selected tokens, 80% are\\\\nreplaced with [MASK], 10% are left unchanged,and 10% are replaced by a randomly selected vo-\\\\ncabulary token. In the original implementation, random mask-\\\\ning and replacement is performed once in the be-\\\\nginning and saved for the duration of training, al-\\\\nthough in practice, data is duplicated so the mask\\\\nis not always the same for every training sentence\\\\n(see Section 4.1). Next Sentence Prediction (NSP) NSP is a bi-\\\\nnary classification loss for predicting whether two\\\\nsegments follow each other in the original text. Positive examples are created by taking consecu-\\\\ntive sentences from the text corpus. Negative ex-\\\\namples are created by pairing segments from dif-\\\\nferent documents. Positive and negative examples\\\\nare sampled with equal probability. The NSP objective was designed to improve\\\\nperformance on downstream tasks, such as Natural\\\\nLanguage Inference ( Bowman et al. ,2015 ), which\\\\nrequire reasoning about the relationships between\\\\npairs of sentences. 2.4 Optimization\\\\nBERT is optimized with Adam ( Kingma and Ba ,\\\\n2015 ) using the following parameters: β1= 0.9,\\\\nβ2= 0.999,ǫ=1e-6 and L2weight de-\\\\ncay of0.01. The learning rate is warmed up\\\\nover the first 10,000 steps to a peak value of\\\\n1e-4, and then linearly decayed. BERT trains\\\\nwith a dropout of 0.1 on all layers and at-\\\\ntention weights, and a GELU activation func-\\\\ntion ( Hendrycks and Gimpel ,2016 ). Models are\\\\npretrained for S=1,000,000 updates, with mini-\\\\nbatches containing B=256 sequences of maxi-\\\\nmum length T=512 tokens. 2.5 Data\\\\nBERT is trained on a combination of B OOK COR-\\\\nPUS (Zhu et al. ,2015 ) plus English W IKIPEDIA ,\\\\nwhich totals 16GB of uncompressed text.3\\\\n3 Experimental Setup\\\\nIn this section, we describe the experimental setup\\\\nfor our replication study of BERT. 3.1 Implementation\\\\nWe reimplement BERT in FAIRSEQ (Ott et al. ,\\\\n2019 ). We primarily follow the original BERT\\\\n3Yang et al. (2019 ) use the same dataset but report having\\\\nonly 13GB of text after data cleaning. This is most likely due\\\\nto subtle differences in cleaning of the Wikipedia data. optimization hyperparameters, given in Section 2,\\\\nexcept for the peak learning rate and number of\\\\nwarmup steps, which are tuned separately for each\\\\nsetting. We additionally found training to be very\\\\nsensitive to the Adam epsilon term, and in some\\\\ncases we obtained better performance or improved\\\\nstability after tuning it. Similarly, we found setting\\\\nβ2= 0.98to improve stability when training with\\\\nlarge batch sizes. We pretrain with sequences of at most T= 512\\\\ntokens. Unlike Devlin et al. (2019 ), we do not ran-\\\\ndomly inject short sequences, and we do not train\\\\nwith a reduced sequence length for the first 90% of\\\\nupdates. We train only with full-length sequences. We train with mixed precision floating point\\\\narithmetic on DGX-1 machines, each with 8 ×\\\\n32GB Nvidia V100 GPUs interconnected by In-\\\\nfiniband ( Micikevicius et al. ,2018 ). 3.2 Data\\\\nBERT-style pretraining crucially relies on large\\\\nquantities of text. Baevski et al. (2019 ) demon-\\\\nstrate that increasing data size can result in im-\\\\nproved end-task performance. Several efforts\\\\nhave trained on datasets larger and more diverse\\\\nthan the original BERT ( Radford et al.'}\\nBased on this text, provide a summary of the main points.\\n\\nRULES:\\n- Organize the points in markdown format.\\n\\nHelpful Answer:\"\n", " ]\n", "}\n", "\u001b[32;1m\u001b[1;3m[llm/start]\u001b[0m \u001b[1m[chain:RunnableSequence > llm:ChatOpenAI] Entering LLM run with input:\n", "\u001b[0m{\n", " \"prompts\": [\n", " \"Human: You are an expert in technical papers and journals.\\nYou're tasked with summarizing the main points in the following text.\\nThe following is the text you need to summarize:\\n{'doc': ',2019 ;\\\\nYang et al. ,2019 ;Zellers et al. ,2019 ). Unfortu-\\\\nnately, not all of the additional datasets can be\\\\npublicly released. For our study, we focus on gath-\\\\nering as much data as possible for experimenta-\\\\ntion, allowing us to match the overall quality and\\\\nquantity of data as appropriate for each compari-\\\\nson. We consider five English-language corpora of\\\\nvarying sizes and domains, totaling over 160GB\\\\nof uncompressed text. We use the following text\\\\ncorpora:\\\\n•BOOK CORPUS (Zhu et al. ,2015 ) plus English\\\\nWIKIPEDIA . This is the original data used to\\\\ntrain BERT. (16GB). •CC-N EWS, which we collected from the En-\\\\nglish portion of the CommonCrawl News\\\\ndataset ( Nagel ,2016 ). The data contains 63\\\\nmillion English news articles crawled between\\\\nSeptember 2016 and February 2019. (76GB af-\\\\nter filtering).4\\\\n•OPENWEBTEXT (Gokaslan and Cohen ,2019 ),\\\\nan open-source recreation of the WebText cor-\\\\n4We usenews-please (Hamborg et al. ,2017 ) to col-\\\\nlect and extract CC-N EWS. CC-N EWS is similar to the R E-\\\\nALNEWS dataset described in Zellers et al. (2019 ).pus described in Radford et al. (2019 ). The text\\\\nis web content extracted from URLs shared on\\\\nReddit with at least three upvotes. (38GB).5\\\\n•STORIES , a dataset introduced in Trinh and Le\\\\n(2018 ) containing a subset of CommonCrawl\\\\ndata filtered to match the story-like style of\\\\nWinograd schemas. (31GB). 3.3 Evaluation\\\\nFollowing previous work, we evaluate our pre-\\\\ntrained models on downstream tasks using the fol-\\\\nlowing three benchmarks. GLUE The General Language Understand-\\\\ning Evaluation (GLUE) benchmark ( Wang et al. ,\\\\n2019b ) is a collection of 9 datasets for evaluating\\\\nnatural language understanding systems.6Tasks\\\\nare framed as either single-sentence classification\\\\nor sentence-pair classification tasks. The GLUE\\\\norganizers provide training and development data\\\\nsplits as well as a submission server and leader-\\\\nboard that allows participants to evaluate and com-\\\\npare their systems on private held-out test data. For the replication study in Section 4, we report\\\\nresults on the development sets after finetuning\\\\nthe pretrained models on the corresponding single-\\\\ntask training data (i.e., without multi-task training\\\\nor ensembling). Our finetuning procedure follows\\\\nthe original BERT paper ( Devlin et al. ,2019 ). In Section 5we additionally report test set re-\\\\nsults obtained from the public leaderboard. These\\\\nresults depend on a several task-specific modifica-\\\\ntions, which we describe in Section 5.1. SQuAD The Stanford Question Answering\\\\nDataset (SQuAD) provides a paragraph of context\\\\nand a question. The task is to answer the question\\\\nby extracting the relevant span from the context. We evaluate on two versions of SQuAD: V1.1\\\\nand V2.0 ( Rajpurkar et al. ,2016 ,2018 ). In V1.1\\\\nthe context always contains an answer, whereas in\\\\n5The authors and their affiliated institutions are not in any\\\\nway affiliated with the creation of the OpenWebText dataset. 6The datasets are: CoLA ( Warstadt et al. ,2018 ),\\\\nStanford Sentiment Treebank (SST) ( Socher et al. ,\\\\n2013 ), Microsoft Research Paragraph Corpus\\\\n(MRPC) ( Dolan and Brockett ,2005 ), Semantic Tex-\\\\ntual Similarity Benchmark (STS) ( Agirre et al. ,2007 ),\\\\nQuora Question Pairs (QQP) ( Iyer et al. ,2016 ), Multi-\\\\nGenre NLI (MNLI) ( Williams et al. ,2018 ), Question NLI\\\\n(QNLI) ( Rajpurkar et al. ,2016 ), Recognizing Textual\\\\nEntailment (RTE) ( Dagan et al. ,2006 ;Bar-Haim et al. ,\\\\n2006 ;Giampiccolo et al. ,2007 ;Bentivogli et al. ,2009 ) and\\\\nWinograd NLI (WNLI) ( Levesque et al. ,2011 ). V2.0 some questions are not answered in the pro-\\\\nvided context, making the task more challenging. For SQuAD V1.1 we adopt the same span pre-\\\\ndiction method as BERT ( Devlin et al. ,2019 ). For\\\\nSQuAD V2.0, we add an additional binary classi-\\\\nfier to predict whether the question is answerable,\\\\nwhich we train jointly by summing the classifica-\\\\ntion and span loss terms. During evaluation, we\\\\nonly predict span indices on pairs that are classi-\\\\nfied as answerable. RACE The ReAding Comprehension from Ex-\\\\naminations (RACE) ( Lai et al. ,2017 ) task is a\\\\nlarge-scale reading comprehension dataset with\\\\nmore than 28,000 passages and nearly 100,000\\\\nquestions. The dataset is collected from English\\\\nexaminations in China, which are designed for\\\\nmiddle and high school students. In RACE, each\\\\npassage is associated with multiple questions. For\\\\nevery question, the task is to select one correct an-\\\\nswer from four options. RACE has significantly\\\\nlonger context than other popular reading compre-\\\\nhension datasets and the proportion of questions\\\\nthat requires reasoning is very large. 4 Training Procedure Analysis\\\\nThis section explores and quantifies which choices\\\\nare important for successfully pretraining BERT\\\\nmodels. We keep the model architecture fixed.7\\\\nSpecifically, we begin by training BERT models\\\\nwith the same configuration as BERT BASE (L=\\\\n12,H= 768 ,A= 12 , 110M params). 4.1 Static vs. Dynamic Masking\\\\nAs discussed in Section 2, BERT relies on ran-\\\\ndomly masking and predicting tokens. The orig-\\\\ninal BERT implementation performed masking\\\\nonce during data preprocessing, resulting in a sin-\\\\nglestatic mask. To avoid using the same mask for\\\\neach training instance in every epoch, training data\\\\nwas duplicated 10 times so that each sequence is\\\\nmasked in 10 different ways over the 40 epochs of\\\\ntraining. Thus, each training sequence was seen\\\\nwith the same mask four times during training. We compare this strategy with dynamic mask-\\\\ningwhere we generate the masking pattern every\\\\ntime we feed a sequence to the model. This be-\\\\ncomes crucial when pretraining for more steps or\\\\nwith larger datasets. 7Studying architectural changes, including larger archi-\\\\ntectures, is an important area for future work.Masking SQuAD 2.0 MNLI-m SST-2\\\\nreference 76.3 84.3 92.8\\\\nOur reimplementation:\\\\nstatic 78.3 84.3 92.5\\\\ndynamic 78.7 84.0 92.9\\\\nTable 1: Comparison between static and dynamic\\\\nmasking for BERT BASE. We report F1 for SQuAD and\\\\naccuracy for MNLI-m and SST-2. Reported results are\\\\nmedians over 5 random initializations (seeds). Refer-\\\\nence results are from Yang et al. (2019 ). Results Table 1compares the published\\\\nBERT BASE results from Devlin et al. (2019 ) to our\\\\nreimplementation with either static or dynamic\\\\nmasking. We find that our reimplementation\\\\nwith static masking performs similar to the\\\\noriginal BERT model, and dynamic masking is\\\\ncomparable or slightly better than static masking. Given these results and the additional efficiency\\\\nbenefits of dynamic masking, we use dynamic\\\\nmasking in the remainder of the experiments. 4.2 Model Input Format and Next Sentence\\\\nPrediction\\\\nIn the original BERT pretraining procedure, the\\\\nmodel observes two concatenated document seg-\\\\nments, which are either sampled contiguously\\\\nfrom the same document (with p= 0.5) or from\\\\ndistinct documents. In addition to the masked lan-\\\\nguage modeling objective, the model is trained to\\\\npredict whether the observed document segments\\\\ncome from the same or distinct documents via an\\\\nauxiliary Next Sentence Prediction (NSP) loss. The NSP loss was hypothesized to be an impor-\\\\ntant factor in training the original BERT model. Devlin et al. (2019 ) observe that removing NSP\\\\nhurts performance, with significant performance\\\\ndegradation on QNLI, MNLI, and SQuAD 1.1. However, some recent work has questioned the\\\\nnecessity of the NSP loss ( Lample and Conneau ,\\\\n2019 ;Yang et al.'}\\nBased on this text, provide a summary of the main points.\\n\\nRULES:\\n- Organize the points in markdown format.\\n\\nHelpful Answer:\"\n", " ]\n", "}\n", "\u001b[32;1m\u001b[1;3m[llm/start]\u001b[0m \u001b[1m[chain:RunnableSequence > llm:ChatOpenAI] Entering LLM run with input:\n", "\u001b[0m{\n", " \"prompts\": [\n", " \"Human: You are an expert in technical papers and journals.\\nYou're tasked with summarizing the main points in the following text.\\nThe following is the text you need to summarize:\\n{'doc': ',2019 ;Joshi et al. ,2019 ). To better understand this discrepancy, we com-\\\\npare several alternative training formats:\\\\n•SEGMENT -PAIR +NSP: This follows the original\\\\ninput format used in BERT ( Devlin et al. ,2019 ),\\\\nwith the NSP loss. Each input has a pair of seg-\\\\nments, which can each contain multiple natural\\\\nsentences, but the total combined length must\\\\nbe less than 512 tokens. Model SQuAD 1.1/2.0 MNLI-m SST-2 RACE\\\\nOur reimplementation (with NSP loss):\\\\nSEGMENT -PAIR 90.4/78.7 84.0 92.9 64.2\\\\nSENTENCE -PAIR 88.7/76.2 82.9 92.1 63.0\\\\nOur reimplementation (without NSP loss):\\\\nFULL -SENTENCES 90.4/79.1 84.7 92.5 64.8\\\\nDOC-SENTENCES 90.6/79.7 84.7 92.7 65.6\\\\nBERT BASE 88.5/76.3 84.3 92.8 64.3\\\\nXLNet BASE (K = 7) –/81.3 85.8 92.7 66.1\\\\nXLNet BASE (K = 6) –/81.0 85.6 93.4 66.7\\\\nTable 2: Development set results for base models pretrained over B OOK CORPUS and W IKIPEDIA . All models are\\\\ntrained for 1M steps with a batch size of 256 sequences. We rep ort F1 for SQuAD and accuracy for MNLI-m,\\\\nSST-2 and RACE. Reported results are medians over five random initializations (seeds). Results for BERT BASEand\\\\nXLNet BASEare from Yang et al. (2019 ). •SENTENCE -PAIR +NSP: Each input contains a\\\\npair of natural sentences , either sampled from\\\\na contiguous portion of one document or from\\\\nseparate documents. Since these inputs are sig-\\\\nnificantly shorter than 512 tokens, we increase\\\\nthe batch size so that the total number of tokens\\\\nremains similar to SEGMENT -PAIR +NSP. We re-\\\\ntain the NSP loss. •FULL -SENTENCES : Each input is packed with\\\\nfull sentences sampled contiguously from one\\\\nor more documents, such that the total length is\\\\nat most 512 tokens. Inputs may cross document\\\\nboundaries. When we reach the end of one doc-\\\\nument, we begin sampling sentences from the\\\\nnext document and add an extra separator token\\\\nbetween documents. We remove the NSP loss. •DOC-SENTENCES : Inputs are constructed sim-\\\\nilarly to FULL -SENTENCES , except that they\\\\nmay not cross document boundaries. Inputs\\\\nsampled near the end of a document may be\\\\nshorter than 512 tokens, so we dynamically in-\\\\ncrease the batch size in these cases to achieve\\\\na similar number of total tokens as FULL -\\\\nSENTENCES .'}\\nBased on this text, provide a summary of the main points.\\n\\nRULES:\\n- Organize the points in markdown format.\\n\\nHelpful Answer:\"\n", " ]\n", "}\n", "\u001b[32;1m\u001b[1;3m[llm/start]\u001b[0m \u001b[1m[chain:RunnableSequence > llm:ChatOpenAI] Entering LLM run with input:\n", "\u001b[0m{\n", " \"prompts\": [\n", " \"Human: You are an expert in technical papers and journals.\\nYou're tasked with summarizing the main points in the following text.\\nThe following is the text you need to summarize:\\n{'doc': 'We remove the NSP loss. Results Table 2shows results for the four dif-\\\\nferent settings. We first compare the original\\\\nSEGMENT -PAIR input format from Devlin et al. (2019 ) to the SENTENCE -PAIR format; both for-\\\\nmats retain the NSP loss, but the latter uses sin-\\\\ngle sentences. We find that using individual\\\\nsentences hurts performance on downstream\\\\ntasks , which we hypothesize is because the model\\\\nis not able to learn long-range dependencies.We next compare training without the NSP\\\\nloss and training with blocks of text from a sin-\\\\ngle document ( DOC-SENTENCES ). We find that\\\\nthis setting outperforms the originally published\\\\nBERT BASEresults and that removing the NSP loss\\\\nmatches or slightly improves downstream task\\\\nperformance , in contrast to Devlin et al. (2019 ). It is possible that the original BERT implementa-\\\\ntion may only have removed the loss term while\\\\nstill retaining the SEGMENT -PAIR input format. Finally we find that restricting sequences to\\\\ncome from a single document ( DOC-SENTENCES )\\\\nperforms slightly better than packing sequences\\\\nfrom multiple documents ( FULL -SENTENCES ). However, because the DOC-SENTENCES format\\\\nresults in variable batch sizes, we use FULL -\\\\nSENTENCES in the remainder of our experiments\\\\nfor easier comparison with related work. 4.3 Training with large batches\\\\nPast work in Neural Machine Translation has\\\\nshown that training with very large mini-batches\\\\ncan both improve optimization speed and end-task\\\\nperformance when the learning rate is increased\\\\nappropriately ( Ott et al. ,2018 ). Recent work has\\\\nshown that BERT is also amenable to large batch\\\\ntraining ( You et al. ,2019 ). Devlin et al. (2019 ) originally trained\\\\nBERT BASE for 1M steps with a batch size of\\\\n256 sequences. This is equivalent in computa-\\\\ntional cost, via gradient accumulation, to training\\\\nfor 125K steps with a batch size of 2K sequences,\\\\nor for 31K steps with a batch size of 8K. In Table 3we compare perplexity and end- bsz steps lr ppl MNLI-m SST-2\\\\n256 1M 1e-4 3.99 84.7 92.7\\\\n2K 125K 7e-4 3.68 85.2 92.9\\\\n8K 31K 1e-3 3.77 84.6 92.8\\\\nTable 3: Perplexity on held-out training data ( ppl) and\\\\ndevelopment set accuracy for base models trained over\\\\nBOOK CORPUS and W IKIPEDIA with varying batch\\\\nsizes ( bsz).'}\\nBased on this text, provide a summary of the main points.\\n\\nRULES:\\n- Organize the points in markdown format.\\n\\nHelpful Answer:\"\n", " ]\n", "}\n", "\u001b[32;1m\u001b[1;3m[llm/start]\u001b[0m \u001b[1m[chain:RunnableSequence > llm:ChatOpenAI] Entering LLM run with input:\n", "\u001b[0m{\n", " \"prompts\": [\n", " \"Human: You are an expert in technical papers and journals.\\nYou're tasked with summarizing the main points in the following text.\\nThe following is the text you need to summarize:\\n{'doc': 'We tune the learning rate ( lr) for each set-\\\\nting. Models make the same number of passes over the\\\\ndata (epochs) and have the same computational cost. task performance of BERT BASE as we increase the\\\\nbatch size, controlling for the number of passes\\\\nthrough the training data. We observe that train-\\\\ning with large batches improves perplexity for the\\\\nmasked language modeling objective, as well as\\\\nend-task accuracy. Large batches are also easier to\\\\nparallelize via distributed data parallel training,8\\\\nand in later experiments we train with batches of\\\\n8K sequences. Notably You et al. (2019 ) train BERT with even\\\\nlarger batche sizes, up to 32K sequences. We leave\\\\nfurther exploration of the limits of large batch\\\\ntraining to future work. 4.4 Text Encoding\\\\nByte-Pair Encoding (BPE) ( Sennrich et al. ,2016 )\\\\nis a hybrid between character- and word-level rep-\\\\nresentations that allows handling the large vocab-\\\\nularies common in natural language corpora. In-\\\\nstead of full words, BPE relies on subwords units,\\\\nwhich are extracted by performing statistical anal-\\\\nysis of the training corpus. BPE vocabulary sizes typically range from\\\\n10K-100K subword units. However, unicode char-\\\\nacters can account for a sizeable portion of this\\\\nvocabulary when modeling large and diverse cor-\\\\npora, such as the ones considered in this work. Radford et al. (2019 ) introduce a clever imple-\\\\nmentation of BPE that uses bytes instead of uni-\\\\ncode characters as the base subword units. Using\\\\nbytes makes it possible to learn a subword vocab-\\\\nulary of a modest size (50K units) that can still en-\\\\ncode any input text without introducing any “un-\\\\nknown” tokens. 8Large batch training can improve training efficiency even\\\\nwithout large scale parallel hardware through gradient ac-\\\\ncumulation , whereby gradients from multiple mini-batches\\\\nare accumulated locally before each optimization step. Thi s\\\\nfunctionality is supported natively in FAIRSEQ (Ott et al. ,\\\\n2019 ).The original BERT implementa-\\\\ntion ( Devlin et al. ,2019 ) uses a character-level\\\\nBPE vocabulary of size 30K, which is learned\\\\nafter preprocessing the input with heuristic tok-\\\\nenization rules. Following Radford et al. (2019 ),\\\\nwe instead consider training BERT with a larger\\\\nbyte-level BPE vocabulary containing 50K sub-\\\\nword units, without any additional preprocessing\\\\nor tokenization of the input. This adds approxi-\\\\nmately 15M and 20M additional parameters for\\\\nBERT BASEand BERT LARGE , respectively. Early experiments revealed only slight dif-\\\\nferences between these encodings, with the\\\\nRadford et al. (2019 ) BPE achieving slightly\\\\nworse end-task performance on some tasks. Nev-\\\\nertheless, we believe the advantages of a univer-\\\\nsal encoding scheme outweighs the minor degre-\\\\ndation in performance and use this encoding in\\\\nthe remainder of our experiments. A more de-\\\\ntailed comparison of these encodings is left to fu-\\\\nture work. 5 RoBERTa\\\\nIn the previous section we propose modifications\\\\nto the BERT pretraining procedure that improve\\\\nend-task performance. We now aggregate these\\\\nimprovements and evaluate their combined im-\\\\npact. We call this configuration RoBERTa for\\\\nRobustly optimized BERT approach. Specifi-\\\\ncally, RoBERTa is trained with dynamic mask-\\\\ning (Section 4.1),FULL -SENTENCES without NSP\\\\nloss (Section 4.2), large mini-batches (Section 4.3)\\\\nand a larger byte-level BPE (Section 4.4). Additionally, we investigate two other impor-\\\\ntant factors that have been under-emphasized in\\\\nprevious work: (1) the data used for pretraining,\\\\nand (2) the number of training passes through the\\\\ndata. For example, the recently proposed XLNet\\\\narchitecture ( Yang et al. ,2019 ) is pretrained us-\\\\ning nearly 10 times more data than the original\\\\nBERT ( Devlin et al. ,2019 ). It is also trained with\\\\na batch size eight times larger for half as many op-\\\\ntimization steps, thus seeing four times as many\\\\nsequences in pretraining compared to BERT. To help disentangle the importance of these fac-\\\\ntors from other modeling choices (e.g., the pre-\\\\ntraining objective), we begin by training RoBERTa\\\\nfollowing the BERT LARGE architecture ( L= 24 ,\\\\nH= 1024 ,A= 16 , 355M parameters). We\\\\npretrain for 100K steps over a comparable B OOK -\\\\nCORPUS plus W IKIPEDIA dataset as was used in Model data bsz stepsSQuADMNLI-m SST-2(v1.1/2.0)\\\\nRoBERTa\\\\nwith B OOKS + W IKI 16GB 8K 100K 93.6/87.3 89.0 95.3\\\\n+ additional data ( §3.2) 160GB 8K 100K 94.0/87.7 89.3 95.6\\\\n+ pretrain longer 160GB 8K 300K 94.4/88.7 90.0 96.1\\\\n+ pretrain even longer 160GB 8K 500K 94.6/89.4 90.2 96.4\\\\nBERT LARGE\\\\nwith B OOKS + W IKI 13GB 256 1M 90.9/81.8 86.6 93.7\\\\nXLNet LARGE\\\\nwith B OOKS + W IKI 13GB 256 1M 94.0/87.8 88.4 94.4\\\\n+ additional data 126GB 2K 500K 94.5/88.8 89.8 95.6\\\\nTable 4: Development set results for RoBERTa as we pretrain o ver more data (16GB →160GB of text) and pretrain\\\\nfor longer (100K →300K→500K steps).'}\\nBased on this text, provide a summary of the main points.\\n\\nRULES:\\n- Organize the points in markdown format.\\n\\nHelpful Answer:\"\n", " ]\n", "}\n", "\u001b[32;1m\u001b[1;3m[llm/start]\u001b[0m \u001b[1m[chain:RunnableSequence > llm:ChatOpenAI] Entering LLM run with input:\n", "\u001b[0m{\n", " \"prompts\": [\n", " \"Human: You are an expert in technical papers and journals.\\nYou're tasked with summarizing the main points in the following text.\\nThe following is the text you need to summarize:\\n{'doc': 'Each row accumulates improvements from the row s above. RoBERTa\\\\nmatches the architecture and training objective of BERT LARGE . Results for BERT LARGE and XLNet LARGE are from\\\\nDevlin et al. (2019 ) and Yang et al. (2019 ), respectively. Complete results on all GLUE tasks can be fo und in the\\\\nAppendix.'}\\nBased on this text, provide a summary of the main points.\\n\\nRULES:\\n- Organize the points in markdown format.\\n\\nHelpful Answer:\"\n", " ]\n", "}\n", "\u001b[32;1m\u001b[1;3m[llm/start]\u001b[0m \u001b[1m[chain:RunnableSequence > llm:ChatOpenAI] Entering LLM run with input:\n", "\u001b[0m{\n", " \"prompts\": [\n", " \"Human: You are an expert in technical papers and journals.\\nYou're tasked with summarizing the main points in the following text.\\nThe following is the text you need to summarize:\\n{'doc': 'Devlin et al. (2019 ). We pretrain our model using\\\\n1024 V100 GPUs for approximately one day. Results We present our results in Table 4. When\\\\ncontrolling for training data, we observe that\\\\nRoBERTa provides a large improvement over the\\\\noriginally reported BERT LARGE results, reaffirming\\\\nthe importance of the design choices we explored\\\\nin Section 4. Next, we combine this data with the three ad-\\\\nditional datasets described in Section 3.2. We\\\\ntrain RoBERTa over the combined data with the\\\\nsame number of training steps as before (100K). In total, we pretrain over 160GB of text. We ob-\\\\nserve further improvements in performance across\\\\nall downstream tasks, validating the importance of\\\\ndata size and diversity in pretraining.9\\\\nFinally, we pretrain RoBERTa for significantly\\\\nlonger, increasing the number of pretraining steps\\\\nfrom 100K to 300K, and then further to 500K. We\\\\nagain observe significant gains in downstream task\\\\nperformance, and the 300K and 500K step mod-\\\\nels outperform XLNet LARGE across most tasks. We\\\\nnote that even our longest-trained model does not\\\\nappear to overfit our data and would likely benefit\\\\nfrom additional training. In the rest of the paper, we evaluate our best\\\\nRoBERTa model on the three different bench-\\\\nmarks: GLUE, SQuaD and RACE. Specifically\\\\n9Our experiments conflate increases in data size and di-\\\\nversity. We leave a more careful analysis of these two dimen-\\\\nsions to future work.we consider RoBERTa trained for 500K steps over\\\\nall five of the datasets introduced in Section 3.2. 5.1 GLUE Results\\\\nFor GLUE we consider two finetuning settings. In the first setting ( single-task, dev ) we finetune\\\\nRoBERTa separately for each of the GLUE tasks,\\\\nusing only the training data for the correspond-\\\\ning task. We consider a limited hyperparameter\\\\nsweep for each task, with batch sizes ∈ {16,32}\\\\nand learning rates ∈ {1e−5,2e−5,3e−5}, with a\\\\nlinear warmup for the first 6% of steps followed by\\\\na linear decay to 0. We finetune for 10 epochs and\\\\nperform early stopping based on each task’s eval-\\\\nuation metric on the dev set. The rest of the hyper-\\\\nparameters remain the same as during pretraining. In this setting, we report the median development\\\\nset results for each task over five random initial-\\\\nizations, without model ensembling. In the second setting ( ensembles, test ), we com-\\\\npare RoBERTa to other approaches on the test set\\\\nvia the GLUE leaderboard. While many submis-\\\\nsions to the GLUE leaderboard depend on multi-\\\\ntask finetuning, our submission depends only on\\\\nsingle-task finetuning . For RTE, STS and MRPC\\\\nwe found it helpful to finetune starting from the\\\\nMNLI single-task model, rather than the baseline\\\\npretrained RoBERTa. We explore a slightly wider\\\\nhyperparameter space, described in the Appendix,\\\\nand ensemble between 5 and 7 models per task. MNLI QNLI QQP RTE SST MRPC CoLA STS WNLI Avg\\\\nSingle-task single models on dev\\\\nBERT LARGE 86.6/- 92.3 91.3 70.4 93.2 88.0 60.6 90.0 - -\\\\nXLNet LARGE 89.8/- 93.9 91.8 83.8 95.6 89.2 63.6 91.8 - -\\\\nRoBERTa 90.2/90.2 94.7 92.2 86.6 96.4 90.9 68.0 92.4 91.3 -\\\\nEnsembles on test (from leaderboard as of July 25, 2019)\\\\nALICE 88.2/87.9 95.7 90.7 83.5 95.2 92.6 68.6 91.1 80.8 86.3\\\\nMT-DNN 87.9/87.4 96.0 89.9 86.3 96.5 92.7 68.4 91.1 89.0 87.6\\\\nXLNet 90.2/89.8 98.6 90.3 86.3 96.8 93.0 67.8 91.6 90.4 88.4\\\\nRoBERTa 90.8/90.2 98.9 90.2 88.2 96.7 92.3 67.8 92.2 89.0 88.5\\\\nTable 5: Results on GLUE. All results are based on a 24-layer a rchitecture. BERT LARGE and XLNet LARGE results\\\\nare from Devlin et al. (2019 ) and Yang et al. (2019 ), respectively. RoBERTa results on the development set are a\\\\nmedian over five runs. RoBERTa results on the test set are ense mbles of single-task models. For RTE, STS and\\\\nMRPC we finetune starting from the MNLI model instead of the ba seline pretrained model. Averages are obtained\\\\nfrom the GLUE leaderboard. Task-specific modifications Two of the GLUE\\\\ntasks require task-specific finetuning approaches\\\\nto achieve competitive leaderboard results. QNLI : Recent submissions on the GLUE\\\\nleaderboard adopt a pairwise ranking formulation\\\\nfor the QNLI task, in which candidate answers\\\\nare mined from the training set and compared to\\\\none another, and a single (question, candidate)\\\\npair is classified as positive ( Liu et al.'}\\nBased on this text, provide a summary of the main points.\\n\\nRULES:\\n- Organize the points in markdown format.\\n\\nHelpful Answer:\"\n", " ]\n", "}\n", "\u001b[32;1m\u001b[1;3m[llm/start]\u001b[0m \u001b[1m[chain:RunnableSequence > llm:ChatOpenAI] Entering LLM run with input:\n", "\u001b[0m{\n", " \"prompts\": [\n", " \"Human: You are an expert in technical papers and journals.\\nYou're tasked with summarizing the main points in the following text.\\nThe following is the text you need to summarize:\\n{'doc': ',2019b ,a;\\\\nYang et al. ,2019 ). This formulation significantly\\\\nsimplifies the task, but is not directly comparable\\\\nto BERT ( Devlin et al. ,2019 ). Following recent\\\\nwork, we adopt the ranking approach for our test\\\\nsubmission, but for direct comparison with BERT\\\\nwe report development set results based on a pure\\\\nclassification approach. WNLI : We found the provided NLI-format\\\\ndata to be challenging to work with. Instead\\\\nwe use the reformatted WNLI data from Super-\\\\nGLUE ( Wang et al. ,2019a ), which indicates the\\\\nspan of the query pronoun and referent. We fine-\\\\ntune RoBERTa using the margin ranking loss from\\\\nKocijan et al. (2019 ). For a given input sentence,\\\\nwe use spaCy ( Honnibal and Montani ,2017 ) to\\\\nextract additional candidate noun phrases from the\\\\nsentence and finetune our model so that it assigns\\\\nhigher scores to positive referent phrases than for\\\\nany of the generated negative candidate phrases. One unfortunate consequence of this formulation\\\\nis that we can only make use of the positive train-\\\\ning examples, which excludes over half of the pro-\\\\nvided training examples.10\\\\n10While we only use the provided WNLI training data, ourResults We present our results in Table 5. In the\\\\nfirst setting ( single-task, dev ), RoBERTa achieves\\\\nstate-of-the-art results on all 9 of the GLUE\\\\ntask development sets. Crucially, RoBERTa uses\\\\nthe same masked language modeling pretrain-\\\\ning objective and architecture as BERT LARGE , yet\\\\nconsistently outperforms both BERT LARGE and\\\\nXLNet LARGE . This raises questions about the rel-\\\\native importance of model architecture and pre-\\\\ntraining objective, compared to more mundane de-\\\\ntails like dataset size and training time that we ex-\\\\nplore in this work. In the second setting ( ensembles, test ), we\\\\nsubmit RoBERTa to the GLUE leaderboard and\\\\nachieve state-of-the-art results on 4 out of 9 tasks\\\\nand the highest average score to date. This is espe-\\\\ncially exciting because RoBERTa does not depend\\\\non multi-task finetuning, unlike most of the other\\\\ntop submissions. We expect future work may fur-\\\\nther improve these results by incorporating more\\\\nsophisticated multi-task finetuning procedures. 5.2 SQuAD Results\\\\nWe adopt a much simpler approach for SQuAD\\\\ncompared to past work. In particular, while\\\\nboth BERT ( Devlin et al. ,2019 ) and XL-\\\\nNet ( Yang et al. ,2019 ) augment their training data\\\\nwith additional QA datasets, we only finetune\\\\nRoBERTa using the provided SQuAD training\\\\ndata .Yang et al. (2019 ) also employed a custom\\\\nlayer-wise learning rate schedule to finetune\\\\nresults could potentially be improved by augmenting this wi th\\\\nadditional pronoun disambiguation datasets. ModelSQuAD 1.1 SQuAD 2.0\\\\nEM F1 EM F1\\\\nSingle models on dev, w/o data augmentation\\\\nBERT LARGE 84.1 90.9 79.0 81.8\\\\nXLNet LARGE 89.0 94.5 86.1 88.8\\\\nRoBERTa 88.9 94.6 86.5 89.4\\\\nSingle models on test (as of July 25, 2019)\\\\nXLNet LARGE 86.3†89.1†\\\\nRoBERTa 86.8 89.8\\\\nXLNet + SG-Net Verifier 87.0†89.9†\\\\nTable 6: Results on SQuAD. †indicates results that de-\\\\npend on additional external training data. RoBERTa\\\\nuses only the provided SQuAD data in both dev and\\\\ntest settings. BERT LARGE and XLNet LARGE results are\\\\nfrom Devlin et al. (2019 ) and Yang et al. (2019 ), re-\\\\nspectively. XLNet, while we use the same learning rate for\\\\nall layers. For SQuAD v1.1 we follow the same finetun-\\\\ning procedure as Devlin et al. (2019 ). For SQuAD\\\\nv2.0, we additionally classify whether a given\\\\nquestion is answerable; we train this classifier\\\\njointly with the span predictor by summing the\\\\nclassification and span loss terms. Results We present our results in Table 6. On\\\\nthe SQuAD v1.1 development set, RoBERTa\\\\nmatches the state-of-the-art set by XLNet. On the\\\\nSQuAD v2.0 development set, RoBERTa sets a\\\\nnew state-of-the-art, improving over XLNet by 0.4\\\\npoints (EM) and 0.6 points (F1). We also submit RoBERTa to the public SQuAD\\\\n2.0 leaderboard and evaluate its performance rel-\\\\native to other systems. Most of the top systems\\\\nbuild upon either BERT ( Devlin et al. ,2019 ) or\\\\nXLNet ( Yang et al. ,2019 ), both of which rely on\\\\nadditional external training data. In contrast, our\\\\nsubmission does not use any additional data. Our single RoBERTa model outperforms all but\\\\none of the single model submissions, and is the\\\\ntop scoring system among those that do not rely\\\\non data augmentation. 5.3 RACE Results\\\\nIn RACE, systems are provided with a passage of\\\\ntext, an associated question, and four candidate an-\\\\nswers. Systems are required to classify which of\\\\nthe four candidate answers is correct. We modify RoBERTa for this task by concate-Model Accuracy Middle High\\\\nSingle models on test (as of July 25, 2019)\\\\nBERT LARGE 72.0 76.6 70.1\\\\nXLNet LARGE 81.7 85.4 80.2\\\\nRoBERTa 83.2 86.5 81.3\\\\nTable 7: Results on the RACE test set. BERT LARGE and\\\\nXLNet LARGE results are from Yang et al. (2019 ). nating each candidate answer with the correspond-\\\\ning question and passage. We then encode each of\\\\nthese four sequences and pass the resulting [CLS]\\\\nrepresentations through a fully-connected layer,\\\\nwhich is used to predict the correct answer. We\\\\ntruncate question-answer pairs that are longer than\\\\n128 tokens and, if needed, the passage so that the\\\\ntotal length is at most 512 tokens. Results on the RACE test sets are presented in\\\\nTable 7. RoBERTa achieves state-of-the-art results\\\\non both middle-school and high-school settings. 6 Related Work\\\\nPretraining methods have been designed\\\\nwith different training objectives, includ-\\\\ning language modeling ( Dai and Le ,2015 ;\\\\nPeters et al. ,2018 ;Howard and Ruder ,2018 ),\\\\nmachine translation ( McCann et al. ,2017 ), and\\\\nmasked language modeling ( Devlin et al. ,2019 ;\\\\nLample and Conneau ,2019 ). Many recent\\\\npapers have used a basic recipe of finetuning\\\\nmodels for each end task ( Howard and Ruder ,\\\\n2018 ;Radford et al. ,2018 ), and pretraining\\\\nwith some variant of a masked language model\\\\nobjective. However, newer methods have\\\\nimproved performance by multi-task fine tun-\\\\ning ( Dong et al. ,2019 ), incorporating entity\\\\nembeddings ( Sun et al. ,2019 ), span predic-\\\\ntion ( Joshi et al. ,2019 ), and multiple variants\\\\nof autoregressive pretraining ( Song et al. ,2019 ;\\\\nChan et al. ,2019 ;Yang et al. ,2019 ). Perfor-\\\\nmance is also typically improved by training\\\\nbigger models on more data ( Devlin et al.'}\\nBased on this text, provide a summary of the main points.\\n\\nRULES:\\n- Organize the points in markdown format.\\n\\nHelpful Answer:\"\n", " ]\n", "}\n", "\u001b[32;1m\u001b[1;3m[llm/start]\u001b[0m \u001b[1m[chain:RunnableSequence > llm:ChatOpenAI] Entering LLM run with input:\n", "\u001b[0m{\n", " \"prompts\": [\n", " \"Human: You are an expert in technical papers and journals.\\nYou're tasked with summarizing the main points in the following text.\\nThe following is the text you need to summarize:\\n{'doc': ',\\\\n2019 ;Baevski et al. ,2019 ;Yang et al. ,2019 ;\\\\nRadford et al. ,2019 ). Our goal was to replicate,\\\\nsimplify, and better tune the training of BERT,\\\\nas a reference point for better understanding the\\\\nrelative performance of all of these methods. 7 Conclusion\\\\nWe carefully evaluate a number of design de-\\\\ncisions when pretraining BERT models. We\\\\nfind that performance can be substantially im-\\\\nproved by training the model longer, with bigger\\\\nbatches over more data; removing the next sen-\\\\ntence prediction objective; training on longer se-\\\\nquences; and dynamically changing the masking\\\\npattern applied to the training data. Our improved\\\\npretraining procedure, which we call RoBERTa,\\\\nachieves state-of-the-art results on GLUE, RACE\\\\nand SQuAD, without multi-task finetuning for\\\\nGLUE or additional data for SQuAD. These re-\\\\nsults illustrate the importance of these previ-\\\\nously overlooked design decisions and suggest\\\\nthat BERT’s pretraining objective remains com-\\\\npetitive with recently proposed alternatives. We additionally use a novel dataset,\\\\nCC-N EWS, and release our models and\\\\ncode for pretraining and finetuning at:\\\\nhttps://github.com/pytorch/fairseq . References\\\\nEneko Agirre, Llu’is M‘arquez, and Richard Wicen-\\\\ntowski, editors. 2007. Proceedings of the Fourth\\\\nInternational Workshop on Semantic Evaluations\\\\n(SemEval-2007) . Alexei Baevski, Sergey Edunov, Yinhan Liu, Luke\\\\nZettlemoyer, and Michael Auli. 2019. Cloze-\\\\ndriven pretraining of self-attention networks. arXiv\\\\npreprint arXiv:1903.07785 .'}\\nBased on this text, provide a summary of the main points.\\n\\nRULES:\\n- Organize the points in markdown format.\\n\\nHelpful Answer:\"\n", " ]\n", "}\n", "\u001b[32;1m\u001b[1;3m[llm/start]\u001b[0m \u001b[1m[chain:RunnableSequence > llm:ChatOpenAI] Entering LLM run with input:\n", "\u001b[0m{\n", " \"prompts\": [\n", " \"Human: You are an expert in technical papers and journals.\\nYou're tasked with summarizing the main points in the following text.\\nThe following is the text you need to summarize:\\n{'doc': 'Roy Bar-Haim, Ido Dagan, Bill Dolan, Lisa Ferro,\\\\nDanilo Giampiccolo, Bernardo Magnini, and Idan\\\\nSzpektor. 2006. The second PASCAL recognising\\\\ntextual entailment challenge. In Proceedings of the\\\\nsecond PASCAL challenges workshop on recognis-\\\\ning textual entailment . Luisa Bentivogli, Ido Dagan, Hoa Trang Dang, Danilo\\\\nGiampiccolo, and Bernardo Magnini. 2009. The\\\\nfifth PASCAL recognizing textual entailment chal-\\\\nlenge. Samuel R Bowman, Gabor Angeli, Christopher Potts,\\\\nand Christopher D Manning. 2015. A large anno-\\\\ntated corpus for learning natural language inference. InEmpirical Methods in Natural Language Process-\\\\ning (EMNLP) .'}\\nBased on this text, provide a summary of the main points.\\n\\nRULES:\\n- Organize the points in markdown format.\\n\\nHelpful Answer:\"\n", " ]\n", "}\n", "\u001b[32;1m\u001b[1;3m[llm/start]\u001b[0m \u001b[1m[chain:RunnableSequence > llm:ChatOpenAI] Entering LLM run with input:\n", "\u001b[0m{\n", " \"prompts\": [\n", " \"Human: You are an expert in technical papers and journals.\\nYou're tasked with summarizing the main points in the following text.\\nThe following is the text you need to summarize:\\n{'doc': 'William Chan, Nikita Kitaev, Kelvin Guu, Mitchell\\\\nStern, and Jakob Uszkoreit. 2019. KERMIT: Gener-\\\\native insertion-based modeling for sequences.'}\\nBased on this text, provide a summary of the main points.\\n\\nRULES:\\n- Organize the points in markdown format.\\n\\nHelpful Answer:\"\n", " ]\n", "}\n", "\u001b[32;1m\u001b[1;3m[llm/start]\u001b[0m \u001b[1m[chain:RunnableSequence > llm:ChatOpenAI] Entering LLM run with input:\n", "\u001b[0m{\n", " \"prompts\": [\n", " \"Human: You are an expert in technical papers and journals.\\nYou're tasked with summarizing the main points in the following text.\\nThe following is the text you need to summarize:\\n{'doc': 'arXiv\\\\npreprint arXiv:1906.01604 .Ido Dagan, Oren Glickman, and Bernardo Magnini. 2006. The PASCAL recognising textual entailment\\\\nchallenge. In Machine learning challenges. evalu-\\\\nating predictive uncertainty, visual object classifica-\\\\ntion, and recognising tectual entailment . Andrew M Dai and Quoc V Le. 2015. Semi-supervised\\\\nsequence learning. In Advances in Neural Informa-\\\\ntion Processing Systems (NIPS) . Jacob Devlin, Ming-Wei Chang, Kenton Lee, and\\\\nKristina Toutanova. 2019. BERT: Pre-training of\\\\ndeep bidirectional transformers for language under-\\\\nstanding. In North American Association for Com-\\\\nputational Linguistics (NAACL) . William B Dolan and Chris Brockett. 2005. Auto-\\\\nmatically constructing a corpus of sentential para-\\\\nphrases. In Proceedings of the International Work-\\\\nshop on Paraphrasing . Li Dong, Nan Yang, Wenhui Wang, Furu Wei,\\\\nXiaodong Liu, Yu Wang, Jianfeng Gao, Ming\\\\nZhou, and Hsiao-Wuen Hon. 2019. Unified\\\\nlanguage model pre-training for natural language\\\\nunderstanding and generation. arXiv preprint\\\\narXiv:1905.03197 .'}\\nBased on this text, provide a summary of the main points.\\n\\nRULES:\\n- Organize the points in markdown format.\\n\\nHelpful Answer:\"\n", " ]\n", "}\n", "\u001b[32;1m\u001b[1;3m[llm/start]\u001b[0m \u001b[1m[chain:RunnableSequence > llm:ChatOpenAI] Entering LLM run with input:\n", "\u001b[0m{\n", " \"prompts\": [\n", " \"Human: You are an expert in technical papers and journals.\\nYou're tasked with summarizing the main points in the following text.\\nThe following is the text you need to summarize:\\n{'doc': 'Danilo Giampiccolo, Bernardo Magnini, Ido Dagan,\\\\nand Bill Dolan. 2007. The third PASCAL recog-\\\\nnizing textual entailment challenge. In Proceedings\\\\nof the ACL-PASCAL workshop on textual entailment\\\\nand paraphrasing . Aaron Gokaslan and Vanya Cohen. 2019. Openweb-\\\\ntext corpus. http://web.archive.org/\\\\nsave/http://Skylion007.github.io/\\\\nOpenWebTextCorpus .'}\\nBased on this text, provide a summary of the main points.\\n\\nRULES:\\n- Organize the points in markdown format.\\n\\nHelpful Answer:\"\n", " ]\n", "}\n", "\u001b[32;1m\u001b[1;3m[llm/start]\u001b[0m \u001b[1m[chain:RunnableSequence > llm:ChatOpenAI] Entering LLM run with input:\n", "\u001b[0m{\n", " \"prompts\": [\n", " \"Human: You are an expert in technical papers and journals.\\nYou're tasked with summarizing the main points in the following text.\\nThe following is the text you need to summarize:\\n{'doc': 'Felix Hamborg, Norman Meuschke, Corinna Bre-\\\\nitinger, and Bela Gipp. 2017. news-please: A\\\\ngeneric news crawler and extractor. In Proceedings\\\\nof the 15th International Symposium of Information\\\\nScience .'}\\nBased on this text, provide a summary of the main points.\\n\\nRULES:\\n- Organize the points in markdown format.\\n\\nHelpful Answer:\"\n", " ]\n", "}\n", "\u001b[32;1m\u001b[1;3m[llm/start]\u001b[0m \u001b[1m[chain:RunnableSequence > llm:ChatOpenAI] Entering LLM run with input:\n", "\u001b[0m{\n", " \"prompts\": [\n", " \"Human: You are an expert in technical papers and journals.\\nYou're tasked with summarizing the main points in the following text.\\nThe following is the text you need to summarize:\\n{'doc': 'Dan Hendrycks and Kevin Gimpel. 2016. Gaus-\\\\nsian error linear units (gelus). arXiv preprint\\\\narXiv:1606.08415 .'}\\nBased on this text, provide a summary of the main points.\\n\\nRULES:\\n- Organize the points in markdown format.\\n\\nHelpful Answer:\"\n", " ]\n", "}\n", "\u001b[32;1m\u001b[1;3m[llm/start]\u001b[0m \u001b[1m[chain:RunnableSequence > llm:ChatOpenAI] Entering LLM run with input:\n", "\u001b[0m{\n", " \"prompts\": [\n", " \"Human: You are an expert in technical papers and journals.\\nYou're tasked with summarizing the main points in the following text.\\nThe following is the text you need to summarize:\\n{'doc': 'Matthew Honnibal and Ines Montani. 2017. spaCy 2:\\\\nNatural language understanding with Bloom embed-\\\\ndings, convolutional neural networks and incremen-\\\\ntal parsing. To appear. Jeremy Howard and Sebastian Ruder. 2018. Universal\\\\nlanguage model fine-tuning for text classification. arXiv preprint arXiv:1801.06146 . Shankar Iyer, Nikhil Dandekar, and Kornl Cser-\\\\nnai. 2016. First quora dataset release: Question\\\\npairs.https://data.quora.com/First-\\\\nQuora-Dataset-Release-Question-\\\\nPairs . Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S.'}\\nBased on this text, provide a summary of the main points.\\n\\nRULES:\\n- Organize the points in markdown format.\\n\\nHelpful Answer:\"\n", " ]\n", "}\n", "\u001b[32;1m\u001b[1;3m[llm/start]\u001b[0m \u001b[1m[chain:RunnableSequence > llm:ChatOpenAI] Entering LLM run with input:\n", "\u001b[0m{\n", " \"prompts\": [\n", " \"Human: You are an expert in technical papers and journals.\\nYou're tasked with summarizing the main points in the following text.\\nThe following is the text you need to summarize:\\n{'doc': 'Weld, Luke Zettlemoyer, and Omer Levy. 2019. SpanBERT: Improving pre-training by repre-\\\\nsenting and predicting spans. arXiv preprint\\\\narXiv:1907.10529 . Diederik Kingma and Jimmy Ba. 2015. Adam: A\\\\nmethod for stochastic optimization. In International\\\\nConference on Learning Representations (ICLR) .'}\\nBased on this text, provide a summary of the main points.\\n\\nRULES:\\n- Organize the points in markdown format.\\n\\nHelpful Answer:\"\n", " ]\n", "}\n", "\u001b[32;1m\u001b[1;3m[llm/start]\u001b[0m \u001b[1m[chain:RunnableSequence > llm:ChatOpenAI] Entering LLM run with input:\n", "\u001b[0m{\n", " \"prompts\": [\n", " \"Human: You are an expert in technical papers and journals.\\nYou're tasked with summarizing the main points in the following text.\\nThe following is the text you need to summarize:\\n{'doc': 'Vid Kocijan, Ana-Maria Cretu, Oana-Maria Camburu,\\\\nYordan Yordanov, and Thomas Lukasiewicz. 2019. A surprisingly robust trick for winograd schema\\\\nchallenge. arXiv preprint arXiv:1905.06290 . Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang,\\\\nand Eduard Hovy. 2017. Race: Large-scale reading\\\\ncomprehension dataset from examinations. arXiv\\\\npreprint arXiv:1704.04683 . Guillaume Lample and Alexis Conneau. 2019. Cross-\\\\nlingual language model pretraining. arXiv preprint\\\\narXiv:1901.07291 .'}\\nBased on this text, provide a summary of the main points.\\n\\nRULES:\\n- Organize the points in markdown format.\\n\\nHelpful Answer:\"\n", " ]\n", "}\n", "\u001b[32;1m\u001b[1;3m[llm/start]\u001b[0m \u001b[1m[chain:RunnableSequence > llm:ChatOpenAI] Entering LLM run with input:\n", "\u001b[0m{\n", " \"prompts\": [\n", " \"Human: You are an expert in technical papers and journals.\\nYou're tasked with summarizing the main points in the following text.\\nThe following is the text you need to summarize:\\n{'doc': 'Hector J Levesque, Ernest Davis, and Leora Morgen-\\\\nstern. 2011. The Winograd schema challenge. In\\\\nAAAI Spring Symposium: Logical Formalizations of\\\\nCommonsense Reasoning . Xiaodong Liu, Pengcheng He, Weizhu Chen, and\\\\nJianfeng Gao. 2019a. Improving multi-task deep\\\\nneural networks via knowledge distillation for\\\\nnatural language understanding. arXiv preprint\\\\narXiv:1904.09482 .'}\\nBased on this text, provide a summary of the main points.\\n\\nRULES:\\n- Organize the points in markdown format.\\n\\nHelpful Answer:\"\n", " ]\n", "}\n", "\u001b[32;1m\u001b[1;3m[llm/start]\u001b[0m \u001b[1m[chain:RunnableSequence > llm:ChatOpenAI] Entering LLM run with input:\n", "\u001b[0m{\n", " \"prompts\": [\n", " \"Human: You are an expert in technical papers and journals.\\nYou're tasked with summarizing the main points in the following text.\\nThe following is the text you need to summarize:\\n{'doc': 'Xiaodong Liu, Pengcheng He, Weizhu Chen, and Jian-\\\\nfeng Gao. 2019b. Multi-task deep neural networks\\\\nfor natural language understanding. arXiv preprint\\\\narXiv:1901.11504 . Bryan McCann, James Bradbury, Caiming Xiong, and\\\\nRichard Socher. 2017. Learned in translation: Con-\\\\ntextualized word vectors. In Advances in Neural In-\\\\nformation Processing Systems (NIPS) , pages 6297–\\\\n6308.'}\\nBased on this text, provide a summary of the main points.\\n\\nRULES:\\n- Organize the points in markdown format.\\n\\nHelpful Answer:\"\n", " ]\n", "}\n", "\u001b[32;1m\u001b[1;3m[llm/start]\u001b[0m \u001b[1m[chain:RunnableSequence > llm:ChatOpenAI] Entering LLM run with input:\n", "\u001b[0m{\n", " \"prompts\": [\n", " \"Human: You are an expert in technical papers and journals.\\nYou're tasked with summarizing the main points in the following text.\\nThe following is the text you need to summarize:\\n{'doc': 'Paulius Micikevicius, Sharan Narang, Jonah Alben,\\\\nGregory Diamos, Erich Elsen, David Garcia, Boris\\\\nGinsburg, Michael Houston, Oleksii Kuchaiev,\\\\nGanesh Venkatesh, and Hao Wu. 2018. Mixed preci-\\\\nsion training. In International Conference on Learn-\\\\ning Representations . Sebastian Nagel. 2016. Cc-news. http:\\\\n//web.archive.org/save/http:\\\\n//commoncrawl.org/2016/10/news-\\\\ndataset-available .'}\\nBased on this text, provide a summary of the main points.\\n\\nRULES:\\n- Organize the points in markdown format.\\n\\nHelpful Answer:\"\n", " ]\n", "}\n", "\u001b[32;1m\u001b[1;3m[llm/start]\u001b[0m \u001b[1m[chain:RunnableSequence > llm:ChatOpenAI] Entering LLM run with input:\n", "\u001b[0m{\n", " \"prompts\": [\n", " \"Human: You are an expert in technical papers and journals.\\nYou're tasked with summarizing the main points in the following text.\\nThe following is the text you need to summarize:\\n{'doc': 'Myle Ott, Sergey Edunov, Alexei Baevski, Angela\\\\nFan, Sam Gross, Nathan Ng, David Grangier, and\\\\nMichael Auli. 2019. FAIRSEQ : A fast, exten-\\\\nsible toolkit for sequence modeling. In North\\\\nAmerican Association for Computational Linguis-\\\\ntics (NAACL): System Demonstrations .Myle Ott, Sergey Edunov, David Grangier, and\\\\nMichael Auli. 2018. Scaling neural machine trans-\\\\nlation. In Proceedings of the Third Conference on\\\\nMachine Translation (WMT) . Adam Paszke, Sam Gross, Soumith Chintala, Gre-\\\\ngory Chanan, Edward Yang, Zachary DeVito, Zem-\\\\ning Lin, Alban Desmaison, Luca Antiga, and Adam\\\\nLerer. 2017. Automatic differentiation in PyTorch. InNIPS Autodiff Workshop . Matthew Peters, Mark Neumann, Mohit Iyyer, Matt\\\\nGardner, Christopher Clark, Kenton Lee, and Luke\\\\nZettlemoyer. 2018. Deep contextualized word repre-\\\\nsentations. In North American Association for Com-\\\\nputational Linguistics (NAACL) . Alec Radford, Karthik Narasimhan, Time Salimans,\\\\nand Ilya Sutskever. 2018. Improving language un-\\\\nderstanding with unsupervised learning. Technical\\\\nreport, OpenAI. Alec Radford, Jeffrey Wu, Rewon Child, David Luan,\\\\nDario Amodei, and Ilya Sutskever. 2019. Language\\\\nmodels are unsupervised multitask learners. Techni-\\\\ncal report, OpenAI.'}\\nBased on this text, provide a summary of the main points.\\n\\nRULES:\\n- Organize the points in markdown format.\\n\\nHelpful Answer:\"\n", " ]\n", "}\n", "\u001b[32;1m\u001b[1;3m[llm/start]\u001b[0m \u001b[1m[chain:RunnableSequence > llm:ChatOpenAI] Entering LLM run with input:\n", "\u001b[0m{\n", " \"prompts\": [\n", " \"Human: You are an expert in technical papers and journals.\\nYou're tasked with summarizing the main points in the following text.\\nThe following is the text you need to summarize:\\n{'doc': 'Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018. Know what you don’t know: Unanswerable ques-\\\\ntions for squad. In Association for Computational\\\\nLinguistics (ACL) . Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and\\\\nPercy Liang. 2016. SQuAD: 100,000+ questions for\\\\nmachine comprehension of text. In Empirical Meth-\\\\nods in Natural Language Processing (EMNLP) . Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural machine translation of rare words with\\\\nsubword units. In Association for Computational\\\\nLinguistics (ACL) , pages 1715–1725. Richard Socher, Alex Perelygin, Jean Wu, Jason\\\\nChuang, Christopher D Manning, Andrew Ng, and\\\\nChristopher Potts. 2013. Recursive deep models\\\\nfor semantic compositionality over a sentiment tree-\\\\nbank. In Empirical Methods in Natural Language\\\\nProcessing (EMNLP) . Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and\\\\nTie-Yan Liu. 2019. MASS: Masked sequence\\\\nto sequence pre-training for language generation. InInternational Conference on Machine Learning\\\\n(ICML) .'}\\nBased on this text, provide a summary of the main points.\\n\\nRULES:\\n- Organize the points in markdown format.\\n\\nHelpful Answer:\"\n", " ]\n", "}\n", "\u001b[32;1m\u001b[1;3m[llm/start]\u001b[0m \u001b[1m[chain:RunnableSequence > llm:ChatOpenAI] Entering LLM run with input:\n", "\u001b[0m{\n", " \"prompts\": [\n", " \"Human: You are an expert in technical papers and journals.\\nYou're tasked with summarizing the main points in the following text.\\nThe following is the text you need to summarize:\\n{'doc': 'Yu Stephanie Sun, Shuohuan Wang, Yukun Li, Shikun\\\\nFeng, Xuyi Chen, Han Zhang, Xinlun Tian, Danxi-\\\\nang Zhu, Hao Tian, and Hua Wu. 2019. ERNIE: En-\\\\nhanced representation through knowledge integra-\\\\ntion. arXiv preprint arXiv:1904.09223 . Trieu H Trinh and Quoc V Le. 2018. A simple\\\\nmethod for commonsense reasoning. arXiv preprint\\\\narXiv:1806.02847 . Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob\\\\nUszkoreit, Llion Jones, Aidan N Gomez, Łukasz\\\\nKaiser, and Illia Polosukhin. 2017. Attention is all\\\\nyou need. In Advances in neural information pro-\\\\ncessing systems . Alex Wang, Yada Pruksachatkun, Nikita Nangia,\\\\nAmanpreet Singh, Julian Michael, Felix Hill, Omer\\\\nLevy, and Samuel R.'}\\nBased on this text, provide a summary of the main points.\\n\\nRULES:\\n- Organize the points in markdown format.\\n\\nHelpful Answer:\"\n", " ]\n", "}\n", "\u001b[32;1m\u001b[1;3m[llm/start]\u001b[0m \u001b[1m[chain:RunnableSequence > llm:ChatOpenAI] Entering LLM run with input:\n", "\u001b[0m{\n", " \"prompts\": [\n", " \"Human: You are an expert in technical papers and journals.\\nYou're tasked with summarizing the main points in the following text.\\nThe following is the text you need to summarize:\\n{'doc': 'Bowman. 2019a. SuperGLUE:\\\\nA stickier benchmark for general-purpose language\\\\nunderstanding systems.'}\\nBased on this text, provide a summary of the main points.\\n\\nRULES:\\n- Organize the points in markdown format.\\n\\nHelpful Answer:\"\n", " ]\n", "}\n", "\u001b[32;1m\u001b[1;3m[llm/start]\u001b[0m \u001b[1m[chain:RunnableSequence > llm:ChatOpenAI] Entering LLM run with input:\n", "\u001b[0m{\n", " \"prompts\": [\n", " \"Human: You are an expert in technical papers and journals.\\nYou're tasked with summarizing the main points in the following text.\\nThe following is the text you need to summarize:\\n{'doc': 'arXiv preprint 1905.00537 . Alex Wang, Amanpreet Singh, Julian Michael, Felix\\\\nHill, Omer Levy, and Samuel R. Bowman. 2019b. GLUE: A multi-task benchmark and analysis plat-\\\\nform for natural language understanding. In Inter-\\\\nnational Conference on Learning Representations\\\\n(ICLR) . Alex Warstadt, Amanpreet Singh, and Samuel R. Bow-\\\\nman. 2018. Neural network acceptability judg-\\\\nments. arXiv preprint 1805.12471 . Adina Williams, Nikita Nangia, and Samuel Bowman. 2018. A broad-coverage challenge corpus for sen-\\\\ntence understanding through inference. In North\\\\nAmerican Association for Computational Linguis-\\\\ntics (NAACL) . Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Car-\\\\nbonell, Ruslan Salakhutdinov, and Quoc V Le. 2019. Xlnet: Generalized autoregressive pretrain-\\\\ning for language understanding. arXiv preprint\\\\narXiv:1906.08237 .'}\\nBased on this text, provide a summary of the main points.\\n\\nRULES:\\n- Organize the points in markdown format.\\n\\nHelpful Answer:\"\n", " ]\n", "}\n", "\u001b[32;1m\u001b[1;3m[llm/start]\u001b[0m \u001b[1m[chain:RunnableSequence > llm:ChatOpenAI] Entering LLM run with input:\n", "\u001b[0m{\n", " \"prompts\": [\n", " \"Human: You are an expert in technical papers and journals.\\nYou're tasked with summarizing the main points in the following text.\\nThe following is the text you need to summarize:\\n{'doc': 'Yang You, Jing Li, Jonathan Hseu, Xiaodan Song,\\\\nJames Demmel, and Cho-Jui Hsieh. 2019. Reduc-\\\\ning bert pre-training time from 3 days to 76 minutes. arXiv preprint arXiv:1904.00962 . Rowan Zellers, Ari Holtzman, Hannah Rashkin,\\\\nYonatan Bisk, Ali Farhadi, Franziska Roesner, and\\\\nYejin Choi. 2019. Defending against neural fake\\\\nnews. arXiv preprint arXiv:1905.12616 .'}\\nBased on this text, provide a summary of the main points.\\n\\nRULES:\\n- Organize the points in markdown format.\\n\\nHelpful Answer:\"\n", " ]\n", "}\n", "\u001b[32;1m\u001b[1;3m[llm/start]\u001b[0m \u001b[1m[chain:RunnableSequence > llm:ChatOpenAI] Entering LLM run with input:\n", "\u001b[0m{\n", " \"prompts\": [\n", " \"Human: You are an expert in technical papers and journals.\\nYou're tasked with summarizing the main points in the following text.\\nThe following is the text you need to summarize:\\n{'doc': 'Yukun Zhu, Ryan Kiros, Richard Zemel, Ruslan\\\\nSalakhutdinov, Raquel Urtasun, Antonio Torralba,\\\\nand Sanja Fidler. 2015. Aligning books and movies:\\\\nTowards story-like visual explanations by watch-\\\\ning movies and reading books. In arXiv preprint\\\\narXiv:1506.06724 . Appendix for “RoBERTa: A Robustly\\\\nOptimized BERT Pretraining Approach”\\\\nA Full results on GLUE\\\\nIn Table 8we present the full set of development\\\\nset results for RoBERTa. We present results for\\\\naLARGE configuration that follows BERT LARGE ,\\\\nas well as a BASE configuration that follows\\\\nBERT BASE.B Pretraining Hyperparameters\\\\nTable 9describes the hyperparameters for pre-\\\\ntraining of RoBERTa LARGE and RoBERTa BASE\\\\nC Finetuning Hyperparameters\\\\nFinetuning hyperparameters for RACE, SQuAD\\\\nand GLUE are given in Table 10. We select the\\\\nbest hyperparameter values based on the median\\\\nof 5 random seeds for each task. MNLI QNLI QQP RTE SST MRPC CoLA STS\\\\nRoBERTa BASE\\\\n+ all data + 500k steps 87.6 92.8 91.9 78.7 94.8 90.2 63.6 91.2\\\\nRoBERTa LARGE\\\\nwith B OOKS + W IKI 89.0 93.9 91.9 84.5 95.3 90.2 66.3 91.6\\\\n+ additional data ( §3.2) 89.3 94.0 92.0 82.7 95.6 91.4 66.1 92.2\\\\n+ pretrain longer 300k 90.0 94.5 92.2 83.3 96.1 91.1 67.4 92.3\\\\n+ pretrain longer 500k 90.2 94.7 92.2 86.6 96.4 90.9 68.0 92.4\\\\nTable 8: Development set results on GLUE tasks for various co nfigurations of RoBERTa. Hyperparam RoBERTa LARGE RoBERTa BASE\\\\nNumber of Layers 24 12\\\\nHidden size 1024 768\\\\nFFN inner hidden size 4096 3072\\\\nAttention heads 16 12\\\\nAttention head size 64 64\\\\nDropout 0.1 0.1\\\\nAttention Dropout 0.1 0.1\\\\nWarmup Steps 30k 24k\\\\nPeak Learning Rate 4e-4 6e-4\\\\nBatch Size 8k 8k\\\\nWeight Decay 0.01 0.01\\\\nMax Steps 500k 500k\\\\nLearning Rate Decay Linear Linear\\\\nAdamǫ 1e-6 1e-6\\\\nAdamβ1 0.9 0.9\\\\nAdamβ2 0.98 0.98\\\\nGradient Clipping 0.0 0.0\\\\nTable 9: Hyperparameters for pretraining RoBERTa LARGE and RoBERTa BASE. Hyperparam RACE SQuAD GLUE\\\\nLearning Rate 1e-5 1.5e-5 {1e-5, 2e-5, 3e-5 }\\\\nBatch Size 16 48 {16, 32}\\\\nWeight Decay 0.1 0.01 0.1\\\\nMax Epochs 4 2 10\\\\nLearning Rate Decay Linear Linear Linear\\\\nWarmup ratio 0.06 0.06 0.06\\\\nTable 10: Hyperparameters for finetuning RoBERTa LARGE on RACE, SQuAD and GLUE.'}\\nBased on this text, provide a summary of the main points.\\n\\nRULES:\\n- Organize the points in markdown format.\\n\\nHelpful Answer:\"\n", " ]\n", "}\n", "\u001b[36;1m\u001b[1;3m[llm/end]\u001b[0m \u001b[1m[chain:RunnableSequence > llm:ChatOpenAI] [1.27s] Exiting LLM run with output:\n", "\u001b[0m{\n", " \"generations\": [\n", " [\n", " {\n", " \"text\": \"- The text mentions a paper on mixed precision training presented at the International Conference on Learning Representations in 2018 by a group of authors.\\n- It also references a dataset called Cc-news from 2016, which is available online.\",\n", " \"generation_info\": {\n", " \"finish_reason\": \"stop\",\n", " \"logprobs\": null\n", " },\n", " \"type\": \"ChatGeneration\",\n", " \"message\": {\n", " \"lc\": 1,\n", " \"type\": \"constructor\",\n", " \"id\": [\n", " \"langchain\",\n", " \"schema\",\n", " \"messages\",\n", " \"AIMessage\"\n", " ],\n", " \"kwargs\": {\n", " \"content\": \"- The text mentions a paper on mixed precision training presented at the International Conference on Learning Representations in 2018 by a group of authors.\\n- It also references a dataset called Cc-news from 2016, which is available online.\",\n", " \"response_metadata\": {\n", " \"token_usage\": {\n", " \"completion_tokens\": 49,\n", " \"prompt_tokens\": 196,\n", " \"total_tokens\": 245\n", " },\n", " \"model_name\": \"gpt-3.5-turbo-0125\",\n", " \"system_fingerprint\": null,\n", " \"finish_reason\": \"stop\",\n", " \"logprobs\": null\n", " },\n", " \"type\": \"ai\",\n", " \"id\": \"run-f58b3f07-7e2b-4f02-a170-1f8d45b10643-0\",\n", " \"usage_metadata\": {\n", " \"input_tokens\": 196,\n", " \"output_tokens\": 49,\n", " \"total_tokens\": 245\n", " },\n", " \"tool_calls\": [],\n", " \"invalid_tool_calls\": []\n", " }\n", " }\n", " }\n", " ]\n", " ],\n", " \"llm_output\": {\n", " \"token_usage\": {\n", " \"completion_tokens\": 49,\n", " \"prompt_tokens\": 196,\n", " \"total_tokens\": 245\n", " },\n", " \"model_name\": \"gpt-3.5-turbo-0125\",\n", " \"system_fingerprint\": null\n", " },\n", " \"run\": null\n", "}\n", "\u001b[36;1m\u001b[1;3m[llm/end]\u001b[0m \u001b[1m[chain:RunnableSequence > llm:ChatOpenAI] [1.43s] Exiting LLM run with output:\n", "\u001b[0m{\n", " \"generations\": [\n", " [\n", " {\n", " \"text\": \"### Summary of Main Points:\\n- The text discusses the third PASCAL recognizing textual entailment challenge, which was held in 2007.\\n- It also mentions the OpenWebText Corpus created by Aaron Gokaslan and Vanya Cohen in 2019.\\n- The link to access the OpenWebText Corpus is provided.\",\n", " \"generation_info\": {\n", " \"finish_reason\": \"stop\",\n", " \"logprobs\": null\n", " },\n", " \"type\": \"ChatGeneration\",\n", " \"message\": {\n", " \"lc\": 1,\n", " \"type\": \"constructor\",\n", " \"id\": [\n", " \"langchain\",\n", " \"schema\",\n", " \"messages\",\n", " \"AIMessage\"\n", " ],\n", " \"kwargs\": {\n", " \"content\": \"### Summary of Main Points:\\n- The text discusses the third PASCAL recognizing textual entailment challenge, which was held in 2007.\\n- It also mentions the OpenWebText Corpus created by Aaron Gokaslan and Vanya Cohen in 2019.\\n- The link to access the OpenWebText Corpus is provided.\",\n", " \"response_metadata\": {\n", " \"token_usage\": {\n", " \"completion_tokens\": 67,\n", " \"prompt_tokens\": 181,\n", " \"total_tokens\": 248\n", " },\n", " \"model_name\": \"gpt-3.5-turbo-0125\",\n", " \"system_fingerprint\": null,\n", " \"finish_reason\": \"stop\",\n", " \"logprobs\": null\n", " },\n", " \"type\": \"ai\",\n", " \"id\": \"run-d6ee0613-3e00-4f31-8763-19c21863e37b-0\",\n", " \"usage_metadata\": {\n", " \"input_tokens\": 181,\n", " \"output_tokens\": 67,\n", " \"total_tokens\": 248\n", " },\n", " \"tool_calls\": [],\n", " \"invalid_tool_calls\": []\n", " }\n", " }\n", " }\n", " ]\n", " ],\n", " \"llm_output\": {\n", " \"token_usage\": {\n", " \"completion_tokens\": 67,\n", " \"prompt_tokens\": 181,\n", " \"total_tokens\": 248\n", " },\n", " \"model_name\": \"gpt-3.5-turbo-0125\",\n", " \"system_fingerprint\": null\n", " },\n", " \"run\": null\n", "}\n", "\u001b[36;1m\u001b[1;3m[llm/end]\u001b[0m \u001b[1m[chain:RunnableSequence > llm:ChatOpenAI] [1.51s] Exiting LLM run with output:\n", "\u001b[0m{\n", " \"generations\": [\n", " [\n", " {\n", " \"text\": \"- RoBERTa accumulates improvements from BERT LARGE.\\n- RoBERTa matches the architecture and training objective of BERT LARGE.\\n- Results for BERT LARGE are from Devlin et al. (2019).\\n- Results for XLNet LARGE are from Yang et al. (2019).\\n- Complete results on all GLUE tasks can be found in the Appendix.\",\n", " \"generation_info\": {\n", " \"finish_reason\": \"stop\",\n", " \"logprobs\": null\n", " },\n", " \"type\": \"ChatGeneration\",\n", " \"message\": {\n", " \"lc\": 1,\n", " \"type\": \"constructor\",\n", " \"id\": [\n", " \"langchain\",\n", " \"schema\",\n", " \"messages\",\n", " \"AIMessage\"\n", " ],\n", " \"kwargs\": {\n", " \"content\": \"- RoBERTa accumulates improvements from BERT LARGE.\\n- RoBERTa matches the architecture and training objective of BERT LARGE.\\n- Results for BERT LARGE are from Devlin et al. (2019).\\n- Results for XLNet LARGE are from Yang et al. (2019).\\n- Complete results on all GLUE tasks can be found in the Appendix.\",\n", " \"response_metadata\": {\n", " \"token_usage\": {\n", " \"completion_tokens\": 75,\n", " \"prompt_tokens\": 150,\n", " \"total_tokens\": 225\n", " },\n", " \"model_name\": \"gpt-3.5-turbo-0125\",\n", " \"system_fingerprint\": null,\n", " \"finish_reason\": \"stop\",\n", " \"logprobs\": null\n", " },\n", " \"type\": \"ai\",\n", " \"id\": \"run-134e3803-c9ba-46f5-8b0e-7d880ee43ced-0\",\n", " \"usage_metadata\": {\n", " \"input_tokens\": 150,\n", " \"output_tokens\": 75,\n", " \"total_tokens\": 225\n", " },\n", " \"tool_calls\": [],\n", " \"invalid_tool_calls\": []\n", " }\n", " }\n", " }\n", " ]\n", " ],\n", " \"llm_output\": {\n", " \"token_usage\": {\n", " \"completion_tokens\": 75,\n", " \"prompt_tokens\": 150,\n", " \"total_tokens\": 225\n", " },\n", " \"model_name\": \"gpt-3.5-turbo-0125\",\n", " \"system_fingerprint\": null\n", " },\n", " \"run\": null\n", "}\n", "\u001b[36;1m\u001b[1;3m[llm/end]\u001b[0m \u001b[1m[chain:RunnableSequence > llm:ChatOpenAI] [1.62s] Exiting LLM run with output:\n", "\u001b[0m{\n", " \"generations\": [\n", " [\n", " {\n", " \"text\": \"- The text mentions several technical papers and journals related to natural language processing.\\n- One paper discusses a robust trick for the Winograd Schema Challenge.\\n- Another paper introduces a large-scale reading comprehension dataset called RACE.\\n- A third paper focuses on cross-lingual language model pretraining.\\n- All papers are available as arXiv preprints.\",\n", " \"generation_info\": {\n", " \"finish_reason\": \"stop\",\n", " \"logprobs\": null\n", " },\n", " \"type\": \"ChatGeneration\",\n", " \"message\": {\n", " \"lc\": 1,\n", " \"type\": \"constructor\",\n", " \"id\": [\n", " \"langchain\",\n", " \"schema\",\n", " \"messages\",\n", " \"AIMessage\"\n", " ],\n", " \"kwargs\": {\n", " \"content\": \"- The text mentions several technical papers and journals related to natural language processing.\\n- One paper discusses a robust trick for the Winograd Schema Challenge.\\n- Another paper introduces a large-scale reading comprehension dataset called RACE.\\n- A third paper focuses on cross-lingual language model pretraining.\\n- All papers are available as arXiv preprints.\",\n", " \"response_metadata\": {\n", " \"token_usage\": {\n", " \"completion_tokens\": 70,\n", " \"prompt_tokens\": 239,\n", " \"total_tokens\": 309\n", " },\n", " \"model_name\": \"gpt-3.5-turbo-0125\",\n", " \"system_fingerprint\": null,\n", " \"finish_reason\": \"stop\",\n", " \"logprobs\": null\n", " },\n", " \"type\": \"ai\",\n", " \"id\": \"run-4240f3f6-f800-4e0b-afc8-8f4b70438a9d-0\",\n", " \"usage_metadata\": {\n", " \"input_tokens\": 239,\n", " \"output_tokens\": 70,\n", " \"total_tokens\": 309\n", " },\n", " \"tool_calls\": [],\n", " \"invalid_tool_calls\": []\n", " }\n", " }\n", " }\n", " ]\n", " ],\n", " \"llm_output\": {\n", " \"token_usage\": {\n", " \"completion_tokens\": 70,\n", " \"prompt_tokens\": 239,\n", " \"total_tokens\": 309\n", " },\n", " \"model_name\": \"gpt-3.5-turbo-0125\",\n", " \"system_fingerprint\": null\n", " },\n", " \"run\": null\n", "}\n", "\u001b[36;1m\u001b[1;3m[llm/end]\u001b[0m \u001b[1m[chain:RunnableSequence > llm:ChatOpenAI] [1.63s] Exiting LLM run with output:\n", "\u001b[0m{\n", " \"generations\": [\n", " [\n", " {\n", " \"text\": \"### Summary:\\n- The document discusses a new pretraining approach called RoBERTa, which is an optimized version of BERT.\\n- The authors of the paper are Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov.\",\n", " \"generation_info\": {\n", " \"finish_reason\": \"stop\",\n", " \"logprobs\": null\n", " },\n", " \"type\": \"ChatGeneration\",\n", " \"message\": {\n", " \"lc\": 1,\n", " \"type\": \"constructor\",\n", " \"id\": [\n", " \"langchain\",\n", " \"schema\",\n", " \"messages\",\n", " \"AIMessage\"\n", " ],\n", " \"kwargs\": {\n", " \"content\": \"### Summary:\\n- The document discusses a new pretraining approach called RoBERTa, which is an optimized version of BERT.\\n- The authors of the paper are Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov.\",\n", " \"response_metadata\": {\n", " \"token_usage\": {\n", " \"completion_tokens\": 81,\n", " \"prompt_tokens\": 182,\n", " \"total_tokens\": 263\n", " },\n", " \"model_name\": \"gpt-3.5-turbo-0125\",\n", " \"system_fingerprint\": null,\n", " \"finish_reason\": \"stop\",\n", " \"logprobs\": null\n", " },\n", " \"type\": \"ai\",\n", " \"id\": \"run-b3735f08-fdd8-4aa6-8b9e-7c3a6a224ef6-0\",\n", " \"usage_metadata\": {\n", " \"input_tokens\": 182,\n", " \"output_tokens\": 81,\n", " \"total_tokens\": 263\n", " },\n", " \"tool_calls\": [],\n", " \"invalid_tool_calls\": []\n", " }\n", " }\n", " }\n", " ]\n", " ],\n", " \"llm_output\": {\n", " \"token_usage\": {\n", " \"completion_tokens\": 81,\n", " \"prompt_tokens\": 182,\n", " \"total_tokens\": 263\n", " },\n", " \"model_name\": \"gpt-3.5-turbo-0125\",\n", " \"system_fingerprint\": null\n", " },\n", " \"run\": null\n", "}\n", "\u001b[36;1m\u001b[1;3m[llm/end]\u001b[0m \u001b[1m[chain:RunnableSequence > llm:ChatOpenAI] [1.68s] Exiting LLM run with output:\n", "\u001b[0m{\n", " \"generations\": [\n", " [\n", " {\n", " \"text\": \"- The text discusses a paper titled \\\"KERMIT: Generative insertion-based modeling for sequences\\\"\\n- The authors of the paper are William Chan, Nikita Kitaev, Kelvin Guu, Mitchell Stern, and Jakob Uszkoreit\\n- The paper likely introduces a new model or approach called KERMIT for generating sequences using insertion-based modeling\",\n", " \"generation_info\": {\n", " \"finish_reason\": \"stop\",\n", " \"logprobs\": null\n", " },\n", " \"type\": \"ChatGeneration\",\n", " \"message\": {\n", " \"lc\": 1,\n", " \"type\": \"constructor\",\n", " \"id\": [\n", " \"langchain\",\n", " \"schema\",\n", " \"messages\",\n", " \"AIMessage\"\n", " ],\n", " \"kwargs\": {\n", " \"content\": \"- The text discusses a paper titled \\\"KERMIT: Generative insertion-based modeling for sequences\\\"\\n- The authors of the paper are William Chan, Nikita Kitaev, Kelvin Guu, Mitchell Stern, and Jakob Uszkoreit\\n- The paper likely introduces a new model or approach called KERMIT for generating sequences using insertion-based modeling\",\n", " \"response_metadata\": {\n", " \"token_usage\": {\n", " \"completion_tokens\": 71,\n", " \"prompt_tokens\": 118,\n", " \"total_tokens\": 189\n", " },\n", " \"model_name\": \"gpt-3.5-turbo-0125\",\n", " \"system_fingerprint\": null,\n", " \"finish_reason\": \"stop\",\n", " \"logprobs\": null\n", " },\n", " \"type\": \"ai\",\n", " \"id\": \"run-e294e84d-fb13-459c-b9fe-f8592b8aaa54-0\",\n", " \"usage_metadata\": {\n", " \"input_tokens\": 118,\n", " \"output_tokens\": 71,\n", " \"total_tokens\": 189\n", " },\n", " \"tool_calls\": [],\n", " \"invalid_tool_calls\": []\n", " }\n", " }\n", " }\n", " ]\n", " ],\n", " \"llm_output\": {\n", " \"token_usage\": {\n", " \"completion_tokens\": 71,\n", " \"prompt_tokens\": 118,\n", " \"total_tokens\": 189\n", " },\n", " \"model_name\": \"gpt-3.5-turbo-0125\",\n", " \"system_fingerprint\": null\n", " },\n", " \"run\": null\n", "}\n", "\u001b[36;1m\u001b[1;3m[llm/end]\u001b[0m \u001b[1m[chain:RunnableSequence > llm:ChatOpenAI] [1.70s] Exiting LLM run with output:\n", "\u001b[0m{\n", " \"generations\": [\n", " [\n", " {\n", " \"text\": \"- The paper is authored by Dan Hendrycks and Kevin Gimpel in 2016.\\n- The paper introduces a new activation function called Gaussian Error Linear Units (GELUs).\\n- The paper is available as an arXiv preprint with the identifier arXiv:1606.08415.\",\n", " \"generation_info\": {\n", " \"finish_reason\": \"stop\",\n", " \"logprobs\": null\n", " },\n", " \"type\": \"ChatGeneration\",\n", " \"message\": {\n", " \"lc\": 1,\n", " \"type\": \"constructor\",\n", " \"id\": [\n", " \"langchain\",\n", " \"schema\",\n", " \"messages\",\n", " \"AIMessage\"\n", " ],\n", " \"kwargs\": {\n", " \"content\": \"- The paper is authored by Dan Hendrycks and Kevin Gimpel in 2016.\\n- The paper introduces a new activation function called Gaussian Error Linear Units (GELUs).\\n- The paper is available as an arXiv preprint with the identifier arXiv:1606.08415.\",\n", " \"response_metadata\": {\n", " \"token_usage\": {\n", " \"completion_tokens\": 63,\n", " \"prompt_tokens\": 117,\n", " \"total_tokens\": 180\n", " },\n", " \"model_name\": \"gpt-3.5-turbo-0125\",\n", " \"system_fingerprint\": null,\n", " \"finish_reason\": \"stop\",\n", " \"logprobs\": null\n", " },\n", " \"type\": \"ai\",\n", " \"id\": \"run-edf8912f-5996-4b74-86b4-5f7559161221-0\",\n", " \"usage_metadata\": {\n", " \"input_tokens\": 117,\n", " \"output_tokens\": 63,\n", " \"total_tokens\": 180\n", " },\n", " \"tool_calls\": [],\n", " \"invalid_tool_calls\": []\n", " }\n", " }\n", " }\n", " ]\n", " ],\n", " \"llm_output\": {\n", " \"token_usage\": {\n", " \"completion_tokens\": 63,\n", " \"prompt_tokens\": 117,\n", " \"total_tokens\": 180\n", " },\n", " \"model_name\": \"gpt-3.5-turbo-0125\",\n", " \"system_fingerprint\": null\n", " },\n", " \"run\": null\n", "}\n", "\u001b[36;1m\u001b[1;3m[llm/end]\u001b[0m \u001b[1m[chain:RunnableSequence > llm:ChatOpenAI] [1.73s] Exiting LLM run with output:\n", "\u001b[0m{\n", " \"generations\": [\n", " [\n", " {\n", " \"text\": \"- FAIRSEQ is a fast and extensible toolkit for sequence modeling, presented at NAACL.\\n- Scaling neural machine translation was discussed in a paper at the Third Conference on Machine Translation (WMT).\\n- Automatic differentiation in PyTorch was covered at the NIPS Autodiff Workshop.\\n- Deep contextualized word representations were presented at NAACL.\\n- OpenAI published reports on improving language understanding with unsupervised learning and language models as unsupervised multitask learners.\",\n", " \"generation_info\": {\n", " \"finish_reason\": \"stop\",\n", " \"logprobs\": null\n", " },\n", " \"type\": \"ChatGeneration\",\n", " \"message\": {\n", " \"lc\": 1,\n", " \"type\": \"constructor\",\n", " \"id\": [\n", " \"langchain\",\n", " \"schema\",\n", " \"messages\",\n", " \"AIMessage\"\n", " ],\n", " \"kwargs\": {\n", " \"content\": \"- FAIRSEQ is a fast and extensible toolkit for sequence modeling, presented at NAACL.\\n- Scaling neural machine translation was discussed in a paper at the Third Conference on Machine Translation (WMT).\\n- Automatic differentiation in PyTorch was covered at the NIPS Autodiff Workshop.\\n- Deep contextualized word representations were presented at NAACL.\\n- OpenAI published reports on improving language understanding with unsupervised learning and language models as unsupervised multitask learners.\",\n", " \"response_metadata\": {\n", " \"token_usage\": {\n", " \"completion_tokens\": 96,\n", " \"prompt_tokens\": 439,\n", " \"total_tokens\": 535\n", " },\n", " \"model_name\": \"gpt-3.5-turbo-0125\",\n", " \"system_fingerprint\": null,\n", " \"finish_reason\": \"stop\",\n", " \"logprobs\": null\n", " },\n", " \"type\": \"ai\",\n", " \"id\": \"run-9d97ef45-64bb-442a-b296-87dfebdbb3ac-0\",\n", " \"usage_metadata\": {\n", " \"input_tokens\": 439,\n", " \"output_tokens\": 96,\n", " \"total_tokens\": 535\n", " },\n", " \"tool_calls\": [],\n", " \"invalid_tool_calls\": []\n", " }\n", " }\n", " }\n", " ]\n", " ],\n", " \"llm_output\": {\n", " \"token_usage\": {\n", " \"completion_tokens\": 96,\n", " \"prompt_tokens\": 439,\n", " \"total_tokens\": 535\n", " },\n", " \"model_name\": \"gpt-3.5-turbo-0125\",\n", " \"system_fingerprint\": null\n", " },\n", " \"run\": null\n", "}\n", "\u001b[36;1m\u001b[1;3m[llm/end]\u001b[0m \u001b[1m[chain:RunnableSequence > llm:ChatOpenAI] [1.73s] Exiting LLM run with output:\n", "\u001b[0m{\n", " \"generations\": [\n", " [\n", " {\n", " \"text\": \"- The first paper discusses a method to reduce the pre-training time for the BERT model from 3 days to just 76 minutes.\\n- The second paper focuses on defending against neural fake news, with authors including Rowan Zellers, Ari Holtzman, and Franziska Roesner.\\n- Both papers are arXiv preprints, indicating they have not yet been peer-reviewed.\",\n", " \"generation_info\": {\n", " \"finish_reason\": \"stop\",\n", " \"logprobs\": null\n", " },\n", " \"type\": \"ChatGeneration\",\n", " \"message\": {\n", " \"lc\": 1,\n", " \"type\": \"constructor\",\n", " \"id\": [\n", " \"langchain\",\n", " \"schema\",\n", " \"messages\",\n", " \"AIMessage\"\n", " ],\n", " \"kwargs\": {\n", " \"content\": \"- The first paper discusses a method to reduce the pre-training time for the BERT model from 3 days to just 76 minutes.\\n- The second paper focuses on defending against neural fake news, with authors including Rowan Zellers, Ari Holtzman, and Franziska Roesner.\\n- Both papers are arXiv preprints, indicating they have not yet been peer-reviewed.\",\n", " \"response_metadata\": {\n", " \"token_usage\": {\n", " \"completion_tokens\": 78,\n", " \"prompt_tokens\": 205,\n", " \"total_tokens\": 283\n", " },\n", " \"model_name\": \"gpt-3.5-turbo-0125\",\n", " \"system_fingerprint\": null,\n", " \"finish_reason\": \"stop\",\n", " \"logprobs\": null\n", " },\n", " \"type\": \"ai\",\n", " \"id\": \"run-250a154f-7fed-453c-9e4e-c8e482a023c8-0\",\n", " \"usage_metadata\": {\n", " \"input_tokens\": 205,\n", " \"output_tokens\": 78,\n", " \"total_tokens\": 283\n", " },\n", " \"tool_calls\": [],\n", " \"invalid_tool_calls\": []\n", " }\n", " }\n", " }\n", " ]\n", " ],\n", " \"llm_output\": {\n", " \"token_usage\": {\n", " \"completion_tokens\": 78,\n", " \"prompt_tokens\": 205,\n", " \"total_tokens\": 283\n", " },\n", " \"model_name\": \"gpt-3.5-turbo-0125\",\n", " \"system_fingerprint\": null\n", " },\n", " \"run\": null\n", "}\n", "\u001b[36;1m\u001b[1;3m[llm/end]\u001b[0m \u001b[1m[chain:RunnableSequence > llm:ChatOpenAI] [1.74s] Exiting LLM run with output:\n", "\u001b[0m{\n", " \"generations\": [\n", " [\n", " {\n", " \"text\": \"# Summary of Main Points:\\n\\n- **Authors**: Felix Hamborg, Norman Meuschke, Corinna Brenitinger, and Bela Gipp\\n- **Year**: 2017\\n- **Title**: news-please: A generic news crawler and extractor\\n- **Event**: Proceedings of the 15th International Symposium of Information Science\",\n", " \"generation_info\": {\n", " \"finish_reason\": \"stop\",\n", " \"logprobs\": null\n", " },\n", " \"type\": \"ChatGeneration\",\n", " \"message\": {\n", " \"lc\": 1,\n", " \"type\": \"constructor\",\n", " \"id\": [\n", " \"langchain\",\n", " \"schema\",\n", " \"messages\",\n", " \"AIMessage\"\n", " ],\n", " \"kwargs\": {\n", " \"content\": \"# Summary of Main Points:\\n\\n- **Authors**: Felix Hamborg, Norman Meuschke, Corinna Brenitinger, and Bela Gipp\\n- **Year**: 2017\\n- **Title**: news-please: A generic news crawler and extractor\\n- **Event**: Proceedings of the 15th International Symposium of Information Science\",\n", " \"response_metadata\": {\n", " \"token_usage\": {\n", " \"completion_tokens\": 71,\n", " \"prompt_tokens\": 132,\n", " \"total_tokens\": 203\n", " },\n", " \"model_name\": \"gpt-3.5-turbo-0125\",\n", " \"system_fingerprint\": null,\n", " \"finish_reason\": \"stop\",\n", " \"logprobs\": null\n", " },\n", " \"type\": \"ai\",\n", " \"id\": \"run-b3d50fe8-1206-44d1-9012-1ab7b08ca243-0\",\n", " \"usage_metadata\": {\n", " \"input_tokens\": 132,\n", " \"output_tokens\": 71,\n", " \"total_tokens\": 203\n", " },\n", " \"tool_calls\": [],\n", " \"invalid_tool_calls\": []\n", " }\n", " }\n", " }\n", " ]\n", " ],\n", " \"llm_output\": {\n", " \"token_usage\": {\n", " \"completion_tokens\": 71,\n", " \"prompt_tokens\": 132,\n", " \"total_tokens\": 203\n", " },\n", " \"model_name\": \"gpt-3.5-turbo-0125\",\n", " \"system_fingerprint\": null\n", " },\n", " \"run\": null\n", "}\n", "\u001b[36;1m\u001b[1;3m[llm/end]\u001b[0m \u001b[1m[chain:RunnableSequence > llm:ChatOpenAI] [1.91s] Exiting LLM run with output:\n", "\u001b[0m{\n", " \"generations\": [\n", " [\n", " {\n", " \"text\": \"- The text discusses various technical papers and preprints related to natural language understanding and neural network acceptability judgments.\\n- The GLUE benchmark is highlighted as a multi-task benchmark and analysis platform for natural language understanding.\\n- The paper by Warstadt et al. (2018) focuses on neural network acceptability judgments.\\n- Williams et al. (2018) introduce a challenge corpus for sentence understanding through inference.\\n- Yang et al. (2019) present XLNet, a generalized autoregressive pretraining method for language understanding.\",\n", " \"generation_info\": {\n", " \"finish_reason\": \"stop\",\n", " \"logprobs\": null\n", " },\n", " \"type\": \"ChatGeneration\",\n", " \"message\": {\n", " \"lc\": 1,\n", " \"type\": \"constructor\",\n", " \"id\": [\n", " \"langchain\",\n", " \"schema\",\n", " \"messages\",\n", " \"AIMessage\"\n", " ],\n", " \"kwargs\": {\n", " \"content\": \"- The text discusses various technical papers and preprints related to natural language understanding and neural network acceptability judgments.\\n- The GLUE benchmark is highlighted as a multi-task benchmark and analysis platform for natural language understanding.\\n- The paper by Warstadt et al. (2018) focuses on neural network acceptability judgments.\\n- Williams et al. (2018) introduce a challenge corpus for sentence understanding through inference.\\n- Yang et al. (2019) present XLNet, a generalized autoregressive pretraining method for language understanding.\",\n", " \"response_metadata\": {\n", " \"token_usage\": {\n", " \"completion_tokens\": 107,\n", " \"prompt_tokens\": 319,\n", " \"total_tokens\": 426\n", " },\n", " \"model_name\": \"gpt-3.5-turbo-0125\",\n", " \"system_fingerprint\": null,\n", " \"finish_reason\": \"stop\",\n", " \"logprobs\": null\n", " },\n", " \"type\": \"ai\",\n", " \"id\": \"run-1b9cdb23-2bda-4e02-b7dd-46f278e814d2-0\",\n", " \"usage_metadata\": {\n", " \"input_tokens\": 319,\n", " \"output_tokens\": 107,\n", " \"total_tokens\": 426\n", " },\n", " \"tool_calls\": [],\n", " \"invalid_tool_calls\": []\n", " }\n", " }\n", " }\n", " ]\n", " ],\n", " \"llm_output\": {\n", " \"token_usage\": {\n", " \"completion_tokens\": 107,\n", " \"prompt_tokens\": 319,\n", " \"total_tokens\": 426\n", " },\n", " \"model_name\": \"gpt-3.5-turbo-0125\",\n", " \"system_fingerprint\": null\n", " },\n", " \"run\": null\n", "}\n", "\u001b[36;1m\u001b[1;3m[llm/end]\u001b[0m \u001b[1m[chain:RunnableSequence > llm:ChatOpenAI] [1.97s] Exiting LLM run with output:\n", "\u001b[0m{\n", " \"generations\": [\n", " [\n", " {\n", " \"text\": \"- The text discusses two papers on natural language understanding using deep neural networks.\\n- The first paper by Liu et al. focuses on multi-task deep neural networks for natural language understanding.\\n- The second paper by McCann et al. discusses contextualized word vectors learned in translation.\\n- Both papers are available as preprints on arXiv and were presented at the Advances in Neural Information Processing Systems (NIPS) conference.\",\n", " \"generation_info\": {\n", " \"finish_reason\": \"stop\",\n", " \"logprobs\": null\n", " },\n", " \"type\": \"ChatGeneration\",\n", " \"message\": {\n", " \"lc\": 1,\n", " \"type\": \"constructor\",\n", " \"id\": [\n", " \"langchain\",\n", " \"schema\",\n", " \"messages\",\n", " \"AIMessage\"\n", " ],\n", " \"kwargs\": {\n", " \"content\": \"- The text discusses two papers on natural language understanding using deep neural networks.\\n- The first paper by Liu et al. focuses on multi-task deep neural networks for natural language understanding.\\n- The second paper by McCann et al. discusses contextualized word vectors learned in translation.\\n- Both papers are available as preprints on arXiv and were presented at the Advances in Neural Information Processing Systems (NIPS) conference.\",\n", " \"response_metadata\": {\n", " \"token_usage\": {\n", " \"completion_tokens\": 84,\n", " \"prompt_tokens\": 193,\n", " \"total_tokens\": 277\n", " },\n", " \"model_name\": \"gpt-3.5-turbo-0125\",\n", " \"system_fingerprint\": null,\n", " \"finish_reason\": \"stop\",\n", " \"logprobs\": null\n", " },\n", " \"type\": \"ai\",\n", " \"id\": \"run-a53f8018-4244-460e-bf36-466c695e6754-0\",\n", " \"usage_metadata\": {\n", " \"input_tokens\": 193,\n", " \"output_tokens\": 84,\n", " \"total_tokens\": 277\n", " },\n", " \"tool_calls\": [],\n", " \"invalid_tool_calls\": []\n", " }\n", " }\n", " }\n", " ]\n", " ],\n", " \"llm_output\": {\n", " \"token_usage\": {\n", " \"completion_tokens\": 84,\n", " \"prompt_tokens\": 193,\n", " \"total_tokens\": 277\n", " },\n", " \"model_name\": \"gpt-3.5-turbo-0125\",\n", " \"system_fingerprint\": null\n", " },\n", " \"run\": null\n", "}\n", "\u001b[36;1m\u001b[1;3m[llm/end]\u001b[0m \u001b[1m[chain:RunnableSequence > llm:ChatOpenAI] [1.98s] Exiting LLM run with output:\n", "\u001b[0m{\n", " \"generations\": [\n", " [\n", " {\n", " \"text\": \"- SpanBERT is a pre-training method that improves performance by representing and predicting spans.\\n- The paper was published as an arXiv preprint with the identifier arXiv:1907.10529.\\n- Another method mentioned in the text is Adam, which is a stochastic optimization technique.\\n- Adam was presented at the International Conference on Learning Representations (ICLR) in 2015 by Diederik Kingma and Jimmy Ba.\",\n", " \"generation_info\": {\n", " \"finish_reason\": \"stop\",\n", " \"logprobs\": null\n", " },\n", " \"type\": \"ChatGeneration\",\n", " \"message\": {\n", " \"lc\": 1,\n", " \"type\": \"constructor\",\n", " \"id\": [\n", " \"langchain\",\n", " \"schema\",\n", " \"messages\",\n", " \"AIMessage\"\n", " ],\n", " \"kwargs\": {\n", " \"content\": \"- SpanBERT is a pre-training method that improves performance by representing and predicting spans.\\n- The paper was published as an arXiv preprint with the identifier arXiv:1907.10529.\\n- Another method mentioned in the text is Adam, which is a stochastic optimization technique.\\n- Adam was presented at the International Conference on Learning Representations (ICLR) in 2015 by Diederik Kingma and Jimmy Ba.\",\n", " \"response_metadata\": {\n", " \"token_usage\": {\n", " \"completion_tokens\": 89,\n", " \"prompt_tokens\": 161,\n", " \"total_tokens\": 250\n", " },\n", " \"model_name\": \"gpt-3.5-turbo-0125\",\n", " \"system_fingerprint\": null,\n", " \"finish_reason\": \"stop\",\n", " \"logprobs\": null\n", " },\n", " \"type\": \"ai\",\n", " \"id\": \"run-29ce0d28-5945-43dc-956c-4d3526654618-0\",\n", " \"usage_metadata\": {\n", " \"input_tokens\": 161,\n", " \"output_tokens\": 89,\n", " \"total_tokens\": 250\n", " },\n", " \"tool_calls\": [],\n", " \"invalid_tool_calls\": []\n", " }\n", " }\n", " }\n", " ]\n", " ],\n", " \"llm_output\": {\n", " \"token_usage\": {\n", " \"completion_tokens\": 89,\n", " \"prompt_tokens\": 161,\n", " \"total_tokens\": 250\n", " },\n", " \"model_name\": \"gpt-3.5-turbo-0125\",\n", " \"system_fingerprint\": null\n", " },\n", " \"run\": null\n", "}\n", "\u001b[36;1m\u001b[1;3m[llm/end]\u001b[0m \u001b[1m[chain:RunnableSequence > llm:ChatOpenAI] [1.98s] Exiting LLM run with output:\n", "\u001b[0m{\n", " \"generations\": [\n", " [\n", " {\n", " \"text\": \"- The text references several key papers and journals in the field of natural language processing and machine learning.\\n- These papers include topics such as unanswerable questions for machine comprehension, neural machine translation, recursive deep models for sentiment analysis, and masked sequence-to-sequence pre-training for language generation.\\n- The papers were presented at conferences such as the Association for Computational Linguistics (ACL), Empirical Methods in Natural Language Processing (EMNLP), and the International Conference on Machine Learning (ICML).\",\n", " \"generation_info\": {\n", " \"finish_reason\": \"stop\",\n", " \"logprobs\": null\n", " },\n", " \"type\": \"ChatGeneration\",\n", " \"message\": {\n", " \"lc\": 1,\n", " \"type\": \"constructor\",\n", " \"id\": [\n", " \"langchain\",\n", " \"schema\",\n", " \"messages\",\n", " \"AIMessage\"\n", " ],\n", " \"kwargs\": {\n", " \"content\": \"- The text references several key papers and journals in the field of natural language processing and machine learning.\\n- These papers include topics such as unanswerable questions for machine comprehension, neural machine translation, recursive deep models for sentiment analysis, and masked sequence-to-sequence pre-training for language generation.\\n- The papers were presented at conferences such as the Association for Computational Linguistics (ACL), Empirical Methods in Natural Language Processing (EMNLP), and the International Conference on Machine Learning (ICML).\",\n", " \"response_metadata\": {\n", " \"token_usage\": {\n", " \"completion_tokens\": 99,\n", " \"prompt_tokens\": 356,\n", " \"total_tokens\": 455\n", " },\n", " \"model_name\": \"gpt-3.5-turbo-0125\",\n", " \"system_fingerprint\": null,\n", " \"finish_reason\": \"stop\",\n", " \"logprobs\": null\n", " },\n", " \"type\": \"ai\",\n", " \"id\": \"run-4452c353-9791-4747-9fd0-862a73125f50-0\",\n", " \"usage_metadata\": {\n", " \"input_tokens\": 356,\n", " \"output_tokens\": 99,\n", " \"total_tokens\": 455\n", " },\n", " \"tool_calls\": [],\n", " \"invalid_tool_calls\": []\n", " }\n", " }\n", " }\n", " ]\n", " ],\n", " \"llm_output\": {\n", " \"token_usage\": {\n", " \"completion_tokens\": 99,\n", " \"prompt_tokens\": 356,\n", " \"total_tokens\": 455\n", " },\n", " \"model_name\": \"gpt-3.5-turbo-0125\",\n", " \"system_fingerprint\": null\n", " },\n", " \"run\": null\n", "}\n", "\u001b[36;1m\u001b[1;3m[llm/end]\u001b[0m \u001b[1m[chain:RunnableSequence > llm:ChatOpenAI] [2.02s] Exiting LLM run with output:\n", "\u001b[0m{\n", " \"generations\": [\n", " [\n", " {\n", " \"text\": \"- The text mentions several key papers and challenges in the field of natural language processing and machine learning.\\n- It references the PASCAL Recognising Textual Entailment Challenge, Semi-supervised Sequence Learning, BERT pre-training model, and Unified Language Model pre-training.\\n- These papers and challenges are important in advancing the understanding and generation of natural language.\\n- The text also highlights the use of deep bidirectional transformers and the construction of corpora for paraphrasing in natural language processing research.\",\n", " \"generation_info\": {\n", " \"finish_reason\": \"stop\",\n", " \"logprobs\": null\n", " },\n", " \"type\": \"ChatGeneration\",\n", " \"message\": {\n", " \"lc\": 1,\n", " \"type\": \"constructor\",\n", " \"id\": [\n", " \"langchain\",\n", " \"schema\",\n", " \"messages\",\n", " \"AIMessage\"\n", " ],\n", " \"kwargs\": {\n", " \"content\": \"- The text mentions several key papers and challenges in the field of natural language processing and machine learning.\\n- It references the PASCAL Recognising Textual Entailment Challenge, Semi-supervised Sequence Learning, BERT pre-training model, and Unified Language Model pre-training.\\n- These papers and challenges are important in advancing the understanding and generation of natural language.\\n- The text also highlights the use of deep bidirectional transformers and the construction of corpora for paraphrasing in natural language processing research.\",\n", " \"response_metadata\": {\n", " \"token_usage\": {\n", " \"completion_tokens\": 99,\n", " \"prompt_tokens\": 377,\n", " \"total_tokens\": 476\n", " },\n", " \"model_name\": \"gpt-3.5-turbo-0125\",\n", " \"system_fingerprint\": null,\n", " \"finish_reason\": \"stop\",\n", " \"logprobs\": null\n", " },\n", " \"type\": \"ai\",\n", " \"id\": \"run-392ae862-d242-451b-84a2-4e9541565630-0\",\n", " \"usage_metadata\": {\n", " \"input_tokens\": 377,\n", " \"output_tokens\": 99,\n", " \"total_tokens\": 476\n", " },\n", " \"tool_calls\": [],\n", " \"invalid_tool_calls\": []\n", " }\n", " }\n", " }\n", " ]\n", " ],\n", " \"llm_output\": {\n", " \"token_usage\": {\n", " \"completion_tokens\": 99,\n", " \"prompt_tokens\": 377,\n", " \"total_tokens\": 476\n", " },\n", " \"model_name\": \"gpt-3.5-turbo-0125\",\n", " \"system_fingerprint\": null\n", " },\n", " \"run\": null\n", "}\n", "\u001b[36;1m\u001b[1;3m[llm/end]\u001b[0m \u001b[1m[chain:RunnableSequence > llm:ChatOpenAI] [2.41s] Exiting LLM run with output:\n", "\u001b[0m{\n", " \"generations\": [\n", " [\n", " {\n", " \"text\": \"# Summary of \\\"SuperGLUE: A stickier benchmark for general-purpose language understanding systems\\\" by Bowman (2019)\\n\\n- The text is discussing a benchmark called SuperGLUE, which is designed to evaluate general-purpose language understanding systems.\\n- SuperGLUE is intended to be a more challenging benchmark compared to existing ones.\\n- The benchmark is named \\\"SuperGLUE\\\" to imply that it is stickier, meaning it is more difficult for language understanding systems to achieve high scores.\\n- The goal of SuperGLUE is to push the boundaries of what language understanding systems can achieve.\",\n", " \"generation_info\": {\n", " \"finish_reason\": \"stop\",\n", " \"logprobs\": null\n", " },\n", " \"type\": \"ChatGeneration\",\n", " \"message\": {\n", " \"lc\": 1,\n", " \"type\": \"constructor\",\n", " \"id\": [\n", " \"langchain\",\n", " \"schema\",\n", " \"messages\",\n", " \"AIMessage\"\n", " ],\n", " \"kwargs\": {\n", " \"content\": \"# Summary of \\\"SuperGLUE: A stickier benchmark for general-purpose language understanding systems\\\" by Bowman (2019)\\n\\n- The text is discussing a benchmark called SuperGLUE, which is designed to evaluate general-purpose language understanding systems.\\n- SuperGLUE is intended to be a more challenging benchmark compared to existing ones.\\n- The benchmark is named \\\"SuperGLUE\\\" to imply that it is stickier, meaning it is more difficult for language understanding systems to achieve high scores.\\n- The goal of SuperGLUE is to push the boundaries of what language understanding systems can achieve.\",\n", " \"response_metadata\": {\n", " \"token_usage\": {\n", " \"completion_tokens\": 117,\n", " \"prompt_tokens\": 101,\n", " \"total_tokens\": 218\n", " },\n", " \"model_name\": \"gpt-3.5-turbo-0125\",\n", " \"system_fingerprint\": null,\n", " \"finish_reason\": \"stop\",\n", " \"logprobs\": null\n", " },\n", " \"type\": \"ai\",\n", " \"id\": \"run-27a8ce88-dcdf-48a9-af6b-5ca8bbc4d4af-0\",\n", " \"usage_metadata\": {\n", " \"input_tokens\": 101,\n", " \"output_tokens\": 117,\n", " \"total_tokens\": 218\n", " },\n", " \"tool_calls\": [],\n", " \"invalid_tool_calls\": []\n", " }\n", " }\n", " }\n", " ]\n", " ],\n", " \"llm_output\": {\n", " \"token_usage\": {\n", " \"completion_tokens\": 117,\n", " \"prompt_tokens\": 101,\n", " \"total_tokens\": 218\n", " },\n", " \"model_name\": \"gpt-3.5-turbo-0125\",\n", " \"system_fingerprint\": null\n", " },\n", " \"run\": null\n", "}\n", "\u001b[36;1m\u001b[1;3m[llm/end]\u001b[0m \u001b[1m[chain:RunnableSequence > llm:ChatOpenAI] [2.46s] Exiting LLM run with output:\n", "\u001b[0m{\n", " \"generations\": [\n", " [\n", " {\n", " \"text\": \"- The text discusses the Winograd schema challenge, introduced by Hector J Levesque, Ernest Davis, and Leora Morgenstern in 2011.\\n- It also mentions a more recent paper by Xiaodong Liu et al. from 2019, which focuses on improving multi-task deep neural networks through knowledge distillation for natural language understanding.\\n- The first paper addresses commonsense reasoning, while the second paper deals with enhancing neural networks for language processing.\",\n", " \"generation_info\": {\n", " \"finish_reason\": \"stop\",\n", " \"logprobs\": null\n", " },\n", " \"type\": \"ChatGeneration\",\n", " \"message\": {\n", " \"lc\": 1,\n", " \"type\": \"constructor\",\n", " \"id\": [\n", " \"langchain\",\n", " \"schema\",\n", " \"messages\",\n", " \"AIMessage\"\n", " ],\n", " \"kwargs\": {\n", " \"content\": \"- The text discusses the Winograd schema challenge, introduced by Hector J Levesque, Ernest Davis, and Leora Morgenstern in 2011.\\n- It also mentions a more recent paper by Xiaodong Liu et al. from 2019, which focuses on improving multi-task deep neural networks through knowledge distillation for natural language understanding.\\n- The first paper addresses commonsense reasoning, while the second paper deals with enhancing neural networks for language processing.\",\n", " \"response_metadata\": {\n", " \"token_usage\": {\n", " \"completion_tokens\": 94,\n", " \"prompt_tokens\": 184,\n", " \"total_tokens\": 278\n", " },\n", " \"model_name\": \"gpt-3.5-turbo-0125\",\n", " \"system_fingerprint\": null,\n", " \"finish_reason\": \"stop\",\n", " \"logprobs\": null\n", " },\n", " \"type\": \"ai\",\n", " \"id\": \"run-4a2c4819-b11a-4b81-9017-85e3e56d94f4-0\",\n", " \"usage_metadata\": {\n", " \"input_tokens\": 184,\n", " \"output_tokens\": 94,\n", " \"total_tokens\": 278\n", " },\n", " \"tool_calls\": [],\n", " \"invalid_tool_calls\": []\n", " }\n", " }\n", " }\n", " ]\n", " ],\n", " \"llm_output\": {\n", " \"token_usage\": {\n", " \"completion_tokens\": 94,\n", " \"prompt_tokens\": 184,\n", " \"total_tokens\": 278\n", " },\n", " \"model_name\": \"gpt-3.5-turbo-0125\",\n", " \"system_fingerprint\": null\n", " },\n", " \"run\": null\n", "}\n", "\u001b[36;1m\u001b[1;3m[llm/end]\u001b[0m \u001b[1m[chain:RunnableSequence > llm:ChatOpenAI] [2.46s] Exiting LLM run with output:\n", "\u001b[0m{\n", " \"generations\": [\n", " [\n", " {\n", " \"text\": \"### Summary:\\n- The paper discusses aligning books and movies to create story-like visual explanations by watching movies and reading books.\\n- The paper presents results for RoBERTa in both LARGE and BASE configurations for various tasks.\\n- Hyperparameters for pretraining RoBERTa LARGE and RoBERTa BASE are provided in detail.\\n- Finetuning hyperparameters for tasks such as RACE, SQuAD, and GLUE are also discussed.\\n- Results for different configurations of RoBERTa on GLUE tasks are presented, showing improvements with additional data and longer pretraining.\\n- A comparison of hyperparameters for RoBERTa LARGE and RoBERTa BASE is provided in tables.\",\n", " \"generation_info\": {\n", " \"finish_reason\": \"stop\",\n", " \"logprobs\": null\n", " },\n", " \"type\": \"ChatGeneration\",\n", " \"message\": {\n", " \"lc\": 1,\n", " \"type\": \"constructor\",\n", " \"id\": [\n", " \"langchain\",\n", " \"schema\",\n", " \"messages\",\n", " \"AIMessage\"\n", " ],\n", " \"kwargs\": {\n", " \"content\": \"### Summary:\\n- The paper discusses aligning books and movies to create story-like visual explanations by watching movies and reading books.\\n- The paper presents results for RoBERTa in both LARGE and BASE configurations for various tasks.\\n- Hyperparameters for pretraining RoBERTa LARGE and RoBERTa BASE are provided in detail.\\n- Finetuning hyperparameters for tasks such as RACE, SQuAD, and GLUE are also discussed.\\n- Results for different configurations of RoBERTa on GLUE tasks are presented, showing improvements with additional data and longer pretraining.\\n- A comparison of hyperparameters for RoBERTa LARGE and RoBERTa BASE is provided in tables.\",\n", " \"response_metadata\": {\n", " \"token_usage\": {\n", " \"completion_tokens\": 135,\n", " \"prompt_tokens\": 914,\n", " \"total_tokens\": 1049\n", " },\n", " \"model_name\": \"gpt-3.5-turbo-0125\",\n", " \"system_fingerprint\": null,\n", " \"finish_reason\": \"stop\",\n", " \"logprobs\": null\n", " },\n", " \"type\": \"ai\",\n", " \"id\": \"run-34832376-e334-4e6e-8393-11f7b85f7fae-0\",\n", " \"usage_metadata\": {\n", " \"input_tokens\": 914,\n", " \"output_tokens\": 135,\n", " \"total_tokens\": 1049\n", " },\n", " \"tool_calls\": [],\n", " \"invalid_tool_calls\": []\n", " }\n", " }\n", " }\n", " ]\n", " ],\n", " \"llm_output\": {\n", " \"token_usage\": {\n", " \"completion_tokens\": 135,\n", " \"prompt_tokens\": 914,\n", " \"total_tokens\": 1049\n", " },\n", " \"model_name\": \"gpt-3.5-turbo-0125\",\n", " \"system_fingerprint\": null\n", " },\n", " \"run\": null\n", "}\n", "\u001b[36;1m\u001b[1;3m[llm/end]\u001b[0m \u001b[1m[chain:RunnableSequence > llm:ChatOpenAI] [2.47s] Exiting LLM run with output:\n", "\u001b[0m{\n", " \"generations\": [\n", " [\n", " {\n", " \"text\": \"### Summary:\\n- **Authors and Publications:** \\n - Matthew Honnibal and Ines Montani are working on spaCy 2 for natural language understanding with Bloom embeddings, convolutional neural networks, and incremental parsing. \\n - Jeremy Howard and Sebastian Ruder are focusing on universal language model fine-tuning for text classification.\\n - Shankar Iyer, Nikhil Dandekar, and Kornl Csernai released the first Quora dataset on question pairs.\\n - Mandar Joshi, Danqi Chen, Yinhan Liu, and Daniel S. are also involved in the field.\",\n", " \"generation_info\": {\n", " \"finish_reason\": \"stop\",\n", " \"logprobs\": null\n", " },\n", " \"type\": \"ChatGeneration\",\n", " \"message\": {\n", " \"lc\": 1,\n", " \"type\": \"constructor\",\n", " \"id\": [\n", " \"langchain\",\n", " \"schema\",\n", " \"messages\",\n", " \"AIMessage\"\n", " ],\n", " \"kwargs\": {\n", " \"content\": \"### Summary:\\n- **Authors and Publications:** \\n - Matthew Honnibal and Ines Montani are working on spaCy 2 for natural language understanding with Bloom embeddings, convolutional neural networks, and incremental parsing. \\n - Jeremy Howard and Sebastian Ruder are focusing on universal language model fine-tuning for text classification.\\n - Shankar Iyer, Nikhil Dandekar, and Kornl Csernai released the first Quora dataset on question pairs.\\n - Mandar Joshi, Danqi Chen, Yinhan Liu, and Daniel S. are also involved in the field.\",\n", " \"response_metadata\": {\n", " \"token_usage\": {\n", " \"completion_tokens\": 125,\n", " \"prompt_tokens\": 237,\n", " \"total_tokens\": 362\n", " },\n", " \"model_name\": \"gpt-3.5-turbo-0125\",\n", " \"system_fingerprint\": null,\n", " \"finish_reason\": \"stop\",\n", " \"logprobs\": null\n", " },\n", " \"type\": \"ai\",\n", " \"id\": \"run-a1def3ae-6dda-4989-beae-0094f64cfa5c-0\",\n", " \"usage_metadata\": {\n", " \"input_tokens\": 237,\n", " \"output_tokens\": 125,\n", " \"total_tokens\": 362\n", " },\n", " \"tool_calls\": [],\n", " \"invalid_tool_calls\": []\n", " }\n", " }\n", " }\n", " ]\n", " ],\n", " \"llm_output\": {\n", " \"token_usage\": {\n", " \"completion_tokens\": 125,\n", " \"prompt_tokens\": 237,\n", " \"total_tokens\": 362\n", " },\n", " \"model_name\": \"gpt-3.5-turbo-0125\",\n", " \"system_fingerprint\": null\n", " },\n", " \"run\": null\n", "}\n", "\u001b[36;1m\u001b[1;3m[llm/end]\u001b[0m \u001b[1m[chain:RunnableSequence > llm:ChatOpenAI] [2.55s] Exiting LLM run with output:\n", "\u001b[0m{\n", " \"generations\": [\n", " [\n", " {\n", " \"text\": \"- The text discusses multiple technical papers related to natural language processing and machine learning.\\n- One of the papers mentioned is \\\"ERNIE: Enhanced representation through knowledge integration\\\" by Yu Stephanie Sun et al.\\n- Another paper mentioned is \\\"A simple method for commonsense reasoning\\\" by Trieu H Trinh and Quoc V Le.\\n- The text also references the paper \\\"Attention is all you need\\\" by Ashish Vaswani et al.\\n- Additionally, the paper \\\"Advances in neural information processing systems\\\" is mentioned.\\n- The text provides a list of authors for each paper mentioned.\",\n", " \"generation_info\": {\n", " \"finish_reason\": \"stop\",\n", " \"logprobs\": null\n", " },\n", " \"type\": \"ChatGeneration\",\n", " \"message\": {\n", " \"lc\": 1,\n", " \"type\": \"constructor\",\n", " \"id\": [\n", " \"langchain\",\n", " \"schema\",\n", " \"messages\",\n", " \"AIMessage\"\n", " ],\n", " \"kwargs\": {\n", " \"content\": \"- The text discusses multiple technical papers related to natural language processing and machine learning.\\n- One of the papers mentioned is \\\"ERNIE: Enhanced representation through knowledge integration\\\" by Yu Stephanie Sun et al.\\n- Another paper mentioned is \\\"A simple method for commonsense reasoning\\\" by Trieu H Trinh and Quoc V Le.\\n- The text also references the paper \\\"Attention is all you need\\\" by Ashish Vaswani et al.\\n- Additionally, the paper \\\"Advances in neural information processing systems\\\" is mentioned.\\n- The text provides a list of authors for each paper mentioned.\",\n", " \"response_metadata\": {\n", " \"token_usage\": {\n", " \"completion_tokens\": 119,\n", " \"prompt_tokens\": 313,\n", " \"total_tokens\": 432\n", " },\n", " \"model_name\": \"gpt-3.5-turbo-0125\",\n", " \"system_fingerprint\": null,\n", " \"finish_reason\": \"stop\",\n", " \"logprobs\": null\n", " },\n", " \"type\": \"ai\",\n", " \"id\": \"run-11b7a63c-f932-4b2d-99ff-46ec3bbedc82-0\",\n", " \"usage_metadata\": {\n", " \"input_tokens\": 313,\n", " \"output_tokens\": 119,\n", " \"total_tokens\": 432\n", " },\n", " \"tool_calls\": [],\n", " \"invalid_tool_calls\": []\n", " }\n", " }\n", " }\n", " ]\n", " ],\n", " \"llm_output\": {\n", " \"token_usage\": {\n", " \"completion_tokens\": 119,\n", " \"prompt_tokens\": 313,\n", " \"total_tokens\": 432\n", " },\n", " \"model_name\": \"gpt-3.5-turbo-0125\",\n", " \"system_fingerprint\": null\n", " },\n", " \"run\": null\n", "}\n", "\u001b[36;1m\u001b[1;3m[llm/end]\u001b[0m \u001b[1m[chain:RunnableSequence > llm:ChatOpenAI] [2.81s] Exiting LLM run with output:\n", "\u001b[0m{\n", " \"generations\": [\n", " [\n", " {\n", " \"text\": \"### Summary of Main Points:\\n\\n- The text compares different training formats for natural language processing tasks using models like BERT and XLNet.\\n- Different input formats are tested, including SEGMENT-PAIR +NSP, SENTENCE-PAIR +NSP, FULL-SENTENCES, and DOC-SENTENCES.\\n- Results are reported for models pre-trained on BOOK CORPUS and WIKIPEDIA, showing performance in tasks like SQuAD, MNLI-m, SST-2, and RACE.\\n- The study includes variations in batch size and the inclusion/exclusion of the Next Sentence Prediction (NSP) loss in the training process.\\n- The results show varying performance levels for different input formats and pre-trained models, with DOC-SENTENCES performing slightly better in some cases.\",\n", " \"generation_info\": {\n", " \"finish_reason\": \"stop\",\n", " \"logprobs\": null\n", " },\n", " \"type\": \"ChatGeneration\",\n", " \"message\": {\n", " \"lc\": 1,\n", " \"type\": \"constructor\",\n", " \"id\": [\n", " \"langchain\",\n", " \"schema\",\n", " \"messages\",\n", " \"AIMessage\"\n", " ],\n", " \"kwargs\": {\n", " \"content\": \"### Summary of Main Points:\\n\\n- The text compares different training formats for natural language processing tasks using models like BERT and XLNet.\\n- Different input formats are tested, including SEGMENT-PAIR +NSP, SENTENCE-PAIR +NSP, FULL-SENTENCES, and DOC-SENTENCES.\\n- Results are reported for models pre-trained on BOOK CORPUS and WIKIPEDIA, showing performance in tasks like SQuAD, MNLI-m, SST-2, and RACE.\\n- The study includes variations in batch size and the inclusion/exclusion of the Next Sentence Prediction (NSP) loss in the training process.\\n- The results show varying performance levels for different input formats and pre-trained models, with DOC-SENTENCES performing slightly better in some cases.\",\n", " \"response_metadata\": {\n", " \"token_usage\": {\n", " \"completion_tokens\": 158,\n", " \"prompt_tokens\": 753,\n", " \"total_tokens\": 911\n", " },\n", " \"model_name\": \"gpt-3.5-turbo-0125\",\n", " \"system_fingerprint\": null,\n", " \"finish_reason\": \"stop\",\n", " \"logprobs\": null\n", " },\n", " \"type\": \"ai\",\n", " \"id\": \"run-990d2393-5dd0-4ccc-ab79-d308193dcc65-0\",\n", " \"usage_metadata\": {\n", " \"input_tokens\": 753,\n", " \"output_tokens\": 158,\n", " \"total_tokens\": 911\n", " },\n", " \"tool_calls\": [],\n", " \"invalid_tool_calls\": []\n", " }\n", " }\n", " }\n", " ]\n", " ],\n", " \"llm_output\": {\n", " \"token_usage\": {\n", " \"completion_tokens\": 158,\n", " \"prompt_tokens\": 753,\n", " \"total_tokens\": 911\n", " },\n", " \"model_name\": \"gpt-3.5-turbo-0125\",\n", " \"system_fingerprint\": null\n", " },\n", " \"run\": null\n", "}\n", "\u001b[36;1m\u001b[1;3m[llm/end]\u001b[0m \u001b[1m[chain:RunnableSequence > llm:ChatOpenAI] [2.95s] Exiting LLM run with output:\n", "\u001b[0m{\n", " \"generations\": [\n", " [\n", " {\n", " \"text\": \"- The text discusses the PASCAL recognizing textual entailment challenge, which has had multiple iterations over the years.\\n- It mentions key researchers involved in the challenges, such as Roy Bar-Haim, Ido Dagan, Bill Dolan, Lisa Ferro, Danilo Giampiccolo, Bernardo Magnini, Idan Szpektor, Luisa Bentivogli, Hoa Trang Dang, Samuel R Bowman, Gabor Angeli, Christopher Potts, and Christopher D Manning.\\n- The challenges focus on the task of recognizing textual entailment, which involves determining if a given piece of text entails, contradicts, or is neutral with respect to another piece of text.\\n- The text also references the creation of a large annotated corpus for learning natural language inference, which was presented at the Empirical Methods in Natural Language Processing (EMNLP) conference in 2015.\",\n", " \"generation_info\": {\n", " \"finish_reason\": \"stop\",\n", " \"logprobs\": null\n", " },\n", " \"type\": \"ChatGeneration\",\n", " \"message\": {\n", " \"lc\": 1,\n", " \"type\": \"constructor\",\n", " \"id\": [\n", " \"langchain\",\n", " \"schema\",\n", " \"messages\",\n", " \"AIMessage\"\n", " ],\n", " \"kwargs\": {\n", " \"content\": \"- The text discusses the PASCAL recognizing textual entailment challenge, which has had multiple iterations over the years.\\n- It mentions key researchers involved in the challenges, such as Roy Bar-Haim, Ido Dagan, Bill Dolan, Lisa Ferro, Danilo Giampiccolo, Bernardo Magnini, Idan Szpektor, Luisa Bentivogli, Hoa Trang Dang, Samuel R Bowman, Gabor Angeli, Christopher Potts, and Christopher D Manning.\\n- The challenges focus on the task of recognizing textual entailment, which involves determining if a given piece of text entails, contradicts, or is neutral with respect to another piece of text.\\n- The text also references the creation of a large annotated corpus for learning natural language inference, which was presented at the Empirical Methods in Natural Language Processing (EMNLP) conference in 2015.\",\n", " \"response_metadata\": {\n", " \"token_usage\": {\n", " \"completion_tokens\": 182,\n", " \"prompt_tokens\": 262,\n", " \"total_tokens\": 444\n", " },\n", " \"model_name\": \"gpt-3.5-turbo-0125\",\n", " \"system_fingerprint\": null,\n", " \"finish_reason\": \"stop\",\n", " \"logprobs\": null\n", " },\n", " \"type\": \"ai\",\n", " \"id\": \"run-6a7f8301-9541-42e8-afbf-4a86c6c25118-0\",\n", " \"usage_metadata\": {\n", " \"input_tokens\": 262,\n", " \"output_tokens\": 182,\n", " \"total_tokens\": 444\n", " },\n", " \"tool_calls\": [],\n", " \"invalid_tool_calls\": []\n", " }\n", " }\n", " }\n", " ]\n", " ],\n", " \"llm_output\": {\n", " \"token_usage\": {\n", " \"completion_tokens\": 182,\n", " \"prompt_tokens\": 262,\n", " \"total_tokens\": 444\n", " },\n", " \"model_name\": \"gpt-3.5-turbo-0125\",\n", " \"system_fingerprint\": null\n", " },\n", " \"run\": null\n", "}\n", "\u001b[36;1m\u001b[1;3m[llm/end]\u001b[0m \u001b[1m[chain:RunnableSequence > llm:ChatOpenAI] [2.97s] Exiting LLM run with output:\n", "\u001b[0m{\n", " \"generations\": [\n", " [\n", " {\n", " \"text\": \"### Summary of Main Points:\\n\\n- The study focuses on gathering data for experimentation from five English-language corpora, totaling over 160GB of uncompressed text.\\n- The text corpora used include BOOK CORPUS, English WIKIPEDIA, CC-NEWS, OPENWEBTEXT, and STORIES.\\n- The evaluation of pretrained models is done using benchmarks like GLUE, SQuAD, and RACE.\\n- The training procedure analysis explores the importance of choices in successfully pretraining BERT models, including comparing static vs. dynamic masking.\\n- The study finds that dynamic masking is comparable or slightly better than static masking for BERT BASE models.\\n- The NSP loss in the original BERT pretraining procedure is important for performance, with recent work questioning its necessity.\",\n", " \"generation_info\": {\n", " \"finish_reason\": \"stop\",\n", " \"logprobs\": null\n", " },\n", " \"type\": \"ChatGeneration\",\n", " \"message\": {\n", " \"lc\": 1,\n", " \"type\": \"constructor\",\n", " \"id\": [\n", " \"langchain\",\n", " \"schema\",\n", " \"messages\",\n", " \"AIMessage\"\n", " ],\n", " \"kwargs\": {\n", " \"content\": \"### Summary of Main Points:\\n\\n- The study focuses on gathering data for experimentation from five English-language corpora, totaling over 160GB of uncompressed text.\\n- The text corpora used include BOOK CORPUS, English WIKIPEDIA, CC-NEWS, OPENWEBTEXT, and STORIES.\\n- The evaluation of pretrained models is done using benchmarks like GLUE, SQuAD, and RACE.\\n- The training procedure analysis explores the importance of choices in successfully pretraining BERT models, including comparing static vs. dynamic masking.\\n- The study finds that dynamic masking is comparable or slightly better than static masking for BERT BASE models.\\n- The NSP loss in the original BERT pretraining procedure is important for performance, with recent work questioning its necessity.\",\n", " \"response_metadata\": {\n", " \"token_usage\": {\n", " \"completion_tokens\": 155,\n", " \"prompt_tokens\": 2097,\n", " \"total_tokens\": 2252\n", " },\n", " \"model_name\": \"gpt-3.5-turbo-0125\",\n", " \"system_fingerprint\": null,\n", " \"finish_reason\": \"stop\",\n", " \"logprobs\": null\n", " },\n", " \"type\": \"ai\",\n", " \"id\": \"run-dc89b621-1f39-45ae-b016-0e18819d5165-0\",\n", " \"usage_metadata\": {\n", " \"input_tokens\": 2097,\n", " \"output_tokens\": 155,\n", " \"total_tokens\": 2252\n", " },\n", " \"tool_calls\": [],\n", " \"invalid_tool_calls\": []\n", " }\n", " }\n", " }\n", " ]\n", " ],\n", " \"llm_output\": {\n", " \"token_usage\": {\n", " \"completion_tokens\": 155,\n", " \"prompt_tokens\": 2097,\n", " \"total_tokens\": 2252\n", " },\n", " \"model_name\": \"gpt-3.5-turbo-0125\",\n", " \"system_fingerprint\": null\n", " },\n", " \"run\": null\n", "}\n", "\u001b[36;1m\u001b[1;3m[llm/end]\u001b[0m \u001b[1m[chain:RunnableSequence > llm:ChatOpenAI] [2.98s] Exiting LLM run with output:\n", "\u001b[0m{\n", " \"generations\": [\n", " [\n", " {\n", " \"text\": \"- The authors aimed to replicate, simplify, and improve the training of BERT for better understanding its performance compared to other methods.\\n- They evaluated various design decisions for pretraining BERT models and found that performance can be significantly enhanced by training the model longer, with bigger batches, on more data, removing the next sentence prediction objective, training on longer sequences, and dynamically changing the masking pattern.\\n- Their improved pretraining procedure, RoBERTa, achieved state-of-the-art results on GLUE, RACE, and SQuAD without multi-task finetuning for GLUE or additional data for SQuAD.\\n- The results highlight the importance of design decisions previously overlooked and suggest that BERT's pretraining objective remains competitive with other alternatives.\\n- They also introduced a novel dataset, CC-NEWS, and made their models and code available for pretraining and finetuning at a specific GitHub link.\",\n", " \"generation_info\": {\n", " \"finish_reason\": \"stop\",\n", " \"logprobs\": null\n", " },\n", " \"type\": \"ChatGeneration\",\n", " \"message\": {\n", " \"lc\": 1,\n", " \"type\": \"constructor\",\n", " \"id\": [\n", " \"langchain\",\n", " \"schema\",\n", " \"messages\",\n", " \"AIMessage\"\n", " ],\n", " \"kwargs\": {\n", " \"content\": \"- The authors aimed to replicate, simplify, and improve the training of BERT for better understanding its performance compared to other methods.\\n- They evaluated various design decisions for pretraining BERT models and found that performance can be significantly enhanced by training the model longer, with bigger batches, on more data, removing the next sentence prediction objective, training on longer sequences, and dynamically changing the masking pattern.\\n- Their improved pretraining procedure, RoBERTa, achieved state-of-the-art results on GLUE, RACE, and SQuAD without multi-task finetuning for GLUE or additional data for SQuAD.\\n- The results highlight the importance of design decisions previously overlooked and suggest that BERT's pretraining objective remains competitive with other alternatives.\\n- They also introduced a novel dataset, CC-NEWS, and made their models and code available for pretraining and finetuning at a specific GitHub link.\",\n", " \"response_metadata\": {\n", " \"token_usage\": {\n", " \"completion_tokens\": 182,\n", " \"prompt_tokens\": 475,\n", " \"total_tokens\": 657\n", " },\n", " \"model_name\": \"gpt-3.5-turbo-0125\",\n", " \"system_fingerprint\": null,\n", " \"finish_reason\": \"stop\",\n", " \"logprobs\": null\n", " },\n", " \"type\": \"ai\",\n", " \"id\": \"run-6c23125c-2c3c-482b-86ea-4cbfdf464f2c-0\",\n", " \"usage_metadata\": {\n", " \"input_tokens\": 475,\n", " \"output_tokens\": 182,\n", " \"total_tokens\": 657\n", " },\n", " \"tool_calls\": [],\n", " \"invalid_tool_calls\": []\n", " }\n", " }\n", " }\n", " ]\n", " ],\n", " \"llm_output\": {\n", " \"token_usage\": {\n", " \"completion_tokens\": 182,\n", " \"prompt_tokens\": 475,\n", " \"total_tokens\": 657\n", " },\n", " \"model_name\": \"gpt-3.5-turbo-0125\",\n", " \"system_fingerprint\": null\n", " },\n", " \"run\": null\n", "}\n", "\u001b[36;1m\u001b[1;3m[llm/end]\u001b[0m \u001b[1m[chain:RunnableSequence > llm:ChatOpenAI] [3.18s] Exiting LLM run with output:\n", "\u001b[0m{\n", " \"generations\": [\n", " [\n", " {\n", " \"text\": \"### Summary of Main Points:\\n\\n- Devlin et al. (2019) pretrain their model, RoBERTa, using 1024 V100 GPUs for approximately one day.\\n- RoBERTa shows significant improvements over the originally reported BERT LARGE results, emphasizing the importance of design choices.\\n- Pretraining RoBERTa over a combined dataset of 160GB of text results in further performance improvements across all tasks.\\n- Longer pretraining steps (300K and 500K) lead to significant gains in downstream task performance, outperforming XLNet LARGE.\\n- Even the longest-trained model does not overfit the data and could benefit from additional training.\\n- The best RoBERTa model is evaluated on GLUE, SQuaD, and RACE benchmarks.\\n- For GLUE, RoBERTa is compared to other approaches on the test set via the leaderboard, showing competitive results.\\n- Task-specific modifications are necessary for certain tasks in the GLUE benchmark to achieve competitive results, such as the QNLI task adopting a pairwise ranking formulation.\",\n", " \"generation_info\": {\n", " \"finish_reason\": \"stop\",\n", " \"logprobs\": null\n", " },\n", " \"type\": \"ChatGeneration\",\n", " \"message\": {\n", " \"lc\": 1,\n", " \"type\": \"constructor\",\n", " \"id\": [\n", " \"langchain\",\n", " \"schema\",\n", " \"messages\",\n", " \"AIMessage\"\n", " ],\n", " \"kwargs\": {\n", " \"content\": \"### Summary of Main Points:\\n\\n- Devlin et al. (2019) pretrain their model, RoBERTa, using 1024 V100 GPUs for approximately one day.\\n- RoBERTa shows significant improvements over the originally reported BERT LARGE results, emphasizing the importance of design choices.\\n- Pretraining RoBERTa over a combined dataset of 160GB of text results in further performance improvements across all tasks.\\n- Longer pretraining steps (300K and 500K) lead to significant gains in downstream task performance, outperforming XLNet LARGE.\\n- Even the longest-trained model does not overfit the data and could benefit from additional training.\\n- The best RoBERTa model is evaluated on GLUE, SQuaD, and RACE benchmarks.\\n- For GLUE, RoBERTa is compared to other approaches on the test set via the leaderboard, showing competitive results.\\n- Task-specific modifications are necessary for certain tasks in the GLUE benchmark to achieve competitive results, such as the QNLI task adopting a pairwise ranking formulation.\",\n", " \"response_metadata\": {\n", " \"token_usage\": {\n", " \"completion_tokens\": 213,\n", " \"prompt_tokens\": 1409,\n", " \"total_tokens\": 1622\n", " },\n", " \"model_name\": \"gpt-3.5-turbo-0125\",\n", " \"system_fingerprint\": null,\n", " \"finish_reason\": \"stop\",\n", " \"logprobs\": null\n", " },\n", " \"type\": \"ai\",\n", " \"id\": \"run-da7e7d8e-b6a1-4b04-b1d1-4c2eeb971324-0\",\n", " \"usage_metadata\": {\n", " \"input_tokens\": 1409,\n", " \"output_tokens\": 213,\n", " \"total_tokens\": 1622\n", " },\n", " \"tool_calls\": [],\n", " \"invalid_tool_calls\": []\n", " }\n", " }\n", " }\n", " ]\n", " ],\n", " \"llm_output\": {\n", " \"token_usage\": {\n", " \"completion_tokens\": 213,\n", " \"prompt_tokens\": 1409,\n", " \"total_tokens\": 1622\n", " },\n", " \"model_name\": \"gpt-3.5-turbo-0125\",\n", " \"system_fingerprint\": null\n", " },\n", " \"run\": null\n", "}\n", "\u001b[36;1m\u001b[1;3m[llm/end]\u001b[0m \u001b[1m[chain:RunnableSequence > llm:ChatOpenAI] [3.59s] Exiting LLM run with output:\n", "\u001b[0m{\n", " \"generations\": [\n", " [\n", " {\n", " \"text\": \"### Summary of Main Points\\n\\n- The text discusses the importance of tuning the learning rate for each setting in models to improve task performance.\\n- Training with large batches improves perplexity for the masked language modeling objective and end-task accuracy.\\n- Byte-Pair Encoding (BPE) is a hybrid representation that handles large vocabularies in natural language corpora.\\n- BPE can use bytes instead of unicode characters to create a subword vocabulary of a modest size.\\n- Large batch training can improve training efficiency through gradient accumulation.\\n- RoBERTa is a modified BERT approach trained with dynamic masking, full sentences, large mini-batches, and a larger byte-level BPE.\\n- RoBERTa is trained with different amounts of data and training passes to evaluate its impact on performance.\\n- Results for RoBERTa pretraining over more data and for longer durations are provided.\\n- Comparison results with BERT LARGE and XLNet LARGE are also included in the text.\",\n", " \"generation_info\": {\n", " \"finish_reason\": \"stop\",\n", " \"logprobs\": null\n", " },\n", " \"type\": \"ChatGeneration\",\n", " \"message\": {\n", " \"lc\": 1,\n", " \"type\": \"constructor\",\n", " \"id\": [\n", " \"langchain\",\n", " \"schema\",\n", " \"messages\",\n", " \"AIMessage\"\n", " ],\n", " \"kwargs\": {\n", " \"content\": \"### Summary of Main Points\\n\\n- The text discusses the importance of tuning the learning rate for each setting in models to improve task performance.\\n- Training with large batches improves perplexity for the masked language modeling objective and end-task accuracy.\\n- Byte-Pair Encoding (BPE) is a hybrid representation that handles large vocabularies in natural language corpora.\\n- BPE can use bytes instead of unicode characters to create a subword vocabulary of a modest size.\\n- Large batch training can improve training efficiency through gradient accumulation.\\n- RoBERTa is a modified BERT approach trained with dynamic masking, full sentences, large mini-batches, and a larger byte-level BPE.\\n- RoBERTa is trained with different amounts of data and training passes to evaluate its impact on performance.\\n- Results for RoBERTa pretraining over more data and for longer durations are provided.\\n- Comparison results with BERT LARGE and XLNet LARGE are also included in the text.\",\n", " \"response_metadata\": {\n", " \"token_usage\": {\n", " \"completion_tokens\": 192,\n", " \"prompt_tokens\": 1441,\n", " \"total_tokens\": 1633\n", " },\n", " \"model_name\": \"gpt-3.5-turbo-0125\",\n", " \"system_fingerprint\": null,\n", " \"finish_reason\": \"stop\",\n", " \"logprobs\": null\n", " },\n", " \"type\": \"ai\",\n", " \"id\": \"run-272f60cd-1c1c-43b3-b29b-d6d9b1bfe149-0\",\n", " \"usage_metadata\": {\n", " \"input_tokens\": 1441,\n", " \"output_tokens\": 192,\n", " \"total_tokens\": 1633\n", " },\n", " \"tool_calls\": [],\n", " \"invalid_tool_calls\": []\n", " }\n", " }\n", " }\n", " ]\n", " ],\n", " \"llm_output\": {\n", " \"token_usage\": {\n", " \"completion_tokens\": 192,\n", " \"prompt_tokens\": 1441,\n", " \"total_tokens\": 1633\n", " },\n", " \"model_name\": \"gpt-3.5-turbo-0125\",\n", " \"system_fingerprint\": null\n", " },\n", " \"run\": null\n", "}\n", "\u001b[36;1m\u001b[1;3m[llm/end]\u001b[0m \u001b[1m[chain:RunnableSequence > llm:ChatOpenAI] [3.78s] Exiting LLM run with output:\n", "\u001b[0m{\n", " \"generations\": [\n", " [\n", " {\n", " \"text\": \"### Summary of Main Points:\\n- The study removes the NSP loss and compares different input formats for BERT.\\n- Using individual sentences hurts performance on downstream tasks due to the model's inability to learn long-range dependencies.\\n- Training without the NSP loss and using blocks of text from a single document (DOC-SENTENCES) outperforms the original BERT BASE results.\\n- Removing the NSP loss matches or slightly improves downstream task performance compared to Devlin et al. (2019).\\n- Restricting sequences to come from a single document (DOC-SENTENCES) performs slightly better than using sequences from multiple documents (FULL-SENTENCES).\\n- Training with large batches has been shown to improve optimization speed and end-task performance in Neural Machine Translation and BERT models.\\n- BERT BASE was originally trained for 1M steps with a batch size of 256 sequences, equivalent to training for fewer steps with larger batch sizes.\\n- Perplexity and end-task performance were compared for base models trained over BOOK CORPUS and WIKIPEDIA using varying batch sizes.\",\n", " \"generation_info\": {\n", " \"finish_reason\": \"stop\",\n", " \"logprobs\": null\n", " },\n", " \"type\": \"ChatGeneration\",\n", " \"message\": {\n", " \"lc\": 1,\n", " \"type\": \"constructor\",\n", " \"id\": [\n", " \"langchain\",\n", " \"schema\",\n", " \"messages\",\n", " \"AIMessage\"\n", " ],\n", " \"kwargs\": {\n", " \"content\": \"### Summary of Main Points:\\n- The study removes the NSP loss and compares different input formats for BERT.\\n- Using individual sentences hurts performance on downstream tasks due to the model's inability to learn long-range dependencies.\\n- Training without the NSP loss and using blocks of text from a single document (DOC-SENTENCES) outperforms the original BERT BASE results.\\n- Removing the NSP loss matches or slightly improves downstream task performance compared to Devlin et al. (2019).\\n- Restricting sequences to come from a single document (DOC-SENTENCES) performs slightly better than using sequences from multiple documents (FULL-SENTENCES).\\n- Training with large batches has been shown to improve optimization speed and end-task performance in Neural Machine Translation and BERT models.\\n- BERT BASE was originally trained for 1M steps with a batch size of 256 sequences, equivalent to training for fewer steps with larger batch sizes.\\n- Perplexity and end-task performance were compared for base models trained over BOOK CORPUS and WIKIPEDIA using varying batch sizes.\",\n", " \"response_metadata\": {\n", " \"token_usage\": {\n", " \"completion_tokens\": 218,\n", " \"prompt_tokens\": 665,\n", " \"total_tokens\": 883\n", " },\n", " \"model_name\": \"gpt-3.5-turbo-0125\",\n", " \"system_fingerprint\": null,\n", " \"finish_reason\": \"stop\",\n", " \"logprobs\": null\n", " },\n", " \"type\": \"ai\",\n", " \"id\": \"run-e5d72283-c5d4-498d-b535-00baa9fbcb08-0\",\n", " \"usage_metadata\": {\n", " \"input_tokens\": 665,\n", " \"output_tokens\": 218,\n", " \"total_tokens\": 883\n", " },\n", " \"tool_calls\": [],\n", " \"invalid_tool_calls\": []\n", " }\n", " }\n", " }\n", " ]\n", " ],\n", " \"llm_output\": {\n", " \"token_usage\": {\n", " \"completion_tokens\": 218,\n", " \"prompt_tokens\": 665,\n", " \"total_tokens\": 883\n", " },\n", " \"model_name\": \"gpt-3.5-turbo-0125\",\n", " \"system_fingerprint\": null\n", " },\n", " \"run\": null\n", "}\n", "\u001b[36;1m\u001b[1;3m[llm/end]\u001b[0m \u001b[1m[chain:RunnableSequence > llm:ChatOpenAI] [3.80s] Exiting LLM run with output:\n", "\u001b[0m{\n", " \"generations\": [\n", " [\n", " {\n", " \"text\": \"### Summary of Main Points:\\n\\n- The paper presents the use of RoBERTa, a modified version of BERT, for various natural language processing tasks.\\n- RoBERTa achieves state-of-the-art results on all GLUE task development sets, outperforming BERT LARGE and XLNet LARGE.\\n- RoBERTa is submitted to the GLUE leaderboard and achieves state-of-the-art results on 4 out of 9 tasks.\\n- RoBERTa is used for the SQuAD dataset and achieves competitive results with XLNet, setting a new state-of-the-art on SQuAD v2.0.\\n- RoBERTa is also evaluated on the RACE dataset and achieves state-of-the-art results on both middle-school and high-school settings.\\n- The paper discusses the importance of model architecture, pretraining objective, dataset size, and training time in achieving high performance in natural language processing tasks.\\n- Various pretraining methods and techniques are mentioned, including language modeling, masked language modeling, and multi-task fine-tuning.\\n- The paper highlights the effectiveness of RoBERTa in achieving competitive results without relying on additional external training data.\",\n", " \"generation_info\": {\n", " \"finish_reason\": \"stop\",\n", " \"logprobs\": null\n", " },\n", " \"type\": \"ChatGeneration\",\n", " \"message\": {\n", " \"lc\": 1,\n", " \"type\": \"constructor\",\n", " \"id\": [\n", " \"langchain\",\n", " \"schema\",\n", " \"messages\",\n", " \"AIMessage\"\n", " ],\n", " \"kwargs\": {\n", " \"content\": \"### Summary of Main Points:\\n\\n- The paper presents the use of RoBERTa, a modified version of BERT, for various natural language processing tasks.\\n- RoBERTa achieves state-of-the-art results on all GLUE task development sets, outperforming BERT LARGE and XLNet LARGE.\\n- RoBERTa is submitted to the GLUE leaderboard and achieves state-of-the-art results on 4 out of 9 tasks.\\n- RoBERTa is used for the SQuAD dataset and achieves competitive results with XLNet, setting a new state-of-the-art on SQuAD v2.0.\\n- RoBERTa is also evaluated on the RACE dataset and achieves state-of-the-art results on both middle-school and high-school settings.\\n- The paper discusses the importance of model architecture, pretraining objective, dataset size, and training time in achieving high performance in natural language processing tasks.\\n- Various pretraining methods and techniques are mentioned, including language modeling, masked language modeling, and multi-task fine-tuning.\\n- The paper highlights the effectiveness of RoBERTa in achieving competitive results without relying on additional external training data.\",\n", " \"response_metadata\": {\n", " \"token_usage\": {\n", " \"completion_tokens\": 228,\n", " \"prompt_tokens\": 1828,\n", " \"total_tokens\": 2056\n", " },\n", " \"model_name\": \"gpt-3.5-turbo-0125\",\n", " \"system_fingerprint\": null,\n", " \"finish_reason\": \"stop\",\n", " \"logprobs\": null\n", " },\n", " \"type\": \"ai\",\n", " \"id\": \"run-0d0261f0-a681-471c-a82f-1f0430a582e3-0\",\n", " \"usage_metadata\": {\n", " \"input_tokens\": 1828,\n", " \"output_tokens\": 228,\n", " \"total_tokens\": 2056\n", " },\n", " \"tool_calls\": [],\n", " \"invalid_tool_calls\": []\n", " }\n", " }\n", " }\n", " ]\n", " ],\n", " \"llm_output\": {\n", " \"token_usage\": {\n", " \"completion_tokens\": 228,\n", " \"prompt_tokens\": 1828,\n", " \"total_tokens\": 2056\n", " },\n", " \"model_name\": \"gpt-3.5-turbo-0125\",\n", " \"system_fingerprint\": null\n", " },\n", " \"run\": null\n", "}\n", "\u001b[36;1m\u001b[1;3m[llm/end]\u001b[0m \u001b[1m[chain:RunnableSequence > llm:ChatOpenAI] [4.80s] Exiting LLM run with output:\n", "\u001b[0m{\n", " \"generations\": [\n", " [\n", " {\n", " \"text\": \"**Summary:**\\n\\n- Language model pretraining, such as BERT, has shown significant performance gains in natural language processing tasks.\\n- A replication study of BERT pretraining was conducted to evaluate the impact of hyperparameters and training data size.\\n- The study found that BERT was significantly undertrained and proposed an improved training approach called RoBERTa, which surpassed the performance of post-BERT models.\\n- The improvements included training the model longer with bigger batches, removing the next sentence prediction objective, training on longer sequences, and dynamically changing the masking pattern.\\n- The study used a new dataset (CC-NEWS) to better control for training set size effects and achieved state-of-the-art results on various tasks.\\n- The study highlights the importance of design choices in language model pretraining and releases the models and code for further research.\\n- BERT uses a transformer architecture with self-attention heads and hidden dimensions during pretraining.\\n- BERT's pretraining objectives include masked language modeling and next sentence prediction to improve downstream task performance.\\n- BERT is optimized using the Adam optimizer with specific parameters and trained on a combination of BOOK CORPUS and English WIKIPEDIA.\\n- The experimental setup for the replication study involved re-implementing BERT in FAIRSEQ and tuning hyperparameters for optimal performance.\\n- The study also emphasizes the importance of large and diverse training data in achieving better end-task performance.\",\n", " \"generation_info\": {\n", " \"finish_reason\": \"stop\",\n", " \"logprobs\": null\n", " },\n", " \"type\": \"ChatGeneration\",\n", " \"message\": {\n", " \"lc\": 1,\n", " \"type\": \"constructor\",\n", " \"id\": [\n", " \"langchain\",\n", " \"schema\",\n", " \"messages\",\n", " \"AIMessage\"\n", " ],\n", " \"kwargs\": {\n", " \"content\": \"**Summary:**\\n\\n- Language model pretraining, such as BERT, has shown significant performance gains in natural language processing tasks.\\n- A replication study of BERT pretraining was conducted to evaluate the impact of hyperparameters and training data size.\\n- The study found that BERT was significantly undertrained and proposed an improved training approach called RoBERTa, which surpassed the performance of post-BERT models.\\n- The improvements included training the model longer with bigger batches, removing the next sentence prediction objective, training on longer sequences, and dynamically changing the masking pattern.\\n- The study used a new dataset (CC-NEWS) to better control for training set size effects and achieved state-of-the-art results on various tasks.\\n- The study highlights the importance of design choices in language model pretraining and releases the models and code for further research.\\n- BERT uses a transformer architecture with self-attention heads and hidden dimensions during pretraining.\\n- BERT's pretraining objectives include masked language modeling and next sentence prediction to improve downstream task performance.\\n- BERT is optimized using the Adam optimizer with specific parameters and trained on a combination of BOOK CORPUS and English WIKIPEDIA.\\n- The experimental setup for the replication study involved re-implementing BERT in FAIRSEQ and tuning hyperparameters for optimal performance.\\n- The study also emphasizes the importance of large and diverse training data in achieving better end-task performance.\",\n", " \"response_metadata\": {\n", " \"token_usage\": {\n", " \"completion_tokens\": 283,\n", " \"prompt_tokens\": 2317,\n", " \"total_tokens\": 2600\n", " },\n", " \"model_name\": \"gpt-3.5-turbo-0125\",\n", " \"system_fingerprint\": null,\n", " \"finish_reason\": \"stop\",\n", " \"logprobs\": null\n", " },\n", " \"type\": \"ai\",\n", " \"id\": \"run-75cc630f-34d1-4cfa-a6d9-1945ad2037ac-0\",\n", " \"usage_metadata\": {\n", " \"input_tokens\": 2317,\n", " \"output_tokens\": 283,\n", " \"total_tokens\": 2600\n", " },\n", " \"tool_calls\": [],\n", " \"invalid_tool_calls\": []\n", " }\n", " }\n", " }\n", " ]\n", " ],\n", " \"llm_output\": {\n", " \"token_usage\": {\n", " \"completion_tokens\": 283,\n", " \"prompt_tokens\": 2317,\n", " \"total_tokens\": 2600\n", " },\n", " \"model_name\": \"gpt-3.5-turbo-0125\",\n", " \"system_fingerprint\": null\n", " },\n", " \"run\": null\n", "}\n", "\u001b[36;1m\u001b[1;3m[chain/end]\u001b[0m \u001b[1m[chain:RunnableSequence] [4.85s] Exiting Chain run with output:\n", "\u001b[0m[outputs]\n", "\u001b[36;1m\u001b[1;3m[chain/end]\u001b[0m \u001b[1m[chain:RunnableSequence] [4.85s] Exiting Chain run with output:\n", "\u001b[0m[outputs]\n", "\u001b[36;1m\u001b[1;3m[chain/end]\u001b[0m \u001b[1m[chain:RunnableSequence] [4.85s] Exiting Chain run with output:\n", "\u001b[0m[outputs]\n", "\u001b[36;1m\u001b[1;3m[chain/end]\u001b[0m \u001b[1m[chain:RunnableSequence] [4.85s] Exiting Chain run with output:\n", "\u001b[0m[outputs]\n", "\u001b[36;1m\u001b[1;3m[chain/end]\u001b[0m \u001b[1m[chain:RunnableSequence] [4.85s] Exiting Chain run with output:\n", "\u001b[0m[outputs]\n", "\u001b[36;1m\u001b[1;3m[chain/end]\u001b[0m \u001b[1m[chain:RunnableSequence] [4.85s] Exiting Chain run with output:\n", "\u001b[0m[outputs]\n", "\u001b[36;1m\u001b[1;3m[chain/end]\u001b[0m \u001b[1m[chain:RunnableSequence] [4.85s] Exiting Chain run with output:\n", "\u001b[0m[outputs]\n", "\u001b[36;1m\u001b[1;3m[chain/end]\u001b[0m \u001b[1m[chain:RunnableSequence] [4.85s] Exiting Chain run with output:\n", "\u001b[0m[outputs]\n", "\u001b[36;1m\u001b[1;3m[chain/end]\u001b[0m \u001b[1m[chain:RunnableSequence] [4.85s] Exiting Chain run with output:\n", "\u001b[0m[outputs]\n", "\u001b[36;1m\u001b[1;3m[chain/end]\u001b[0m \u001b[1m[chain:RunnableSequence] [4.85s] Exiting Chain run with output:\n", "\u001b[0m[outputs]\n", "\u001b[36;1m\u001b[1;3m[chain/end]\u001b[0m \u001b[1m[chain:RunnableSequence] [4.85s] Exiting Chain run with output:\n", "\u001b[0m[outputs]\n", "\u001b[36;1m\u001b[1;3m[chain/end]\u001b[0m \u001b[1m[chain:RunnableSequence] [4.85s] Exiting Chain run with output:\n", "\u001b[0m[outputs]\n", "\u001b[36;1m\u001b[1;3m[chain/end]\u001b[0m \u001b[1m[chain:RunnableSequence] [4.85s] Exiting Chain run with output:\n", "\u001b[0m[outputs]\n", "\u001b[36;1m\u001b[1;3m[chain/end]\u001b[0m \u001b[1m[chain:RunnableSequence] [4.85s] Exiting Chain run with output:\n", "\u001b[0m[outputs]\n", "\u001b[36;1m\u001b[1;3m[chain/end]\u001b[0m \u001b[1m[chain:RunnableSequence] [4.85s] Exiting Chain run with output:\n", "\u001b[0m[outputs]\n", "\u001b[36;1m\u001b[1;3m[chain/end]\u001b[0m \u001b[1m[chain:RunnableSequence] [4.85s] Exiting Chain run with output:\n", "\u001b[0m[outputs]\n", "\u001b[36;1m\u001b[1;3m[chain/end]\u001b[0m \u001b[1m[chain:RunnableSequence] [4.85s] Exiting Chain run with output:\n", "\u001b[0m[outputs]\n", "\u001b[36;1m\u001b[1;3m[chain/end]\u001b[0m \u001b[1m[chain:RunnableSequence] [4.85s] Exiting Chain run with output:\n", "\u001b[0m[outputs]\n", "\u001b[36;1m\u001b[1;3m[chain/end]\u001b[0m \u001b[1m[chain:RunnableSequence] [4.85s] Exiting Chain run with output:\n", "\u001b[0m[outputs]\n", "\u001b[36;1m\u001b[1;3m[chain/end]\u001b[0m \u001b[1m[chain:RunnableSequence] [4.85s] Exiting Chain run with output:\n", "\u001b[0m[outputs]\n", "\u001b[36;1m\u001b[1;3m[chain/end]\u001b[0m \u001b[1m[chain:RunnableSequence] [4.85s] Exiting Chain run with output:\n", "\u001b[0m[outputs]\n", "\u001b[36;1m\u001b[1;3m[chain/end]\u001b[0m \u001b[1m[chain:RunnableSequence] [4.85s] Exiting Chain run with output:\n", "\u001b[0m[outputs]\n", "\u001b[36;1m\u001b[1;3m[chain/end]\u001b[0m \u001b[1m[chain:RunnableSequence] [4.85s] Exiting Chain run with output:\n", "\u001b[0m[outputs]\n", "\u001b[36;1m\u001b[1;3m[chain/end]\u001b[0m \u001b[1m[chain:RunnableSequence] [4.85s] Exiting Chain run with output:\n", "\u001b[0m[outputs]\n", "\u001b[36;1m\u001b[1;3m[chain/end]\u001b[0m \u001b[1m[chain:RunnableSequence] [4.85s] Exiting Chain run with output:\n", "\u001b[0m[outputs]\n", "\u001b[36;1m\u001b[1;3m[chain/end]\u001b[0m \u001b[1m[chain:RunnableSequence] [4.85s] Exiting Chain run with output:\n", "\u001b[0m[outputs]\n", "\u001b[36;1m\u001b[1;3m[chain/end]\u001b[0m \u001b[1m[chain:RunnableSequence] [4.85s] Exiting Chain run with output:\n", "\u001b[0m[outputs]\n", "\u001b[36;1m\u001b[1;3m[chain/end]\u001b[0m \u001b[1m[chain:RunnableSequence] [4.85s] Exiting Chain run with output:\n", "\u001b[0m[outputs]\n", "\u001b[36;1m\u001b[1;3m[chain/end]\u001b[0m \u001b[1m[chain:RunnableSequence] [4.85s] Exiting Chain run with output:\n", "\u001b[0m[outputs]\n" ] }, { "data": { "text/plain": [ "[AIMessage(content='### Summary:\\n- The document discusses a new pretraining approach called RoBERTa, which is an optimized version of BERT.\\n- The authors of the paper are Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov.', response_metadata={'token_usage': {'completion_tokens': 81, 'prompt_tokens': 182, 'total_tokens': 263}, 'model_name': 'gpt-3.5-turbo-0125', 'system_fingerprint': None, 'finish_reason': 'stop', 'logprobs': None}, id='run-b3735f08-fdd8-4aa6-8b9e-7c3a6a224ef6-0', usage_metadata={'input_tokens': 182, 'output_tokens': 81, 'total_tokens': 263}),\n", " AIMessage(content=\"**Summary:**\\n\\n- Language model pretraining, such as BERT, has shown significant performance gains in natural language processing tasks.\\n- A replication study of BERT pretraining was conducted to evaluate the impact of hyperparameters and training data size.\\n- The study found that BERT was significantly undertrained and proposed an improved training approach called RoBERTa, which surpassed the performance of post-BERT models.\\n- The improvements included training the model longer with bigger batches, removing the next sentence prediction objective, training on longer sequences, and dynamically changing the masking pattern.\\n- The study used a new dataset (CC-NEWS) to better control for training set size effects and achieved state-of-the-art results on various tasks.\\n- The study highlights the importance of design choices in language model pretraining and releases the models and code for further research.\\n- BERT uses a transformer architecture with self-attention heads and hidden dimensions during pretraining.\\n- BERT's pretraining objectives include masked language modeling and next sentence prediction to improve downstream task performance.\\n- BERT is optimized using the Adam optimizer with specific parameters and trained on a combination of BOOK CORPUS and English WIKIPEDIA.\\n- The experimental setup for the replication study involved re-implementing BERT in FAIRSEQ and tuning hyperparameters for optimal performance.\\n- The study also emphasizes the importance of large and diverse training data in achieving better end-task performance.\", response_metadata={'token_usage': {'completion_tokens': 283, 'prompt_tokens': 2317, 'total_tokens': 2600}, 'model_name': 'gpt-3.5-turbo-0125', 'system_fingerprint': None, 'finish_reason': 'stop', 'logprobs': None}, id='run-75cc630f-34d1-4cfa-a6d9-1945ad2037ac-0', usage_metadata={'input_tokens': 2317, 'output_tokens': 283, 'total_tokens': 2600}),\n", " AIMessage(content='### Summary of Main Points:\\n\\n- The study focuses on gathering data for experimentation from five English-language corpora, totaling over 160GB of uncompressed text.\\n- The text corpora used include BOOK CORPUS, English WIKIPEDIA, CC-NEWS, OPENWEBTEXT, and STORIES.\\n- The evaluation of pretrained models is done using benchmarks like GLUE, SQuAD, and RACE.\\n- The training procedure analysis explores the importance of choices in successfully pretraining BERT models, including comparing static vs. dynamic masking.\\n- The study finds that dynamic masking is comparable or slightly better than static masking for BERT BASE models.\\n- The NSP loss in the original BERT pretraining procedure is important for performance, with recent work questioning its necessity.', response_metadata={'token_usage': {'completion_tokens': 155, 'prompt_tokens': 2097, 'total_tokens': 2252}, 'model_name': 'gpt-3.5-turbo-0125', 'system_fingerprint': None, 'finish_reason': 'stop', 'logprobs': None}, id='run-dc89b621-1f39-45ae-b016-0e18819d5165-0', usage_metadata={'input_tokens': 2097, 'output_tokens': 155, 'total_tokens': 2252}),\n", " AIMessage(content='### Summary of Main Points:\\n\\n- The text compares different training formats for natural language processing tasks using models like BERT and XLNet.\\n- Different input formats are tested, including SEGMENT-PAIR +NSP, SENTENCE-PAIR +NSP, FULL-SENTENCES, and DOC-SENTENCES.\\n- Results are reported for models pre-trained on BOOK CORPUS and WIKIPEDIA, showing performance in tasks like SQuAD, MNLI-m, SST-2, and RACE.\\n- The study includes variations in batch size and the inclusion/exclusion of the Next Sentence Prediction (NSP) loss in the training process.\\n- The results show varying performance levels for different input formats and pre-trained models, with DOC-SENTENCES performing slightly better in some cases.', response_metadata={'token_usage': {'completion_tokens': 158, 'prompt_tokens': 753, 'total_tokens': 911}, 'model_name': 'gpt-3.5-turbo-0125', 'system_fingerprint': None, 'finish_reason': 'stop', 'logprobs': None}, id='run-990d2393-5dd0-4ccc-ab79-d308193dcc65-0', usage_metadata={'input_tokens': 753, 'output_tokens': 158, 'total_tokens': 911}),\n", " AIMessage(content=\"### Summary of Main Points:\\n- The study removes the NSP loss and compares different input formats for BERT.\\n- Using individual sentences hurts performance on downstream tasks due to the model's inability to learn long-range dependencies.\\n- Training without the NSP loss and using blocks of text from a single document (DOC-SENTENCES) outperforms the original BERT BASE results.\\n- Removing the NSP loss matches or slightly improves downstream task performance compared to Devlin et al. (2019).\\n- Restricting sequences to come from a single document (DOC-SENTENCES) performs slightly better than using sequences from multiple documents (FULL-SENTENCES).\\n- Training with large batches has been shown to improve optimization speed and end-task performance in Neural Machine Translation and BERT models.\\n- BERT BASE was originally trained for 1M steps with a batch size of 256 sequences, equivalent to training for fewer steps with larger batch sizes.\\n- Perplexity and end-task performance were compared for base models trained over BOOK CORPUS and WIKIPEDIA using varying batch sizes.\", response_metadata={'token_usage': {'completion_tokens': 218, 'prompt_tokens': 665, 'total_tokens': 883}, 'model_name': 'gpt-3.5-turbo-0125', 'system_fingerprint': None, 'finish_reason': 'stop', 'logprobs': None}, id='run-e5d72283-c5d4-498d-b535-00baa9fbcb08-0', usage_metadata={'input_tokens': 665, 'output_tokens': 218, 'total_tokens': 883}),\n", " AIMessage(content='### Summary of Main Points\\n\\n- The text discusses the importance of tuning the learning rate for each setting in models to improve task performance.\\n- Training with large batches improves perplexity for the masked language modeling objective and end-task accuracy.\\n- Byte-Pair Encoding (BPE) is a hybrid representation that handles large vocabularies in natural language corpora.\\n- BPE can use bytes instead of unicode characters to create a subword vocabulary of a modest size.\\n- Large batch training can improve training efficiency through gradient accumulation.\\n- RoBERTa is a modified BERT approach trained with dynamic masking, full sentences, large mini-batches, and a larger byte-level BPE.\\n- RoBERTa is trained with different amounts of data and training passes to evaluate its impact on performance.\\n- Results for RoBERTa pretraining over more data and for longer durations are provided.\\n- Comparison results with BERT LARGE and XLNet LARGE are also included in the text.', response_metadata={'token_usage': {'completion_tokens': 192, 'prompt_tokens': 1441, 'total_tokens': 1633}, 'model_name': 'gpt-3.5-turbo-0125', 'system_fingerprint': None, 'finish_reason': 'stop', 'logprobs': None}, id='run-272f60cd-1c1c-43b3-b29b-d6d9b1bfe149-0', usage_metadata={'input_tokens': 1441, 'output_tokens': 192, 'total_tokens': 1633}),\n", " AIMessage(content='- RoBERTa accumulates improvements from BERT LARGE.\\n- RoBERTa matches the architecture and training objective of BERT LARGE.\\n- Results for BERT LARGE are from Devlin et al. (2019).\\n- Results for XLNet LARGE are from Yang et al. (2019).\\n- Complete results on all GLUE tasks can be found in the Appendix.', response_metadata={'token_usage': {'completion_tokens': 75, 'prompt_tokens': 150, 'total_tokens': 225}, 'model_name': 'gpt-3.5-turbo-0125', 'system_fingerprint': None, 'finish_reason': 'stop', 'logprobs': None}, id='run-134e3803-c9ba-46f5-8b0e-7d880ee43ced-0', usage_metadata={'input_tokens': 150, 'output_tokens': 75, 'total_tokens': 225}),\n", " AIMessage(content='### Summary of Main Points:\\n\\n- Devlin et al. (2019) pretrain their model, RoBERTa, using 1024 V100 GPUs for approximately one day.\\n- RoBERTa shows significant improvements over the originally reported BERT LARGE results, emphasizing the importance of design choices.\\n- Pretraining RoBERTa over a combined dataset of 160GB of text results in further performance improvements across all tasks.\\n- Longer pretraining steps (300K and 500K) lead to significant gains in downstream task performance, outperforming XLNet LARGE.\\n- Even the longest-trained model does not overfit the data and could benefit from additional training.\\n- The best RoBERTa model is evaluated on GLUE, SQuaD, and RACE benchmarks.\\n- For GLUE, RoBERTa is compared to other approaches on the test set via the leaderboard, showing competitive results.\\n- Task-specific modifications are necessary for certain tasks in the GLUE benchmark to achieve competitive results, such as the QNLI task adopting a pairwise ranking formulation.', response_metadata={'token_usage': {'completion_tokens': 213, 'prompt_tokens': 1409, 'total_tokens': 1622}, 'model_name': 'gpt-3.5-turbo-0125', 'system_fingerprint': None, 'finish_reason': 'stop', 'logprobs': None}, id='run-da7e7d8e-b6a1-4b04-b1d1-4c2eeb971324-0', usage_metadata={'input_tokens': 1409, 'output_tokens': 213, 'total_tokens': 1622}),\n", " AIMessage(content='### Summary of Main Points:\\n\\n- The paper presents the use of RoBERTa, a modified version of BERT, for various natural language processing tasks.\\n- RoBERTa achieves state-of-the-art results on all GLUE task development sets, outperforming BERT LARGE and XLNet LARGE.\\n- RoBERTa is submitted to the GLUE leaderboard and achieves state-of-the-art results on 4 out of 9 tasks.\\n- RoBERTa is used for the SQuAD dataset and achieves competitive results with XLNet, setting a new state-of-the-art on SQuAD v2.0.\\n- RoBERTa is also evaluated on the RACE dataset and achieves state-of-the-art results on both middle-school and high-school settings.\\n- The paper discusses the importance of model architecture, pretraining objective, dataset size, and training time in achieving high performance in natural language processing tasks.\\n- Various pretraining methods and techniques are mentioned, including language modeling, masked language modeling, and multi-task fine-tuning.\\n- The paper highlights the effectiveness of RoBERTa in achieving competitive results without relying on additional external training data.', response_metadata={'token_usage': {'completion_tokens': 228, 'prompt_tokens': 1828, 'total_tokens': 2056}, 'model_name': 'gpt-3.5-turbo-0125', 'system_fingerprint': None, 'finish_reason': 'stop', 'logprobs': None}, id='run-0d0261f0-a681-471c-a82f-1f0430a582e3-0', usage_metadata={'input_tokens': 1828, 'output_tokens': 228, 'total_tokens': 2056}),\n", " AIMessage(content=\"- The authors aimed to replicate, simplify, and improve the training of BERT for better understanding its performance compared to other methods.\\n- They evaluated various design decisions for pretraining BERT models and found that performance can be significantly enhanced by training the model longer, with bigger batches, on more data, removing the next sentence prediction objective, training on longer sequences, and dynamically changing the masking pattern.\\n- Their improved pretraining procedure, RoBERTa, achieved state-of-the-art results on GLUE, RACE, and SQuAD without multi-task finetuning for GLUE or additional data for SQuAD.\\n- The results highlight the importance of design decisions previously overlooked and suggest that BERT's pretraining objective remains competitive with other alternatives.\\n- They also introduced a novel dataset, CC-NEWS, and made their models and code available for pretraining and finetuning at a specific GitHub link.\", response_metadata={'token_usage': {'completion_tokens': 182, 'prompt_tokens': 475, 'total_tokens': 657}, 'model_name': 'gpt-3.5-turbo-0125', 'system_fingerprint': None, 'finish_reason': 'stop', 'logprobs': None}, id='run-6c23125c-2c3c-482b-86ea-4cbfdf464f2c-0', usage_metadata={'input_tokens': 475, 'output_tokens': 182, 'total_tokens': 657}),\n", " AIMessage(content='- The text discusses the PASCAL recognizing textual entailment challenge, which has had multiple iterations over the years.\\n- It mentions key researchers involved in the challenges, such as Roy Bar-Haim, Ido Dagan, Bill Dolan, Lisa Ferro, Danilo Giampiccolo, Bernardo Magnini, Idan Szpektor, Luisa Bentivogli, Hoa Trang Dang, Samuel R Bowman, Gabor Angeli, Christopher Potts, and Christopher D Manning.\\n- The challenges focus on the task of recognizing textual entailment, which involves determining if a given piece of text entails, contradicts, or is neutral with respect to another piece of text.\\n- The text also references the creation of a large annotated corpus for learning natural language inference, which was presented at the Empirical Methods in Natural Language Processing (EMNLP) conference in 2015.', response_metadata={'token_usage': {'completion_tokens': 182, 'prompt_tokens': 262, 'total_tokens': 444}, 'model_name': 'gpt-3.5-turbo-0125', 'system_fingerprint': None, 'finish_reason': 'stop', 'logprobs': None}, id='run-6a7f8301-9541-42e8-afbf-4a86c6c25118-0', usage_metadata={'input_tokens': 262, 'output_tokens': 182, 'total_tokens': 444}),\n", " AIMessage(content='- The text discusses a paper titled \"KERMIT: Generative insertion-based modeling for sequences\"\\n- The authors of the paper are William Chan, Nikita Kitaev, Kelvin Guu, Mitchell Stern, and Jakob Uszkoreit\\n- The paper likely introduces a new model or approach called KERMIT for generating sequences using insertion-based modeling', response_metadata={'token_usage': {'completion_tokens': 71, 'prompt_tokens': 118, 'total_tokens': 189}, 'model_name': 'gpt-3.5-turbo-0125', 'system_fingerprint': None, 'finish_reason': 'stop', 'logprobs': None}, id='run-e294e84d-fb13-459c-b9fe-f8592b8aaa54-0', usage_metadata={'input_tokens': 118, 'output_tokens': 71, 'total_tokens': 189}),\n", " AIMessage(content='- The text mentions several key papers and challenges in the field of natural language processing and machine learning.\\n- It references the PASCAL Recognising Textual Entailment Challenge, Semi-supervised Sequence Learning, BERT pre-training model, and Unified Language Model pre-training.\\n- These papers and challenges are important in advancing the understanding and generation of natural language.\\n- The text also highlights the use of deep bidirectional transformers and the construction of corpora for paraphrasing in natural language processing research.', response_metadata={'token_usage': {'completion_tokens': 99, 'prompt_tokens': 377, 'total_tokens': 476}, 'model_name': 'gpt-3.5-turbo-0125', 'system_fingerprint': None, 'finish_reason': 'stop', 'logprobs': None}, id='run-392ae862-d242-451b-84a2-4e9541565630-0', usage_metadata={'input_tokens': 377, 'output_tokens': 99, 'total_tokens': 476}),\n", " AIMessage(content='### Summary of Main Points:\\n- The text discusses the third PASCAL recognizing textual entailment challenge, which was held in 2007.\\n- It also mentions the OpenWebText Corpus created by Aaron Gokaslan and Vanya Cohen in 2019.\\n- The link to access the OpenWebText Corpus is provided.', response_metadata={'token_usage': {'completion_tokens': 67, 'prompt_tokens': 181, 'total_tokens': 248}, 'model_name': 'gpt-3.5-turbo-0125', 'system_fingerprint': None, 'finish_reason': 'stop', 'logprobs': None}, id='run-d6ee0613-3e00-4f31-8763-19c21863e37b-0', usage_metadata={'input_tokens': 181, 'output_tokens': 67, 'total_tokens': 248}),\n", " AIMessage(content='# Summary of Main Points:\\n\\n- **Authors**: Felix Hamborg, Norman Meuschke, Corinna Brenitinger, and Bela Gipp\\n- **Year**: 2017\\n- **Title**: news-please: A generic news crawler and extractor\\n- **Event**: Proceedings of the 15th International Symposium of Information Science', response_metadata={'token_usage': {'completion_tokens': 71, 'prompt_tokens': 132, 'total_tokens': 203}, 'model_name': 'gpt-3.5-turbo-0125', 'system_fingerprint': None, 'finish_reason': 'stop', 'logprobs': None}, id='run-b3d50fe8-1206-44d1-9012-1ab7b08ca243-0', usage_metadata={'input_tokens': 132, 'output_tokens': 71, 'total_tokens': 203}),\n", " AIMessage(content='- The paper is authored by Dan Hendrycks and Kevin Gimpel in 2016.\\n- The paper introduces a new activation function called Gaussian Error Linear Units (GELUs).\\n- The paper is available as an arXiv preprint with the identifier arXiv:1606.08415.', response_metadata={'token_usage': {'completion_tokens': 63, 'prompt_tokens': 117, 'total_tokens': 180}, 'model_name': 'gpt-3.5-turbo-0125', 'system_fingerprint': None, 'finish_reason': 'stop', 'logprobs': None}, id='run-edf8912f-5996-4b74-86b4-5f7559161221-0', usage_metadata={'input_tokens': 117, 'output_tokens': 63, 'total_tokens': 180}),\n", " AIMessage(content='### Summary:\\n- **Authors and Publications:** \\n - Matthew Honnibal and Ines Montani are working on spaCy 2 for natural language understanding with Bloom embeddings, convolutional neural networks, and incremental parsing. \\n - Jeremy Howard and Sebastian Ruder are focusing on universal language model fine-tuning for text classification.\\n - Shankar Iyer, Nikhil Dandekar, and Kornl Csernai released the first Quora dataset on question pairs.\\n - Mandar Joshi, Danqi Chen, Yinhan Liu, and Daniel S. are also involved in the field.', response_metadata={'token_usage': {'completion_tokens': 125, 'prompt_tokens': 237, 'total_tokens': 362}, 'model_name': 'gpt-3.5-turbo-0125', 'system_fingerprint': None, 'finish_reason': 'stop', 'logprobs': None}, id='run-a1def3ae-6dda-4989-beae-0094f64cfa5c-0', usage_metadata={'input_tokens': 237, 'output_tokens': 125, 'total_tokens': 362}),\n", " AIMessage(content='- SpanBERT is a pre-training method that improves performance by representing and predicting spans.\\n- The paper was published as an arXiv preprint with the identifier arXiv:1907.10529.\\n- Another method mentioned in the text is Adam, which is a stochastic optimization technique.\\n- Adam was presented at the International Conference on Learning Representations (ICLR) in 2015 by Diederik Kingma and Jimmy Ba.', response_metadata={'token_usage': {'completion_tokens': 89, 'prompt_tokens': 161, 'total_tokens': 250}, 'model_name': 'gpt-3.5-turbo-0125', 'system_fingerprint': None, 'finish_reason': 'stop', 'logprobs': None}, id='run-29ce0d28-5945-43dc-956c-4d3526654618-0', usage_metadata={'input_tokens': 161, 'output_tokens': 89, 'total_tokens': 250}),\n", " AIMessage(content='- The text mentions several technical papers and journals related to natural language processing.\\n- One paper discusses a robust trick for the Winograd Schema Challenge.\\n- Another paper introduces a large-scale reading comprehension dataset called RACE.\\n- A third paper focuses on cross-lingual language model pretraining.\\n- All papers are available as arXiv preprints.', response_metadata={'token_usage': {'completion_tokens': 70, 'prompt_tokens': 239, 'total_tokens': 309}, 'model_name': 'gpt-3.5-turbo-0125', 'system_fingerprint': None, 'finish_reason': 'stop', 'logprobs': None}, id='run-4240f3f6-f800-4e0b-afc8-8f4b70438a9d-0', usage_metadata={'input_tokens': 239, 'output_tokens': 70, 'total_tokens': 309}),\n", " AIMessage(content='- The text discusses the Winograd schema challenge, introduced by Hector J Levesque, Ernest Davis, and Leora Morgenstern in 2011.\\n- It also mentions a more recent paper by Xiaodong Liu et al. from 2019, which focuses on improving multi-task deep neural networks through knowledge distillation for natural language understanding.\\n- The first paper addresses commonsense reasoning, while the second paper deals with enhancing neural networks for language processing.', response_metadata={'token_usage': {'completion_tokens': 94, 'prompt_tokens': 184, 'total_tokens': 278}, 'model_name': 'gpt-3.5-turbo-0125', 'system_fingerprint': None, 'finish_reason': 'stop', 'logprobs': None}, id='run-4a2c4819-b11a-4b81-9017-85e3e56d94f4-0', usage_metadata={'input_tokens': 184, 'output_tokens': 94, 'total_tokens': 278}),\n", " AIMessage(content='- The text discusses two papers on natural language understanding using deep neural networks.\\n- The first paper by Liu et al. focuses on multi-task deep neural networks for natural language understanding.\\n- The second paper by McCann et al. discusses contextualized word vectors learned in translation.\\n- Both papers are available as preprints on arXiv and were presented at the Advances in Neural Information Processing Systems (NIPS) conference.', response_metadata={'token_usage': {'completion_tokens': 84, 'prompt_tokens': 193, 'total_tokens': 277}, 'model_name': 'gpt-3.5-turbo-0125', 'system_fingerprint': None, 'finish_reason': 'stop', 'logprobs': None}, id='run-a53f8018-4244-460e-bf36-466c695e6754-0', usage_metadata={'input_tokens': 193, 'output_tokens': 84, 'total_tokens': 277}),\n", " AIMessage(content='- The text mentions a paper on mixed precision training presented at the International Conference on Learning Representations in 2018 by a group of authors.\\n- It also references a dataset called Cc-news from 2016, which is available online.', response_metadata={'token_usage': {'completion_tokens': 49, 'prompt_tokens': 196, 'total_tokens': 245}, 'model_name': 'gpt-3.5-turbo-0125', 'system_fingerprint': None, 'finish_reason': 'stop', 'logprobs': None}, id='run-f58b3f07-7e2b-4f02-a170-1f8d45b10643-0', usage_metadata={'input_tokens': 196, 'output_tokens': 49, 'total_tokens': 245}),\n", " AIMessage(content='- FAIRSEQ is a fast and extensible toolkit for sequence modeling, presented at NAACL.\\n- Scaling neural machine translation was discussed in a paper at the Third Conference on Machine Translation (WMT).\\n- Automatic differentiation in PyTorch was covered at the NIPS Autodiff Workshop.\\n- Deep contextualized word representations were presented at NAACL.\\n- OpenAI published reports on improving language understanding with unsupervised learning and language models as unsupervised multitask learners.', response_metadata={'token_usage': {'completion_tokens': 96, 'prompt_tokens': 439, 'total_tokens': 535}, 'model_name': 'gpt-3.5-turbo-0125', 'system_fingerprint': None, 'finish_reason': 'stop', 'logprobs': None}, id='run-9d97ef45-64bb-442a-b296-87dfebdbb3ac-0', usage_metadata={'input_tokens': 439, 'output_tokens': 96, 'total_tokens': 535}),\n", " AIMessage(content='- The text references several key papers and journals in the field of natural language processing and machine learning.\\n- These papers include topics such as unanswerable questions for machine comprehension, neural machine translation, recursive deep models for sentiment analysis, and masked sequence-to-sequence pre-training for language generation.\\n- The papers were presented at conferences such as the Association for Computational Linguistics (ACL), Empirical Methods in Natural Language Processing (EMNLP), and the International Conference on Machine Learning (ICML).', response_metadata={'token_usage': {'completion_tokens': 99, 'prompt_tokens': 356, 'total_tokens': 455}, 'model_name': 'gpt-3.5-turbo-0125', 'system_fingerprint': None, 'finish_reason': 'stop', 'logprobs': None}, id='run-4452c353-9791-4747-9fd0-862a73125f50-0', usage_metadata={'input_tokens': 356, 'output_tokens': 99, 'total_tokens': 455}),\n", " AIMessage(content='- The text discusses multiple technical papers related to natural language processing and machine learning.\\n- One of the papers mentioned is \"ERNIE: Enhanced representation through knowledge integration\" by Yu Stephanie Sun et al.\\n- Another paper mentioned is \"A simple method for commonsense reasoning\" by Trieu H Trinh and Quoc V Le.\\n- The text also references the paper \"Attention is all you need\" by Ashish Vaswani et al.\\n- Additionally, the paper \"Advances in neural information processing systems\" is mentioned.\\n- The text provides a list of authors for each paper mentioned.', response_metadata={'token_usage': {'completion_tokens': 119, 'prompt_tokens': 313, 'total_tokens': 432}, 'model_name': 'gpt-3.5-turbo-0125', 'system_fingerprint': None, 'finish_reason': 'stop', 'logprobs': None}, id='run-11b7a63c-f932-4b2d-99ff-46ec3bbedc82-0', usage_metadata={'input_tokens': 313, 'output_tokens': 119, 'total_tokens': 432}),\n", " AIMessage(content='# Summary of \"SuperGLUE: A stickier benchmark for general-purpose language understanding systems\" by Bowman (2019)\\n\\n- The text is discussing a benchmark called SuperGLUE, which is designed to evaluate general-purpose language understanding systems.\\n- SuperGLUE is intended to be a more challenging benchmark compared to existing ones.\\n- The benchmark is named \"SuperGLUE\" to imply that it is stickier, meaning it is more difficult for language understanding systems to achieve high scores.\\n- The goal of SuperGLUE is to push the boundaries of what language understanding systems can achieve.', response_metadata={'token_usage': {'completion_tokens': 117, 'prompt_tokens': 101, 'total_tokens': 218}, 'model_name': 'gpt-3.5-turbo-0125', 'system_fingerprint': None, 'finish_reason': 'stop', 'logprobs': None}, id='run-27a8ce88-dcdf-48a9-af6b-5ca8bbc4d4af-0', usage_metadata={'input_tokens': 101, 'output_tokens': 117, 'total_tokens': 218}),\n", " AIMessage(content='- The text discusses various technical papers and preprints related to natural language understanding and neural network acceptability judgments.\\n- The GLUE benchmark is highlighted as a multi-task benchmark and analysis platform for natural language understanding.\\n- The paper by Warstadt et al. (2018) focuses on neural network acceptability judgments.\\n- Williams et al. (2018) introduce a challenge corpus for sentence understanding through inference.\\n- Yang et al. (2019) present XLNet, a generalized autoregressive pretraining method for language understanding.', response_metadata={'token_usage': {'completion_tokens': 107, 'prompt_tokens': 319, 'total_tokens': 426}, 'model_name': 'gpt-3.5-turbo-0125', 'system_fingerprint': None, 'finish_reason': 'stop', 'logprobs': None}, id='run-1b9cdb23-2bda-4e02-b7dd-46f278e814d2-0', usage_metadata={'input_tokens': 319, 'output_tokens': 107, 'total_tokens': 426}),\n", " AIMessage(content='- The first paper discusses a method to reduce the pre-training time for the BERT model from 3 days to just 76 minutes.\\n- The second paper focuses on defending against neural fake news, with authors including Rowan Zellers, Ari Holtzman, and Franziska Roesner.\\n- Both papers are arXiv preprints, indicating they have not yet been peer-reviewed.', response_metadata={'token_usage': {'completion_tokens': 78, 'prompt_tokens': 205, 'total_tokens': 283}, 'model_name': 'gpt-3.5-turbo-0125', 'system_fingerprint': None, 'finish_reason': 'stop', 'logprobs': None}, id='run-250a154f-7fed-453c-9e4e-c8e482a023c8-0', usage_metadata={'input_tokens': 205, 'output_tokens': 78, 'total_tokens': 283}),\n", " AIMessage(content='### Summary:\\n- The paper discusses aligning books and movies to create story-like visual explanations by watching movies and reading books.\\n- The paper presents results for RoBERTa in both LARGE and BASE configurations for various tasks.\\n- Hyperparameters for pretraining RoBERTa LARGE and RoBERTa BASE are provided in detail.\\n- Finetuning hyperparameters for tasks such as RACE, SQuAD, and GLUE are also discussed.\\n- Results for different configurations of RoBERTa on GLUE tasks are presented, showing improvements with additional data and longer pretraining.\\n- A comparison of hyperparameters for RoBERTa LARGE and RoBERTa BASE is provided in tables.', response_metadata={'token_usage': {'completion_tokens': 135, 'prompt_tokens': 914, 'total_tokens': 1049}, 'model_name': 'gpt-3.5-turbo-0125', 'system_fingerprint': None, 'finish_reason': 'stop', 'logprobs': None}, id='run-34832376-e334-4e6e-8393-11f7b85f7fae-0', usage_metadata={'input_tokens': 914, 'output_tokens': 135, 'total_tokens': 1049})]" ] }, "execution_count": 60, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from langchain.globals import set_debug\n", "\n", "set_debug(True)\n", "\n", "llm = ChatOpenAI(model=\"gpt-3.5-turbo-0125\", openai_api_key=os.environ[\"OPENAI_KEY\"])\n", "\n", "# Map\n", "map_template = \"\"\"You are an expert in technical papers and journals.\n", "You're tasked with summarizing the main points in the following text.\n", "The following is the text you need to summarize:\n", "{doc}\n", "Based on this text, provide a summary of the main points.\n", "\n", "RULES:\n", "- Organize the points in markdown format.\n", "\n", "Helpful Answer:\n", "\"\"\"\n", "\n", "map_prompt = PromptTemplate.from_template(map_template)\n", "map_chain = (\n", " {\"doc\": RunnablePassthrough()}\n", " | map_prompt\n", " | llm\n", ")\n", "map_res = await map_chain.abatch([{'doc': doc.page_content} for doc in docs], config={\"max_concurrency\": 40})\n", "map_res" ] }, { "cell_type": "code", "execution_count": 90, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\u001b[32;1m\u001b[1;3m[chain/start]\u001b[0m \u001b[1m[chain:RunnableSequence] Entering Chain run with input:\n", "\u001b[0m[inputs]\n", "\u001b[32;1m\u001b[1;3m[chain/start]\u001b[0m \u001b[1m[chain:RunnableSequence > chain:RunnableParallel] Entering Chain run with input:\n", "\u001b[0m[inputs]\n", "\u001b[32;1m\u001b[1;3m[chain/start]\u001b[0m \u001b[1m[chain:RunnableSequence > chain:RunnableParallel > chain:RunnableSequence] Entering Chain run with input:\n", "\u001b[0m[inputs]\n", "\u001b[32;1m\u001b[1;3m[chain/start]\u001b[0m \u001b[1m[chain:RunnableSequence > chain:RunnableParallel > chain:RunnableSequence > chain:RunnablePassthrough] Entering Chain run with input:\n", "\u001b[0m[inputs]\n", "\u001b[36;1m\u001b[1;3m[chain/end]\u001b[0m \u001b[1m[chain:RunnableSequence > chain:RunnableParallel > chain:RunnableSequence > chain:RunnablePassthrough] [0ms] Exiting Chain run with output:\n", "\u001b[0m[outputs]\n", "\u001b[32;1m\u001b[1;3m[chain/start]\u001b[0m \u001b[1m[chain:RunnableSequence > chain:RunnableParallel > chain:RunnableSequence > chain:concat_summaries] Entering Chain run with input:\n", "\u001b[0m[inputs]\n", "{'docs': [AIMessage(content='### Summary:\\n- The document discusses a new pretraining approach called RoBERTa, which is an optimized version of BERT.\\n- The authors of the paper are Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov.', response_metadata={'token_usage': {'completion_tokens': 81, 'prompt_tokens': 182, 'total_tokens': 263}, 'model_name': 'gpt-3.5-turbo-0125', 'system_fingerprint': None, 'finish_reason': 'stop', 'logprobs': None}, id='run-b3735f08-fdd8-4aa6-8b9e-7c3a6a224ef6-0', usage_metadata={'input_tokens': 182, 'output_tokens': 81, 'total_tokens': 263}), AIMessage(content=\"**Summary:**\\n\\n- Language model pretraining, such as BERT, has shown significant performance gains in natural language processing tasks.\\n- A replication study of BERT pretraining was conducted to evaluate the impact of hyperparameters and training data size.\\n- The study found that BERT was significantly undertrained and proposed an improved training approach called RoBERTa, which surpassed the performance of post-BERT models.\\n- The improvements included training the model longer with bigger batches, removing the next sentence prediction objective, training on longer sequences, and dynamically changing the masking pattern.\\n- The study used a new dataset (CC-NEWS) to better control for training set size effects and achieved state-of-the-art results on various tasks.\\n- The study highlights the importance of design choices in language model pretraining and releases the models and code for further research.\\n- BERT uses a transformer architecture with self-attention heads and hidden dimensions during pretraining.\\n- BERT's pretraining objectives include masked language modeling and next sentence prediction to improve downstream task performance.\\n- BERT is optimized using the Adam optimizer with specific parameters and trained on a combination of BOOK CORPUS and English WIKIPEDIA.\\n- The experimental setup for the replication study involved re-implementing BERT in FAIRSEQ and tuning hyperparameters for optimal performance.\\n- The study also emphasizes the importance of large and diverse training data in achieving better end-task performance.\", response_metadata={'token_usage': {'completion_tokens': 283, 'prompt_tokens': 2317, 'total_tokens': 2600}, 'model_name': 'gpt-3.5-turbo-0125', 'system_fingerprint': None, 'finish_reason': 'stop', 'logprobs': None}, id='run-75cc630f-34d1-4cfa-a6d9-1945ad2037ac-0', usage_metadata={'input_tokens': 2317, 'output_tokens': 283, 'total_tokens': 2600}), AIMessage(content='### Summary of Main Points:\\n\\n- The study focuses on gathering data for experimentation from five English-language corpora, totaling over 160GB of uncompressed text.\\n- The text corpora used include BOOK CORPUS, English WIKIPEDIA, CC-NEWS, OPENWEBTEXT, and STORIES.\\n- The evaluation of pretrained models is done using benchmarks like GLUE, SQuAD, and RACE.\\n- The training procedure analysis explores the importance of choices in successfully pretraining BERT models, including comparing static vs. dynamic masking.\\n- The study finds that dynamic masking is comparable or slightly better than static masking for BERT BASE models.\\n- The NSP loss in the original BERT pretraining procedure is important for performance, with recent work questioning its necessity.', response_metadata={'token_usage': {'completion_tokens': 155, 'prompt_tokens': 2097, 'total_tokens': 2252}, 'model_name': 'gpt-3.5-turbo-0125', 'system_fingerprint': None, 'finish_reason': 'stop', 'logprobs': None}, id='run-dc89b621-1f39-45ae-b016-0e18819d5165-0', usage_metadata={'input_tokens': 2097, 'output_tokens': 155, 'total_tokens': 2252}), AIMessage(content='### Summary of Main Points:\\n\\n- The text compares different training formats for natural language processing tasks using models like BERT and XLNet.\\n- Different input formats are tested, including SEGMENT-PAIR +NSP, SENTENCE-PAIR +NSP, FULL-SENTENCES, and DOC-SENTENCES.\\n- Results are reported for models pre-trained on BOOK CORPUS and WIKIPEDIA, showing performance in tasks like SQuAD, MNLI-m, SST-2, and RACE.\\n- The study includes variations in batch size and the inclusion/exclusion of the Next Sentence Prediction (NSP) loss in the training process.\\n- The results show varying performance levels for different input formats and pre-trained models, with DOC-SENTENCES performing slightly better in some cases.', response_metadata={'token_usage': {'completion_tokens': 158, 'prompt_tokens': 753, 'total_tokens': 911}, 'model_name': 'gpt-3.5-turbo-0125', 'system_fingerprint': None, 'finish_reason': 'stop', 'logprobs': None}, id='run-990d2393-5dd0-4ccc-ab79-d308193dcc65-0', usage_metadata={'input_tokens': 753, 'output_tokens': 158, 'total_tokens': 911}), AIMessage(content=\"### Summary of Main Points:\\n- The study removes the NSP loss and compares different input formats for BERT.\\n- Using individual sentences hurts performance on downstream tasks due to the model's inability to learn long-range dependencies.\\n- Training without the NSP loss and using blocks of text from a single document (DOC-SENTENCES) outperforms the original BERT BASE results.\\n- Removing the NSP loss matches or slightly improves downstream task performance compared to Devlin et al. (2019).\\n- Restricting sequences to come from a single document (DOC-SENTENCES) performs slightly better than using sequences from multiple documents (FULL-SENTENCES).\\n- Training with large batches has been shown to improve optimization speed and end-task performance in Neural Machine Translation and BERT models.\\n- BERT BASE was originally trained for 1M steps with a batch size of 256 sequences, equivalent to training for fewer steps with larger batch sizes.\\n- Perplexity and end-task performance were compared for base models trained over BOOK CORPUS and WIKIPEDIA using varying batch sizes.\", response_metadata={'token_usage': {'completion_tokens': 218, 'prompt_tokens': 665, 'total_tokens': 883}, 'model_name': 'gpt-3.5-turbo-0125', 'system_fingerprint': None, 'finish_reason': 'stop', 'logprobs': None}, id='run-e5d72283-c5d4-498d-b535-00baa9fbcb08-0', usage_metadata={'input_tokens': 665, 'output_tokens': 218, 'total_tokens': 883}), AIMessage(content='### Summary of Main Points\\n\\n- The text discusses the importance of tuning the learning rate for each setting in models to improve task performance.\\n- Training with large batches improves perplexity for the masked language modeling objective and end-task accuracy.\\n- Byte-Pair Encoding (BPE) is a hybrid representation that handles large vocabularies in natural language corpora.\\n- BPE can use bytes instead of unicode characters to create a subword vocabulary of a modest size.\\n- Large batch training can improve training efficiency through gradient accumulation.\\n- RoBERTa is a modified BERT approach trained with dynamic masking, full sentences, large mini-batches, and a larger byte-level BPE.\\n- RoBERTa is trained with different amounts of data and training passes to evaluate its impact on performance.\\n- Results for RoBERTa pretraining over more data and for longer durations are provided.\\n- Comparison results with BERT LARGE and XLNet LARGE are also included in the text.', response_metadata={'token_usage': {'completion_tokens': 192, 'prompt_tokens': 1441, 'total_tokens': 1633}, 'model_name': 'gpt-3.5-turbo-0125', 'system_fingerprint': None, 'finish_reason': 'stop', 'logprobs': None}, id='run-272f60cd-1c1c-43b3-b29b-d6d9b1bfe149-0', usage_metadata={'input_tokens': 1441, 'output_tokens': 192, 'total_tokens': 1633}), AIMessage(content='- RoBERTa accumulates improvements from BERT LARGE.\\n- RoBERTa matches the architecture and training objective of BERT LARGE.\\n- Results for BERT LARGE are from Devlin et al. (2019).\\n- Results for XLNet LARGE are from Yang et al. (2019).\\n- Complete results on all GLUE tasks can be found in the Appendix.', response_metadata={'token_usage': {'completion_tokens': 75, 'prompt_tokens': 150, 'total_tokens': 225}, 'model_name': 'gpt-3.5-turbo-0125', 'system_fingerprint': None, 'finish_reason': 'stop', 'logprobs': None}, id='run-134e3803-c9ba-46f5-8b0e-7d880ee43ced-0', usage_metadata={'input_tokens': 150, 'output_tokens': 75, 'total_tokens': 225}), AIMessage(content='### Summary of Main Points:\\n\\n- Devlin et al. (2019) pretrain their model, RoBERTa, using 1024 V100 GPUs for approximately one day.\\n- RoBERTa shows significant improvements over the originally reported BERT LARGE results, emphasizing the importance of design choices.\\n- Pretraining RoBERTa over a combined dataset of 160GB of text results in further performance improvements across all tasks.\\n- Longer pretraining steps (300K and 500K) lead to significant gains in downstream task performance, outperforming XLNet LARGE.\\n- Even the longest-trained model does not overfit the data and could benefit from additional training.\\n- The best RoBERTa model is evaluated on GLUE, SQuaD, and RACE benchmarks.\\n- For GLUE, RoBERTa is compared to other approaches on the test set via the leaderboard, showing competitive results.\\n- Task-specific modifications are necessary for certain tasks in the GLUE benchmark to achieve competitive results, such as the QNLI task adopting a pairwise ranking formulation.', response_metadata={'token_usage': {'completion_tokens': 213, 'prompt_tokens': 1409, 'total_tokens': 1622}, 'model_name': 'gpt-3.5-turbo-0125', 'system_fingerprint': None, 'finish_reason': 'stop', 'logprobs': None}, id='run-da7e7d8e-b6a1-4b04-b1d1-4c2eeb971324-0', usage_metadata={'input_tokens': 1409, 'output_tokens': 213, 'total_tokens': 1622}), AIMessage(content='### Summary of Main Points:\\n\\n- The paper presents the use of RoBERTa, a modified version of BERT, for various natural language processing tasks.\\n- RoBERTa achieves state-of-the-art results on all GLUE task development sets, outperforming BERT LARGE and XLNet LARGE.\\n- RoBERTa is submitted to the GLUE leaderboard and achieves state-of-the-art results on 4 out of 9 tasks.\\n- RoBERTa is used for the SQuAD dataset and achieves competitive results with XLNet, setting a new state-of-the-art on SQuAD v2.0.\\n- RoBERTa is also evaluated on the RACE dataset and achieves state-of-the-art results on both middle-school and high-school settings.\\n- The paper discusses the importance of model architecture, pretraining objective, dataset size, and training time in achieving high performance in natural language processing tasks.\\n- Various pretraining methods and techniques are mentioned, including language modeling, masked language modeling, and multi-task fine-tuning.\\n- The paper highlights the effectiveness of RoBERTa in achieving competitive results without relying on additional external training data.', response_metadata={'token_usage': {'completion_tokens': 228, 'prompt_tokens': 1828, 'total_tokens': 2056}, 'model_name': 'gpt-3.5-turbo-0125', 'system_fingerprint': None, 'finish_reason': 'stop', 'logprobs': None}, id='run-0d0261f0-a681-471c-a82f-1f0430a582e3-0', usage_metadata={'input_tokens': 1828, 'output_tokens': 228, 'total_tokens': 2056}), AIMessage(content=\"- The authors aimed to replicate, simplify, and improve the training of BERT for better understanding its performance compared to other methods.\\n- They evaluated various design decisions for pretraining BERT models and found that performance can be significantly enhanced by training the model longer, with bigger batches, on more data, removing the next sentence prediction objective, training on longer sequences, and dynamically changing the masking pattern.\\n- Their improved pretraining procedure, RoBERTa, achieved state-of-the-art results on GLUE, RACE, and SQuAD without multi-task finetuning for GLUE or additional data for SQuAD.\\n- The results highlight the importance of design decisions previously overlooked and suggest that BERT's pretraining objective remains competitive with other alternatives.\\n- They also introduced a novel dataset, CC-NEWS, and made their models and code available for pretraining and finetuning at a specific GitHub link.\", response_metadata={'token_usage': {'completion_tokens': 182, 'prompt_tokens': 475, 'total_tokens': 657}, 'model_name': 'gpt-3.5-turbo-0125', 'system_fingerprint': None, 'finish_reason': 'stop', 'logprobs': None}, id='run-6c23125c-2c3c-482b-86ea-4cbfdf464f2c-0', usage_metadata={'input_tokens': 475, 'output_tokens': 182, 'total_tokens': 657}), AIMessage(content='- The text discusses the PASCAL recognizing textual entailment challenge, which has had multiple iterations over the years.\\n- It mentions key researchers involved in the challenges, such as Roy Bar-Haim, Ido Dagan, Bill Dolan, Lisa Ferro, Danilo Giampiccolo, Bernardo Magnini, Idan Szpektor, Luisa Bentivogli, Hoa Trang Dang, Samuel R Bowman, Gabor Angeli, Christopher Potts, and Christopher D Manning.\\n- The challenges focus on the task of recognizing textual entailment, which involves determining if a given piece of text entails, contradicts, or is neutral with respect to another piece of text.\\n- The text also references the creation of a large annotated corpus for learning natural language inference, which was presented at the Empirical Methods in Natural Language Processing (EMNLP) conference in 2015.', response_metadata={'token_usage': {'completion_tokens': 182, 'prompt_tokens': 262, 'total_tokens': 444}, 'model_name': 'gpt-3.5-turbo-0125', 'system_fingerprint': None, 'finish_reason': 'stop', 'logprobs': None}, id='run-6a7f8301-9541-42e8-afbf-4a86c6c25118-0', usage_metadata={'input_tokens': 262, 'output_tokens': 182, 'total_tokens': 444}), AIMessage(content='- The text discusses a paper titled \"KERMIT: Generative insertion-based modeling for sequences\"\\n- The authors of the paper are William Chan, Nikita Kitaev, Kelvin Guu, Mitchell Stern, and Jakob Uszkoreit\\n- The paper likely introduces a new model or approach called KERMIT for generating sequences using insertion-based modeling', response_metadata={'token_usage': {'completion_tokens': 71, 'prompt_tokens': 118, 'total_tokens': 189}, 'model_name': 'gpt-3.5-turbo-0125', 'system_fingerprint': None, 'finish_reason': 'stop', 'logprobs': None}, id='run-e294e84d-fb13-459c-b9fe-f8592b8aaa54-0', usage_metadata={'input_tokens': 118, 'output_tokens': 71, 'total_tokens': 189}), AIMessage(content='- The text mentions several key papers and challenges in the field of natural language processing and machine learning.\\n- It references the PASCAL Recognising Textual Entailment Challenge, Semi-supervised Sequence Learning, BERT pre-training model, and Unified Language Model pre-training.\\n- These papers and challenges are important in advancing the understanding and generation of natural language.\\n- The text also highlights the use of deep bidirectional transformers and the construction of corpora for paraphrasing in natural language processing research.', response_metadata={'token_usage': {'completion_tokens': 99, 'prompt_tokens': 377, 'total_tokens': 476}, 'model_name': 'gpt-3.5-turbo-0125', 'system_fingerprint': None, 'finish_reason': 'stop', 'logprobs': None}, id='run-392ae862-d242-451b-84a2-4e9541565630-0', usage_metadata={'input_tokens': 377, 'output_tokens': 99, 'total_tokens': 476}), AIMessage(content='### Summary of Main Points:\\n- The text discusses the third PASCAL recognizing textual entailment challenge, which was held in 2007.\\n- It also mentions the OpenWebText Corpus created by Aaron Gokaslan and Vanya Cohen in 2019.\\n- The link to access the OpenWebText Corpus is provided.', response_metadata={'token_usage': {'completion_tokens': 67, 'prompt_tokens': 181, 'total_tokens': 248}, 'model_name': 'gpt-3.5-turbo-0125', 'system_fingerprint': None, 'finish_reason': 'stop', 'logprobs': None}, id='run-d6ee0613-3e00-4f31-8763-19c21863e37b-0', usage_metadata={'input_tokens': 181, 'output_tokens': 67, 'total_tokens': 248}), AIMessage(content='# Summary of Main Points:\\n\\n- **Authors**: Felix Hamborg, Norman Meuschke, Corinna Brenitinger, and Bela Gipp\\n- **Year**: 2017\\n- **Title**: news-please: A generic news crawler and extractor\\n- **Event**: Proceedings of the 15th International Symposium of Information Science', response_metadata={'token_usage': {'completion_tokens': 71, 'prompt_tokens': 132, 'total_tokens': 203}, 'model_name': 'gpt-3.5-turbo-0125', 'system_fingerprint': None, 'finish_reason': 'stop', 'logprobs': None}, id='run-b3d50fe8-1206-44d1-9012-1ab7b08ca243-0', usage_metadata={'input_tokens': 132, 'output_tokens': 71, 'total_tokens': 203}), AIMessage(content='- The paper is authored by Dan Hendrycks and Kevin Gimpel in 2016.\\n- The paper introduces a new activation function called Gaussian Error Linear Units (GELUs).\\n- The paper is available as an arXiv preprint with the identifier arXiv:1606.08415.', response_metadata={'token_usage': {'completion_tokens': 63, 'prompt_tokens': 117, 'total_tokens': 180}, 'model_name': 'gpt-3.5-turbo-0125', 'system_fingerprint': None, 'finish_reason': 'stop', 'logprobs': None}, id='run-edf8912f-5996-4b74-86b4-5f7559161221-0', usage_metadata={'input_tokens': 117, 'output_tokens': 63, 'total_tokens': 180}), AIMessage(content='### Summary:\\n- **Authors and Publications:** \\n - Matthew Honnibal and Ines Montani are working on spaCy 2 for natural language understanding with Bloom embeddings, convolutional neural networks, and incremental parsing. \\n - Jeremy Howard and Sebastian Ruder are focusing on universal language model fine-tuning for text classification.\\n - Shankar Iyer, Nikhil Dandekar, and Kornl Csernai released the first Quora dataset on question pairs.\\n - Mandar Joshi, Danqi Chen, Yinhan Liu, and Daniel S. are also involved in the field.', response_metadata={'token_usage': {'completion_tokens': 125, 'prompt_tokens': 237, 'total_tokens': 362}, 'model_name': 'gpt-3.5-turbo-0125', 'system_fingerprint': None, 'finish_reason': 'stop', 'logprobs': None}, id='run-a1def3ae-6dda-4989-beae-0094f64cfa5c-0', usage_metadata={'input_tokens': 237, 'output_tokens': 125, 'total_tokens': 362}), AIMessage(content='- SpanBERT is a pre-training method that improves performance by representing and predicting spans.\\n- The paper was published as an arXiv preprint with the identifier arXiv:1907.10529.\\n- Another method mentioned in the text is Adam, which is a stochastic optimization technique.\\n- Adam was presented at the International Conference on Learning Representations (ICLR) in 2015 by Diederik Kingma and Jimmy Ba.', response_metadata={'token_usage': {'completion_tokens': 89, 'prompt_tokens': 161, 'total_tokens': 250}, 'model_name': 'gpt-3.5-turbo-0125', 'system_fingerprint': None, 'finish_reason': 'stop', 'logprobs': None}, id='run-29ce0d28-5945-43dc-956c-4d3526654618-0', usage_metadata={'input_tokens': 161, 'output_tokens': 89, 'total_tokens': 250}), AIMessage(content='- The text mentions several technical papers and journals related to natural language processing.\\n- One paper discusses a robust trick for the Winograd Schema Challenge.\\n- Another paper introduces a large-scale reading comprehension dataset called RACE.\\n- A third paper focuses on cross-lingual language model pretraining.\\n- All papers are available as arXiv preprints.', response_metadata={'token_usage': {'completion_tokens': 70, 'prompt_tokens': 239, 'total_tokens': 309}, 'model_name': 'gpt-3.5-turbo-0125', 'system_fingerprint': None, 'finish_reason': 'stop', 'logprobs': None}, id='run-4240f3f6-f800-4e0b-afc8-8f4b70438a9d-0', usage_metadata={'input_tokens': 239, 'output_tokens': 70, 'total_tokens': 309}), AIMessage(content='- The text discusses the Winograd schema challenge, introduced by Hector J Levesque, Ernest Davis, and Leora Morgenstern in 2011.\\n- It also mentions a more recent paper by Xiaodong Liu et al. from 2019, which focuses on improving multi-task deep neural networks through knowledge distillation for natural language understanding.\\n- The first paper addresses commonsense reasoning, while the second paper deals with enhancing neural networks for language processing.', response_metadata={'token_usage': {'completion_tokens': 94, 'prompt_tokens': 184, 'total_tokens': 278}, 'model_name': 'gpt-3.5-turbo-0125', 'system_fingerprint': None, 'finish_reason': 'stop', 'logprobs': None}, id='run-4a2c4819-b11a-4b81-9017-85e3e56d94f4-0', usage_metadata={'input_tokens': 184, 'output_tokens': 94, 'total_tokens': 278}), AIMessage(content='- The text discusses two papers on natural language understanding using deep neural networks.\\n- The first paper by Liu et al. focuses on multi-task deep neural networks for natural language understanding.\\n- The second paper by McCann et al. discusses contextualized word vectors learned in translation.\\n- Both papers are available as preprints on arXiv and were presented at the Advances in Neural Information Processing Systems (NIPS) conference.', response_metadata={'token_usage': {'completion_tokens': 84, 'prompt_tokens': 193, 'total_tokens': 277}, 'model_name': 'gpt-3.5-turbo-0125', 'system_fingerprint': None, 'finish_reason': 'stop', 'logprobs': None}, id='run-a53f8018-4244-460e-bf36-466c695e6754-0', usage_metadata={'input_tokens': 193, 'output_tokens': 84, 'total_tokens': 277}), AIMessage(content='- The text mentions a paper on mixed precision training presented at the International Conference on Learning Representations in 2018 by a group of authors.\\n- It also references a dataset called Cc-news from 2016, which is available online.', response_metadata={'token_usage': {'completion_tokens': 49, 'prompt_tokens': 196, 'total_tokens': 245}, 'model_name': 'gpt-3.5-turbo-0125', 'system_fingerprint': None, 'finish_reason': 'stop', 'logprobs': None}, id='run-f58b3f07-7e2b-4f02-a170-1f8d45b10643-0', usage_metadata={'input_tokens': 196, 'output_tokens': 49, 'total_tokens': 245}), AIMessage(content='- FAIRSEQ is a fast and extensible toolkit for sequence modeling, presented at NAACL.\\n- Scaling neural machine translation was discussed in a paper at the Third Conference on Machine Translation (WMT).\\n- Automatic differentiation in PyTorch was covered at the NIPS Autodiff Workshop.\\n- Deep contextualized word representations were presented at NAACL.\\n- OpenAI published reports on improving language understanding with unsupervised learning and language models as unsupervised multitask learners.', response_metadata={'token_usage': {'completion_tokens': 96, 'prompt_tokens': 439, 'total_tokens': 535}, 'model_name': 'gpt-3.5-turbo-0125', 'system_fingerprint': None, 'finish_reason': 'stop', 'logprobs': None}, id='run-9d97ef45-64bb-442a-b296-87dfebdbb3ac-0', usage_metadata={'input_tokens': 439, 'output_tokens': 96, 'total_tokens': 535}), AIMessage(content='- The text references several key papers and journals in the field of natural language processing and machine learning.\\n- These papers include topics such as unanswerable questions for machine comprehension, neural machine translation, recursive deep models for sentiment analysis, and masked sequence-to-sequence pre-training for language generation.\\n- The papers were presented at conferences such as the Association for Computational Linguistics (ACL), Empirical Methods in Natural Language Processing (EMNLP), and the International Conference on Machine Learning (ICML).', response_metadata={'token_usage': {'completion_tokens': 99, 'prompt_tokens': 356, 'total_tokens': 455}, 'model_name': 'gpt-3.5-turbo-0125', 'system_fingerprint': None, 'finish_reason': 'stop', 'logprobs': None}, id='run-4452c353-9791-4747-9fd0-862a73125f50-0', usage_metadata={'input_tokens': 356, 'output_tokens': 99, 'total_tokens': 455}), AIMessage(content='- The text discusses multiple technical papers related to natural language processing and machine learning.\\n- One of the papers mentioned is \"ERNIE: Enhanced representation through knowledge integration\" by Yu Stephanie Sun et al.\\n- Another paper mentioned is \"A simple method for commonsense reasoning\" by Trieu H Trinh and Quoc V Le.\\n- The text also references the paper \"Attention is all you need\" by Ashish Vaswani et al.\\n- Additionally, the paper \"Advances in neural information processing systems\" is mentioned.\\n- The text provides a list of authors for each paper mentioned.', response_metadata={'token_usage': {'completion_tokens': 119, 'prompt_tokens': 313, 'total_tokens': 432}, 'model_name': 'gpt-3.5-turbo-0125', 'system_fingerprint': None, 'finish_reason': 'stop', 'logprobs': None}, id='run-11b7a63c-f932-4b2d-99ff-46ec3bbedc82-0', usage_metadata={'input_tokens': 313, 'output_tokens': 119, 'total_tokens': 432}), AIMessage(content='# Summary of \"SuperGLUE: A stickier benchmark for general-purpose language understanding systems\" by Bowman (2019)\\n\\n- The text is discussing a benchmark called SuperGLUE, which is designed to evaluate general-purpose language understanding systems.\\n- SuperGLUE is intended to be a more challenging benchmark compared to existing ones.\\n- The benchmark is named \"SuperGLUE\" to imply that it is stickier, meaning it is more difficult for language understanding systems to achieve high scores.\\n- The goal of SuperGLUE is to push the boundaries of what language understanding systems can achieve.', response_metadata={'token_usage': {'completion_tokens': 117, 'prompt_tokens': 101, 'total_tokens': 218}, 'model_name': 'gpt-3.5-turbo-0125', 'system_fingerprint': None, 'finish_reason': 'stop', 'logprobs': None}, id='run-27a8ce88-dcdf-48a9-af6b-5ca8bbc4d4af-0', usage_metadata={'input_tokens': 101, 'output_tokens': 117, 'total_tokens': 218}), AIMessage(content='- The text discusses various technical papers and preprints related to natural language understanding and neural network acceptability judgments.\\n- The GLUE benchmark is highlighted as a multi-task benchmark and analysis platform for natural language understanding.\\n- The paper by Warstadt et al. (2018) focuses on neural network acceptability judgments.\\n- Williams et al. (2018) introduce a challenge corpus for sentence understanding through inference.\\n- Yang et al. (2019) present XLNet, a generalized autoregressive pretraining method for language understanding.', response_metadata={'token_usage': {'completion_tokens': 107, 'prompt_tokens': 319, 'total_tokens': 426}, 'model_name': 'gpt-3.5-turbo-0125', 'system_fingerprint': None, 'finish_reason': 'stop', 'logprobs': None}, id='run-1b9cdb23-2bda-4e02-b7dd-46f278e814d2-0', usage_metadata={'input_tokens': 319, 'output_tokens': 107, 'total_tokens': 426}), AIMessage(content='- The first paper discusses a method to reduce the pre-training time for the BERT model from 3 days to just 76 minutes.\\n- The second paper focuses on defending against neural fake news, with authors including Rowan Zellers, Ari Holtzman, and Franziska Roesner.\\n- Both papers are arXiv preprints, indicating they have not yet been peer-reviewed.', response_metadata={'token_usage': {'completion_tokens': 78, 'prompt_tokens': 205, 'total_tokens': 283}, 'model_name': 'gpt-3.5-turbo-0125', 'system_fingerprint': None, 'finish_reason': 'stop', 'logprobs': None}, id='run-250a154f-7fed-453c-9e4e-c8e482a023c8-0', usage_metadata={'input_tokens': 205, 'output_tokens': 78, 'total_tokens': 283}), AIMessage(content='### Summary:\\n- The paper discusses aligning books and movies to create story-like visual explanations by watching movies and reading books.\\n- The paper presents results for RoBERTa in both LARGE and BASE configurations for various tasks.\\n- Hyperparameters for pretraining RoBERTa LARGE and RoBERTa BASE are provided in detail.\\n- Finetuning hyperparameters for tasks such as RACE, SQuAD, and GLUE are also discussed.\\n- Results for different configurations of RoBERTa on GLUE tasks are presented, showing improvements with additional data and longer pretraining.\\n- A comparison of hyperparameters for RoBERTa LARGE and RoBERTa BASE is provided in tables.', response_metadata={'token_usage': {'completion_tokens': 135, 'prompt_tokens': 914, 'total_tokens': 1049}, 'model_name': 'gpt-3.5-turbo-0125', 'system_fingerprint': None, 'finish_reason': 'stop', 'logprobs': None}, id='run-34832376-e334-4e6e-8393-11f7b85f7fae-0', usage_metadata={'input_tokens': 914, 'output_tokens': 135, 'total_tokens': 1049})]}\n", "\u001b[36;1m\u001b[1;3m[chain/end]\u001b[0m \u001b[1m[chain:RunnableSequence > chain:RunnableParallel > chain:RunnableSequence > chain:concat_summaries] [1ms] Exiting Chain run with output:\n", "\u001b[0m{\n", " \"output\": \"### Summary:\\n- The document discusses a new pretraining approach called RoBERTa, which is an optimized version of BERT.\\n- The authors of the paper are Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov.\\n\\n**Summary:**\\n\\n- Language model pretraining, such as BERT, has shown significant performance gains in natural language processing tasks.\\n- A replication study of BERT pretraining was conducted to evaluate the impact of hyperparameters and training data size.\\n- The study found that BERT was significantly undertrained and proposed an improved training approach called RoBERTa, which surpassed the performance of post-BERT models.\\n- The improvements included training the model longer with bigger batches, removing the next sentence prediction objective, training on longer sequences, and dynamically changing the masking pattern.\\n- The study used a new dataset (CC-NEWS) to better control for training set size effects and achieved state-of-the-art results on various tasks.\\n- The study highlights the importance of design choices in language model pretraining and releases the models and code for further research.\\n- BERT uses a transformer architecture with self-attention heads and hidden dimensions during pretraining.\\n- BERT's pretraining objectives include masked language modeling and next sentence prediction to improve downstream task performance.\\n- BERT is optimized using the Adam optimizer with specific parameters and trained on a combination of BOOK CORPUS and English WIKIPEDIA.\\n- The experimental setup for the replication study involved re-implementing BERT in FAIRSEQ and tuning hyperparameters for optimal performance.\\n- The study also emphasizes the importance of large and diverse training data in achieving better end-task performance.\\n\\n### Summary of Main Points:\\n\\n- The study focuses on gathering data for experimentation from five English-language corpora, totaling over 160GB of uncompressed text.\\n- The text corpora used include BOOK CORPUS, English WIKIPEDIA, CC-NEWS, OPENWEBTEXT, and STORIES.\\n- The evaluation of pretrained models is done using benchmarks like GLUE, SQuAD, and RACE.\\n- The training procedure analysis explores the importance of choices in successfully pretraining BERT models, including comparing static vs. dynamic masking.\\n- The study finds that dynamic masking is comparable or slightly better than static masking for BERT BASE models.\\n- The NSP loss in the original BERT pretraining procedure is important for performance, with recent work questioning its necessity.\\n\\n### Summary of Main Points:\\n\\n- The text compares different training formats for natural language processing tasks using models like BERT and XLNet.\\n- Different input formats are tested, including SEGMENT-PAIR +NSP, SENTENCE-PAIR +NSP, FULL-SENTENCES, and DOC-SENTENCES.\\n- Results are reported for models pre-trained on BOOK CORPUS and WIKIPEDIA, showing performance in tasks like SQuAD, MNLI-m, SST-2, and RACE.\\n- The study includes variations in batch size and the inclusion/exclusion of the Next Sentence Prediction (NSP) loss in the training process.\\n- The results show varying performance levels for different input formats and pre-trained models, with DOC-SENTENCES performing slightly better in some cases.\\n\\n### Summary of Main Points:\\n- The study removes the NSP loss and compares different input formats for BERT.\\n- Using individual sentences hurts performance on downstream tasks due to the model's inability to learn long-range dependencies.\\n- Training without the NSP loss and using blocks of text from a single document (DOC-SENTENCES) outperforms the original BERT BASE results.\\n- Removing the NSP loss matches or slightly improves downstream task performance compared to Devlin et al. (2019).\\n- Restricting sequences to come from a single document (DOC-SENTENCES) performs slightly better than using sequences from multiple documents (FULL-SENTENCES).\\n- Training with large batches has been shown to improve optimization speed and end-task performance in Neural Machine Translation and BERT models.\\n- BERT BASE was originally trained for 1M steps with a batch size of 256 sequences, equivalent to training for fewer steps with larger batch sizes.\\n- Perplexity and end-task performance were compared for base models trained over BOOK CORPUS and WIKIPEDIA using varying batch sizes.\\n\\n### Summary of Main Points\\n\\n- The text discusses the importance of tuning the learning rate for each setting in models to improve task performance.\\n- Training with large batches improves perplexity for the masked language modeling objective and end-task accuracy.\\n- Byte-Pair Encoding (BPE) is a hybrid representation that handles large vocabularies in natural language corpora.\\n- BPE can use bytes instead of unicode characters to create a subword vocabulary of a modest size.\\n- Large batch training can improve training efficiency through gradient accumulation.\\n- RoBERTa is a modified BERT approach trained with dynamic masking, full sentences, large mini-batches, and a larger byte-level BPE.\\n- RoBERTa is trained with different amounts of data and training passes to evaluate its impact on performance.\\n- Results for RoBERTa pretraining over more data and for longer durations are provided.\\n- Comparison results with BERT LARGE and XLNet LARGE are also included in the text.\\n\\n- RoBERTa accumulates improvements from BERT LARGE.\\n- RoBERTa matches the architecture and training objective of BERT LARGE.\\n- Results for BERT LARGE are from Devlin et al. (2019).\\n- Results for XLNet LARGE are from Yang et al. (2019).\\n- Complete results on all GLUE tasks can be found in the Appendix.\\n\\n### Summary of Main Points:\\n\\n- Devlin et al. (2019) pretrain their model, RoBERTa, using 1024 V100 GPUs for approximately one day.\\n- RoBERTa shows significant improvements over the originally reported BERT LARGE results, emphasizing the importance of design choices.\\n- Pretraining RoBERTa over a combined dataset of 160GB of text results in further performance improvements across all tasks.\\n- Longer pretraining steps (300K and 500K) lead to significant gains in downstream task performance, outperforming XLNet LARGE.\\n- Even the longest-trained model does not overfit the data and could benefit from additional training.\\n- The best RoBERTa model is evaluated on GLUE, SQuaD, and RACE benchmarks.\\n- For GLUE, RoBERTa is compared to other approaches on the test set via the leaderboard, showing competitive results.\\n- Task-specific modifications are necessary for certain tasks in the GLUE benchmark to achieve competitive results, such as the QNLI task adopting a pairwise ranking formulation.\\n\\n### Summary of Main Points:\\n\\n- The paper presents the use of RoBERTa, a modified version of BERT, for various natural language processing tasks.\\n- RoBERTa achieves state-of-the-art results on all GLUE task development sets, outperforming BERT LARGE and XLNet LARGE.\\n- RoBERTa is submitted to the GLUE leaderboard and achieves state-of-the-art results on 4 out of 9 tasks.\\n- RoBERTa is used for the SQuAD dataset and achieves competitive results with XLNet, setting a new state-of-the-art on SQuAD v2.0.\\n- RoBERTa is also evaluated on the RACE dataset and achieves state-of-the-art results on both middle-school and high-school settings.\\n- The paper discusses the importance of model architecture, pretraining objective, dataset size, and training time in achieving high performance in natural language processing tasks.\\n- Various pretraining methods and techniques are mentioned, including language modeling, masked language modeling, and multi-task fine-tuning.\\n- The paper highlights the effectiveness of RoBERTa in achieving competitive results without relying on additional external training data.\\n\\n- The authors aimed to replicate, simplify, and improve the training of BERT for better understanding its performance compared to other methods.\\n- They evaluated various design decisions for pretraining BERT models and found that performance can be significantly enhanced by training the model longer, with bigger batches, on more data, removing the next sentence prediction objective, training on longer sequences, and dynamically changing the masking pattern.\\n- Their improved pretraining procedure, RoBERTa, achieved state-of-the-art results on GLUE, RACE, and SQuAD without multi-task finetuning for GLUE or additional data for SQuAD.\\n- The results highlight the importance of design decisions previously overlooked and suggest that BERT's pretraining objective remains competitive with other alternatives.\\n- They also introduced a novel dataset, CC-NEWS, and made their models and code available for pretraining and finetuning at a specific GitHub link.\\n\\n- The text discusses the PASCAL recognizing textual entailment challenge, which has had multiple iterations over the years.\\n- It mentions key researchers involved in the challenges, such as Roy Bar-Haim, Ido Dagan, Bill Dolan, Lisa Ferro, Danilo Giampiccolo, Bernardo Magnini, Idan Szpektor, Luisa Bentivogli, Hoa Trang Dang, Samuel R Bowman, Gabor Angeli, Christopher Potts, and Christopher D Manning.\\n- The challenges focus on the task of recognizing textual entailment, which involves determining if a given piece of text entails, contradicts, or is neutral with respect to another piece of text.\\n- The text also references the creation of a large annotated corpus for learning natural language inference, which was presented at the Empirical Methods in Natural Language Processing (EMNLP) conference in 2015.\\n\\n- The text discusses a paper titled \\\"KERMIT: Generative insertion-based modeling for sequences\\\"\\n- The authors of the paper are William Chan, Nikita Kitaev, Kelvin Guu, Mitchell Stern, and Jakob Uszkoreit\\n- The paper likely introduces a new model or approach called KERMIT for generating sequences using insertion-based modeling\\n\\n- The text mentions several key papers and challenges in the field of natural language processing and machine learning.\\n- It references the PASCAL Recognising Textual Entailment Challenge, Semi-supervised Sequence Learning, BERT pre-training model, and Unified Language Model pre-training.\\n- These papers and challenges are important in advancing the understanding and generation of natural language.\\n- The text also highlights the use of deep bidirectional transformers and the construction of corpora for paraphrasing in natural language processing research.\\n\\n### Summary of Main Points:\\n- The text discusses the third PASCAL recognizing textual entailment challenge, which was held in 2007.\\n- It also mentions the OpenWebText Corpus created by Aaron Gokaslan and Vanya Cohen in 2019.\\n- The link to access the OpenWebText Corpus is provided.\\n\\n# Summary of Main Points:\\n\\n- **Authors**: Felix Hamborg, Norman Meuschke, Corinna Brenitinger, and Bela Gipp\\n- **Year**: 2017\\n- **Title**: news-please: A generic news crawler and extractor\\n- **Event**: Proceedings of the 15th International Symposium of Information Science\\n\\n- The paper is authored by Dan Hendrycks and Kevin Gimpel in 2016.\\n- The paper introduces a new activation function called Gaussian Error Linear Units (GELUs).\\n- The paper is available as an arXiv preprint with the identifier arXiv:1606.08415.\\n\\n### Summary:\\n- **Authors and Publications:** \\n - Matthew Honnibal and Ines Montani are working on spaCy 2 for natural language understanding with Bloom embeddings, convolutional neural networks, and incremental parsing. \\n - Jeremy Howard and Sebastian Ruder are focusing on universal language model fine-tuning for text classification.\\n - Shankar Iyer, Nikhil Dandekar, and Kornl Csernai released the first Quora dataset on question pairs.\\n - Mandar Joshi, Danqi Chen, Yinhan Liu, and Daniel S. are also involved in the field.\\n\\n- SpanBERT is a pre-training method that improves performance by representing and predicting spans.\\n- The paper was published as an arXiv preprint with the identifier arXiv:1907.10529.\\n- Another method mentioned in the text is Adam, which is a stochastic optimization technique.\\n- Adam was presented at the International Conference on Learning Representations (ICLR) in 2015 by Diederik Kingma and Jimmy Ba.\\n\\n- The text mentions several technical papers and journals related to natural language processing.\\n- One paper discusses a robust trick for the Winograd Schema Challenge.\\n- Another paper introduces a large-scale reading comprehension dataset called RACE.\\n- A third paper focuses on cross-lingual language model pretraining.\\n- All papers are available as arXiv preprints.\\n\\n- The text discusses the Winograd schema challenge, introduced by Hector J Levesque, Ernest Davis, and Leora Morgenstern in 2011.\\n- It also mentions a more recent paper by Xiaodong Liu et al. from 2019, which focuses on improving multi-task deep neural networks through knowledge distillation for natural language understanding.\\n- The first paper addresses commonsense reasoning, while the second paper deals with enhancing neural networks for language processing.\\n\\n- The text discusses two papers on natural language understanding using deep neural networks.\\n- The first paper by Liu et al. focuses on multi-task deep neural networks for natural language understanding.\\n- The second paper by McCann et al. discusses contextualized word vectors learned in translation.\\n- Both papers are available as preprints on arXiv and were presented at the Advances in Neural Information Processing Systems (NIPS) conference.\\n\\n- The text mentions a paper on mixed precision training presented at the International Conference on Learning Representations in 2018 by a group of authors.\\n- It also references a dataset called Cc-news from 2016, which is available online.\\n\\n- FAIRSEQ is a fast and extensible toolkit for sequence modeling, presented at NAACL.\\n- Scaling neural machine translation was discussed in a paper at the Third Conference on Machine Translation (WMT).\\n- Automatic differentiation in PyTorch was covered at the NIPS Autodiff Workshop.\\n- Deep contextualized word representations were presented at NAACL.\\n- OpenAI published reports on improving language understanding with unsupervised learning and language models as unsupervised multitask learners.\\n\\n- The text references several key papers and journals in the field of natural language processing and machine learning.\\n- These papers include topics such as unanswerable questions for machine comprehension, neural machine translation, recursive deep models for sentiment analysis, and masked sequence-to-sequence pre-training for language generation.\\n- The papers were presented at conferences such as the Association for Computational Linguistics (ACL), Empirical Methods in Natural Language Processing (EMNLP), and the International Conference on Machine Learning (ICML).\\n\\n- The text discusses multiple technical papers related to natural language processing and machine learning.\\n- One of the papers mentioned is \\\"ERNIE: Enhanced representation through knowledge integration\\\" by Yu Stephanie Sun et al.\\n- Another paper mentioned is \\\"A simple method for commonsense reasoning\\\" by Trieu H Trinh and Quoc V Le.\\n- The text also references the paper \\\"Attention is all you need\\\" by Ashish Vaswani et al.\\n- Additionally, the paper \\\"Advances in neural information processing systems\\\" is mentioned.\\n- The text provides a list of authors for each paper mentioned.\\n\\n# Summary of \\\"SuperGLUE: A stickier benchmark for general-purpose language understanding systems\\\" by Bowman (2019)\\n\\n- The text is discussing a benchmark called SuperGLUE, which is designed to evaluate general-purpose language understanding systems.\\n- SuperGLUE is intended to be a more challenging benchmark compared to existing ones.\\n- The benchmark is named \\\"SuperGLUE\\\" to imply that it is stickier, meaning it is more difficult for language understanding systems to achieve high scores.\\n- The goal of SuperGLUE is to push the boundaries of what language understanding systems can achieve.\\n\\n- The text discusses various technical papers and preprints related to natural language understanding and neural network acceptability judgments.\\n- The GLUE benchmark is highlighted as a multi-task benchmark and analysis platform for natural language understanding.\\n- The paper by Warstadt et al. (2018) focuses on neural network acceptability judgments.\\n- Williams et al. (2018) introduce a challenge corpus for sentence understanding through inference.\\n- Yang et al. (2019) present XLNet, a generalized autoregressive pretraining method for language understanding.\\n\\n- The first paper discusses a method to reduce the pre-training time for the BERT model from 3 days to just 76 minutes.\\n- The second paper focuses on defending against neural fake news, with authors including Rowan Zellers, Ari Holtzman, and Franziska Roesner.\\n- Both papers are arXiv preprints, indicating they have not yet been peer-reviewed.\\n\\n### Summary:\\n- The paper discusses aligning books and movies to create story-like visual explanations by watching movies and reading books.\\n- The paper presents results for RoBERTa in both LARGE and BASE configurations for various tasks.\\n- Hyperparameters for pretraining RoBERTa LARGE and RoBERTa BASE are provided in detail.\\n- Finetuning hyperparameters for tasks such as RACE, SQuAD, and GLUE are also discussed.\\n- Results for different configurations of RoBERTa on GLUE tasks are presented, showing improvements with additional data and longer pretraining.\\n- A comparison of hyperparameters for RoBERTa LARGE and RoBERTa BASE is provided in tables.\"\n", "}\n", "\u001b[36;1m\u001b[1;3m[chain/end]\u001b[0m \u001b[1m[chain:RunnableSequence > chain:RunnableParallel > chain:RunnableSequence] [2ms] Exiting Chain run with output:\n", "\u001b[0m{\n", " \"output\": \"### Summary:\\n- The document discusses a new pretraining approach called RoBERTa, which is an optimized version of BERT.\\n- The authors of the paper are Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov.\\n\\n**Summary:**\\n\\n- Language model pretraining, such as BERT, has shown significant performance gains in natural language processing tasks.\\n- A replication study of BERT pretraining was conducted to evaluate the impact of hyperparameters and training data size.\\n- The study found that BERT was significantly undertrained and proposed an improved training approach called RoBERTa, which surpassed the performance of post-BERT models.\\n- The improvements included training the model longer with bigger batches, removing the next sentence prediction objective, training on longer sequences, and dynamically changing the masking pattern.\\n- The study used a new dataset (CC-NEWS) to better control for training set size effects and achieved state-of-the-art results on various tasks.\\n- The study highlights the importance of design choices in language model pretraining and releases the models and code for further research.\\n- BERT uses a transformer architecture with self-attention heads and hidden dimensions during pretraining.\\n- BERT's pretraining objectives include masked language modeling and next sentence prediction to improve downstream task performance.\\n- BERT is optimized using the Adam optimizer with specific parameters and trained on a combination of BOOK CORPUS and English WIKIPEDIA.\\n- The experimental setup for the replication study involved re-implementing BERT in FAIRSEQ and tuning hyperparameters for optimal performance.\\n- The study also emphasizes the importance of large and diverse training data in achieving better end-task performance.\\n\\n### Summary of Main Points:\\n\\n- The study focuses on gathering data for experimentation from five English-language corpora, totaling over 160GB of uncompressed text.\\n- The text corpora used include BOOK CORPUS, English WIKIPEDIA, CC-NEWS, OPENWEBTEXT, and STORIES.\\n- The evaluation of pretrained models is done using benchmarks like GLUE, SQuAD, and RACE.\\n- The training procedure analysis explores the importance of choices in successfully pretraining BERT models, including comparing static vs. dynamic masking.\\n- The study finds that dynamic masking is comparable or slightly better than static masking for BERT BASE models.\\n- The NSP loss in the original BERT pretraining procedure is important for performance, with recent work questioning its necessity.\\n\\n### Summary of Main Points:\\n\\n- The text compares different training formats for natural language processing tasks using models like BERT and XLNet.\\n- Different input formats are tested, including SEGMENT-PAIR +NSP, SENTENCE-PAIR +NSP, FULL-SENTENCES, and DOC-SENTENCES.\\n- Results are reported for models pre-trained on BOOK CORPUS and WIKIPEDIA, showing performance in tasks like SQuAD, MNLI-m, SST-2, and RACE.\\n- The study includes variations in batch size and the inclusion/exclusion of the Next Sentence Prediction (NSP) loss in the training process.\\n- The results show varying performance levels for different input formats and pre-trained models, with DOC-SENTENCES performing slightly better in some cases.\\n\\n### Summary of Main Points:\\n- The study removes the NSP loss and compares different input formats for BERT.\\n- Using individual sentences hurts performance on downstream tasks due to the model's inability to learn long-range dependencies.\\n- Training without the NSP loss and using blocks of text from a single document (DOC-SENTENCES) outperforms the original BERT BASE results.\\n- Removing the NSP loss matches or slightly improves downstream task performance compared to Devlin et al. (2019).\\n- Restricting sequences to come from a single document (DOC-SENTENCES) performs slightly better than using sequences from multiple documents (FULL-SENTENCES).\\n- Training with large batches has been shown to improve optimization speed and end-task performance in Neural Machine Translation and BERT models.\\n- BERT BASE was originally trained for 1M steps with a batch size of 256 sequences, equivalent to training for fewer steps with larger batch sizes.\\n- Perplexity and end-task performance were compared for base models trained over BOOK CORPUS and WIKIPEDIA using varying batch sizes.\\n\\n### Summary of Main Points\\n\\n- The text discusses the importance of tuning the learning rate for each setting in models to improve task performance.\\n- Training with large batches improves perplexity for the masked language modeling objective and end-task accuracy.\\n- Byte-Pair Encoding (BPE) is a hybrid representation that handles large vocabularies in natural language corpora.\\n- BPE can use bytes instead of unicode characters to create a subword vocabulary of a modest size.\\n- Large batch training can improve training efficiency through gradient accumulation.\\n- RoBERTa is a modified BERT approach trained with dynamic masking, full sentences, large mini-batches, and a larger byte-level BPE.\\n- RoBERTa is trained with different amounts of data and training passes to evaluate its impact on performance.\\n- Results for RoBERTa pretraining over more data and for longer durations are provided.\\n- Comparison results with BERT LARGE and XLNet LARGE are also included in the text.\\n\\n- RoBERTa accumulates improvements from BERT LARGE.\\n- RoBERTa matches the architecture and training objective of BERT LARGE.\\n- Results for BERT LARGE are from Devlin et al. (2019).\\n- Results for XLNet LARGE are from Yang et al. (2019).\\n- Complete results on all GLUE tasks can be found in the Appendix.\\n\\n### Summary of Main Points:\\n\\n- Devlin et al. (2019) pretrain their model, RoBERTa, using 1024 V100 GPUs for approximately one day.\\n- RoBERTa shows significant improvements over the originally reported BERT LARGE results, emphasizing the importance of design choices.\\n- Pretraining RoBERTa over a combined dataset of 160GB of text results in further performance improvements across all tasks.\\n- Longer pretraining steps (300K and 500K) lead to significant gains in downstream task performance, outperforming XLNet LARGE.\\n- Even the longest-trained model does not overfit the data and could benefit from additional training.\\n- The best RoBERTa model is evaluated on GLUE, SQuaD, and RACE benchmarks.\\n- For GLUE, RoBERTa is compared to other approaches on the test set via the leaderboard, showing competitive results.\\n- Task-specific modifications are necessary for certain tasks in the GLUE benchmark to achieve competitive results, such as the QNLI task adopting a pairwise ranking formulation.\\n\\n### Summary of Main Points:\\n\\n- The paper presents the use of RoBERTa, a modified version of BERT, for various natural language processing tasks.\\n- RoBERTa achieves state-of-the-art results on all GLUE task development sets, outperforming BERT LARGE and XLNet LARGE.\\n- RoBERTa is submitted to the GLUE leaderboard and achieves state-of-the-art results on 4 out of 9 tasks.\\n- RoBERTa is used for the SQuAD dataset and achieves competitive results with XLNet, setting a new state-of-the-art on SQuAD v2.0.\\n- RoBERTa is also evaluated on the RACE dataset and achieves state-of-the-art results on both middle-school and high-school settings.\\n- The paper discusses the importance of model architecture, pretraining objective, dataset size, and training time in achieving high performance in natural language processing tasks.\\n- Various pretraining methods and techniques are mentioned, including language modeling, masked language modeling, and multi-task fine-tuning.\\n- The paper highlights the effectiveness of RoBERTa in achieving competitive results without relying on additional external training data.\\n\\n- The authors aimed to replicate, simplify, and improve the training of BERT for better understanding its performance compared to other methods.\\n- They evaluated various design decisions for pretraining BERT models and found that performance can be significantly enhanced by training the model longer, with bigger batches, on more data, removing the next sentence prediction objective, training on longer sequences, and dynamically changing the masking pattern.\\n- Their improved pretraining procedure, RoBERTa, achieved state-of-the-art results on GLUE, RACE, and SQuAD without multi-task finetuning for GLUE or additional data for SQuAD.\\n- The results highlight the importance of design decisions previously overlooked and suggest that BERT's pretraining objective remains competitive with other alternatives.\\n- They also introduced a novel dataset, CC-NEWS, and made their models and code available for pretraining and finetuning at a specific GitHub link.\\n\\n- The text discusses the PASCAL recognizing textual entailment challenge, which has had multiple iterations over the years.\\n- It mentions key researchers involved in the challenges, such as Roy Bar-Haim, Ido Dagan, Bill Dolan, Lisa Ferro, Danilo Giampiccolo, Bernardo Magnini, Idan Szpektor, Luisa Bentivogli, Hoa Trang Dang, Samuel R Bowman, Gabor Angeli, Christopher Potts, and Christopher D Manning.\\n- The challenges focus on the task of recognizing textual entailment, which involves determining if a given piece of text entails, contradicts, or is neutral with respect to another piece of text.\\n- The text also references the creation of a large annotated corpus for learning natural language inference, which was presented at the Empirical Methods in Natural Language Processing (EMNLP) conference in 2015.\\n\\n- The text discusses a paper titled \\\"KERMIT: Generative insertion-based modeling for sequences\\\"\\n- The authors of the paper are William Chan, Nikita Kitaev, Kelvin Guu, Mitchell Stern, and Jakob Uszkoreit\\n- The paper likely introduces a new model or approach called KERMIT for generating sequences using insertion-based modeling\\n\\n- The text mentions several key papers and challenges in the field of natural language processing and machine learning.\\n- It references the PASCAL Recognising Textual Entailment Challenge, Semi-supervised Sequence Learning, BERT pre-training model, and Unified Language Model pre-training.\\n- These papers and challenges are important in advancing the understanding and generation of natural language.\\n- The text also highlights the use of deep bidirectional transformers and the construction of corpora for paraphrasing in natural language processing research.\\n\\n### Summary of Main Points:\\n- The text discusses the third PASCAL recognizing textual entailment challenge, which was held in 2007.\\n- It also mentions the OpenWebText Corpus created by Aaron Gokaslan and Vanya Cohen in 2019.\\n- The link to access the OpenWebText Corpus is provided.\\n\\n# Summary of Main Points:\\n\\n- **Authors**: Felix Hamborg, Norman Meuschke, Corinna Brenitinger, and Bela Gipp\\n- **Year**: 2017\\n- **Title**: news-please: A generic news crawler and extractor\\n- **Event**: Proceedings of the 15th International Symposium of Information Science\\n\\n- The paper is authored by Dan Hendrycks and Kevin Gimpel in 2016.\\n- The paper introduces a new activation function called Gaussian Error Linear Units (GELUs).\\n- The paper is available as an arXiv preprint with the identifier arXiv:1606.08415.\\n\\n### Summary:\\n- **Authors and Publications:** \\n - Matthew Honnibal and Ines Montani are working on spaCy 2 for natural language understanding with Bloom embeddings, convolutional neural networks, and incremental parsing. \\n - Jeremy Howard and Sebastian Ruder are focusing on universal language model fine-tuning for text classification.\\n - Shankar Iyer, Nikhil Dandekar, and Kornl Csernai released the first Quora dataset on question pairs.\\n - Mandar Joshi, Danqi Chen, Yinhan Liu, and Daniel S. are also involved in the field.\\n\\n- SpanBERT is a pre-training method that improves performance by representing and predicting spans.\\n- The paper was published as an arXiv preprint with the identifier arXiv:1907.10529.\\n- Another method mentioned in the text is Adam, which is a stochastic optimization technique.\\n- Adam was presented at the International Conference on Learning Representations (ICLR) in 2015 by Diederik Kingma and Jimmy Ba.\\n\\n- The text mentions several technical papers and journals related to natural language processing.\\n- One paper discusses a robust trick for the Winograd Schema Challenge.\\n- Another paper introduces a large-scale reading comprehension dataset called RACE.\\n- A third paper focuses on cross-lingual language model pretraining.\\n- All papers are available as arXiv preprints.\\n\\n- The text discusses the Winograd schema challenge, introduced by Hector J Levesque, Ernest Davis, and Leora Morgenstern in 2011.\\n- It also mentions a more recent paper by Xiaodong Liu et al. from 2019, which focuses on improving multi-task deep neural networks through knowledge distillation for natural language understanding.\\n- The first paper addresses commonsense reasoning, while the second paper deals with enhancing neural networks for language processing.\\n\\n- The text discusses two papers on natural language understanding using deep neural networks.\\n- The first paper by Liu et al. focuses on multi-task deep neural networks for natural language understanding.\\n- The second paper by McCann et al. discusses contextualized word vectors learned in translation.\\n- Both papers are available as preprints on arXiv and were presented at the Advances in Neural Information Processing Systems (NIPS) conference.\\n\\n- The text mentions a paper on mixed precision training presented at the International Conference on Learning Representations in 2018 by a group of authors.\\n- It also references a dataset called Cc-news from 2016, which is available online.\\n\\n- FAIRSEQ is a fast and extensible toolkit for sequence modeling, presented at NAACL.\\n- Scaling neural machine translation was discussed in a paper at the Third Conference on Machine Translation (WMT).\\n- Automatic differentiation in PyTorch was covered at the NIPS Autodiff Workshop.\\n- Deep contextualized word representations were presented at NAACL.\\n- OpenAI published reports on improving language understanding with unsupervised learning and language models as unsupervised multitask learners.\\n\\n- The text references several key papers and journals in the field of natural language processing and machine learning.\\n- These papers include topics such as unanswerable questions for machine comprehension, neural machine translation, recursive deep models for sentiment analysis, and masked sequence-to-sequence pre-training for language generation.\\n- The papers were presented at conferences such as the Association for Computational Linguistics (ACL), Empirical Methods in Natural Language Processing (EMNLP), and the International Conference on Machine Learning (ICML).\\n\\n- The text discusses multiple technical papers related to natural language processing and machine learning.\\n- One of the papers mentioned is \\\"ERNIE: Enhanced representation through knowledge integration\\\" by Yu Stephanie Sun et al.\\n- Another paper mentioned is \\\"A simple method for commonsense reasoning\\\" by Trieu H Trinh and Quoc V Le.\\n- The text also references the paper \\\"Attention is all you need\\\" by Ashish Vaswani et al.\\n- Additionally, the paper \\\"Advances in neural information processing systems\\\" is mentioned.\\n- The text provides a list of authors for each paper mentioned.\\n\\n# Summary of \\\"SuperGLUE: A stickier benchmark for general-purpose language understanding systems\\\" by Bowman (2019)\\n\\n- The text is discussing a benchmark called SuperGLUE, which is designed to evaluate general-purpose language understanding systems.\\n- SuperGLUE is intended to be a more challenging benchmark compared to existing ones.\\n- The benchmark is named \\\"SuperGLUE\\\" to imply that it is stickier, meaning it is more difficult for language understanding systems to achieve high scores.\\n- The goal of SuperGLUE is to push the boundaries of what language understanding systems can achieve.\\n\\n- The text discusses various technical papers and preprints related to natural language understanding and neural network acceptability judgments.\\n- The GLUE benchmark is highlighted as a multi-task benchmark and analysis platform for natural language understanding.\\n- The paper by Warstadt et al. (2018) focuses on neural network acceptability judgments.\\n- Williams et al. (2018) introduce a challenge corpus for sentence understanding through inference.\\n- Yang et al. (2019) present XLNet, a generalized autoregressive pretraining method for language understanding.\\n\\n- The first paper discusses a method to reduce the pre-training time for the BERT model from 3 days to just 76 minutes.\\n- The second paper focuses on defending against neural fake news, with authors including Rowan Zellers, Ari Holtzman, and Franziska Roesner.\\n- Both papers are arXiv preprints, indicating they have not yet been peer-reviewed.\\n\\n### Summary:\\n- The paper discusses aligning books and movies to create story-like visual explanations by watching movies and reading books.\\n- The paper presents results for RoBERTa in both LARGE and BASE configurations for various tasks.\\n- Hyperparameters for pretraining RoBERTa LARGE and RoBERTa BASE are provided in detail.\\n- Finetuning hyperparameters for tasks such as RACE, SQuAD, and GLUE are also discussed.\\n- Results for different configurations of RoBERTa on GLUE tasks are presented, showing improvements with additional data and longer pretraining.\\n- A comparison of hyperparameters for RoBERTa LARGE and RoBERTa BASE is provided in tables.\"\n", "}\n", "\u001b[36;1m\u001b[1;3m[chain/end]\u001b[0m \u001b[1m[chain:RunnableSequence > chain:RunnableParallel] [4ms] Exiting Chain run with output:\n", "\u001b[0m{\n", " \"docs\": \"### Summary:\\n- The document discusses a new pretraining approach called RoBERTa, which is an optimized version of BERT.\\n- The authors of the paper are Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov.\\n\\n**Summary:**\\n\\n- Language model pretraining, such as BERT, has shown significant performance gains in natural language processing tasks.\\n- A replication study of BERT pretraining was conducted to evaluate the impact of hyperparameters and training data size.\\n- The study found that BERT was significantly undertrained and proposed an improved training approach called RoBERTa, which surpassed the performance of post-BERT models.\\n- The improvements included training the model longer with bigger batches, removing the next sentence prediction objective, training on longer sequences, and dynamically changing the masking pattern.\\n- The study used a new dataset (CC-NEWS) to better control for training set size effects and achieved state-of-the-art results on various tasks.\\n- The study highlights the importance of design choices in language model pretraining and releases the models and code for further research.\\n- BERT uses a transformer architecture with self-attention heads and hidden dimensions during pretraining.\\n- BERT's pretraining objectives include masked language modeling and next sentence prediction to improve downstream task performance.\\n- BERT is optimized using the Adam optimizer with specific parameters and trained on a combination of BOOK CORPUS and English WIKIPEDIA.\\n- The experimental setup for the replication study involved re-implementing BERT in FAIRSEQ and tuning hyperparameters for optimal performance.\\n- The study also emphasizes the importance of large and diverse training data in achieving better end-task performance.\\n\\n### Summary of Main Points:\\n\\n- The study focuses on gathering data for experimentation from five English-language corpora, totaling over 160GB of uncompressed text.\\n- The text corpora used include BOOK CORPUS, English WIKIPEDIA, CC-NEWS, OPENWEBTEXT, and STORIES.\\n- The evaluation of pretrained models is done using benchmarks like GLUE, SQuAD, and RACE.\\n- The training procedure analysis explores the importance of choices in successfully pretraining BERT models, including comparing static vs. dynamic masking.\\n- The study finds that dynamic masking is comparable or slightly better than static masking for BERT BASE models.\\n- The NSP loss in the original BERT pretraining procedure is important for performance, with recent work questioning its necessity.\\n\\n### Summary of Main Points:\\n\\n- The text compares different training formats for natural language processing tasks using models like BERT and XLNet.\\n- Different input formats are tested, including SEGMENT-PAIR +NSP, SENTENCE-PAIR +NSP, FULL-SENTENCES, and DOC-SENTENCES.\\n- Results are reported for models pre-trained on BOOK CORPUS and WIKIPEDIA, showing performance in tasks like SQuAD, MNLI-m, SST-2, and RACE.\\n- The study includes variations in batch size and the inclusion/exclusion of the Next Sentence Prediction (NSP) loss in the training process.\\n- The results show varying performance levels for different input formats and pre-trained models, with DOC-SENTENCES performing slightly better in some cases.\\n\\n### Summary of Main Points:\\n- The study removes the NSP loss and compares different input formats for BERT.\\n- Using individual sentences hurts performance on downstream tasks due to the model's inability to learn long-range dependencies.\\n- Training without the NSP loss and using blocks of text from a single document (DOC-SENTENCES) outperforms the original BERT BASE results.\\n- Removing the NSP loss matches or slightly improves downstream task performance compared to Devlin et al. (2019).\\n- Restricting sequences to come from a single document (DOC-SENTENCES) performs slightly better than using sequences from multiple documents (FULL-SENTENCES).\\n- Training with large batches has been shown to improve optimization speed and end-task performance in Neural Machine Translation and BERT models.\\n- BERT BASE was originally trained for 1M steps with a batch size of 256 sequences, equivalent to training for fewer steps with larger batch sizes.\\n- Perplexity and end-task performance were compared for base models trained over BOOK CORPUS and WIKIPEDIA using varying batch sizes.\\n\\n### Summary of Main Points\\n\\n- The text discusses the importance of tuning the learning rate for each setting in models to improve task performance.\\n- Training with large batches improves perplexity for the masked language modeling objective and end-task accuracy.\\n- Byte-Pair Encoding (BPE) is a hybrid representation that handles large vocabularies in natural language corpora.\\n- BPE can use bytes instead of unicode characters to create a subword vocabulary of a modest size.\\n- Large batch training can improve training efficiency through gradient accumulation.\\n- RoBERTa is a modified BERT approach trained with dynamic masking, full sentences, large mini-batches, and a larger byte-level BPE.\\n- RoBERTa is trained with different amounts of data and training passes to evaluate its impact on performance.\\n- Results for RoBERTa pretraining over more data and for longer durations are provided.\\n- Comparison results with BERT LARGE and XLNet LARGE are also included in the text.\\n\\n- RoBERTa accumulates improvements from BERT LARGE.\\n- RoBERTa matches the architecture and training objective of BERT LARGE.\\n- Results for BERT LARGE are from Devlin et al. (2019).\\n- Results for XLNet LARGE are from Yang et al. (2019).\\n- Complete results on all GLUE tasks can be found in the Appendix.\\n\\n### Summary of Main Points:\\n\\n- Devlin et al. (2019) pretrain their model, RoBERTa, using 1024 V100 GPUs for approximately one day.\\n- RoBERTa shows significant improvements over the originally reported BERT LARGE results, emphasizing the importance of design choices.\\n- Pretraining RoBERTa over a combined dataset of 160GB of text results in further performance improvements across all tasks.\\n- Longer pretraining steps (300K and 500K) lead to significant gains in downstream task performance, outperforming XLNet LARGE.\\n- Even the longest-trained model does not overfit the data and could benefit from additional training.\\n- The best RoBERTa model is evaluated on GLUE, SQuaD, and RACE benchmarks.\\n- For GLUE, RoBERTa is compared to other approaches on the test set via the leaderboard, showing competitive results.\\n- Task-specific modifications are necessary for certain tasks in the GLUE benchmark to achieve competitive results, such as the QNLI task adopting a pairwise ranking formulation.\\n\\n### Summary of Main Points:\\n\\n- The paper presents the use of RoBERTa, a modified version of BERT, for various natural language processing tasks.\\n- RoBERTa achieves state-of-the-art results on all GLUE task development sets, outperforming BERT LARGE and XLNet LARGE.\\n- RoBERTa is submitted to the GLUE leaderboard and achieves state-of-the-art results on 4 out of 9 tasks.\\n- RoBERTa is used for the SQuAD dataset and achieves competitive results with XLNet, setting a new state-of-the-art on SQuAD v2.0.\\n- RoBERTa is also evaluated on the RACE dataset and achieves state-of-the-art results on both middle-school and high-school settings.\\n- The paper discusses the importance of model architecture, pretraining objective, dataset size, and training time in achieving high performance in natural language processing tasks.\\n- Various pretraining methods and techniques are mentioned, including language modeling, masked language modeling, and multi-task fine-tuning.\\n- The paper highlights the effectiveness of RoBERTa in achieving competitive results without relying on additional external training data.\\n\\n- The authors aimed to replicate, simplify, and improve the training of BERT for better understanding its performance compared to other methods.\\n- They evaluated various design decisions for pretraining BERT models and found that performance can be significantly enhanced by training the model longer, with bigger batches, on more data, removing the next sentence prediction objective, training on longer sequences, and dynamically changing the masking pattern.\\n- Their improved pretraining procedure, RoBERTa, achieved state-of-the-art results on GLUE, RACE, and SQuAD without multi-task finetuning for GLUE or additional data for SQuAD.\\n- The results highlight the importance of design decisions previously overlooked and suggest that BERT's pretraining objective remains competitive with other alternatives.\\n- They also introduced a novel dataset, CC-NEWS, and made their models and code available for pretraining and finetuning at a specific GitHub link.\\n\\n- The text discusses the PASCAL recognizing textual entailment challenge, which has had multiple iterations over the years.\\n- It mentions key researchers involved in the challenges, such as Roy Bar-Haim, Ido Dagan, Bill Dolan, Lisa Ferro, Danilo Giampiccolo, Bernardo Magnini, Idan Szpektor, Luisa Bentivogli, Hoa Trang Dang, Samuel R Bowman, Gabor Angeli, Christopher Potts, and Christopher D Manning.\\n- The challenges focus on the task of recognizing textual entailment, which involves determining if a given piece of text entails, contradicts, or is neutral with respect to another piece of text.\\n- The text also references the creation of a large annotated corpus for learning natural language inference, which was presented at the Empirical Methods in Natural Language Processing (EMNLP) conference in 2015.\\n\\n- The text discusses a paper titled \\\"KERMIT: Generative insertion-based modeling for sequences\\\"\\n- The authors of the paper are William Chan, Nikita Kitaev, Kelvin Guu, Mitchell Stern, and Jakob Uszkoreit\\n- The paper likely introduces a new model or approach called KERMIT for generating sequences using insertion-based modeling\\n\\n- The text mentions several key papers and challenges in the field of natural language processing and machine learning.\\n- It references the PASCAL Recognising Textual Entailment Challenge, Semi-supervised Sequence Learning, BERT pre-training model, and Unified Language Model pre-training.\\n- These papers and challenges are important in advancing the understanding and generation of natural language.\\n- The text also highlights the use of deep bidirectional transformers and the construction of corpora for paraphrasing in natural language processing research.\\n\\n### Summary of Main Points:\\n- The text discusses the third PASCAL recognizing textual entailment challenge, which was held in 2007.\\n- It also mentions the OpenWebText Corpus created by Aaron Gokaslan and Vanya Cohen in 2019.\\n- The link to access the OpenWebText Corpus is provided.\\n\\n# Summary of Main Points:\\n\\n- **Authors**: Felix Hamborg, Norman Meuschke, Corinna Brenitinger, and Bela Gipp\\n- **Year**: 2017\\n- **Title**: news-please: A generic news crawler and extractor\\n- **Event**: Proceedings of the 15th International Symposium of Information Science\\n\\n- The paper is authored by Dan Hendrycks and Kevin Gimpel in 2016.\\n- The paper introduces a new activation function called Gaussian Error Linear Units (GELUs).\\n- The paper is available as an arXiv preprint with the identifier arXiv:1606.08415.\\n\\n### Summary:\\n- **Authors and Publications:** \\n - Matthew Honnibal and Ines Montani are working on spaCy 2 for natural language understanding with Bloom embeddings, convolutional neural networks, and incremental parsing. \\n - Jeremy Howard and Sebastian Ruder are focusing on universal language model fine-tuning for text classification.\\n - Shankar Iyer, Nikhil Dandekar, and Kornl Csernai released the first Quora dataset on question pairs.\\n - Mandar Joshi, Danqi Chen, Yinhan Liu, and Daniel S. are also involved in the field.\\n\\n- SpanBERT is a pre-training method that improves performance by representing and predicting spans.\\n- The paper was published as an arXiv preprint with the identifier arXiv:1907.10529.\\n- Another method mentioned in the text is Adam, which is a stochastic optimization technique.\\n- Adam was presented at the International Conference on Learning Representations (ICLR) in 2015 by Diederik Kingma and Jimmy Ba.\\n\\n- The text mentions several technical papers and journals related to natural language processing.\\n- One paper discusses a robust trick for the Winograd Schema Challenge.\\n- Another paper introduces a large-scale reading comprehension dataset called RACE.\\n- A third paper focuses on cross-lingual language model pretraining.\\n- All papers are available as arXiv preprints.\\n\\n- The text discusses the Winograd schema challenge, introduced by Hector J Levesque, Ernest Davis, and Leora Morgenstern in 2011.\\n- It also mentions a more recent paper by Xiaodong Liu et al. from 2019, which focuses on improving multi-task deep neural networks through knowledge distillation for natural language understanding.\\n- The first paper addresses commonsense reasoning, while the second paper deals with enhancing neural networks for language processing.\\n\\n- The text discusses two papers on natural language understanding using deep neural networks.\\n- The first paper by Liu et al. focuses on multi-task deep neural networks for natural language understanding.\\n- The second paper by McCann et al. discusses contextualized word vectors learned in translation.\\n- Both papers are available as preprints on arXiv and were presented at the Advances in Neural Information Processing Systems (NIPS) conference.\\n\\n- The text mentions a paper on mixed precision training presented at the International Conference on Learning Representations in 2018 by a group of authors.\\n- It also references a dataset called Cc-news from 2016, which is available online.\\n\\n- FAIRSEQ is a fast and extensible toolkit for sequence modeling, presented at NAACL.\\n- Scaling neural machine translation was discussed in a paper at the Third Conference on Machine Translation (WMT).\\n- Automatic differentiation in PyTorch was covered at the NIPS Autodiff Workshop.\\n- Deep contextualized word representations were presented at NAACL.\\n- OpenAI published reports on improving language understanding with unsupervised learning and language models as unsupervised multitask learners.\\n\\n- The text references several key papers and journals in the field of natural language processing and machine learning.\\n- These papers include topics such as unanswerable questions for machine comprehension, neural machine translation, recursive deep models for sentiment analysis, and masked sequence-to-sequence pre-training for language generation.\\n- The papers were presented at conferences such as the Association for Computational Linguistics (ACL), Empirical Methods in Natural Language Processing (EMNLP), and the International Conference on Machine Learning (ICML).\\n\\n- The text discusses multiple technical papers related to natural language processing and machine learning.\\n- One of the papers mentioned is \\\"ERNIE: Enhanced representation through knowledge integration\\\" by Yu Stephanie Sun et al.\\n- Another paper mentioned is \\\"A simple method for commonsense reasoning\\\" by Trieu H Trinh and Quoc V Le.\\n- The text also references the paper \\\"Attention is all you need\\\" by Ashish Vaswani et al.\\n- Additionally, the paper \\\"Advances in neural information processing systems\\\" is mentioned.\\n- The text provides a list of authors for each paper mentioned.\\n\\n# Summary of \\\"SuperGLUE: A stickier benchmark for general-purpose language understanding systems\\\" by Bowman (2019)\\n\\n- The text is discussing a benchmark called SuperGLUE, which is designed to evaluate general-purpose language understanding systems.\\n- SuperGLUE is intended to be a more challenging benchmark compared to existing ones.\\n- The benchmark is named \\\"SuperGLUE\\\" to imply that it is stickier, meaning it is more difficult for language understanding systems to achieve high scores.\\n- The goal of SuperGLUE is to push the boundaries of what language understanding systems can achieve.\\n\\n- The text discusses various technical papers and preprints related to natural language understanding and neural network acceptability judgments.\\n- The GLUE benchmark is highlighted as a multi-task benchmark and analysis platform for natural language understanding.\\n- The paper by Warstadt et al. (2018) focuses on neural network acceptability judgments.\\n- Williams et al. (2018) introduce a challenge corpus for sentence understanding through inference.\\n- Yang et al. (2019) present XLNet, a generalized autoregressive pretraining method for language understanding.\\n\\n- The first paper discusses a method to reduce the pre-training time for the BERT model from 3 days to just 76 minutes.\\n- The second paper focuses on defending against neural fake news, with authors including Rowan Zellers, Ari Holtzman, and Franziska Roesner.\\n- Both papers are arXiv preprints, indicating they have not yet been peer-reviewed.\\n\\n### Summary:\\n- The paper discusses aligning books and movies to create story-like visual explanations by watching movies and reading books.\\n- The paper presents results for RoBERTa in both LARGE and BASE configurations for various tasks.\\n- Hyperparameters for pretraining RoBERTa LARGE and RoBERTa BASE are provided in detail.\\n- Finetuning hyperparameters for tasks such as RACE, SQuAD, and GLUE are also discussed.\\n- Results for different configurations of RoBERTa on GLUE tasks are presented, showing improvements with additional data and longer pretraining.\\n- A comparison of hyperparameters for RoBERTa LARGE and RoBERTa BASE is provided in tables.\"\n", "}\n", "\u001b[32;1m\u001b[1;3m[chain/start]\u001b[0m \u001b[1m[chain:RunnableSequence > prompt:PromptTemplate] Entering Prompt run with input:\n", "\u001b[0m{\n", " \"docs\": \"### Summary:\\n- The document discusses a new pretraining approach called RoBERTa, which is an optimized version of BERT.\\n- The authors of the paper are Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov.\\n\\n**Summary:**\\n\\n- Language model pretraining, such as BERT, has shown significant performance gains in natural language processing tasks.\\n- A replication study of BERT pretraining was conducted to evaluate the impact of hyperparameters and training data size.\\n- The study found that BERT was significantly undertrained and proposed an improved training approach called RoBERTa, which surpassed the performance of post-BERT models.\\n- The improvements included training the model longer with bigger batches, removing the next sentence prediction objective, training on longer sequences, and dynamically changing the masking pattern.\\n- The study used a new dataset (CC-NEWS) to better control for training set size effects and achieved state-of-the-art results on various tasks.\\n- The study highlights the importance of design choices in language model pretraining and releases the models and code for further research.\\n- BERT uses a transformer architecture with self-attention heads and hidden dimensions during pretraining.\\n- BERT's pretraining objectives include masked language modeling and next sentence prediction to improve downstream task performance.\\n- BERT is optimized using the Adam optimizer with specific parameters and trained on a combination of BOOK CORPUS and English WIKIPEDIA.\\n- The experimental setup for the replication study involved re-implementing BERT in FAIRSEQ and tuning hyperparameters for optimal performance.\\n- The study also emphasizes the importance of large and diverse training data in achieving better end-task performance.\\n\\n### Summary of Main Points:\\n\\n- The study focuses on gathering data for experimentation from five English-language corpora, totaling over 160GB of uncompressed text.\\n- The text corpora used include BOOK CORPUS, English WIKIPEDIA, CC-NEWS, OPENWEBTEXT, and STORIES.\\n- The evaluation of pretrained models is done using benchmarks like GLUE, SQuAD, and RACE.\\n- The training procedure analysis explores the importance of choices in successfully pretraining BERT models, including comparing static vs. dynamic masking.\\n- The study finds that dynamic masking is comparable or slightly better than static masking for BERT BASE models.\\n- The NSP loss in the original BERT pretraining procedure is important for performance, with recent work questioning its necessity.\\n\\n### Summary of Main Points:\\n\\n- The text compares different training formats for natural language processing tasks using models like BERT and XLNet.\\n- Different input formats are tested, including SEGMENT-PAIR +NSP, SENTENCE-PAIR +NSP, FULL-SENTENCES, and DOC-SENTENCES.\\n- Results are reported for models pre-trained on BOOK CORPUS and WIKIPEDIA, showing performance in tasks like SQuAD, MNLI-m, SST-2, and RACE.\\n- The study includes variations in batch size and the inclusion/exclusion of the Next Sentence Prediction (NSP) loss in the training process.\\n- The results show varying performance levels for different input formats and pre-trained models, with DOC-SENTENCES performing slightly better in some cases.\\n\\n### Summary of Main Points:\\n- The study removes the NSP loss and compares different input formats for BERT.\\n- Using individual sentences hurts performance on downstream tasks due to the model's inability to learn long-range dependencies.\\n- Training without the NSP loss and using blocks of text from a single document (DOC-SENTENCES) outperforms the original BERT BASE results.\\n- Removing the NSP loss matches or slightly improves downstream task performance compared to Devlin et al. (2019).\\n- Restricting sequences to come from a single document (DOC-SENTENCES) performs slightly better than using sequences from multiple documents (FULL-SENTENCES).\\n- Training with large batches has been shown to improve optimization speed and end-task performance in Neural Machine Translation and BERT models.\\n- BERT BASE was originally trained for 1M steps with a batch size of 256 sequences, equivalent to training for fewer steps with larger batch sizes.\\n- Perplexity and end-task performance were compared for base models trained over BOOK CORPUS and WIKIPEDIA using varying batch sizes.\\n\\n### Summary of Main Points\\n\\n- The text discusses the importance of tuning the learning rate for each setting in models to improve task performance.\\n- Training with large batches improves perplexity for the masked language modeling objective and end-task accuracy.\\n- Byte-Pair Encoding (BPE) is a hybrid representation that handles large vocabularies in natural language corpora.\\n- BPE can use bytes instead of unicode characters to create a subword vocabulary of a modest size.\\n- Large batch training can improve training efficiency through gradient accumulation.\\n- RoBERTa is a modified BERT approach trained with dynamic masking, full sentences, large mini-batches, and a larger byte-level BPE.\\n- RoBERTa is trained with different amounts of data and training passes to evaluate its impact on performance.\\n- Results for RoBERTa pretraining over more data and for longer durations are provided.\\n- Comparison results with BERT LARGE and XLNet LARGE are also included in the text.\\n\\n- RoBERTa accumulates improvements from BERT LARGE.\\n- RoBERTa matches the architecture and training objective of BERT LARGE.\\n- Results for BERT LARGE are from Devlin et al. (2019).\\n- Results for XLNet LARGE are from Yang et al. (2019).\\n- Complete results on all GLUE tasks can be found in the Appendix.\\n\\n### Summary of Main Points:\\n\\n- Devlin et al. (2019) pretrain their model, RoBERTa, using 1024 V100 GPUs for approximately one day.\\n- RoBERTa shows significant improvements over the originally reported BERT LARGE results, emphasizing the importance of design choices.\\n- Pretraining RoBERTa over a combined dataset of 160GB of text results in further performance improvements across all tasks.\\n- Longer pretraining steps (300K and 500K) lead to significant gains in downstream task performance, outperforming XLNet LARGE.\\n- Even the longest-trained model does not overfit the data and could benefit from additional training.\\n- The best RoBERTa model is evaluated on GLUE, SQuaD, and RACE benchmarks.\\n- For GLUE, RoBERTa is compared to other approaches on the test set via the leaderboard, showing competitive results.\\n- Task-specific modifications are necessary for certain tasks in the GLUE benchmark to achieve competitive results, such as the QNLI task adopting a pairwise ranking formulation.\\n\\n### Summary of Main Points:\\n\\n- The paper presents the use of RoBERTa, a modified version of BERT, for various natural language processing tasks.\\n- RoBERTa achieves state-of-the-art results on all GLUE task development sets, outperforming BERT LARGE and XLNet LARGE.\\n- RoBERTa is submitted to the GLUE leaderboard and achieves state-of-the-art results on 4 out of 9 tasks.\\n- RoBERTa is used for the SQuAD dataset and achieves competitive results with XLNet, setting a new state-of-the-art on SQuAD v2.0.\\n- RoBERTa is also evaluated on the RACE dataset and achieves state-of-the-art results on both middle-school and high-school settings.\\n- The paper discusses the importance of model architecture, pretraining objective, dataset size, and training time in achieving high performance in natural language processing tasks.\\n- Various pretraining methods and techniques are mentioned, including language modeling, masked language modeling, and multi-task fine-tuning.\\n- The paper highlights the effectiveness of RoBERTa in achieving competitive results without relying on additional external training data.\\n\\n- The authors aimed to replicate, simplify, and improve the training of BERT for better understanding its performance compared to other methods.\\n- They evaluated various design decisions for pretraining BERT models and found that performance can be significantly enhanced by training the model longer, with bigger batches, on more data, removing the next sentence prediction objective, training on longer sequences, and dynamically changing the masking pattern.\\n- Their improved pretraining procedure, RoBERTa, achieved state-of-the-art results on GLUE, RACE, and SQuAD without multi-task finetuning for GLUE or additional data for SQuAD.\\n- The results highlight the importance of design decisions previously overlooked and suggest that BERT's pretraining objective remains competitive with other alternatives.\\n- They also introduced a novel dataset, CC-NEWS, and made their models and code available for pretraining and finetuning at a specific GitHub link.\\n\\n- The text discusses the PASCAL recognizing textual entailment challenge, which has had multiple iterations over the years.\\n- It mentions key researchers involved in the challenges, such as Roy Bar-Haim, Ido Dagan, Bill Dolan, Lisa Ferro, Danilo Giampiccolo, Bernardo Magnini, Idan Szpektor, Luisa Bentivogli, Hoa Trang Dang, Samuel R Bowman, Gabor Angeli, Christopher Potts, and Christopher D Manning.\\n- The challenges focus on the task of recognizing textual entailment, which involves determining if a given piece of text entails, contradicts, or is neutral with respect to another piece of text.\\n- The text also references the creation of a large annotated corpus for learning natural language inference, which was presented at the Empirical Methods in Natural Language Processing (EMNLP) conference in 2015.\\n\\n- The text discusses a paper titled \\\"KERMIT: Generative insertion-based modeling for sequences\\\"\\n- The authors of the paper are William Chan, Nikita Kitaev, Kelvin Guu, Mitchell Stern, and Jakob Uszkoreit\\n- The paper likely introduces a new model or approach called KERMIT for generating sequences using insertion-based modeling\\n\\n- The text mentions several key papers and challenges in the field of natural language processing and machine learning.\\n- It references the PASCAL Recognising Textual Entailment Challenge, Semi-supervised Sequence Learning, BERT pre-training model, and Unified Language Model pre-training.\\n- These papers and challenges are important in advancing the understanding and generation of natural language.\\n- The text also highlights the use of deep bidirectional transformers and the construction of corpora for paraphrasing in natural language processing research.\\n\\n### Summary of Main Points:\\n- The text discusses the third PASCAL recognizing textual entailment challenge, which was held in 2007.\\n- It also mentions the OpenWebText Corpus created by Aaron Gokaslan and Vanya Cohen in 2019.\\n- The link to access the OpenWebText Corpus is provided.\\n\\n# Summary of Main Points:\\n\\n- **Authors**: Felix Hamborg, Norman Meuschke, Corinna Brenitinger, and Bela Gipp\\n- **Year**: 2017\\n- **Title**: news-please: A generic news crawler and extractor\\n- **Event**: Proceedings of the 15th International Symposium of Information Science\\n\\n- The paper is authored by Dan Hendrycks and Kevin Gimpel in 2016.\\n- The paper introduces a new activation function called Gaussian Error Linear Units (GELUs).\\n- The paper is available as an arXiv preprint with the identifier arXiv:1606.08415.\\n\\n### Summary:\\n- **Authors and Publications:** \\n - Matthew Honnibal and Ines Montani are working on spaCy 2 for natural language understanding with Bloom embeddings, convolutional neural networks, and incremental parsing. \\n - Jeremy Howard and Sebastian Ruder are focusing on universal language model fine-tuning for text classification.\\n - Shankar Iyer, Nikhil Dandekar, and Kornl Csernai released the first Quora dataset on question pairs.\\n - Mandar Joshi, Danqi Chen, Yinhan Liu, and Daniel S. are also involved in the field.\\n\\n- SpanBERT is a pre-training method that improves performance by representing and predicting spans.\\n- The paper was published as an arXiv preprint with the identifier arXiv:1907.10529.\\n- Another method mentioned in the text is Adam, which is a stochastic optimization technique.\\n- Adam was presented at the International Conference on Learning Representations (ICLR) in 2015 by Diederik Kingma and Jimmy Ba.\\n\\n- The text mentions several technical papers and journals related to natural language processing.\\n- One paper discusses a robust trick for the Winograd Schema Challenge.\\n- Another paper introduces a large-scale reading comprehension dataset called RACE.\\n- A third paper focuses on cross-lingual language model pretraining.\\n- All papers are available as arXiv preprints.\\n\\n- The text discusses the Winograd schema challenge, introduced by Hector J Levesque, Ernest Davis, and Leora Morgenstern in 2011.\\n- It also mentions a more recent paper by Xiaodong Liu et al. from 2019, which focuses on improving multi-task deep neural networks through knowledge distillation for natural language understanding.\\n- The first paper addresses commonsense reasoning, while the second paper deals with enhancing neural networks for language processing.\\n\\n- The text discusses two papers on natural language understanding using deep neural networks.\\n- The first paper by Liu et al. focuses on multi-task deep neural networks for natural language understanding.\\n- The second paper by McCann et al. discusses contextualized word vectors learned in translation.\\n- Both papers are available as preprints on arXiv and were presented at the Advances in Neural Information Processing Systems (NIPS) conference.\\n\\n- The text mentions a paper on mixed precision training presented at the International Conference on Learning Representations in 2018 by a group of authors.\\n- It also references a dataset called Cc-news from 2016, which is available online.\\n\\n- FAIRSEQ is a fast and extensible toolkit for sequence modeling, presented at NAACL.\\n- Scaling neural machine translation was discussed in a paper at the Third Conference on Machine Translation (WMT).\\n- Automatic differentiation in PyTorch was covered at the NIPS Autodiff Workshop.\\n- Deep contextualized word representations were presented at NAACL.\\n- OpenAI published reports on improving language understanding with unsupervised learning and language models as unsupervised multitask learners.\\n\\n- The text references several key papers and journals in the field of natural language processing and machine learning.\\n- These papers include topics such as unanswerable questions for machine comprehension, neural machine translation, recursive deep models for sentiment analysis, and masked sequence-to-sequence pre-training for language generation.\\n- The papers were presented at conferences such as the Association for Computational Linguistics (ACL), Empirical Methods in Natural Language Processing (EMNLP), and the International Conference on Machine Learning (ICML).\\n\\n- The text discusses multiple technical papers related to natural language processing and machine learning.\\n- One of the papers mentioned is \\\"ERNIE: Enhanced representation through knowledge integration\\\" by Yu Stephanie Sun et al.\\n- Another paper mentioned is \\\"A simple method for commonsense reasoning\\\" by Trieu H Trinh and Quoc V Le.\\n- The text also references the paper \\\"Attention is all you need\\\" by Ashish Vaswani et al.\\n- Additionally, the paper \\\"Advances in neural information processing systems\\\" is mentioned.\\n- The text provides a list of authors for each paper mentioned.\\n\\n# Summary of \\\"SuperGLUE: A stickier benchmark for general-purpose language understanding systems\\\" by Bowman (2019)\\n\\n- The text is discussing a benchmark called SuperGLUE, which is designed to evaluate general-purpose language understanding systems.\\n- SuperGLUE is intended to be a more challenging benchmark compared to existing ones.\\n- The benchmark is named \\\"SuperGLUE\\\" to imply that it is stickier, meaning it is more difficult for language understanding systems to achieve high scores.\\n- The goal of SuperGLUE is to push the boundaries of what language understanding systems can achieve.\\n\\n- The text discusses various technical papers and preprints related to natural language understanding and neural network acceptability judgments.\\n- The GLUE benchmark is highlighted as a multi-task benchmark and analysis platform for natural language understanding.\\n- The paper by Warstadt et al. (2018) focuses on neural network acceptability judgments.\\n- Williams et al. (2018) introduce a challenge corpus for sentence understanding through inference.\\n- Yang et al. (2019) present XLNet, a generalized autoregressive pretraining method for language understanding.\\n\\n- The first paper discusses a method to reduce the pre-training time for the BERT model from 3 days to just 76 minutes.\\n- The second paper focuses on defending against neural fake news, with authors including Rowan Zellers, Ari Holtzman, and Franziska Roesner.\\n- Both papers are arXiv preprints, indicating they have not yet been peer-reviewed.\\n\\n### Summary:\\n- The paper discusses aligning books and movies to create story-like visual explanations by watching movies and reading books.\\n- The paper presents results for RoBERTa in both LARGE and BASE configurations for various tasks.\\n- Hyperparameters for pretraining RoBERTa LARGE and RoBERTa BASE are provided in detail.\\n- Finetuning hyperparameters for tasks such as RACE, SQuAD, and GLUE are also discussed.\\n- Results for different configurations of RoBERTa on GLUE tasks are presented, showing improvements with additional data and longer pretraining.\\n- A comparison of hyperparameters for RoBERTa LARGE and RoBERTa BASE is provided in tables.\"\n", "}\n", "\u001b[36;1m\u001b[1;3m[chain/end]\u001b[0m \u001b[1m[chain:RunnableSequence > prompt:PromptTemplate] [0ms] Exiting Prompt run with output:\n", "\u001b[0m[outputs]\n", "\u001b[32;1m\u001b[1;3m[llm/start]\u001b[0m \u001b[1m[chain:RunnableSequence > llm:ChatOpenAI] Entering LLM run with input:\n", "\u001b[0m{\n", " \"prompts\": [\n", " \"Human: The following is set of summaries of a technical paper:\\n### Summary:\\n- The document discusses a new pretraining approach called RoBERTa, which is an optimized version of BERT.\\n- The authors of the paper are Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov.\\n\\n**Summary:**\\n\\n- Language model pretraining, such as BERT, has shown significant performance gains in natural language processing tasks.\\n- A replication study of BERT pretraining was conducted to evaluate the impact of hyperparameters and training data size.\\n- The study found that BERT was significantly undertrained and proposed an improved training approach called RoBERTa, which surpassed the performance of post-BERT models.\\n- The improvements included training the model longer with bigger batches, removing the next sentence prediction objective, training on longer sequences, and dynamically changing the masking pattern.\\n- The study used a new dataset (CC-NEWS) to better control for training set size effects and achieved state-of-the-art results on various tasks.\\n- The study highlights the importance of design choices in language model pretraining and releases the models and code for further research.\\n- BERT uses a transformer architecture with self-attention heads and hidden dimensions during pretraining.\\n- BERT's pretraining objectives include masked language modeling and next sentence prediction to improve downstream task performance.\\n- BERT is optimized using the Adam optimizer with specific parameters and trained on a combination of BOOK CORPUS and English WIKIPEDIA.\\n- The experimental setup for the replication study involved re-implementing BERT in FAIRSEQ and tuning hyperparameters for optimal performance.\\n- The study also emphasizes the importance of large and diverse training data in achieving better end-task performance.\\n\\n### Summary of Main Points:\\n\\n- The study focuses on gathering data for experimentation from five English-language corpora, totaling over 160GB of uncompressed text.\\n- The text corpora used include BOOK CORPUS, English WIKIPEDIA, CC-NEWS, OPENWEBTEXT, and STORIES.\\n- The evaluation of pretrained models is done using benchmarks like GLUE, SQuAD, and RACE.\\n- The training procedure analysis explores the importance of choices in successfully pretraining BERT models, including comparing static vs. dynamic masking.\\n- The study finds that dynamic masking is comparable or slightly better than static masking for BERT BASE models.\\n- The NSP loss in the original BERT pretraining procedure is important for performance, with recent work questioning its necessity.\\n\\n### Summary of Main Points:\\n\\n- The text compares different training formats for natural language processing tasks using models like BERT and XLNet.\\n- Different input formats are tested, including SEGMENT-PAIR +NSP, SENTENCE-PAIR +NSP, FULL-SENTENCES, and DOC-SENTENCES.\\n- Results are reported for models pre-trained on BOOK CORPUS and WIKIPEDIA, showing performance in tasks like SQuAD, MNLI-m, SST-2, and RACE.\\n- The study includes variations in batch size and the inclusion/exclusion of the Next Sentence Prediction (NSP) loss in the training process.\\n- The results show varying performance levels for different input formats and pre-trained models, with DOC-SENTENCES performing slightly better in some cases.\\n\\n### Summary of Main Points:\\n- The study removes the NSP loss and compares different input formats for BERT.\\n- Using individual sentences hurts performance on downstream tasks due to the model's inability to learn long-range dependencies.\\n- Training without the NSP loss and using blocks of text from a single document (DOC-SENTENCES) outperforms the original BERT BASE results.\\n- Removing the NSP loss matches or slightly improves downstream task performance compared to Devlin et al. (2019).\\n- Restricting sequences to come from a single document (DOC-SENTENCES) performs slightly better than using sequences from multiple documents (FULL-SENTENCES).\\n- Training with large batches has been shown to improve optimization speed and end-task performance in Neural Machine Translation and BERT models.\\n- BERT BASE was originally trained for 1M steps with a batch size of 256 sequences, equivalent to training for fewer steps with larger batch sizes.\\n- Perplexity and end-task performance were compared for base models trained over BOOK CORPUS and WIKIPEDIA using varying batch sizes.\\n\\n### Summary of Main Points\\n\\n- The text discusses the importance of tuning the learning rate for each setting in models to improve task performance.\\n- Training with large batches improves perplexity for the masked language modeling objective and end-task accuracy.\\n- Byte-Pair Encoding (BPE) is a hybrid representation that handles large vocabularies in natural language corpora.\\n- BPE can use bytes instead of unicode characters to create a subword vocabulary of a modest size.\\n- Large batch training can improve training efficiency through gradient accumulation.\\n- RoBERTa is a modified BERT approach trained with dynamic masking, full sentences, large mini-batches, and a larger byte-level BPE.\\n- RoBERTa is trained with different amounts of data and training passes to evaluate its impact on performance.\\n- Results for RoBERTa pretraining over more data and for longer durations are provided.\\n- Comparison results with BERT LARGE and XLNet LARGE are also included in the text.\\n\\n- RoBERTa accumulates improvements from BERT LARGE.\\n- RoBERTa matches the architecture and training objective of BERT LARGE.\\n- Results for BERT LARGE are from Devlin et al. (2019).\\n- Results for XLNet LARGE are from Yang et al. (2019).\\n- Complete results on all GLUE tasks can be found in the Appendix.\\n\\n### Summary of Main Points:\\n\\n- Devlin et al. (2019) pretrain their model, RoBERTa, using 1024 V100 GPUs for approximately one day.\\n- RoBERTa shows significant improvements over the originally reported BERT LARGE results, emphasizing the importance of design choices.\\n- Pretraining RoBERTa over a combined dataset of 160GB of text results in further performance improvements across all tasks.\\n- Longer pretraining steps (300K and 500K) lead to significant gains in downstream task performance, outperforming XLNet LARGE.\\n- Even the longest-trained model does not overfit the data and could benefit from additional training.\\n- The best RoBERTa model is evaluated on GLUE, SQuaD, and RACE benchmarks.\\n- For GLUE, RoBERTa is compared to other approaches on the test set via the leaderboard, showing competitive results.\\n- Task-specific modifications are necessary for certain tasks in the GLUE benchmark to achieve competitive results, such as the QNLI task adopting a pairwise ranking formulation.\\n\\n### Summary of Main Points:\\n\\n- The paper presents the use of RoBERTa, a modified version of BERT, for various natural language processing tasks.\\n- RoBERTa achieves state-of-the-art results on all GLUE task development sets, outperforming BERT LARGE and XLNet LARGE.\\n- RoBERTa is submitted to the GLUE leaderboard and achieves state-of-the-art results on 4 out of 9 tasks.\\n- RoBERTa is used for the SQuAD dataset and achieves competitive results with XLNet, setting a new state-of-the-art on SQuAD v2.0.\\n- RoBERTa is also evaluated on the RACE dataset and achieves state-of-the-art results on both middle-school and high-school settings.\\n- The paper discusses the importance of model architecture, pretraining objective, dataset size, and training time in achieving high performance in natural language processing tasks.\\n- Various pretraining methods and techniques are mentioned, including language modeling, masked language modeling, and multi-task fine-tuning.\\n- The paper highlights the effectiveness of RoBERTa in achieving competitive results without relying on additional external training data.\\n\\n- The authors aimed to replicate, simplify, and improve the training of BERT for better understanding its performance compared to other methods.\\n- They evaluated various design decisions for pretraining BERT models and found that performance can be significantly enhanced by training the model longer, with bigger batches, on more data, removing the next sentence prediction objective, training on longer sequences, and dynamically changing the masking pattern.\\n- Their improved pretraining procedure, RoBERTa, achieved state-of-the-art results on GLUE, RACE, and SQuAD without multi-task finetuning for GLUE or additional data for SQuAD.\\n- The results highlight the importance of design decisions previously overlooked and suggest that BERT's pretraining objective remains competitive with other alternatives.\\n- They also introduced a novel dataset, CC-NEWS, and made their models and code available for pretraining and finetuning at a specific GitHub link.\\n\\n- The text discusses the PASCAL recognizing textual entailment challenge, which has had multiple iterations over the years.\\n- It mentions key researchers involved in the challenges, such as Roy Bar-Haim, Ido Dagan, Bill Dolan, Lisa Ferro, Danilo Giampiccolo, Bernardo Magnini, Idan Szpektor, Luisa Bentivogli, Hoa Trang Dang, Samuel R Bowman, Gabor Angeli, Christopher Potts, and Christopher D Manning.\\n- The challenges focus on the task of recognizing textual entailment, which involves determining if a given piece of text entails, contradicts, or is neutral with respect to another piece of text.\\n- The text also references the creation of a large annotated corpus for learning natural language inference, which was presented at the Empirical Methods in Natural Language Processing (EMNLP) conference in 2015.\\n\\n- The text discusses a paper titled \\\"KERMIT: Generative insertion-based modeling for sequences\\\"\\n- The authors of the paper are William Chan, Nikita Kitaev, Kelvin Guu, Mitchell Stern, and Jakob Uszkoreit\\n- The paper likely introduces a new model or approach called KERMIT for generating sequences using insertion-based modeling\\n\\n- The text mentions several key papers and challenges in the field of natural language processing and machine learning.\\n- It references the PASCAL Recognising Textual Entailment Challenge, Semi-supervised Sequence Learning, BERT pre-training model, and Unified Language Model pre-training.\\n- These papers and challenges are important in advancing the understanding and generation of natural language.\\n- The text also highlights the use of deep bidirectional transformers and the construction of corpora for paraphrasing in natural language processing research.\\n\\n### Summary of Main Points:\\n- The text discusses the third PASCAL recognizing textual entailment challenge, which was held in 2007.\\n- It also mentions the OpenWebText Corpus created by Aaron Gokaslan and Vanya Cohen in 2019.\\n- The link to access the OpenWebText Corpus is provided.\\n\\n# Summary of Main Points:\\n\\n- **Authors**: Felix Hamborg, Norman Meuschke, Corinna Brenitinger, and Bela Gipp\\n- **Year**: 2017\\n- **Title**: news-please: A generic news crawler and extractor\\n- **Event**: Proceedings of the 15th International Symposium of Information Science\\n\\n- The paper is authored by Dan Hendrycks and Kevin Gimpel in 2016.\\n- The paper introduces a new activation function called Gaussian Error Linear Units (GELUs).\\n- The paper is available as an arXiv preprint with the identifier arXiv:1606.08415.\\n\\n### Summary:\\n- **Authors and Publications:** \\n - Matthew Honnibal and Ines Montani are working on spaCy 2 for natural language understanding with Bloom embeddings, convolutional neural networks, and incremental parsing. \\n - Jeremy Howard and Sebastian Ruder are focusing on universal language model fine-tuning for text classification.\\n - Shankar Iyer, Nikhil Dandekar, and Kornl Csernai released the first Quora dataset on question pairs.\\n - Mandar Joshi, Danqi Chen, Yinhan Liu, and Daniel S. are also involved in the field.\\n\\n- SpanBERT is a pre-training method that improves performance by representing and predicting spans.\\n- The paper was published as an arXiv preprint with the identifier arXiv:1907.10529.\\n- Another method mentioned in the text is Adam, which is a stochastic optimization technique.\\n- Adam was presented at the International Conference on Learning Representations (ICLR) in 2015 by Diederik Kingma and Jimmy Ba.\\n\\n- The text mentions several technical papers and journals related to natural language processing.\\n- One paper discusses a robust trick for the Winograd Schema Challenge.\\n- Another paper introduces a large-scale reading comprehension dataset called RACE.\\n- A third paper focuses on cross-lingual language model pretraining.\\n- All papers are available as arXiv preprints.\\n\\n- The text discusses the Winograd schema challenge, introduced by Hector J Levesque, Ernest Davis, and Leora Morgenstern in 2011.\\n- It also mentions a more recent paper by Xiaodong Liu et al. from 2019, which focuses on improving multi-task deep neural networks through knowledge distillation for natural language understanding.\\n- The first paper addresses commonsense reasoning, while the second paper deals with enhancing neural networks for language processing.\\n\\n- The text discusses two papers on natural language understanding using deep neural networks.\\n- The first paper by Liu et al. focuses on multi-task deep neural networks for natural language understanding.\\n- The second paper by McCann et al. discusses contextualized word vectors learned in translation.\\n- Both papers are available as preprints on arXiv and were presented at the Advances in Neural Information Processing Systems (NIPS) conference.\\n\\n- The text mentions a paper on mixed precision training presented at the International Conference on Learning Representations in 2018 by a group of authors.\\n- It also references a dataset called Cc-news from 2016, which is available online.\\n\\n- FAIRSEQ is a fast and extensible toolkit for sequence modeling, presented at NAACL.\\n- Scaling neural machine translation was discussed in a paper at the Third Conference on Machine Translation (WMT).\\n- Automatic differentiation in PyTorch was covered at the NIPS Autodiff Workshop.\\n- Deep contextualized word representations were presented at NAACL.\\n- OpenAI published reports on improving language understanding with unsupervised learning and language models as unsupervised multitask learners.\\n\\n- The text references several key papers and journals in the field of natural language processing and machine learning.\\n- These papers include topics such as unanswerable questions for machine comprehension, neural machine translation, recursive deep models for sentiment analysis, and masked sequence-to-sequence pre-training for language generation.\\n- The papers were presented at conferences such as the Association for Computational Linguistics (ACL), Empirical Methods in Natural Language Processing (EMNLP), and the International Conference on Machine Learning (ICML).\\n\\n- The text discusses multiple technical papers related to natural language processing and machine learning.\\n- One of the papers mentioned is \\\"ERNIE: Enhanced representation through knowledge integration\\\" by Yu Stephanie Sun et al.\\n- Another paper mentioned is \\\"A simple method for commonsense reasoning\\\" by Trieu H Trinh and Quoc V Le.\\n- The text also references the paper \\\"Attention is all you need\\\" by Ashish Vaswani et al.\\n- Additionally, the paper \\\"Advances in neural information processing systems\\\" is mentioned.\\n- The text provides a list of authors for each paper mentioned.\\n\\n# Summary of \\\"SuperGLUE: A stickier benchmark for general-purpose language understanding systems\\\" by Bowman (2019)\\n\\n- The text is discussing a benchmark called SuperGLUE, which is designed to evaluate general-purpose language understanding systems.\\n- SuperGLUE is intended to be a more challenging benchmark compared to existing ones.\\n- The benchmark is named \\\"SuperGLUE\\\" to imply that it is stickier, meaning it is more difficult for language understanding systems to achieve high scores.\\n- The goal of SuperGLUE is to push the boundaries of what language understanding systems can achieve.\\n\\n- The text discusses various technical papers and preprints related to natural language understanding and neural network acceptability judgments.\\n- The GLUE benchmark is highlighted as a multi-task benchmark and analysis platform for natural language understanding.\\n- The paper by Warstadt et al. (2018) focuses on neural network acceptability judgments.\\n- Williams et al. (2018) introduce a challenge corpus for sentence understanding through inference.\\n- Yang et al. (2019) present XLNet, a generalized autoregressive pretraining method for language understanding.\\n\\n- The first paper discusses a method to reduce the pre-training time for the BERT model from 3 days to just 76 minutes.\\n- The second paper focuses on defending against neural fake news, with authors including Rowan Zellers, Ari Holtzman, and Franziska Roesner.\\n- Both papers are arXiv preprints, indicating they have not yet been peer-reviewed.\\n\\n### Summary:\\n- The paper discusses aligning books and movies to create story-like visual explanations by watching movies and reading books.\\n- The paper presents results for RoBERTa in both LARGE and BASE configurations for various tasks.\\n- Hyperparameters for pretraining RoBERTa LARGE and RoBERTa BASE are provided in detail.\\n- Finetuning hyperparameters for tasks such as RACE, SQuAD, and GLUE are also discussed.\\n- Results for different configurations of RoBERTa on GLUE tasks are presented, showing improvements with additional data and longer pretraining.\\n- A comparison of hyperparameters for RoBERTa LARGE and RoBERTa BASE is provided in tables.\\n\\nTake these and distill it into a final, consolidated summary of the main points. \\n\\nRULES:\\n- The summary should be as if you are presenting the main points in a seminar.\\n- The outline should include common sections of a technical seminar.\\n- Organize the points in powerpoint slide format.\\n- Use markdown to format the text.\\n- Each point may be technical.\\n- You may have as many points as you need.\\n\\nEach slide should follow the following format:\\n### Slide 2: Slide title\\n- point 1\\n- point 2\\n\\nHelpful Answer:\"\n", " ]\n", "}\n", "\u001b[36;1m\u001b[1;3m[llm/end]\u001b[0m \u001b[1m[chain:RunnableSequence > llm:ChatOpenAI] [6.67s] Exiting LLM run with output:\n", "\u001b[0m{\n", " \"generations\": [\n", " [\n", " {\n", " \"text\": \"### Slide 1: Introduction\\n- Discusses advancements in language model pretraining, particularly the RoBERTa approach.\\n- Authors of the paper include Yinhan Liu, Myle Ott, Naman Goyal, and others.\\n\\n### Slide 2: BERT Replication Study\\n- Evaluation of hyperparameters and training data size in BERT pretraining.\\n- Proposed improvements in training approach with RoBERTa surpassing post-BERT models.\\n\\n### Slide 3: RoBERTa Training Details\\n- Training model longer with bigger batches in RoBERTa.\\n- Removal of next sentence prediction objective and training on longer sequences.\\n- Dynamic masking pattern implementation.\\n\\n### Slide 4: Experimental Setup\\n- Use of CC-NEWS dataset for better control in training set size effects.\\n- Achieving state-of-the-art results on various tasks with RoBERTa.\\n- Importance of design choices in language model pretraining.\\n\\n### Slide 5: Key Model Features\\n- Transformer architecture with self-attention heads and hidden dimensions in BERT.\\n- Pretraining objectives in BERT: masked language modeling and next sentence prediction.\\n- Optimization using the Adam optimizer with specific parameters.\\n\\n### Slide 6: Pretraining Comparison\\n- Importance of large and diverse training data in achieving better end-task performance.\\n- Training format comparisons for natural language processing tasks using models like BERT and XLNet.\\n- Results reported for different input formats and pre-trained models.\\n\\n### Slide 7: RoBERTa Performance\\n- Outperforming BERT LARGE and XLNet LARGE with RoBERTa.\\n- Evaluation on GLUE, SQuAD, and RACE benchmarks.\\n- Competitive results without relying on additional external training data.\\n\\n### Slide 8: Conclusion\\n- Significance of RoBERTa as an optimized version of BERT in achieving state-of-the-art results.\\n- Highlighting the impact of design choices in language model pretraining.\\n- Availability of models and code for further research.\",\n", " \"generation_info\": {\n", " \"finish_reason\": \"stop\",\n", " \"logprobs\": null\n", " },\n", " \"type\": \"ChatGeneration\",\n", " \"message\": {\n", " \"lc\": 1,\n", " \"type\": \"constructor\",\n", " \"id\": [\n", " \"langchain\",\n", " \"schema\",\n", " \"messages\",\n", " \"AIMessage\"\n", " ],\n", " \"kwargs\": {\n", " \"content\": \"### Slide 1: Introduction\\n- Discusses advancements in language model pretraining, particularly the RoBERTa approach.\\n- Authors of the paper include Yinhan Liu, Myle Ott, Naman Goyal, and others.\\n\\n### Slide 2: BERT Replication Study\\n- Evaluation of hyperparameters and training data size in BERT pretraining.\\n- Proposed improvements in training approach with RoBERTa surpassing post-BERT models.\\n\\n### Slide 3: RoBERTa Training Details\\n- Training model longer with bigger batches in RoBERTa.\\n- Removal of next sentence prediction objective and training on longer sequences.\\n- Dynamic masking pattern implementation.\\n\\n### Slide 4: Experimental Setup\\n- Use of CC-NEWS dataset for better control in training set size effects.\\n- Achieving state-of-the-art results on various tasks with RoBERTa.\\n- Importance of design choices in language model pretraining.\\n\\n### Slide 5: Key Model Features\\n- Transformer architecture with self-attention heads and hidden dimensions in BERT.\\n- Pretraining objectives in BERT: masked language modeling and next sentence prediction.\\n- Optimization using the Adam optimizer with specific parameters.\\n\\n### Slide 6: Pretraining Comparison\\n- Importance of large and diverse training data in achieving better end-task performance.\\n- Training format comparisons for natural language processing tasks using models like BERT and XLNet.\\n- Results reported for different input formats and pre-trained models.\\n\\n### Slide 7: RoBERTa Performance\\n- Outperforming BERT LARGE and XLNet LARGE with RoBERTa.\\n- Evaluation on GLUE, SQuAD, and RACE benchmarks.\\n- Competitive results without relying on additional external training data.\\n\\n### Slide 8: Conclusion\\n- Significance of RoBERTa as an optimized version of BERT in achieving state-of-the-art results.\\n- Highlighting the impact of design choices in language model pretraining.\\n- Availability of models and code for further research.\",\n", " \"response_metadata\": {\n", " \"token_usage\": {\n", " \"completion_tokens\": 391,\n", " \"prompt_tokens\": 3737,\n", " \"total_tokens\": 4128\n", " },\n", " \"model_name\": \"gpt-3.5-turbo-0125\",\n", " \"system_fingerprint\": null,\n", " \"finish_reason\": \"stop\",\n", " \"logprobs\": null\n", " },\n", " \"type\": \"ai\",\n", " \"id\": \"run-65e0abce-e9e6-4164-9fb3-4701e15ef7df-0\",\n", " \"usage_metadata\": {\n", " \"input_tokens\": 3737,\n", " \"output_tokens\": 391,\n", " \"total_tokens\": 4128\n", " },\n", " \"tool_calls\": [],\n", " \"invalid_tool_calls\": []\n", " }\n", " }\n", " }\n", " ]\n", " ],\n", " \"llm_output\": {\n", " \"token_usage\": {\n", " \"completion_tokens\": 391,\n", " \"prompt_tokens\": 3737,\n", " \"total_tokens\": 4128\n", " },\n", " \"model_name\": \"gpt-3.5-turbo-0125\",\n", " \"system_fingerprint\": null\n", " },\n", " \"run\": null\n", "}\n", "\u001b[36;1m\u001b[1;3m[chain/end]\u001b[0m \u001b[1m[chain:RunnableSequence] [6.68s] Exiting Chain run with output:\n", "\u001b[0m[outputs]\n", "### Slide 1: Introduction\n", "- Discusses advancements in language model pretraining, particularly the RoBERTa approach.\n", "- Authors of the paper include Yinhan Liu, Myle Ott, Naman Goyal, and others.\n", "\n", "### Slide 2: BERT Replication Study\n", "- Evaluation of hyperparameters and training data size in BERT pretraining.\n", "- Proposed improvements in training approach with RoBERTa surpassing post-BERT models.\n", "\n", "### Slide 3: RoBERTa Training Details\n", "- Training model longer with bigger batches in RoBERTa.\n", "- Removal of next sentence prediction objective and training on longer sequences.\n", "- Dynamic masking pattern implementation.\n", "\n", "### Slide 4: Experimental Setup\n", "- Use of CC-NEWS dataset for better control in training set size effects.\n", "- Achieving state-of-the-art results on various tasks with RoBERTa.\n", "- Importance of design choices in language model pretraining.\n", "\n", "### Slide 5: Key Model Features\n", "- Transformer architecture with self-attention heads and hidden dimensions in BERT.\n", "- Pretraining objectives in BERT: masked language modeling and next sentence prediction.\n", "- Optimization using the Adam optimizer with specific parameters.\n", "\n", "### Slide 6: Pretraining Comparison\n", "- Importance of large and diverse training data in achieving better end-task performance.\n", "- Training format comparisons for natural language processing tasks using models like BERT and XLNet.\n", "- Results reported for different input formats and pre-trained models.\n", "\n", "### Slide 7: RoBERTa Performance\n", "- Outperforming BERT LARGE and XLNet LARGE with RoBERTa.\n", "- Evaluation on GLUE, SQuAD, and RACE benchmarks.\n", "- Competitive results without relying on additional external training data.\n", "\n", "### Slide 8: Conclusion\n", "- Significance of RoBERTa as an optimized version of BERT in achieving state-of-the-art results.\n", "- Highlighting the impact of design choices in language model pretraining.\n", "- Availability of models and code for further research.\n" ] } ], "source": [ "def concat_summaries(docs):\n", " print(docs)\n", " return \"\\n\\n\".join(doc.content for doc in docs['docs'])\n", "\n", "# Reduce\n", "reduce_template = \"\"\"The following is set of summaries of a technical paper:\n", "{docs}\n", "\n", "Take these and distill it into a final, consolidated summary of the main points. \n", "\n", "RULES:\n", "- The summary should be as if you are presenting the paper in a seminar.\n", "- The outline should include common sections of a technical seminar.\n", "- Organize the points in powerpoint slide format.\n", "- Use markdown to format the text.\n", "- Each point may be technical.\n", "- You may have as many points as you need.\n", "\n", "Each slide should follow the following format:\n", "### Slide 2: Slide title\n", "- point 1\n", "- point 2\n", "\n", "Helpful Answer:\n", "\"\"\"\n", "\n", "reduce_prompt = PromptTemplate.from_template(reduce_template)\n", "\n", "# Run chain\n", "reduce_chain = (\n", " {\"docs\": RunnablePassthrough() | concat_summaries}\n", " | reduce_prompt\n", " | llm\n", ")\n", "\n", "reduce_res = reduce_chain.invoke({\"docs\": map_res})\n", "print(reduce_res.content)" ] }, { "cell_type": "code", "execution_count": 65, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\u001b[32;1m\u001b[1;3m[chain/start]\u001b[0m \u001b[1m[chain:RunnableSequence] Entering Chain run with input:\n", "\u001b[0m{\n", " \"content\": \"## Main Points Summary:\\n\\n### Slide 1: Introduction\\n- **Authors**: Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov\\n- Discusses RoBERTa, an optimized version of BERT, for natural language processing tasks\\n\\n### Slide 2: RoBERTa Improvements\\n- Longer training with bigger batches, no next sentence prediction, longer sequences, dynamic masking\\n- Used CC-NEWS dataset for better control, achieved state-of-the-art results\\n\\n### Slide 3: BERT Architecture & Training\\n- Transformer architecture, self-attention heads, hidden dimensions\\n- Pretraining objectives: masked language modeling, next sentence prediction\\n- Optimized with Adam on BOOK CORPUS and English WIKIPEDIA\\n\\n### Slide 4: RoBERTa Pretraining Details\\n- Used dynamic masking, full sentences, large mini-batches, larger byte-level BPE\\n- Longer training durations and more data lead to performance gains\\n\\n### Slide 5: Results & Comparisons\\n- RoBERTa surpasses BERT LARGE and XLNet LARGE in performance\\n- Evaluated on GLUE, SQuAD, RACE benchmarks, achieving state-of-the-art results\\n\\n### Slide 6: Conclusion & Impact\\n- RoBERTa's design choices and training approach significantly improve performance\\n- Released models and code for further research, highlighting the importance of design decisions in pretraining\\n\\n### Slide 7: Future Directions\\n- Further exploration of RoBERTa in different NLP tasks\\n- Potential enhancements to training methodology for even better performance\\n\\n### Slide 8: References\\n- List of key papers and authors mentioned in the presentation\\n- Additional resources for diving deeper into natural language processing and machine learning.\"\n", "}\n", "\u001b[32;1m\u001b[1;3m[chain/start]\u001b[0m \u001b[1m[chain:RunnableSequence > chain:RunnableParallel] Entering Chain run with input:\n", "\u001b[0m{\n", " \"content\": \"## Main Points Summary:\\n\\n### Slide 1: Introduction\\n- **Authors**: Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov\\n- Discusses RoBERTa, an optimized version of BERT, for natural language processing tasks\\n\\n### Slide 2: RoBERTa Improvements\\n- Longer training with bigger batches, no next sentence prediction, longer sequences, dynamic masking\\n- Used CC-NEWS dataset for better control, achieved state-of-the-art results\\n\\n### Slide 3: BERT Architecture & Training\\n- Transformer architecture, self-attention heads, hidden dimensions\\n- Pretraining objectives: masked language modeling, next sentence prediction\\n- Optimized with Adam on BOOK CORPUS and English WIKIPEDIA\\n\\n### Slide 4: RoBERTa Pretraining Details\\n- Used dynamic masking, full sentences, large mini-batches, larger byte-level BPE\\n- Longer training durations and more data lead to performance gains\\n\\n### Slide 5: Results & Comparisons\\n- RoBERTa surpasses BERT LARGE and XLNet LARGE in performance\\n- Evaluated on GLUE, SQuAD, RACE benchmarks, achieving state-of-the-art results\\n\\n### Slide 6: Conclusion & Impact\\n- RoBERTa's design choices and training approach significantly improve performance\\n- Released models and code for further research, highlighting the importance of design decisions in pretraining\\n\\n### Slide 7: Future Directions\\n- Further exploration of RoBERTa in different NLP tasks\\n- Potential enhancements to training methodology for even better performance\\n\\n### Slide 8: References\\n- List of key papers and authors mentioned in the presentation\\n- Additional resources for diving deeper into natural language processing and machine learning.\"\n", "}\n", "\u001b[32;1m\u001b[1;3m[chain/start]\u001b[0m \u001b[1m[chain:RunnableSequence > chain:RunnableParallel > chain:RunnablePassthrough] Entering Chain run with input:\n", "\u001b[0m{\n", " \"content\": \"## Main Points Summary:\\n\\n### Slide 1: Introduction\\n- **Authors**: Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov\\n- Discusses RoBERTa, an optimized version of BERT, for natural language processing tasks\\n\\n### Slide 2: RoBERTa Improvements\\n- Longer training with bigger batches, no next sentence prediction, longer sequences, dynamic masking\\n- Used CC-NEWS dataset for better control, achieved state-of-the-art results\\n\\n### Slide 3: BERT Architecture & Training\\n- Transformer architecture, self-attention heads, hidden dimensions\\n- Pretraining objectives: masked language modeling, next sentence prediction\\n- Optimized with Adam on BOOK CORPUS and English WIKIPEDIA\\n\\n### Slide 4: RoBERTa Pretraining Details\\n- Used dynamic masking, full sentences, large mini-batches, larger byte-level BPE\\n- Longer training durations and more data lead to performance gains\\n\\n### Slide 5: Results & Comparisons\\n- RoBERTa surpasses BERT LARGE and XLNet LARGE in performance\\n- Evaluated on GLUE, SQuAD, RACE benchmarks, achieving state-of-the-art results\\n\\n### Slide 6: Conclusion & Impact\\n- RoBERTa's design choices and training approach significantly improve performance\\n- Released models and code for further research, highlighting the importance of design decisions in pretraining\\n\\n### Slide 7: Future Directions\\n- Further exploration of RoBERTa in different NLP tasks\\n- Potential enhancements to training methodology for even better performance\\n\\n### Slide 8: References\\n- List of key papers and authors mentioned in the presentation\\n- Additional resources for diving deeper into natural language processing and machine learning.\"\n", "}\n", "\u001b[36;1m\u001b[1;3m[chain/end]\u001b[0m \u001b[1m[chain:RunnableSequence > chain:RunnableParallel > chain:RunnablePassthrough] [1ms] Exiting Chain run with output:\n", "\u001b[0m{\n", " \"content\": \"## Main Points Summary:\\n\\n### Slide 1: Introduction\\n- **Authors**: Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov\\n- Discusses RoBERTa, an optimized version of BERT, for natural language processing tasks\\n\\n### Slide 2: RoBERTa Improvements\\n- Longer training with bigger batches, no next sentence prediction, longer sequences, dynamic masking\\n- Used CC-NEWS dataset for better control, achieved state-of-the-art results\\n\\n### Slide 3: BERT Architecture & Training\\n- Transformer architecture, self-attention heads, hidden dimensions\\n- Pretraining objectives: masked language modeling, next sentence prediction\\n- Optimized with Adam on BOOK CORPUS and English WIKIPEDIA\\n\\n### Slide 4: RoBERTa Pretraining Details\\n- Used dynamic masking, full sentences, large mini-batches, larger byte-level BPE\\n- Longer training durations and more data lead to performance gains\\n\\n### Slide 5: Results & Comparisons\\n- RoBERTa surpasses BERT LARGE and XLNet LARGE in performance\\n- Evaluated on GLUE, SQuAD, RACE benchmarks, achieving state-of-the-art results\\n\\n### Slide 6: Conclusion & Impact\\n- RoBERTa's design choices and training approach significantly improve performance\\n- Released models and code for further research, highlighting the importance of design decisions in pretraining\\n\\n### Slide 7: Future Directions\\n- Further exploration of RoBERTa in different NLP tasks\\n- Potential enhancements to training methodology for even better performance\\n\\n### Slide 8: References\\n- List of key papers and authors mentioned in the presentation\\n- Additional resources for diving deeper into natural language processing and machine learning.\"\n", "}\n", "\u001b[36;1m\u001b[1;3m[chain/end]\u001b[0m \u001b[1m[chain:RunnableSequence > chain:RunnableParallel] [5ms] Exiting Chain run with output:\n", "\u001b[0m{\n", " \"content\": {\n", " \"content\": \"## Main Points Summary:\\n\\n### Slide 1: Introduction\\n- **Authors**: Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov\\n- Discusses RoBERTa, an optimized version of BERT, for natural language processing tasks\\n\\n### Slide 2: RoBERTa Improvements\\n- Longer training with bigger batches, no next sentence prediction, longer sequences, dynamic masking\\n- Used CC-NEWS dataset for better control, achieved state-of-the-art results\\n\\n### Slide 3: BERT Architecture & Training\\n- Transformer architecture, self-attention heads, hidden dimensions\\n- Pretraining objectives: masked language modeling, next sentence prediction\\n- Optimized with Adam on BOOK CORPUS and English WIKIPEDIA\\n\\n### Slide 4: RoBERTa Pretraining Details\\n- Used dynamic masking, full sentences, large mini-batches, larger byte-level BPE\\n- Longer training durations and more data lead to performance gains\\n\\n### Slide 5: Results & Comparisons\\n- RoBERTa surpasses BERT LARGE and XLNet LARGE in performance\\n- Evaluated on GLUE, SQuAD, RACE benchmarks, achieving state-of-the-art results\\n\\n### Slide 6: Conclusion & Impact\\n- RoBERTa's design choices and training approach significantly improve performance\\n- Released models and code for further research, highlighting the importance of design decisions in pretraining\\n\\n### Slide 7: Future Directions\\n- Further exploration of RoBERTa in different NLP tasks\\n- Potential enhancements to training methodology for even better performance\\n\\n### Slide 8: References\\n- List of key papers and authors mentioned in the presentation\\n- Additional resources for diving deeper into natural language processing and machine learning.\"\n", " }\n", "}\n", "\u001b[32;1m\u001b[1;3m[chain/start]\u001b[0m \u001b[1m[chain:RunnableSequence > prompt:PromptTemplate] Entering Prompt run with input:\n", "\u001b[0m{\n", " \"content\": {\n", " \"content\": \"## Main Points Summary:\\n\\n### Slide 1: Introduction\\n- **Authors**: Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov\\n- Discusses RoBERTa, an optimized version of BERT, for natural language processing tasks\\n\\n### Slide 2: RoBERTa Improvements\\n- Longer training with bigger batches, no next sentence prediction, longer sequences, dynamic masking\\n- Used CC-NEWS dataset for better control, achieved state-of-the-art results\\n\\n### Slide 3: BERT Architecture & Training\\n- Transformer architecture, self-attention heads, hidden dimensions\\n- Pretraining objectives: masked language modeling, next sentence prediction\\n- Optimized with Adam on BOOK CORPUS and English WIKIPEDIA\\n\\n### Slide 4: RoBERTa Pretraining Details\\n- Used dynamic masking, full sentences, large mini-batches, larger byte-level BPE\\n- Longer training durations and more data lead to performance gains\\n\\n### Slide 5: Results & Comparisons\\n- RoBERTa surpasses BERT LARGE and XLNet LARGE in performance\\n- Evaluated on GLUE, SQuAD, RACE benchmarks, achieving state-of-the-art results\\n\\n### Slide 6: Conclusion & Impact\\n- RoBERTa's design choices and training approach significantly improve performance\\n- Released models and code for further research, highlighting the importance of design decisions in pretraining\\n\\n### Slide 7: Future Directions\\n- Further exploration of RoBERTa in different NLP tasks\\n- Potential enhancements to training methodology for even better performance\\n\\n### Slide 8: References\\n- List of key papers and authors mentioned in the presentation\\n- Additional resources for diving deeper into natural language processing and machine learning.\"\n", " }\n", "}\n", "\u001b[36;1m\u001b[1;3m[chain/end]\u001b[0m \u001b[1m[chain:RunnableSequence > prompt:PromptTemplate] [1ms] Exiting Prompt run with output:\n", "\u001b[0m[outputs]\n", "\u001b[32;1m\u001b[1;3m[llm/start]\u001b[0m \u001b[1m[chain:RunnableSequence > llm:ChatOpenAI] Entering LLM run with input:\n", "\u001b[0m{\n", " \"prompts\": [\n", " \"Human: \\nYou are a technical writer for a seminar.\\nYou have been tasked with creating a presentation slide for a seminar.\\nThe following is the content you need to consider:\\n{'content': \\\"## Main Points Summary:\\\\n\\\\n### Slide 1: Introduction\\\\n- **Authors**: Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov\\\\n- Discusses RoBERTa, an optimized version of BERT, for natural language processing tasks\\\\n\\\\n### Slide 2: RoBERTa Improvements\\\\n- Longer training with bigger batches, no next sentence prediction, longer sequences, dynamic masking\\\\n- Used CC-NEWS dataset for better control, achieved state-of-the-art results\\\\n\\\\n### Slide 3: BERT Architecture & Training\\\\n- Transformer architecture, self-attention heads, hidden dimensions\\\\n- Pretraining objectives: masked language modeling, next sentence prediction\\\\n- Optimized with Adam on BOOK CORPUS and English WIKIPEDIA\\\\n\\\\n### Slide 4: RoBERTa Pretraining Details\\\\n- Used dynamic masking, full sentences, large mini-batches, larger byte-level BPE\\\\n- Longer training durations and more data lead to performance gains\\\\n\\\\n### Slide 5: Results & Comparisons\\\\n- RoBERTa surpasses BERT LARGE and XLNet LARGE in performance\\\\n- Evaluated on GLUE, SQuAD, RACE benchmarks, achieving state-of-the-art results\\\\n\\\\n### Slide 6: Conclusion & Impact\\\\n- RoBERTa's design choices and training approach significantly improve performance\\\\n- Released models and code for further research, highlighting the importance of design decisions in pretraining\\\\n\\\\n### Slide 7: Future Directions\\\\n- Further exploration of RoBERTa in different NLP tasks\\\\n- Potential enhancements to training methodology for even better performance\\\\n\\\\n### Slide 8: References\\\\n- List of key papers and authors mentioned in the presentation\\\\n- Additional resources for diving deeper into natural language processing and machine learning.\\\"}\\n\\nBased on this content, create a slide for a presentation.\\n\\nRULES:\\n- Use Beamer LaTeX format.\\n- Each slide should have a title and bullet points.\"\n", " ]\n", "}\n", "\u001b[36;1m\u001b[1;3m[llm/end]\u001b[0m \u001b[1m[chain:RunnableSequence > llm:ChatOpenAI] [8.14s] Exiting LLM run with output:\n", "\u001b[0m{\n", " \"generations\": [\n", " [\n", " {\n", " \"text\": \"\\\\documentclass{beamer}\\n\\n\\\\begin{document}\\n\\n\\\\begin{frame}\\n\\\\frametitle{Introduction}\\n\\\\begin{itemize}\\n \\\\item Authors: Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov\\n \\\\item Discusses RoBERTa, an optimized version of BERT, for natural language processing tasks\\n\\\\end{itemize}\\n\\\\end{frame}\\n\\n\\\\begin{frame}\\n\\\\frametitle{RoBERTa Improvements}\\n\\\\begin{itemize}\\n \\\\item Longer training with bigger batches, no next sentence prediction, longer sequences, dynamic masking\\n \\\\item Used CC-NEWS dataset for better control, achieved state-of-the-art results\\n\\\\end{itemize}\\n\\\\end{frame}\\n\\n\\\\begin{frame}\\n\\\\frametitle{BERT Architecture \\\\& Training}\\n\\\\begin{itemize}\\n \\\\item Transformer architecture, self-attention heads, hidden dimensions\\n \\\\item Pretraining objectives: masked language modeling, next sentence prediction\\n \\\\item Optimized with Adam on BOOK CORPUS and English WIKIPEDIA\\n\\\\end{itemize}\\n\\\\end{frame}\\n\\n\\\\begin{frame}\\n\\\\frametitle{RoBERTa Pretraining Details}\\n\\\\begin{itemize}\\n \\\\item Used dynamic masking, full sentences, large mini-batches, larger byte-level BPE\\n \\\\item Longer training durations and more data lead to performance gains\\n\\\\end{itemize}\\n\\\\end{frame}\\n\\n\\\\begin{frame}\\n\\\\frametitle{Results \\\\& Comparisons}\\n\\\\begin{itemize}\\n \\\\item RoBERTa surpasses BERT LARGE and XLNet LARGE in performance\\n \\\\item Evaluated on GLUE, SQuAD, RACE benchmarks, achieving state-of-the-art results\\n\\\\end{itemize}\\n\\\\end{frame}\\n\\n\\\\begin{frame}\\n\\\\frametitle{Conclusion \\\\& Impact}\\n\\\\begin{itemize}\\n \\\\item RoBERTa's design choices and training approach significantly improve performance\\n \\\\item Released models and code for further research, highlighting the importance of design decisions in pretraining\\n\\\\end{itemize}\\n\\\\end{frame}\\n\\n\\\\begin{frame}\\n\\\\frametitle{Future Directions}\\n\\\\begin{itemize}\\n \\\\item Further exploration of RoBERTa in different NLP tasks\\n \\\\item Potential enhancements to training methodology for even better performance\\n\\\\end{itemize}\\n\\\\end{frame}\\n\\n\\\\begin{frame}\\n\\\\frametitle{References}\\n\\\\begin{itemize}\\n \\\\item List of key papers and authors mentioned in the presentation\\n \\\\item Additional resources for diving deeper into natural language processing and machine learning\\n\\\\end{itemize}\\n\\\\end{frame}\\n\\n\\\\end{document}\",\n", " \"generation_info\": {\n", " \"finish_reason\": \"stop\",\n", " \"logprobs\": null\n", " },\n", " \"type\": \"ChatGeneration\",\n", " \"message\": {\n", " \"lc\": 1,\n", " \"type\": \"constructor\",\n", " \"id\": [\n", " \"langchain\",\n", " \"schema\",\n", " \"messages\",\n", " \"AIMessage\"\n", " ],\n", " \"kwargs\": {\n", " \"content\": \"\\\\documentclass{beamer}\\n\\n\\\\begin{document}\\n\\n\\\\begin{frame}\\n\\\\frametitle{Introduction}\\n\\\\begin{itemize}\\n \\\\item Authors: Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov\\n \\\\item Discusses RoBERTa, an optimized version of BERT, for natural language processing tasks\\n\\\\end{itemize}\\n\\\\end{frame}\\n\\n\\\\begin{frame}\\n\\\\frametitle{RoBERTa Improvements}\\n\\\\begin{itemize}\\n \\\\item Longer training with bigger batches, no next sentence prediction, longer sequences, dynamic masking\\n \\\\item Used CC-NEWS dataset for better control, achieved state-of-the-art results\\n\\\\end{itemize}\\n\\\\end{frame}\\n\\n\\\\begin{frame}\\n\\\\frametitle{BERT Architecture \\\\& Training}\\n\\\\begin{itemize}\\n \\\\item Transformer architecture, self-attention heads, hidden dimensions\\n \\\\item Pretraining objectives: masked language modeling, next sentence prediction\\n \\\\item Optimized with Adam on BOOK CORPUS and English WIKIPEDIA\\n\\\\end{itemize}\\n\\\\end{frame}\\n\\n\\\\begin{frame}\\n\\\\frametitle{RoBERTa Pretraining Details}\\n\\\\begin{itemize}\\n \\\\item Used dynamic masking, full sentences, large mini-batches, larger byte-level BPE\\n \\\\item Longer training durations and more data lead to performance gains\\n\\\\end{itemize}\\n\\\\end{frame}\\n\\n\\\\begin{frame}\\n\\\\frametitle{Results \\\\& Comparisons}\\n\\\\begin{itemize}\\n \\\\item RoBERTa surpasses BERT LARGE and XLNet LARGE in performance\\n \\\\item Evaluated on GLUE, SQuAD, RACE benchmarks, achieving state-of-the-art results\\n\\\\end{itemize}\\n\\\\end{frame}\\n\\n\\\\begin{frame}\\n\\\\frametitle{Conclusion \\\\& Impact}\\n\\\\begin{itemize}\\n \\\\item RoBERTa's design choices and training approach significantly improve performance\\n \\\\item Released models and code for further research, highlighting the importance of design decisions in pretraining\\n\\\\end{itemize}\\n\\\\end{frame}\\n\\n\\\\begin{frame}\\n\\\\frametitle{Future Directions}\\n\\\\begin{itemize}\\n \\\\item Further exploration of RoBERTa in different NLP tasks\\n \\\\item Potential enhancements to training methodology for even better performance\\n\\\\end{itemize}\\n\\\\end{frame}\\n\\n\\\\begin{frame}\\n\\\\frametitle{References}\\n\\\\begin{itemize}\\n \\\\item List of key papers and authors mentioned in the presentation\\n \\\\item Additional resources for diving deeper into natural language processing and machine learning\\n\\\\end{itemize}\\n\\\\end{frame}\\n\\n\\\\end{document}\",\n", " \"response_metadata\": {\n", " \"token_usage\": {\n", " \"completion_tokens\": 603,\n", " \"prompt_tokens\": 467,\n", " \"total_tokens\": 1070\n", " },\n", " \"model_name\": \"gpt-3.5-turbo-0125\",\n", " \"system_fingerprint\": null,\n", " \"finish_reason\": \"stop\",\n", " \"logprobs\": null\n", " },\n", " \"type\": \"ai\",\n", " \"id\": \"run-ad197e4e-b328-4da0-91ba-9967b039c6ee-0\",\n", " \"usage_metadata\": {\n", " \"input_tokens\": 467,\n", " \"output_tokens\": 603,\n", " \"total_tokens\": 1070\n", " },\n", " \"tool_calls\": [],\n", " \"invalid_tool_calls\": []\n", " }\n", " }\n", " }\n", " ]\n", " ],\n", " \"llm_output\": {\n", " \"token_usage\": {\n", " \"completion_tokens\": 603,\n", " \"prompt_tokens\": 467,\n", " \"total_tokens\": 1070\n", " },\n", " \"model_name\": \"gpt-3.5-turbo-0125\",\n", " \"system_fingerprint\": null\n", " },\n", " \"run\": null\n", "}\n", "\u001b[36;1m\u001b[1;3m[chain/end]\u001b[0m \u001b[1m[chain:RunnableSequence] [8.15s] Exiting Chain run with output:\n", "\u001b[0m[outputs]\n", "\\documentclass{beamer}\n", "\n", "\\begin{document}\n", "\n", "\\begin{frame}\n", "\\frametitle{Introduction}\n", "\\begin{itemize}\n", " \\item Authors: Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov\n", " \\item Discusses RoBERTa, an optimized version of BERT, for natural language processing tasks\n", "\\end{itemize}\n", "\\end{frame}\n", "\n", "\\begin{frame}\n", "\\frametitle{RoBERTa Improvements}\n", "\\begin{itemize}\n", " \\item Longer training with bigger batches, no next sentence prediction, longer sequences, dynamic masking\n", " \\item Used CC-NEWS dataset for better control, achieved state-of-the-art results\n", "\\end{itemize}\n", "\\end{frame}\n", "\n", "\\begin{frame}\n", "\\frametitle{BERT Architecture \\& Training}\n", "\\begin{itemize}\n", " \\item Transformer architecture, self-attention heads, hidden dimensions\n", " \\item Pretraining objectives: masked language modeling, next sentence prediction\n", " \\item Optimized with Adam on BOOK CORPUS and English WIKIPEDIA\n", "\\end{itemize}\n", "\\end{frame}\n", "\n", "\\begin{frame}\n", "\\frametitle{RoBERTa Pretraining Details}\n", "\\begin{itemize}\n", " \\item Used dynamic masking, full sentences, large mini-batches, larger byte-level BPE\n", " \\item Longer training durations and more data lead to performance gains\n", "\\end{itemize}\n", "\\end{frame}\n", "\n", "\\begin{frame}\n", "\\frametitle{Results \\& Comparisons}\n", "\\begin{itemize}\n", " \\item RoBERTa surpasses BERT LARGE and XLNet LARGE in performance\n", " \\item Evaluated on GLUE, SQuAD, RACE benchmarks, achieving state-of-the-art results\n", "\\end{itemize}\n", "\\end{frame}\n", "\n", "\\begin{frame}\n", "\\frametitle{Conclusion \\& Impact}\n", "\\begin{itemize}\n", " \\item RoBERTa's design choices and training approach significantly improve performance\n", " \\item Released models and code for further research, highlighting the importance of design decisions in pretraining\n", "\\end{itemize}\n", "\\end{frame}\n", "\n", "\\begin{frame}\n", "\\frametitle{Future Directions}\n", "\\begin{itemize}\n", " \\item Further exploration of RoBERTa in different NLP tasks\n", " \\item Potential enhancements to training methodology for even better performance\n", "\\end{itemize}\n", "\\end{frame}\n", "\n", "\\begin{frame}\n", "\\frametitle{References}\n", "\\begin{itemize}\n", " \\item List of key papers and authors mentioned in the presentation\n", " \\item Additional resources for diving deeper into natural language processing and machine learning\n", "\\end{itemize}\n", "\\end{frame}\n", "\n", "\\end{document}\n" ] } ], "source": [ "beamer_template = \"\"\"\n", "You are a technical writer for a seminar.\n", "You have been tasked with creating a presentation slide for a seminar.\n", "The following is the content you need to consider:\n", "{content}\n", "\n", "Based on this content, create a slide for a presentation.\n", "\n", "RULES:\n", "- Use Beamer LaTeX format.\n", "- Each slide should have a title and bullet points.\n", "\"\"\"\n", "\n", "beamer_prompt = PromptTemplate(template=beamer_template)\n", "beamer_chain = (\n", " {\"content\": RunnablePassthrough()}\n", " | beamer_prompt\n", " | llm\n", ")\n", "\n", "beamer_res = beamer_chain.invoke({\"content\": reduce_res.content})\n", "print(beamer_res.content)" ] }, { "cell_type": "code", "execution_count": 92, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[{'slide_number': 1,\n", " 'slide_title': 'Introduction',\n", " 'points': [{'content': '- Discusses advancements in language model pretraining, particularly the RoBERTa approach.',\n", " 'sources': [Document(page_content='arXiv:1907.11692v1 [cs.CL] 26 Jul 2019RoBERTa: A Robustly Optimized BERT Pretraining Approach\\nYinhan Liu∗§Myle Ott∗§Naman Goyal∗§Jingfei Du∗§Mandar Joshi†\\nDanqi Chen§Omer Levy§Mike Lewis§Luke Zettlemoyer†§Veselin Stoyanov§\\n†Paul G. Allen School of Computer Science & Engineering,', metadata={'source': 'https://arxiv.org/pdf/1907.11692v1', 'page': 0}),\n", " Document(page_content='task development sets. Crucially, RoBERTa uses\\nthe same masked language modeling pretrain-\\ning objective and architecture as BERT LARGE , yet\\nconsistently outperforms both BERT LARGE and\\nXLNet LARGE . This raises questions about the rel-\\native importance of model architecture and pre-', metadata={'source': 'https://arxiv.org/pdf/1907.11692v1', 'page': 7}),\n", " Document(page_content='128 tokens and, if needed, the passage so that the\\ntotal length is at most 512 tokens.\\nResults on the RACE test sets are presented in\\nTable 7. RoBERTa achieves state-of-the-art results\\non both middle-school and high-school settings.\\n6 Related Work\\nPretraining methods have been designed', metadata={'source': 'https://arxiv.org/pdf/1907.11692v1', 'page': 8}),\n", " Document(page_content='tion Processing Systems (NIPS) .\\nJacob Devlin, Ming-Wei Chang, Kenton Lee, and\\nKristina Toutanova. 2019. BERT: Pre-training of\\ndeep bidirectional transformers for language under-\\nstanding. In North American Association for Com-\\nputational Linguistics (NAACL) .', metadata={'source': 'https://arxiv.org/pdf/1907.11692v1', 'page': 9}),\n", " Document(page_content='Pretraining methods have been designed\\nwith different training objectives, includ-\\ning language modeling ( Dai and Le ,2015 ;\\nPeters et al. ,2018 ;Howard and Ruder ,2018 ),\\nmachine translation ( McCann et al. ,2017 ), and\\nmasked language modeling ( Devlin et al. ,2019 ;', metadata={'source': 'https://arxiv.org/pdf/1907.11692v1', 'page': 8})]},\n", " {'content': '- Authors of the paper include Yinhan Liu, Myle Ott, Naman Goyal, and others.',\n", " 'sources': [Document(page_content='Alex Wang, Amanpreet Singh, Julian Michael, Felix\\nHill, Omer Levy, and Samuel R. Bowman. 2019b.\\nGLUE: A multi-task benchmark and analysis plat-\\nform for natural language understanding. In Inter-\\nnational Conference on Learning Representations\\n(ICLR) .', metadata={'source': 'https://arxiv.org/pdf/1907.11692v1', 'page': 11}),\n", " Document(page_content='Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob\\nUszkoreit, Llion Jones, Aidan N Gomez, Łukasz\\nKaiser, and Illia Polosukhin. 2017. Attention is all\\nyou need. In Advances in neural information pro-\\ncessing systems .\\nAlex Wang, Yada Pruksachatkun, Nikita Nangia,', metadata={'source': 'https://arxiv.org/pdf/1907.11692v1', 'page': 11}),\n", " Document(page_content='Chuang, Christopher D Manning, Andrew Ng, and\\nChristopher Potts. 2013. Recursive deep models\\nfor semantic compositionality over a sentiment tree-\\nbank. In Empirical Methods in Natural Language', metadata={'source': 'https://arxiv.org/pdf/1907.11692v1', 'page': 10}),\n", " Document(page_content='Ganesh Venkatesh, and Hao Wu. 2018. Mixed preci-\\nsion training. In International Conference on Learn-\\ning Representations .\\nSebastian Nagel. 2016. Cc-news. http:\\n//web.archive.org/save/http:\\n//commoncrawl.org/2016/10/news-\\ndataset-available .\\nMyle Ott, Sergey Edunov, Alexei Baevski, Angela', metadata={'source': 'https://arxiv.org/pdf/1907.11692v1', 'page': 10}),\n", " Document(page_content='arXiv:1907.11692v1 [cs.CL] 26 Jul 2019RoBERTa: A Robustly Optimized BERT Pretraining Approach\\nYinhan Liu∗§Myle Ott∗§Naman Goyal∗§Jingfei Du∗§Mandar Joshi†\\nDanqi Chen§Omer Levy§Mike Lewis§Luke Zettlemoyer†§Veselin Stoyanov§\\n†Paul G. Allen School of Computer Science & Engineering,', metadata={'source': 'https://arxiv.org/pdf/1907.11692v1', 'page': 0})]}]},\n", " {'slide_number': 2,\n", " 'slide_title': 'BERT Replication Study',\n", " 'points': [{'content': '- Evaluation of hyperparameters and training data size in BERT pretraining.',\n", " 'sources': [Document(page_content='7 Conclusion\\nWe carefully evaluate a number of design de-\\ncisions when pretraining BERT models. We\\nfind that performance can be substantially im-\\nproved by training the model longer, with bigger\\nbatches over more data; removing the next sen-\\ntence prediction objective; training on longer se-', metadata={'source': 'https://arxiv.org/pdf/1907.11692v1', 'page': 9}),\n", " Document(page_content='32GB Nvidia V100 GPUs interconnected by In-\\nfiniband ( Micikevicius et al. ,2018 ).\\n3.2 Data\\nBERT-style pretraining crucially relies on large\\nquantities of text. Baevski et al. (2019 ) demon-\\nstrate that increasing data size can result in im-\\nproved end-task performance. Several efforts', metadata={'source': 'https://arxiv.org/pdf/1907.11692v1', 'page': 2}),\n", " Document(page_content='choices have significant impact on the final re-\\nsults. We present a replication study of BERT\\npretraining ( Devlin et al. ,2019 ) that carefully\\nmeasures the impact of many key hyperparam-\\neters and training data size. We find that BERT\\nwas significantly undertrained, and can match', metadata={'source': 'https://arxiv.org/pdf/1907.11692v1', 'page': 0}),\n", " Document(page_content='mum length T=512 tokens.\\n2.5 Data\\nBERT is trained on a combination of B OOK COR-\\nPUS (Zhu et al. ,2015 ) plus English W IKIPEDIA ,\\nwhich totals 16GB of uncompressed text.3\\n3 Experimental Setup\\nIn this section, we describe the experimental setup\\nfor our replication study of BERT.\\n3.1 Implementation', metadata={'source': 'https://arxiv.org/pdf/1907.11692v1', 'page': 1}),\n", " Document(page_content='BERT LARGE\\nwith B OOKS + W IKI 13GB 256 1M 90.9/81.8 86.6 93.7\\nXLNet LARGE\\nwith B OOKS + W IKI 13GB 256 1M 94.0/87.8 88.4 94.4\\n+ additional data 126GB 2K 500K 94.5/88.8 89.8 95.6\\nTable 4: Development set results for RoBERTa as we pretrain o ver more data (16GB →160GB of text) and pretrain', metadata={'source': 'https://arxiv.org/pdf/1907.11692v1', 'page': 6})]},\n", " {'content': '- Proposed improvements in training approach with RoBERTa surpassing post-BERT models.',\n", " 'sources': [Document(page_content='tuning and training set size. We find that BERT\\nwas significantly undertrained and propose an im-\\nproved recipe for training BERT models, which\\nwe call RoBERTa, that can match or exceed the\\nperformance of all of the post-BERT methods.\\nOur modifications are simple, they include: (1)', metadata={'source': 'https://arxiv.org/pdf/1907.11692v1', 'page': 0}),\n", " Document(page_content='Results We present our results in Table 4. When\\ncontrolling for training data, we observe that\\nRoBERTa provides a large improvement over the\\noriginally reported BERT LARGE results, reaffirming\\nthe importance of the design choices we explored\\nin Section 4.', metadata={'source': 'https://arxiv.org/pdf/1907.11692v1', 'page': 6}),\n", " Document(page_content='to the BERT pretraining procedure that improve\\nend-task performance. We now aggregate these\\nimprovements and evaluate their combined im-\\npact. We call this configuration RoBERTa for\\nRobustly optimized BERT approach. Specifi-\\ncally, RoBERTa is trained with dynamic mask-', metadata={'source': 'https://arxiv.org/pdf/1907.11692v1', 'page': 5}),\n", " Document(page_content='proved training procedure improves upon the pub-\\nlished BERT results on both GLUE and SQuAD.\\nWhen trained for longer over additional data, our\\nmodel achieves a score of 88.5 on the public\\nGLUE leaderboard, matching the 88.4 reported\\nbyYang et al. (2019 ). Our model establishes a', metadata={'source': 'https://arxiv.org/pdf/1907.11692v1', 'page': 0}),\n", " Document(page_content='arXiv:1907.11692v1 [cs.CL] 26 Jul 2019RoBERTa: A Robustly Optimized BERT Pretraining Approach\\nYinhan Liu∗§Myle Ott∗§Naman Goyal∗§Jingfei Du∗§Mandar Joshi†\\nDanqi Chen§Omer Levy§Mike Lewis§Luke Zettlemoyer†§Veselin Stoyanov§\\n†Paul G. Allen School of Computer Science & Engineering,', metadata={'source': 'https://arxiv.org/pdf/1907.11692v1', 'page': 0})]}]},\n", " {'slide_number': 3,\n", " 'slide_title': 'RoBERTa Training Details',\n", " 'points': [{'content': '- Training model longer with bigger batches in RoBERTa.',\n", " 'sources': [Document(page_content='a batch size eight times larger for half as many op-\\ntimization steps, thus seeing four times as many\\nsequences in pretraining compared to BERT.\\nTo help disentangle the importance of these fac-\\ntors from other modeling choices (e.g., the pre-\\ntraining objective), we begin by training RoBERTa', metadata={'source': 'https://arxiv.org/pdf/1907.11692v1', 'page': 5}),\n", " Document(page_content='serve further improvements in performance across\\nall downstream tasks, validating the importance of\\ndata size and diversity in pretraining.9\\nFinally, we pretrain RoBERTa for significantly\\nlonger, increasing the number of pretraining steps\\nfrom 100K to 300K, and then further to 500K. We', metadata={'source': 'https://arxiv.org/pdf/1907.11692v1', 'page': 6}),\n", " Document(page_content='for longer (100K →300K→500K steps). Each row accumulates improvements from the row s above. RoBERTa\\nmatches the architecture and training objective of BERT LARGE . Results for BERT LARGE and XLNet LARGE are from', metadata={'source': 'https://arxiv.org/pdf/1907.11692v1', 'page': 6}),\n", " Document(page_content='7 Conclusion\\nWe carefully evaluate a number of design de-\\ncisions when pretraining BERT models. We\\nfind that performance can be substantially im-\\nproved by training the model longer, with bigger\\nbatches over more data; removing the next sen-\\ntence prediction objective; training on longer se-', metadata={'source': 'https://arxiv.org/pdf/1907.11692v1', 'page': 9}),\n", " Document(page_content='Our modifications are simple, they include: (1)\\ntraining the model longer, with bigger batches,\\nover more data; (2) removing the next sentence\\nprediction objective; (3) training on longer se-\\nquences; and (4) dynamically changing the mask-\\ning pattern applied to the training data. We also', metadata={'source': 'https://arxiv.org/pdf/1907.11692v1', 'page': 0})]},\n", " {'content': '- Removal of next sentence prediction objective and training on longer sequences.',\n", " 'sources': [Document(page_content='Our modifications are simple, they include: (1)\\ntraining the model longer, with bigger batches,\\nover more data; (2) removing the next sentence\\nprediction objective; (3) training on longer se-\\nquences; and (4) dynamically changing the mask-\\ning pattern applied to the training data. We also', metadata={'source': 'https://arxiv.org/pdf/1907.11692v1', 'page': 0}),\n", " Document(page_content='7 Conclusion\\nWe carefully evaluate a number of design de-\\ncisions when pretraining BERT models. We\\nfind that performance can be substantially im-\\nproved by training the model longer, with bigger\\nbatches over more data; removing the next sen-\\ntence prediction objective; training on longer se-', metadata={'source': 'https://arxiv.org/pdf/1907.11692v1', 'page': 9}),\n", " Document(page_content='2.3 Training Objectives\\nDuring pretraining, BERT uses two objectives:\\nmasked language modeling and next sentence pre-\\ndiction.\\nMasked Language Model (MLM) A random\\nsample of the tokens in the input sequence is\\nselected and replaced with the special token\\n[MASK]. The MLM objective is a cross-entropy', metadata={'source': 'https://arxiv.org/pdf/1907.11692v1', 'page': 1}),\n", " Document(page_content='auxiliary Next Sentence Prediction (NSP) loss.\\nThe NSP loss was hypothesized to be an impor-\\ntant factor in training the original BERT model.\\nDevlin et al. (2019 ) observe that removing NSP\\nhurts performance, with significant performance\\ndegradation on QNLI, MNLI, and SQuAD 1.1.', metadata={'source': 'https://arxiv.org/pdf/1907.11692v1', 'page': 3}),\n", " Document(page_content='loss and training with blocks of text from a sin-\\ngle document ( DOC-SENTENCES ). We find that\\nthis setting outperforms the originally published\\nBERT BASEresults and that removing the NSP loss\\nmatches or slightly improves downstream task\\nperformance , in contrast to Devlin et al. (2019 ).', metadata={'source': 'https://arxiv.org/pdf/1907.11692v1', 'page': 4})]},\n", " {'content': '- Dynamic masking pattern implementation.',\n", " 'sources': [Document(page_content='with static masking performs similar to the\\noriginal BERT model, and dynamic masking is\\ncomparable or slightly better than static masking.\\nGiven these results and the additional efficiency\\nbenefits of dynamic masking, we use dynamic\\nmasking in the remainder of the experiments.', metadata={'source': 'https://arxiv.org/pdf/1907.11692v1', 'page': 3}),\n", " Document(page_content='was duplicated 10 times so that each sequence is\\nmasked in 10 different ways over the 40 epochs of\\ntraining. Thus, each training sequence was seen\\nwith the same mask four times during training.\\nWe compare this strategy with dynamic mask-\\ningwhere we generate the masking pattern every', metadata={'source': 'https://arxiv.org/pdf/1907.11692v1', 'page': 3}),\n", " Document(page_content='ingwhere we generate the masking pattern every\\ntime we feed a sequence to the model. This be-\\ncomes crucial when pretraining for more steps or\\nwith larger datasets.\\n7Studying architectural changes, including larger archi-\\ntectures, is an important area for future work.Masking SQuAD 2.0 MNLI-m SST-2', metadata={'source': 'https://arxiv.org/pdf/1907.11692v1', 'page': 3}),\n", " Document(page_content='domly masking and predicting tokens. The orig-\\ninal BERT implementation performed masking\\nonce during data preprocessing, resulting in a sin-\\nglestatic mask. To avoid using the same mask for\\neach training instance in every epoch, training data\\nwas duplicated 10 times so that each sequence is', metadata={'source': 'https://arxiv.org/pdf/1907.11692v1', 'page': 3}),\n", " Document(page_content='ence results are from Yang et al. (2019 ).\\nResults Table 1compares the published\\nBERT BASE results from Devlin et al. (2019 ) to our\\nreimplementation with either static or dynamic\\nmasking. We find that our reimplementation\\nwith static masking performs similar to the', metadata={'source': 'https://arxiv.org/pdf/1907.11692v1', 'page': 3})]}]},\n", " {'slide_number': 4,\n", " 'slide_title': 'Experimental Setup',\n", " 'points': [{'content': '- Use of CC-NEWS dataset for better control in training set size effects.',\n", " 'sources': [Document(page_content='ing pattern applied to the training data. We also\\ncollect a large new dataset (CC-N EWS) of compa-\\nrable size to other privately used datasets, to better\\ncontrol for training set size effects.\\nWhen controlling for training data, our im-\\nproved training procedure improves upon the pub-', metadata={'source': 'https://arxiv.org/pdf/1907.11692v1', 'page': 0}),\n", " Document(page_content='We additionally use a novel dataset,\\nCC-N EWS, and release our models and\\ncode for pretraining and finetuning at:\\nhttps://github.com/pytorch/fairseq .\\nReferences\\nEneko Agirre, Llu’is M‘arquez, and Richard Wicen-\\ntowski, editors. 2007. Proceedings of the Fourth', metadata={'source': 'https://arxiv.org/pdf/1907.11692v1', 'page': 9}),\n", " Document(page_content='alternatives that lead to better downstream task\\nperformance; (2) We use a novel dataset, CC-\\nNEWS, and confirm that using more data for pre-\\ntraining further improves performance on down-\\nstream tasks; (3) Our training improvements show\\nthat masked language model pretraining, under', metadata={'source': 'https://arxiv.org/pdf/1907.11692v1', 'page': 1}),\n", " Document(page_content='for easier comparison with related work.\\n4.3 Training with large batches\\nPast work in Neural Machine Translation has\\nshown that training with very large mini-batches\\ncan both improve optimization speed and end-task\\nperformance when the learning rate is increased', metadata={'source': 'https://arxiv.org/pdf/1907.11692v1', 'page': 4}),\n", " Document(page_content='lect and extract CC-N EWS. CC-N EWS is similar to the R E-\\nALNEWS dataset described in Zellers et al. (2019 ).pus described in Radford et al. (2019 ). The text\\nis web content extracted from URLs shared on\\nReddit with at least three upvotes. (38GB).5\\n•STORIES , a dataset introduced in Trinh and Le', metadata={'source': 'https://arxiv.org/pdf/1907.11692v1', 'page': 2})]},\n", " {'content': '- Achieving state-of-the-art results on various tasks with RoBERTa.',\n", " 'sources': [Document(page_content='submit RoBERTa to the GLUE leaderboard and\\nachieve state-of-the-art results on 4 out of 9 tasks\\nand the highest average score to date. This is espe-\\ncially exciting because RoBERTa does not depend\\non multi-task finetuning, unlike most of the other\\ntop submissions. We expect future work may fur-', metadata={'source': 'https://arxiv.org/pdf/1907.11692v1', 'page': 7}),\n", " Document(page_content='Results We present our results in Table 4. When\\ncontrolling for training data, we observe that\\nRoBERTa provides a large improvement over the\\noriginally reported BERT LARGE results, reaffirming\\nthe importance of the design choices we explored\\nin Section 4.', metadata={'source': 'https://arxiv.org/pdf/1907.11692v1', 'page': 6}),\n", " Document(page_content='vided training examples.10\\n10While we only use the provided WNLI training data, ourResults We present our results in Table 5. In the\\nfirst setting ( single-task, dev ), RoBERTa achieves\\nstate-of-the-art results on all 9 of the GLUE\\ntask development sets. Crucially, RoBERTa uses', metadata={'source': 'https://arxiv.org/pdf/1907.11692v1', 'page': 7}),\n", " Document(page_content='128 tokens and, if needed, the passage so that the\\ntotal length is at most 512 tokens.\\nResults on the RACE test sets are presented in\\nTable 7. RoBERTa achieves state-of-the-art results\\non both middle-school and high-school settings.\\n6 Related Work\\nPretraining methods have been designed', metadata={'source': 'https://arxiv.org/pdf/1907.11692v1', 'page': 8}),\n", " Document(page_content='matches the state-of-the-art set by XLNet. On the\\nSQuAD v2.0 development set, RoBERTa sets a\\nnew state-of-the-art, improving over XLNet by 0.4\\npoints (EM) and 0.6 points (F1).\\nWe also submit RoBERTa to the public SQuAD\\n2.0 leaderboard and evaluate its performance rel-', metadata={'source': 'https://arxiv.org/pdf/1907.11692v1', 'page': 8})]},\n", " {'content': '- Importance of design choices in language model pretraining.',\n", " 'sources': [Document(page_content='Pretraining methods have been designed\\nwith different training objectives, includ-\\ning language modeling ( Dai and Le ,2015 ;\\nPeters et al. ,2018 ;Howard and Ruder ,2018 ),\\nmachine translation ( McCann et al. ,2017 ), and\\nmasked language modeling ( Devlin et al. ,2019 ;', metadata={'source': 'https://arxiv.org/pdf/1907.11692v1', 'page': 8}),\n", " Document(page_content='University of Washington, Seattle, WA\\n{mandar90,lsz }@cs.washington.edu\\n§Facebook AI\\n{yinhanliu,myleott,naman,jingfeidu,\\ndanqi,omerlevy,mikelewis,lsz,ves }@fb.com\\nAbstract\\nLanguage model pretraining has led to sig-\\nnificant performance gains but careful com-', metadata={'source': 'https://arxiv.org/pdf/1907.11692v1', 'page': 0}),\n", " Document(page_content='7 Conclusion\\nWe carefully evaluate a number of design de-\\ncisions when pretraining BERT models. We\\nfind that performance can be substantially im-\\nproved by training the model longer, with bigger\\nbatches over more data; removing the next sen-\\ntence prediction objective; training on longer se-', metadata={'source': 'https://arxiv.org/pdf/1907.11692v1', 'page': 9}),\n", " Document(page_content='are: (1) We present a set of important BERT de-\\nsign choices and training strategies and introduce\\n2It is possible that these other methods could also improve\\nwith more tuning. We leave this exploration to future work.', metadata={'source': 'https://arxiv.org/pdf/1907.11692v1', 'page': 0}),\n", " Document(page_content='Xiaodong Liu, Yu Wang, Jianfeng Gao, Ming\\nZhou, and Hsiao-Wuen Hon. 2019. Unified\\nlanguage model pre-training for natural language\\nunderstanding and generation. arXiv preprint\\narXiv:1905.03197 .\\nDanilo Giampiccolo, Bernardo Magnini, Ido Dagan,\\nand Bill Dolan. 2007. The third PASCAL recog-', metadata={'source': 'https://arxiv.org/pdf/1907.11692v1', 'page': 9})]}]},\n", " {'slide_number': 5,\n", " 'slide_title': 'Key Model Features',\n", " 'points': [{'content': '- Transformer architecture with self-attention heads and hidden dimensions in BERT.',\n", " 'sources': [Document(page_content='ing end-task labeled data.\\n2.2 Architecture\\nBERT uses the now ubiquitous transformer archi-\\ntecture ( Vaswani et al. ,2017 ), which we will not\\nreview in detail. We use a transformer architecture\\nwithLlayers. Each block uses Aself-attention\\nheads and hidden dimension H.\\n2.3 Training Objectives', metadata={'source': 'https://arxiv.org/pdf/1907.11692v1', 'page': 1}),\n", " Document(page_content='tion Processing Systems (NIPS) .\\nJacob Devlin, Ming-Wei Chang, Kenton Lee, and\\nKristina Toutanova. 2019. BERT: Pre-training of\\ndeep bidirectional transformers for language under-\\nstanding. In North American Association for Com-\\nputational Linguistics (NAACL) .', metadata={'source': 'https://arxiv.org/pdf/1907.11692v1', 'page': 9}),\n", " Document(page_content='a batch size eight times larger for half as many op-\\ntimization steps, thus seeing four times as many\\nsequences in pretraining compared to BERT.\\nTo help disentangle the importance of these fac-\\ntors from other modeling choices (e.g., the pre-\\ntraining objective), we begin by training RoBERTa', metadata={'source': 'https://arxiv.org/pdf/1907.11692v1', 'page': 5}),\n", " Document(page_content='tance of previously overlooked design choices,\\nand raise questions about the source of re-\\ncently reported improvements. We release our\\nmodels and code.1\\n1 Introduction\\nSelf-training methods such as ELMo ( Peters et al. ,\\n2018 ), GPT ( Radford et al. ,2018 ), BERT', metadata={'source': 'https://arxiv.org/pdf/1907.11692v1', 'page': 0}),\n", " Document(page_content='are: (1) We present a set of important BERT de-\\nsign choices and training strategies and introduce\\n2It is possible that these other methods could also improve\\nwith more tuning. We leave this exploration to future work.', metadata={'source': 'https://arxiv.org/pdf/1907.11692v1', 'page': 0})]},\n", " {'content': '- Pretraining objectives in BERT: masked language modeling and next sentence prediction.',\n", " 'sources': [Document(page_content='2.3 Training Objectives\\nDuring pretraining, BERT uses two objectives:\\nmasked language modeling and next sentence pre-\\ndiction.\\nMasked Language Model (MLM) A random\\nsample of the tokens in the input sequence is\\nselected and replaced with the special token\\n[MASK]. The MLM objective is a cross-entropy', metadata={'source': 'https://arxiv.org/pdf/1907.11692v1', 'page': 1}),\n", " Document(page_content='Pretraining methods have been designed\\nwith different training objectives, includ-\\ning language modeling ( Dai and Le ,2015 ;\\nPeters et al. ,2018 ;Howard and Ruder ,2018 ),\\nmachine translation ( McCann et al. ,2017 ), and\\nmasked language modeling ( Devlin et al. ,2019 ;', metadata={'source': 'https://arxiv.org/pdf/1907.11692v1', 'page': 8}),\n", " Document(page_content='task development sets. Crucially, RoBERTa uses\\nthe same masked language modeling pretrain-\\ning objective and architecture as BERT LARGE , yet\\nconsistently outperforms both BERT LARGE and\\nXLNet LARGE . This raises questions about the rel-\\native importance of model architecture and pre-', metadata={'source': 'https://arxiv.org/pdf/1907.11692v1', 'page': 7}),\n", " Document(page_content='7 Conclusion\\nWe carefully evaluate a number of design de-\\ncisions when pretraining BERT models. We\\nfind that performance can be substantially im-\\nproved by training the model longer, with bigger\\nbatches over more data; removing the next sen-\\ntence prediction objective; training on longer se-', metadata={'source': 'https://arxiv.org/pdf/1907.11692v1', 'page': 9}),\n", " Document(page_content='masked language modeling ( Devlin et al. ,2019 ;\\nLample and Conneau ,2019 ). Many recent\\npapers have used a basic recipe of finetuning\\nmodels for each end task ( Howard and Ruder ,\\n2018 ;Radford et al. ,2018 ), and pretraining\\nwith some variant of a masked language model', metadata={'source': 'https://arxiv.org/pdf/1907.11692v1', 'page': 8})]},\n", " {'content': '- Optimization using the Adam optimizer with specific parameters.',\n", " 'sources': [Document(page_content='optimization hyperparameters, given in Section 2,\\nexcept for the peak learning rate and number of\\nwarmup steps, which are tuned separately for each\\nsetting. We additionally found training to be very\\nsensitive to the Adam epsilon term, and in some\\ncases we obtained better performance or improved', metadata={'source': 'https://arxiv.org/pdf/1907.11692v1', 'page': 2}),\n", " Document(page_content='performance on downstream tasks, such as Natural\\nLanguage Inference ( Bowman et al. ,2015 ), which\\nrequire reasoning about the relationships between\\npairs of sentences.\\n2.4 Optimization\\nBERT is optimized with Adam ( Kingma and Ba ,\\n2015 ) using the following parameters: β1= 0.9,', metadata={'source': 'https://arxiv.org/pdf/1907.11692v1', 'page': 1}),\n", " Document(page_content='2015 ) using the following parameters: β1= 0.9,\\nβ2= 0.999,ǫ=1e-6 and L2weight de-\\ncay of0.01. The learning rate is warmed up\\nover the first 10,000 steps to a peak value of\\n1e-4, and then linearly decayed. BERT trains\\nwith a dropout of 0.1 on all layers and at-', metadata={'source': 'https://arxiv.org/pdf/1907.11692v1', 'page': 1}),\n", " Document(page_content='Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S.\\nWeld, Luke Zettlemoyer, and Omer Levy. 2019.\\nSpanBERT: Improving pre-training by repre-\\nsenting and predicting spans. arXiv preprint\\narXiv:1907.10529 .\\nDiederik Kingma and Jimmy Ba. 2015. Adam: A\\nmethod for stochastic optimization. In International', metadata={'source': 'https://arxiv.org/pdf/1907.11692v1', 'page': 10}),\n", " Document(page_content='Weight Decay 0.01 0.01\\nMax Steps 500k 500k\\nLearning Rate Decay Linear Linear\\nAdamǫ 1e-6 1e-6\\nAdamβ1 0.9 0.9\\nAdamβ2 0.98 0.98\\nGradient Clipping 0.0 0.0\\nTable 9: Hyperparameters for pretraining RoBERTa LARGE and RoBERTa BASE.\\nHyperparam RACE SQuAD GLUE\\nLearning Rate 1e-5 1.5e-5 {1e-5, 2e-5, 3e-5 }', metadata={'source': 'https://arxiv.org/pdf/1907.11692v1', 'page': 12})]}]},\n", " {'slide_number': 6,\n", " 'slide_title': 'Pretraining Comparison',\n", " 'points': [{'content': '- Importance of large and diverse training data in achieving better end-task performance.',\n", " 'sources': [Document(page_content='for easier comparison with related work.\\n4.3 Training with large batches\\nPast work in Neural Machine Translation has\\nshown that training with very large mini-batches\\ncan both improve optimization speed and end-task\\nperformance when the learning rate is increased', metadata={'source': 'https://arxiv.org/pdf/1907.11692v1', 'page': 4}),\n", " Document(page_content='proved end-task performance. Several efforts\\nhave trained on datasets larger and more diverse\\nthan the original BERT ( Radford et al. ,2019 ;\\nYang et al. ,2019 ;Zellers et al. ,2019 ). Unfortu-\\nnately, not all of the additional datasets can be\\npublicly released. For our study, we focus on gath-', metadata={'source': 'https://arxiv.org/pdf/1907.11692v1', 'page': 2}),\n", " Document(page_content='32GB Nvidia V100 GPUs interconnected by In-\\nfiniband ( Micikevicius et al. ,2018 ).\\n3.2 Data\\nBERT-style pretraining crucially relies on large\\nquantities of text. Baevski et al. (2019 ) demon-\\nstrate that increasing data size can result in im-\\nproved end-task performance. Several efforts', metadata={'source': 'https://arxiv.org/pdf/1907.11692v1', 'page': 2}),\n", " Document(page_content='nificant performance gains but careful com-\\nparison between different approaches is chal-\\nlenging. Training is computationally expen-\\nsive, often done on private datasets of different\\nsizes, and, as we will show, hyperparameter\\nchoices have significant impact on the final re-', metadata={'source': 'https://arxiv.org/pdf/1907.11692v1', 'page': 0}),\n", " Document(page_content='the methods contribute the most. Training is\\ncomputationally expensive, limiting the amount\\nof tuning that can be done, and is often done with\\nprivate training data of varying sizes, limiting\\nour ability to measure the effects of the modeling\\nadvances.\\n∗Equal contribution.', metadata={'source': 'https://arxiv.org/pdf/1907.11692v1', 'page': 0})]},\n", " {'content': '- Training format comparisons for natural language processing tasks using models like BERT and XLNet.',\n", " 'sources': [Document(page_content='2.0 leaderboard and evaluate its performance rel-\\native to other systems. Most of the top systems\\nbuild upon either BERT ( Devlin et al. ,2019 ) or\\nXLNet ( Yang et al. ,2019 ), both of which rely on\\nadditional external training data. In contrast, our\\nsubmission does not use any additional data.', metadata={'source': 'https://arxiv.org/pdf/1907.11692v1', 'page': 8}),\n", " Document(page_content='Single models on test (as of July 25, 2019)\\nBERT LARGE 72.0 76.6 70.1\\nXLNet LARGE 81.7 85.4 80.2\\nRoBERTa 83.2 86.5 81.3\\nTable 7: Results on the RACE test set. BERT LARGE and\\nXLNet LARGE results are from Yang et al. (2019 ).\\nnating each candidate answer with the correspond-', metadata={'source': 'https://arxiv.org/pdf/1907.11692v1', 'page': 8}),\n", " Document(page_content='2018 ), GPT ( Radford et al. ,2018 ), BERT\\n(Devlin et al. ,2019 ), XLM ( Lample and Conneau ,\\n2019 ), and XLNet ( Yang et al. ,2019 ) have\\nbrought significant performance gains, but it can\\nbe challenging to determine which aspects of\\nthe methods contribute the most. Training is', metadata={'source': 'https://arxiv.org/pdf/1907.11692v1', 'page': 0}),\n", " Document(page_content='simplify, and better tune the training of BERT,\\nas a reference point for better understanding the\\nrelative performance of all of these methods.', metadata={'source': 'https://arxiv.org/pdf/1907.11692v1', 'page': 8}),\n", " Document(page_content='task development sets. Crucially, RoBERTa uses\\nthe same masked language modeling pretrain-\\ning objective and architecture as BERT LARGE , yet\\nconsistently outperforms both BERT LARGE and\\nXLNet LARGE . This raises questions about the rel-\\native importance of model architecture and pre-', metadata={'source': 'https://arxiv.org/pdf/1907.11692v1', 'page': 7})]},\n", " {'content': '- Results reported for different input formats and pre-trained models.',\n", " 'sources': [Document(page_content='Results We present our results in Table 4. When\\ncontrolling for training data, we observe that\\nRoBERTa provides a large improvement over the\\noriginally reported BERT LARGE results, reaffirming\\nthe importance of the design choices we explored\\nin Section 4.', metadata={'source': 'https://arxiv.org/pdf/1907.11692v1', 'page': 6}),\n", " Document(page_content='ModelSQuAD 1.1 SQuAD 2.0\\nEM F1 EM F1\\nSingle models on dev, w/o data augmentation\\nBERT LARGE 84.1 90.9 79.0 81.8\\nXLNet LARGE 89.0 94.5 86.1 88.8\\nRoBERTa 88.9 94.6 86.5 89.4\\nSingle models on test (as of July 25, 2019)\\nXLNet LARGE 86.3†89.1†\\nRoBERTa 86.8 89.8\\nXLNet + SG-Net Verifier 87.0†89.9†', metadata={'source': 'https://arxiv.org/pdf/1907.11692v1', 'page': 8}),\n", " Document(page_content='Results Table 2shows results for the four dif-\\nferent settings. We first compare the original\\nSEGMENT -PAIR input format from Devlin et al.\\n(2019 ) to the SENTENCE -PAIR format; both for-\\nmats retain the NSP loss, but the latter uses sin-\\ngle sentences. We find that using individual', metadata={'source': 'https://arxiv.org/pdf/1907.11692v1', 'page': 4}),\n", " Document(page_content='Pretraining methods have been designed\\nwith different training objectives, includ-\\ning language modeling ( Dai and Le ,2015 ;\\nPeters et al. ,2018 ;Howard and Ruder ,2018 ),\\nmachine translation ( McCann et al. ,2017 ), and\\nmasked language modeling ( Devlin et al. ,2019 ;', metadata={'source': 'https://arxiv.org/pdf/1907.11692v1', 'page': 8}),\n", " Document(page_content='Single models on test (as of July 25, 2019)\\nBERT LARGE 72.0 76.6 70.1\\nXLNet LARGE 81.7 85.4 80.2\\nRoBERTa 83.2 86.5 81.3\\nTable 7: Results on the RACE test set. BERT LARGE and\\nXLNet LARGE results are from Yang et al. (2019 ).\\nnating each candidate answer with the correspond-', metadata={'source': 'https://arxiv.org/pdf/1907.11692v1', 'page': 8})]}]},\n", " {'slide_number': 7,\n", " 'slide_title': 'RoBERTa Performance',\n", " 'points': [{'content': '- Outperforming BERT LARGE and XLNet LARGE with RoBERTa.',\n", " 'sources': [Document(page_content='Single models on test (as of July 25, 2019)\\nBERT LARGE 72.0 76.6 70.1\\nXLNet LARGE 81.7 85.4 80.2\\nRoBERTa 83.2 86.5 81.3\\nTable 7: Results on the RACE test set. BERT LARGE and\\nXLNet LARGE results are from Yang et al. (2019 ).\\nnating each candidate answer with the correspond-', metadata={'source': 'https://arxiv.org/pdf/1907.11692v1', 'page': 8}),\n", " Document(page_content='XLNet + SG-Net Verifier 87.0†89.9†\\nTable 6: Results on SQuAD. †indicates results that de-\\npend on additional external training data. RoBERTa\\nuses only the provided SQuAD data in both dev and\\ntest settings. BERT LARGE and XLNet LARGE results are\\nfrom Devlin et al. (2019 ) and Yang et al. (2019 ), re-', metadata={'source': 'https://arxiv.org/pdf/1907.11692v1', 'page': 8}),\n", " Document(page_content='Results We present our results in Table 4. When\\ncontrolling for training data, we observe that\\nRoBERTa provides a large improvement over the\\noriginally reported BERT LARGE results, reaffirming\\nthe importance of the design choices we explored\\nin Section 4.', metadata={'source': 'https://arxiv.org/pdf/1907.11692v1', 'page': 6}),\n", " Document(page_content='BERT LARGE\\nwith B OOKS + W IKI 13GB 256 1M 90.9/81.8 86.6 93.7\\nXLNet LARGE\\nwith B OOKS + W IKI 13GB 256 1M 94.0/87.8 88.4 94.4\\n+ additional data 126GB 2K 500K 94.5/88.8 89.8 95.6\\nTable 4: Development set results for RoBERTa as we pretrain o ver more data (16GB →160GB of text) and pretrain', metadata={'source': 'https://arxiv.org/pdf/1907.11692v1', 'page': 6}),\n", " Document(page_content='matches the state-of-the-art set by XLNet. On the\\nSQuAD v2.0 development set, RoBERTa sets a\\nnew state-of-the-art, improving over XLNet by 0.4\\npoints (EM) and 0.6 points (F1).\\nWe also submit RoBERTa to the public SQuAD\\n2.0 leaderboard and evaluate its performance rel-', metadata={'source': 'https://arxiv.org/pdf/1907.11692v1', 'page': 8})]},\n", " {'content': '- Evaluation on GLUE, SQuAD, and RACE benchmarks.',\n", " 'sources': [Document(page_content='lowing three benchmarks.\\nGLUE The General Language Understand-\\ning Evaluation (GLUE) benchmark ( Wang et al. ,\\n2019b ) is a collection of 9 datasets for evaluating\\nnatural language understanding systems.6Tasks\\nare framed as either single-sentence classification', metadata={'source': 'https://arxiv.org/pdf/1907.11692v1', 'page': 2}),\n", " Document(page_content='appear to overfit our data and would likely benefit\\nfrom additional training.\\nIn the rest of the paper, we evaluate our best\\nRoBERTa model on the three different bench-\\nmarks: GLUE, SQuaD and RACE. Specifically\\n9Our experiments conflate increases in data size and di-', metadata={'source': 'https://arxiv.org/pdf/1907.11692v1', 'page': 6}),\n", " Document(page_content='was significantly undertrained, and can match\\nor exceed the performance of every model\\npublished after it. Our best model achieves\\nstate-of-the-art results on GLUE, RACE and\\nSQuAD. These results highlight the impor-\\ntance of previously overlooked design choices,', metadata={'source': 'https://arxiv.org/pdf/1907.11692v1', 'page': 0}),\n", " Document(page_content='Alex Wang, Amanpreet Singh, Julian Michael, Felix\\nHill, Omer Levy, and Samuel R. Bowman. 2019b.\\nGLUE: A multi-task benchmark and analysis plat-\\nform for natural language understanding. In Inter-\\nnational Conference on Learning Representations\\n(ICLR) .', metadata={'source': 'https://arxiv.org/pdf/1907.11692v1', 'page': 11}),\n", " Document(page_content='GLUE or additional data for SQuAD. These re-\\nsults illustrate the importance of these previ-\\nously overlooked design decisions and suggest\\nthat BERT’s pretraining objective remains com-\\npetitive with recently proposed alternatives.\\nWe additionally use a novel dataset,', metadata={'source': 'https://arxiv.org/pdf/1907.11692v1', 'page': 9})]},\n", " {'content': '- Competitive results without relying on additional external training data.',\n", " 'sources': [Document(page_content='2.0 leaderboard and evaluate its performance rel-\\native to other systems. Most of the top systems\\nbuild upon either BERT ( Devlin et al. ,2019 ) or\\nXLNet ( Yang et al. ,2019 ), both of which rely on\\nadditional external training data. In contrast, our\\nsubmission does not use any additional data.', metadata={'source': 'https://arxiv.org/pdf/1907.11692v1', 'page': 8}),\n", " Document(page_content='was significantly undertrained, and can match\\nor exceed the performance of every model\\npublished after it. Our best model achieves\\nstate-of-the-art results on GLUE, RACE and\\nSQuAD. These results highlight the impor-\\ntance of previously overlooked design choices,', metadata={'source': 'https://arxiv.org/pdf/1907.11692v1', 'page': 0}),\n", " Document(page_content='GLUE or additional data for SQuAD. These re-\\nsults illustrate the importance of these previ-\\nously overlooked design decisions and suggest\\nthat BERT’s pretraining objective remains com-\\npetitive with recently proposed alternatives.\\nWe additionally use a novel dataset,', metadata={'source': 'https://arxiv.org/pdf/1907.11692v1', 'page': 9}),\n", " Document(page_content='the methods contribute the most. Training is\\ncomputationally expensive, limiting the amount\\nof tuning that can be done, and is often done with\\nprivate training data of varying sizes, limiting\\nour ability to measure the effects of the modeling\\nadvances.\\n∗Equal contribution.', metadata={'source': 'https://arxiv.org/pdf/1907.11692v1', 'page': 0}),\n", " Document(page_content='ing pattern applied to the training data. We also\\ncollect a large new dataset (CC-N EWS) of compa-\\nrable size to other privately used datasets, to better\\ncontrol for training set size effects.\\nWhen controlling for training data, our im-\\nproved training procedure improves upon the pub-', metadata={'source': 'https://arxiv.org/pdf/1907.11692v1', 'page': 0})]}]},\n", " {'slide_number': 8,\n", " 'slide_title': 'Conclusion',\n", " 'points': [{'content': '- Significance of RoBERTa as an optimized version of BERT in achieving state-of-the-art results.',\n", " 'sources': [Document(page_content='Results We present our results in Table 4. When\\ncontrolling for training data, we observe that\\nRoBERTa provides a large improvement over the\\noriginally reported BERT LARGE results, reaffirming\\nthe importance of the design choices we explored\\nin Section 4.', metadata={'source': 'https://arxiv.org/pdf/1907.11692v1', 'page': 6}),\n", " Document(page_content='a batch size eight times larger for half as many op-\\ntimization steps, thus seeing four times as many\\nsequences in pretraining compared to BERT.\\nTo help disentangle the importance of these fac-\\ntors from other modeling choices (e.g., the pre-\\ntraining objective), we begin by training RoBERTa', metadata={'source': 'https://arxiv.org/pdf/1907.11692v1', 'page': 5}),\n", " Document(page_content='matches the state-of-the-art set by XLNet. On the\\nSQuAD v2.0 development set, RoBERTa sets a\\nnew state-of-the-art, improving over XLNet by 0.4\\npoints (EM) and 0.6 points (F1).\\nWe also submit RoBERTa to the public SQuAD\\n2.0 leaderboard and evaluate its performance rel-', metadata={'source': 'https://arxiv.org/pdf/1907.11692v1', 'page': 8}),\n", " Document(page_content='arXiv:1907.11692v1 [cs.CL] 26 Jul 2019RoBERTa: A Robustly Optimized BERT Pretraining Approach\\nYinhan Liu∗§Myle Ott∗§Naman Goyal∗§Jingfei Du∗§Mandar Joshi†\\nDanqi Chen§Omer Levy§Mike Lewis§Luke Zettlemoyer†§Veselin Stoyanov§\\n†Paul G. Allen School of Computer Science & Engineering,', metadata={'source': 'https://arxiv.org/pdf/1907.11692v1', 'page': 0}),\n", " Document(page_content='submit RoBERTa to the GLUE leaderboard and\\nachieve state-of-the-art results on 4 out of 9 tasks\\nand the highest average score to date. This is espe-\\ncially exciting because RoBERTa does not depend\\non multi-task finetuning, unlike most of the other\\ntop submissions. We expect future work may fur-', metadata={'source': 'https://arxiv.org/pdf/1907.11692v1', 'page': 7})]},\n", " {'content': '- Highlighting the impact of design choices in language model pretraining.',\n", " 'sources': [Document(page_content='Pretraining methods have been designed\\nwith different training objectives, includ-\\ning language modeling ( Dai and Le ,2015 ;\\nPeters et al. ,2018 ;Howard and Ruder ,2018 ),\\nmachine translation ( McCann et al. ,2017 ), and\\nmasked language modeling ( Devlin et al. ,2019 ;', metadata={'source': 'https://arxiv.org/pdf/1907.11692v1', 'page': 8}),\n", " Document(page_content='University of Washington, Seattle, WA\\n{mandar90,lsz }@cs.washington.edu\\n§Facebook AI\\n{yinhanliu,myleott,naman,jingfeidu,\\ndanqi,omerlevy,mikelewis,lsz,ves }@fb.com\\nAbstract\\nLanguage model pretraining has led to sig-\\nnificant performance gains but careful com-', metadata={'source': 'https://arxiv.org/pdf/1907.11692v1', 'page': 0}),\n", " Document(page_content='tance of previously overlooked design choices,\\nand raise questions about the source of re-\\ncently reported improvements. We release our\\nmodels and code.1\\n1 Introduction\\nSelf-training methods such as ELMo ( Peters et al. ,\\n2018 ), GPT ( Radford et al. ,2018 ), BERT', metadata={'source': 'https://arxiv.org/pdf/1907.11692v1', 'page': 0}),\n", " Document(page_content='Xiaodong Liu, Yu Wang, Jianfeng Gao, Ming\\nZhou, and Hsiao-Wuen Hon. 2019. Unified\\nlanguage model pre-training for natural language\\nunderstanding and generation. arXiv preprint\\narXiv:1905.03197 .\\nDanilo Giampiccolo, Bernardo Magnini, Ido Dagan,\\nand Bill Dolan. 2007. The third PASCAL recog-', metadata={'source': 'https://arxiv.org/pdf/1907.11692v1', 'page': 9}),\n", " Document(page_content='that masked language model pretraining, under\\nthe right design choices, is competitive with all\\nother recently published methods. We release our\\nmodel, pretraining and fine-tuning code imple-\\nmented in PyTorch ( Paszke et al. ,2017 ).\\n2 Background\\nIn this section, we give a brief overview of the', metadata={'source': 'https://arxiv.org/pdf/1907.11692v1', 'page': 1})]}]}]" ] }, "execution_count": 92, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import re\n", "\n", "\n", "def markdown_to_json(markdown):\n", " regex_pattern = r\"### Slide (\\d+): (.+)\\n((- .+\\n)+)\"\n", " matches = re.findall(regex_pattern, markdown)\n", " slides = []\n", " for match in matches:\n", " slide = {\n", " \"slide_number\": int(match[0]),\n", " \"slide_title\": match[1],\n", " \"points\": [{'content': point.strip(), 'sources': retriever.invoke(point.strip())} for point in match[2].split(\"\\n\") if point.strip()]\n", " }\n", " slides.append(slide)\n", " return slides\n", "\n", "slides = markdown_to_json(reduce_res.content)\n", "slides\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "venv", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.6" } }, "nbformat": 4, "nbformat_minor": 2 }