{
  "title": "StarCoder 2 and The Stack v2: The Next Generation",
  "summary": "This paper introduces StarCoder2, a large language model for code developed by the BigCode project in partnership with Software Heritage. StarCoder2 is trained on a dataset called The Stack v2, which is 4\u00d7 larger than the first StarCoder dataset and includes data from Software Heritage, GitHub, Kaggle, and other sources. The authors evaluate StarCoder2 on a comprehensive set of code LLM benchmarks and find that it outperforms other models of similar size and even outperforms CodeLlama-34B, a model more than twice its size.",
  "abstract": "The BigCode project, an open-scientific collaboration focused on the\nresponsible development of Large Language Models for Code (Code LLMs),\nintroduces StarCoder2. In partnership with Software Heritage (SWH), we build\nThe Stack v2 on top of the digital commons of their source code archive.\nAlongside the SWH repositories spanning 619 programming languages, we carefully\nselect other high-quality data sources, such as GitHub pull requests, Kaggle\nnotebooks, and code documentation. This results in a training set that is 4x\nlarger than the first StarCoder dataset. We train StarCoder2 models with 3B,\n7B, and 15B parameters on 3.3 to 4.3 trillion tokens and thoroughly evaluate\nthem on a comprehensive set of Code LLM benchmarks. We find that our small\nmodel, StarCoder2-3B, outperforms other Code LLMs of similar size on most\nbenchmarks, and also outperforms StarCoderBase-15B. Our large model,\nStarCoder2- 15B, significantly outperforms other models of comparable size. In\naddition, it matches or outperforms CodeLlama-34B, a model more than twice its\nsize. Although DeepSeekCoder- 33B is the best-performing model at code\ncompletion for high-resource languages, we find that StarCoder2-15B outperforms\nit on math and code reasoning benchmarks, as well as several low-resource\nlanguages. We make the model weights available under an OpenRAIL license and\nensure full transparency regarding the training data by releasing the SoftWare\nHeritage persistent IDentifiers (SWHIDs) of the source code data.",
  "authors": "Anton Lozhkov,Raymond Li,Loubna Ben Allal,Federico Cassano,Joel Lamy-Poirier,Nouamane Tazi,Ao Tang,Dmytro Pykhtar,Jiawei Liu,Yuxiang Wei,Tianyang Liu,Max Tian,Denis Kocetkov,Arthur Zucker,Younes Belkada,Zijian Wang,Qian Liu,Dmitry Abulkhanov,Indraneil paul,Zhuang Li,Wen-Ding Li,Meg Risdal,Jia Li,Jian Zhu,Terry Yue Zhuo,Evgenii Zheltonozhskii,Nii Osae,Wenhao Yu,Lucas Krau\u00df,Naman Jain,Yixuan Su,Xuanli He,Manan Dey,Edoardo Abati,Yekun Chai,Niklas Muennighoff,Xiangru Tang,Muhtasham Oblokulov,Christopher Akiki,Marc Marone,Chenghao Mou,Mayank Mishra,Alex Gu,Binyuan Hui,Tri Dao,Armel Zebaze,Olivier Dehaene,Nicolas Patry,Canwen Xu,Julian McAuley,Han Hu,Torsten Scholak,Sebastien Paquet,Jennifer Robinson,Carolyn Anderson,Nicolas Chapados,Mostofa Patwary,Nima Tajbakhsh,Yacine Jernite,Carlos Mu\u00f1oz Ferrandis,Lingming Zhang,Sean Hughes,Thomas Wolf,Arjun Guha,Leandro von Werra,Harm de Vries",
  "arxiv_id": "2402.19173",
  "target_date": "2024-03-01",
  "full_text": "Under review as submission to TMLR\nStarCoder 2 and The Stack v2: The Next Generation\nAnton Lozhkov1\nRaymond Li2\nLoubna Ben Allal1\nFederico Cassano4\nJoel Lamy-Poirier2\nNouamane Tazi1\nAo Tang3\nDmytro Pykhtar3\nJiawei Liu7\nYuxiang Wei7\nTianyang Liu25\nMax Tian2\nDenis Kocetkov2\nArthur Zucker1\nYounes Belkada1\nZijian Wang5\nQian Liu12\nDmitry Abulkhanov5\nIndraneil Paul32\nZhuang Li14\nWen-Ding Li26\nMegan Risdal24\nJia\nLi5\nJian Zhu16\nTerry Yue Zhuo14,15\nEvgenii Zheltonozhskii13\nNii Osae Osae Dade28\nWenhao Yu20\nLucas Krau\u00df5\nNaman Jain27\nYixuan Su30\nXuanli He23\nManan Dey31\nEdoardo Abati5\nYekun Chai33\nNiklas Muennighoff29\nXiangru Tang34\nMuhtasham\nOblokulov18\nChristopher Akiki9,10\nMarc Marone8\nChenghao Mou5\nMayank Mishra19\nAlex Gu17\nBinyuan Hui5\nTri Dao21\nArmel Zebaze1\nOlivier Dehaene1\nNicolas Patry1\nCanwen Xu25\nJulian McAuley25\nHan Hu14\nTorsten Scholak2\nSebastien Paquet2\nJennifer\nRobinson6\nCarolyn Jane Anderson22\nNicolas Chapados2\nMostofa Patwary3\nNima\nTajbakhsh3\nYacine Jernite1\nCarlos Mu\u00f1oz Ferrandis1\nLingming Zhang7\nSean Hughes6\nThomas Wolf 1\nArjun Guha4,11\nLeandro von Werra1,\u22c6\nHarm de Vries2,\u22c6\n1Hugging Face\n2ServiceNow Research\n3Nvidia\n4Northeastern University\n5Independent\n6ServiceNow\n7University of Illinois Urbana-Champaign\n8Johns Hopkins University\n9Leipzig University\n10ScaDS.AI\n11Roblox\n12Sea AI Lab\n13Technion \u2013 Israel Institute of Technology\n14Monash University\n15CSIRO\u2019s\nData61\n16University of British Columbia\n17MIT\n18Technical University of Munich\n19IBM Research\n20University of Notre Dame\n21Princeton University\n22Wellesley College\n23University College London\n24Kaggle\n25UC San Diego\n26Cornell University\n27UC Berkeley\n28Mazzuma\n29Contextual AI\n30Cohere\n31Salesforce\n32Technical University of Darmstadt\n33Baidu\n34Yale University\nCorresponding authors (\u22c6) can be contacted at contact@bigcode-project.org\nAbstract\nThe BigCode project,1 an open-scientific collaboration focused on the responsible development\nof Large Language Models for Code (Code LLMs), introduces StarCoder2. In partnership\nwith Software Heritage (SWH),2 we build The Stack v2 on top of the digital commons of their\nsource code archive. Alongside the SWH repositories spanning 619 programming languages,\nwe carefully select other high-quality data sources, such as GitHub pull requests, Kaggle\nnotebooks, and code documentation. This results in a training set that is 4\u00d7 larger than the\nfirst StarCoder dataset. We train StarCoder2 models with 3B, 7B, and 15B parameters on\n3.3 to 4.3 trillion tokens and thoroughly evaluate them on a comprehensive set of Code LLM\nbenchmarks.\nWe find that our small model, StarCoder2-3B, outperforms other Code LLMs of similar size\non most benchmarks, and also outperforms StarCoderBase-15B. Our large model, StarCoder2-\n15B, significantly outperforms other models of comparable size. In addition, it matches or\noutperforms CodeLlama-34B, a model more than twice its size. Although DeepSeekCoder-\n33B is the best-performing model at code completion for high-resource languages, we find\nthat StarCoder2-15B outperforms it on math and code reasoning benchmarks, as well as\nseveral low-resource languages. We make the model weights available under an OpenRAIL\nlicense and ensure full transparency regarding the training data by releasing the SoftWare\nHeritage persistent IDentifiers (SWHIDs) of the source code data.\n1https://www.bigcode-project.org\n2https://www.softwareheritage.org/\n1\narXiv:2402.19173v1  [cs.SE]  29 Feb 2024\n Under review as submission to TMLR\n1\nIntroduction\nLarge Language Models for Code (Code LLMs; Chen et al., 2021; Nijkamp et al., 2023; Rozi\u00e8re et al., 2023;\nGuo et al., 2024) have rapidly emerged as powerful assistants for writing and editing code. As of January 30,\n2024, GitHub CoPilot has garnered over 1.3 million paying subscribers, with over 50,000 organisations opting\nfor the enterprise version (MSFT Q2 Earning Call, 2024), estimated to increase developer productivity by up\nto 56% as well as developer satisfaction (Peng et al., 2023; Ziegler et al., 2024). ServiceNow recently disclosed\nthat their \u201ctext-to-code\u201d solution, built from fine-tuning StarCoderBase models (Li et al., 2023), results in\na 52% increase in developer productivity (Yahoo Finance, 2024). Despite the initial focus on generating\ncode snippets from natural language instructions or other code snippets, Code LLMs exhibit the potential\nto enhance all phases of the software development cycle (Hou et al., 2023; Fan et al., 2023; Wang et al.,\n2024; Zhuo et al., 2023b; Chai et al., 2023). This includes speeding up the implementation of new projects,\nimproving quality assurance for developed software, helping detect and fix bugs, simplifying maintenance\ntasks, and easing migration to newer software.\nThe development process of LLMs can exhibit different levels of openness (Solaiman, 2023; Ding et al.,\n2022; Akiki et al., 2022). Proprietary models like OpenAI\u2019s GPT-4 (OpenAI et al., 2023) and Google\u2019s\nGemini (Gemini Team et al., 2023) provide access to the model through a paid API but do not disclose\ndevelopment details. On the other hand, open-weight models like Code LLaMa (Rozi\u00e8re et al., 2023),\nMistral (Jiang et al., 2023), and DeepSeekCoder (Guo et al., 2024) have released the model weights. This\nenables the open-source community to run these models locally, inspect the model representations, and fine-\ntune them on their tasks. However, the model developers have not disclosed their training data. Consequently,\ncontent creators do not know if their data was used for training, social scientists cannot scrutinize the dataset\nfor bias and toxicity, and LLM developers lack information as to what extent the training set is contaminated\nwith test benchmarks. More broadly, this practice hinders scientific progress as other research teams cannot\nreadily reuse each other\u2019s training data. Other LLM development projects, like Allen AI\u2019s OLMo (Groeneveld\net al., 2024), Eleuther AI\u2019s Pythia (Biderman et al., 2023), and BigScience\u2019s BLOOM (BigScience Workshop,\n2022; Scao et al., 2022a), have adopted a fully open development approach by releasing training data, training\nframeworks, and evaluation suites.\nThe BigCode project was established in September 2022 as an open scientific collaboration focused on the\nopen and responsible development of Code LLMs. BigCode is stewarded by ServiceNow and Hugging Face in\nthe spirit of open governance (BigCode collaboration et al., 2023) and has brought together more than 1,100\nmembers from diverse academic institutes and industry labs. The community previously released The Stack\nv1 (Kocetkov et al., 2023), a 6.4 TB dataset of permissively licensed source code in 384 programming languages.\nThe Stack v1 includes a governance tool called \u201cAm I in The Stack,\u201d designed for developers to verify if their\nsource code is included in the dataset. It also provides an opt-out process for those who prefer to exclude their\ncode from the dataset. In December 2022, the BigCode community released SantaCoder (Ben Allal et al.,\n2023), a strong-performing 1.1B parameter model trained on Java, JavaScript, and Python code from The\nStack v1. Building upon this success, the community further scaled up its effort and released StarCoder on\nMay 4th, 2023 (Li et al., 2023). At its release, the 15B parameter StarCoder model was the best open-access\nLLM for code.\nThis technical report describes the development process of The Stack v2 and StarCoder2. The Stack v2 builds\nupon the foundation of Software Heritage\u2019s vast source code archive, which spans over 600 programming\nlanguages. In addition to code repositories, we curate other high-quality open data sources, including Github\nissues, pull requests, Kaggle and Jupyter notebooks, code documentation, and other natural language datasets\nrelated to math, coding, and reasoning. To prepare the data for training, we perform deduplication, create\nfilters to eliminate low-quality code, redact Personally Identifiable Information (PII), remove malicious code,\nand handle opt-outs from developers who requested to have their code removed from the dataset. With this\nnew training set of 900B+ unique tokens, 4\u00d7 larger than the first StarCoder dataset, we develop the next\ngeneration of StarCoder models. We train Code LLMs with 3B, 7B, and 15B parameters using a two-stage\ntraining process (Rozi\u00e8re et al., 2023; Guo et al., 2024). We start base model training with a 4k context\nwindow and subsequently fine-tune the model with a 16k context window. We ensure that the training\nprocess does not exceed more than 5 epochs over the dataset (Muennighoff et al., 2023). However, we push\n2\n Under review as submission to TMLR\nthe number of training tokens far beyond the compute-optimal number suggested by Chinchilla (Harm\u2019s law;\nde Vries, 2023) and train relatively small models within the range of 3.3 to 4.3 trillion tokens. We thoroughly\nassess and compare the performance of these models on a suite of code LLM benchmarks (Cassano et al.,\n2023b; Austin et al., 2021; Chen et al., 2021; Liu et al., 2023a; Lai et al., 2023; Muennighoff et al., 2024a;\nCassano et al., 2024; Liu et al., 2023b; Ding et al., 2023; Gu et al., 2024; Cobbe et al., 2021; Pearce et al.,\n2022; Dhamala et al., 2021; Nozza et al., 2021; Gehman et al., 2020), finding that:\n\u2022 The StarCoder2-3B model outperforms other Code LLMs of similar size (StableCode-3B and\nDeepSeekCoder-1.3B) on most benchmarks. Moreover, it matches or surpasses the performance of\nStarCoderBase-15B.\n\u2022 The StarCoder2-15B model significantly outperforms other models of comparable size (CodeLlama-\n13B), and matches or outperforms CodeLlama-34B. DeepSeekCoder-33B is the best model at\ncode completion benchmarks for high-resource languages. However, StarCoder2-15B matches or\noutperforms DeepSeekCoder-33B on low-resource programming languages (e.g., D, Julia, Lua,\nand Perl). Moreover, when we consider benchmarks that require models to reason about code\nexecution (Gu et al., 2024) or mathematics (Cobbe et al., 2021), we find that StarCoder2-15B\noutperforms DeepSeekCoder-33B.\n\u2022 The StarCoder2-7B model outperforms CodeLlama-7B but is behind DeepSeekCoder-6.7B. It is not\nclear to this report\u2019s authors why StarCoder2-7B does not perform as well as StarCoder2-3B and\nStarCoder2-15B for their size.\n2\nData Sources\nIn this section, we elaborate on the process of obtaining training data, encompassing not just the data\nsourced from Software Heritage (\u00a72.1) but also GitHub issues (\u00a72.2), pull requests (\u00a72.3), Jupyter and Kaggle\nnotebooks (\u00a72.4), documentation (\u00a72.5), intermediate representations (\u00a72.6), small math and coding datasets\n(\u00a72.7), and other natural language datasets (\u00a72.8).\n2.1\nSource Code\nSoftware Heritage\nWe build the Stack v2 on top of the Software Heritage (SH) archive (Abramatic et al.,\n2018), maintained by the non-profit organization of the same name. The mission of Software Heritage is to\ncollect and preserve all knowledge taking the form of source code. We work with the SH graph dataset (Pietri\net al., 2020), a fully deduplicated Merkle DAG (Merkle, 1987) representation of the full archive. The SH\ngraph dataset links together file identifiers, source code directories, and git commits, up to the entire states\nof repositories, as observed during periodic crawls by Software Heritage.\nExtracting repositories\nWe leverage the 2023-09-06 version of the SH graph dataset as the primary\nsource. We start by extracting the most recently crawled versions of all GitHub repositories and filtering\nthem to retain only the main branch. The branch is considered main if the repository metadata in GHArchive\nlists it as the default branch or if its name is main or master. We only extract the latest revision (commit)\nfrom the main branch and deduplicate the repositories based on the unique hashes of their contents (column\ndirectory_id of the SH dataset). The repositories\u2019 directory structure is reconstructed by recursively\njoining the directory_entry table of the dataset to itself using the directory_id and target columns and\nconcatenating the directory and file names (column name) into full paths. We only traverse the directory tree\nup to level 64. The individual file contents are downloaded from the SH content S3 bucket if the compressed\nfile size is less than 10MB.\nLicense detection\nWe extract repository-level license information from GHArchive (Github Archive, 2024)\nfor all repositories with matching names in the SWH dataset. When the repo-level license is not available,\ni.e., for 96.93% of repositories, we use the ScanCode Toolkit (ScanCode, 2024) to detect file-level licenses as\nfollows:\n3\n Under review as submission to TMLR\nIs the GitHub\nlicense empty?\nIs the GitHub li-\ncense permissive?\nnon-permissive\npermissive\nDid ScanCode\ndetect licenses?\nno license\nAre all detected li-\ncenses permissive?\npermissive\nnon-permissive\nyes\nno\nno\nyes\nyes\nno\nyes\nno\nFigure 1: File-level license assignment logic.\n\u2022 Find all files that could contain a license using a regular expression in Appendix A.3. This allows us\nto gather files that either explicitly contain a license (e.g., LICENSE, MIT.txt, Apache2.0) or contain\na reference to the license (e.g., README.md, GUIDELINES);\n\u2022 Apply ScanCode\u2019s license detection to the matching files and gather the SPDX3 IDs of the detected\nlicenses;\n\u2022 Propagate the detected licenses to all files that have the same base path within the repository as the\nlicense file.\nOnce the file-level license information is gathered, we decide whether the file is permissively licensed,\nnon-permissively licensed, or unlicensed, following the algorithm described in Figure 1.\nThe licenses we consider permissive are listed in Appendix A.4. This list was compiled from the licenses\napproved by the Blue Oak Council (Blue Oak Council, 2024), as well as licenses categorized as \u201cPermissive\u201d\nor \u201cPublic Domain\u201d by ScanCode (ScanCode License Categories, 2024).\nData licenses\nWe consider three types of files: permissively licensed, non-permissively licensed (e.g.,\ncopyleft), and unlicensed files. The main difference between the Stack v2 and the Stack v1 is that we include\nboth permissively licensed and unlicensed files. We exclude commercial licenses since their creators do\nnot intend their code to be used for commercial purposes. We also exclude copyleft-licensed code due to\nuncertainty regarding the community\u2019s stance on using such data for LLM training and its relatively low\nvolume.\nLanguage detection\nWhile the Stack v1 (Kocetkov et al., 2023) detects programming languages by their\nfile extension, we instead rely on a language classifier. Specifically, we use go-enry based on GitHub\u2019s library\nlinguist (go-enry, 2024) to detect the programming language for each file. We detect 658 unique languages\nin TheStackV2-dedup, some of which get removed at the data inspection stage (see next paragraph).\n3System Package Data Exchange, https://spdx.dev.\n4\n Under review as submission to TMLR\nTable 1: A comparison of The Stack v1 and v2 on 32 popular programming languages. We show the size\nand number of files for different data splits: The Stack v1 deduped, The Stack v2 deduped, and the training\ndata used for StarCoder2-15B.\nThe-stack-v1-dedup\nThe-stack-v2-dedup\nThe-stack-v2-swh-full\nLanguage\nSize (GB)\nFiles (M)\nSize (GB)\nFiles (M)\nSize (GB)\nFiles (M)\nAssembly\n1.58\n0.25\n13.02\n0.77\n7.74\n0.70\nBatchfile\n0.29\n0.25\n2.11\n1.13\n1.02\n0.99\nC\n57.43\n8.53\n202.05\n20.78\n114.92\n19.18\nC#\n46.29\n10.84\n239.89\n51.23\n169.75\n48.49\nC++\n50.89\n6.37\n353.89\n43.18\n211.33\n42.23\nCMake\n0.45\n0.19\n2.58\n1.74\n2.27\n1.70\nCSS\n22.61\n2.99\n161.68\n23.87\n8.00\n1.88\nDockerfile\n0.572\n0.42\n1.27\n1.90\n1.21\n1.88\nFortran\n0.17\n1.84\n4.66\n0.27\n3.61\n0.26\nGo\n25.74\n4.73\n54.60\n9.30\n25.83\n8.62\nHaskell\n2.36\n0.54\n5.11\n1.25\n4.17\n1.23\nHTML\n146.76\n9.53\n2,419.87\n90.23\n99.09\n5.23\nJava\n89.30\n20.15\n548.00\n154.28\n199.68\n62.27\nJavaScript\n141.65\n21.11\n1,115.42\n108.87\n199.99\n66.91\nJulia\n1.54\n0.30\n6.12\n0.45\n1.83\n0.43\nLua\n3.28\n0.56\n33.91\n2.35\n15.22\n2.24\nMakefile\n1.49\n0.66\n21.30\n4.22\n5.19\n2.78\nMarkdown\n75.25\n21.0\n281.04\n82.78\n244.17\n81.42\nPerl\n2.63\n0.39\n7.82\n1.15\n5.66\n1.06\nPHP\n66.84\n15.90\n224.59\n46.03\n183.70\n45.14\nPowerShell\n1.25\n0.27\n3.97\n0.68\n2.46\n0.66\nPython\n64.30\n12.96\n233.29\n56.93\n191.61\n56.19\nR\n0.30\n0.04\n22.39\n5.15\n19.05\n4.29\nRuby\n7.14\n3.41\n31.70\n17.79\n23.38\n17.51\nRust\n9.53\n1.38\n15.60\n2.22\n12.43\n2.19\nScala\n4.86\n1.36\n12.73\n4.45\n11.30\n4.32\nShell\n3.38\n22.69\n19.82\n10.68\n13.51\n10.01\nSQL\n12.22\n0.99\n281.45\n5.29\n35.75\n4.52\nSwift\n0\n0\n23.76\n7.23\n22.32\n7.16\nTeX\n5.44\n0.55\n35.86\n3.19\n30.01\n2.86\nTypeScript\n28.82\n10.64\n61.01\n23.85\n49.14\n23.28\nVisual Basic\n1.49\n0.16\n16.63\n1.06\n7.48\n0.81\nTotal\n875.85\n181.00\n6,457.14\n784.30\n1,922.82\n528.44\nVisual data inspection\nSimilar to the first StarCoder, we involve the BigCode community in a data\ninspection sprint to remove extensions with low-quality training data. We start from the annotations of the\nprevious iteration that eliminated 36 out of the 300 extensions (of the 86 included programming languages).\nFor StarCoder2, we only ran the data inspection for the not-yet-annotated programming languages (i.e.,\nexcluding the 86 languages of StarCoderBase). To streamline this process, we limited our inspection to\nextensions that include over 1,000 files and represent over 0.5% of the files in their respective languages. The\nremaining extensions were retained without further inspection, as they only make up a small volume. With\nthe help of 15 annotators from the BigCode community, we visually inspected around 1000 extensions and\nexcluded 130 (see appendix A.1 for the complete list). Our data inspection step excluded 39 programming\nlanguages from the dataset (appendix A.2), resulting in a final count of 619 programming languages.\nBasic filters\nWe apply a set of basic filters to the dataset to remove autogenerated files, data files, or other\nlow-quality training data.\n5\n Under review as submission to TMLR\n\u2022 Long line filters: we first remove all files with more than 100k lines as those files are likely to be data\nor generated code. We also remove files with an average line length of more than 100 characters or\na maximum line length of more than 1000 characters for all languages, excluding HTML, JSON,\nMarkdown, Roff, Roff Manpage, SMT, TeX, Text, and XML. For the mentioned languages, we\nremove files where the longest line exceeds 100k characters.\n\u2022 Autogenerated filter: we remove files classified as auto-generated by the is_generated function\nof go-enry (go-enry, 2024). Additionally, we exclude files containing one of {\u201cauto-generated\u201d,\n\u201cautogenerated\u201d, \u201cautomatically generated\u201d, \u201cgenerated automatically\u201d, \u201cthis file is generated\u201d} in\nthe first 5 lines of the file.\n\u2022 Alpha filter: we remove files with less than 25% of alphabetic characters for all languages except\nMotorola 68K Assembly and WebAssembly, where we only remove files with less than 25% of\nalpha-numeric characters due to the syntax of those languages.\n\u2022 Encoded data filter: we detect files with inline encoded data using the following regular expressions:\n\u2013 Base64 strings: [a-zA-Z0-9+/\\n=]{64,}\n\u2013 Hexadecimal sequences: (?:\\b(?:0x|\\\\x)?[0-9a-fA-F]{2}(?:,|\\b\\s*)){8,}\n\u2013 Unicode strings: (?:\\\\u[0-9a-fA-F]{4}){8,}\nWe remove the file if any of the substrings matching these expressions is longer than 1024 characters\nor if the fraction of matched characters is more than 50% of the file.\nLanguage-specific filters\nIn addition to the basic filters, we apply the following set of language-specific\nfilters.\n\u2022 For Text, JSON, YAML, Web Ontology Language, and Graphviz (DOT), we remove files with more\nthan 512 lines to minimize the impact of repeated tokens in data files.\n\u2022 For HTML, we keep only the files where visible text is at least 100 characters long and makes up at\nleast 20% of the code, similar to the processing pipeline of StarCoder (Li et al., 2023).\n\u2022 For Text, we keep only files with \u201crequirement\u201d in the lowercased filename, or if the filename without\nthe extension is one of {\u201creadme\u201d, \u201cnotes\u201d, \u201ctodo\u201d, \u201cdescription\u201d, \u201ccmakelists\u201d}.\n2.2\nGithub Issues\nWe incorporate GitHub issues collected from GHArchive (Github Archive, 2024). We exclude pull requests\nhere as we process them separately in \u00a72.3.\nA Github issue consists of a series of events with actions, such as opening the issue, creating a comment, or\nclosing the issue. Each event includes the author\u2019s username, a message, an action, and a creation date. We\nfollow the processing pipeline of StarCoder (Li et al., 2023), which we recap below:\n\u2022 First, we removed auto-generated text when users replied to issues via email (for more information,\nsee Li et al., 2023, Appendix A). We also deleted issues with a short message (less than 200 characters)\nand truncated long comments in the middle to a maximum of 100 lines while retaining the last\n20 lines. This removed 17% of the volume \u2014 a similar percentage as in StarCoderBase.\n\u2022 Next, we excluded comments from bots. To do so, we searched for keywords in the username of\nthe comment\u2019s author (for more information, see Li et al., 2023, Appendix A). This step eliminated\n3% of the issues, much less than the 17% reported in StarCoder (Li et al., 2023). This discrepancy\nis primarily because our dataset does not include pull requests, which are often the source of a\nsignificant proportion of bot-generated content.\n6\n Under review as submission to TMLR\n\u2022 We used the number of users engaged in the conversation as an indicator of quality. Our criterion was\nto include conversations that have two or more users. However, we also preserved conversations that\ninvolved a single user if the total text within comments was less than 7,000 characters (96th percentile).\nAdditionally, we excluded issues authored by a single user if they contained more than ten events, as\nthey tended to be of poor quality or originate from overlooked bots. By implementing these filters,\nwe removed 38% of the remaining issues. Lastly, we anonymized the usernames in the conversations\nby replacing them with a participant counter within the conversation (following the process of\nStarCoder).\n2.3\nPull Requests\nWe include code reviews by gathering pull request events from GHArchive (Github Archive, 2024) and the\ncorresponding source code from Software Heritage (Software Heritage, 2024b). Pull requests are requests to\nmerge particular code changes from one branch into another on GitHub. Typically, they involve multiple\nrounds of code review discussions and additional cycles of code changes before they get merged into the\ntarget branch.\nData collection\nSpecifically, for each pull request, we aggregate the PullRequestEvent, PullRequestReview-\nEvent, PullRequestReviewCommentEvent, IssueCommentEvent, and IssuesEvent events found on GHArchive.\nMore details about the differences between these events can be found in the Github documentation. Next,\nwe extract all base and head commit IDs from these events and retrieve the corresponding code files from\nSoftware Heritage. As we do not have access to the commit diffs, we generate them by identifying changes\nbetween files at the same path. We consider files present in the base but absent in the head as deletions, while\nwe consider files absent in the base but present in the head as additions. This process yields approximately\n300M PRs, accompanied by a volume of 15 TB of base code. Among these, there are 215M closed PRs\noriginating from around 24M repositories.\nPR filters\nWe remove PRs that 1) have been opened by bots, 2) consist only of comments by bots, 3) have\na non-permissive license, 4) have been opted out, 5) changes the base during the PR, 6) are not approved or\nmerged, or 7) lack initial diffs (either due to absent data from Software Heritage or because all data have\nbeen filtered in other steps).\nFile filters\nWe remove files from the base commit if they satisfy one of the following conditions: 1) the\nfile is a deletion or addition, 2) the file length exceeds 1 million characters, 3) the fraction of alphanumeric\ncharacters is less than 0.25, 4) the fraction of hexadecimal characters is greater than 0.25, 5) the max number\nof lines surpasses 100,000, 6) the average line length exceeds 100, 7) the max line length surpasses 1,000, or\n8) the presence of non-English text in Markdown\nTitle and description filtering\nWe apply the following heuristic filters to clean up the PRs further. We\nexclude PRs with changes to the base, those not approved or merged, and those lacking initial diffs (either\ndue to absent data from Software Heritage or being filtered out in previous steps). We also exclude PRs\nwhen the title is less than 10 characters or contains the words \u2019dependencies\u2019, \u2019dependency\u2019, \u2019depend\u2019, or\n\u2019release\u2019. We exclude PRs when the description is less than 20 characters or contains \u2019Qwiet\u2019.\nTruncating inputs\nWe shorten lengthy input fields in the PRs as follows. We truncate titles to 500\ncharacters and descriptions to 80 lines, only displaying the first 60 and the last 20 lines. If the description\nlength still exceeds 1000 characters, we truncate it.\nProcessing comments\nFollowing the processing of GitHub issues (\u00a72.2), we remove comments from bots\nand strip auto-generated text when users post via email reply. We anonymize the usernames of authors as\ndescribed in \u00a73.2. We remove comments from PRs with less than 20 characters unless they are PR review\ncomments. For code review comments, we remove the full diff hunk if it exceeds 10,000 characters while\nkeeping the filename and comment.\n7\n Under review as submission to TMLR\nSubsampling PRs\nTo increase the diversity in the PRs, we sub-sample them on a per-repository basis.\nFor repositories with 1 PR (after filtering), we retain it with a probability of 0.8. We linearly decrease this\nretention probability to 0.1 for repositories with 1,000 PRs. For repositories with more than 1,000 PRs, we\nset the retention probability such that we retain only 100 PRs. Finally, we sub-sample YAML and JSON files\nwith 10% retention probability when their file size exceeds 50% of the total base files size or when the file\npath contains one of the keywords: \u2019pack\u2019, \u2019lock\u2019, \u2019yarn\u2019, \u2019output\u2019, \u2019swagger\u2019, \u2019openapi\u2019, or \u2019output\u2019.\nMax sequence length\nWe determine the maximum sequence length of PRs by first investigating the\ndata distribution after the processing steps mentioned above. We find 3.7M PRs with up to 1M characters,\nresulting in 194 GB of data. This reduces to 3.3M PRs when we set a limit of 100K characters, resulting in a\ndataset size of 67.3 GB. (appendix A.5 has more details about sequence length statistics.) For the StarCoder2\nmodels, we opt to include PRs with up to 100K characters (translating to roughly 25k tokens). Since we\nare pre-training with a limited context of 4K tokens, not all PRs fit into the context window. However, as\ndescribed in \u00a75.2, we format the PRs so that the diffs are local and do not require long context.\n2.4\nNotebooks\nWe include notebooks from two separate sources: Jupyter notebooks extracted from the Software Heritage\narchive and notebooks released by the Kaggle platform.\n2.4.1\nJupyter Notebooks\nWe transform Jupyter Notebooks into scripts and structured notebooks following the same pipeline as\nStarCoder (Li et al., 2023). One key difference is that we keep the markdown structure of the text blocks\nwhile it is removed in StarCoder. For completeness, we recap these preprocessing steps below.\nJupyter \u2013 scripts\nWe utilize Jupytext4 to convert notebooks to scripts. To initiate the conversion process,\nJupytext requires the identification of the specific programming languages within each notebook. This\ninformation is typically available in the metadata of most notebooks. In cases where it is not, we use the\nGuesslang library5 to identify the programming language, using a probability threshold of 0.5 or higher. Our\ninitial dataset comprised 11 million notebooks, of which 3 million were excluded due to parsing errors. After\nnear-deduplication, the dataset was reduced to 4 million notebooks converted to scripts.\nJupyter \u2013 structured\nTo create this dataset, we first filtered out notebooks that did not contain any\nPython code or Markdown text using the metadata information of each notebook. Only notebooks explicitly\nmarked as \u2018Python\u2019 in the metadata were kept. Then, for each notebook, consecutive Markdown blocks\nor code blocks were merged into a single Markdown or code block, respectively. Eventually, we ended up\nwith consecutive code-text pairs in temporal order grouped by each notebook. Each Jupyter code-text pair\ncontained the Markdown text immediately preceding the code block and the Python code, forming a natural\ninstruction pair. We also included the formatted output of a code block if the output cell was non-empty;\notherwise, it was marked by a special <empty_output> token. If consecutive code blocks have multiple output\ncells before merging, we only retain the output of the last code block. After these preprocessing steps and\nnear-deduplication, we ended up with 4.6M structured Jupyter notebooks.\n2.4.2\nKaggle Notebooks\nWe include Python notebooks released by the Kaggle platform6 under an Apache 2.0 license, starting with an\ninitial dataset of 3.6M notebooks. Note that this Kaggle dataset does not include the output cells, only the\nmarkdown and code cells.\nCleaning\nWe start the data cleaning process by dropping notebooks with less than 100 characters and\nthose with syntax errors. We also remove the templated text at the beginning of notebooks (see appendix A.7\n4https://jupytext.readthedocs.io/\n5https://guesslang.readthedocs.io/\n6https://www.kaggle.com/datasets/kaggle/meta-kaggle-code\n8\n Under review as submission to TMLR\nfor the templates). These steps remove 18% of the notebooks. Next, we convert the notebooks to the\nstructured and script format, following the processing of the Jupyter notebooks in \u00a72.4.1. Finally, we remove\nnear-duplicates using the pipeline described in \u00a73.1, eliminating 78% of the notebooks and leaving us with\n580k notebooks.\nDataset description\nTo provide the model with more context regarding the content and objectives of the\nnotebook, we include metadata about the Kaggle dataset whenever this information is available. We find\nthat 42% of the notebooks are associated with a Kaggle dataset and include its title and description at the\nbeginning of each notebook.\nDataset schema\nIn addition to these high-level dataset descriptions, we scanned the code inside the\nnotebooks for instances of read_csv. We found that 25% of the samples were loading CSV datasets. We\nextracted and incorporated detailed information about these datasets as follows. First, we used the Kaggle\nAPI to download the datasets and successfully retrieved 8.6% of the notebooks.\nThe remaining cases\nwere attributed to either the dataset being unavailable or encountering challenges downloading it within a\nreasonable time frame. For the downloaded datasets, we prefix the output of df.info() to the notebook,\nwhich displays the column names and their dtypes, the non-null values count, and the memory usage. We\nalso include four sample rows from the dataset.\n2.5\nDocumentation\nDocumentation from package managers\nWe crawl documentation from several package manager\nplatforms, including npm, PyPI, Go Packages, Packagist, Rubygems, Cargo, CocoaPods, Bower, CPAN,\nClojars, Conda, Hex and Julia. We first retrieve the names of the most popular libraries across various\nplatforms from libraries.io. These library names are then used to search through individual package managers,\nenabling us to obtain the respective homepages for each library. We systematically crawled the documentation\nfiles from the obtained homepage links or, alternatively, extracted information from the provided README\nor documentation files on the platform. For documents obtained through homepage links, we adhere to the\nsame processing strategy outlined below in the paragraph titled \u201cDocumentation from websites\u201d. When\nextracting documents from the REwang2023softwareADME or documentation files on the platform, we\nemploy distinct heuristics to extract the text using markdown formats whenever feasible, aiming to maintain\na simple and effective format. It is worth noting that many libraries available on PyPI and Conda have their\nassociated documentation hosted on Read the Docs, which typically offers more comprehensive documentation.\nConsequently, we prioritize utilizing Read the Docs as the primary source of documentation for these libraries.\nFor these documents hosted on Read the Docs, we follow the same processing procedure outlined in the\nparagraph titled \u201cDocumentation from websites\u201d.\nPDFs from package managers\nFor documents related to the R language, we extracted text from all\nPDF files hosted on CRAN using the pdftotext library.7 This library is particularly effective in preserving\nthe formatting, including spaces within code snippets. For LaTeX-related documentation, we extracted the\ndocumentation, tutorial, and usage guide PDFs of LaTeX packages from CTAN, filtered out image-heavy\nPDFs, and converted the rest into markdown using the Nougat neural OCR tool.\nDocumentation from websites\nWe collect code documentation from a carefully curated list of websites\nas detailed in Table 2. We start by systematically exploring the website from its initial URL listed in Table 2,\nusing a queue to store URLs within the same domain. This queue expands dynamically as we discover new\nlinks during the crawl. Given that most documents comprise HTML pages, we focus our processing pipeline\non (1) content extraction and (2) content concatenation. To extract the content, we utilize the trafilatura\nlibrary8 to convert each HTML page into XML format, simultaneously eliminating redundant navigation and\nindex bars, elements that often recur in documentation. Next, we converted the XML format to markdown\nusing our XML-to-Markdown conversion script. In the second stage, to compile these documents into a\nsingle text, we first do a near-deduplication of the content extracted from different HTML pages. This\n7https://github.com/jalan/pdftotext\n8https://github.com/adbar/trafilatura\n9\n Under review as submission to TMLR\n102\n103\n104\nNumber of Occurrences\nCSS\nHaskell\nHTML\nPerl\nPHP\nJulia\nJSON\nSQL\nObjective-C\nYAML\nMarkdown\nTeX\nRuby\nPython\nErlang\nUnknown\nRust\nJavaScript\nGo\nR\nProgramming Languages\nProgramming Language Usage\nFigure 2: The distribution of the top 20 programming languages in our crawled documentation collection.\nstep was essential since we have observed that certain document pages only comprise website layouts (e.g.,\nnavigation bars) instead of fruitful information for documents, resulting in a substantial amount of duplicated\ncontent. To accomplish this, we treat each HTML page from a single website as a cluster and apply the\nminhash locality-sensitive hashing technique to identify and eliminate similar pages, using a threshold of 0.7.\nFinally, we assemble the gathered content from different pages of the same website in the order of web page\ncrawling, ensuring a cohesive narrative. This parallels the \u201cbreadth-first search\u201d approach, where all nodes at\nthe current depth are explored before proceeding to the next depth level. Also, we collected code-relevant\ndata from existing web crawls such as RefinedWeb (Penedo et al., 2023), OSCAR (Ortiz Su\u00e1rez et al.,\n2019), and esCorpius (Guti\u00e9rrez-Fandi\u00f1o et al., 2022). We use regular expressions to identify programming\nlanguage-specific constructs within the documents and to detect the \u201cdocs.\u201d substring in the page URLs.\nThe resulting dataset primarily comprises content sourced from programming blogs, coding tutorials, and\nplatforms like Read the Docs, with the exclusion of the documents gathered above.\nFree textbooks\nWe scraped free programming books compiled in the Free Programming Books project,\nwhich aims at promoting the distribution of free programming e-books. First, we extract all links and identify\nthose with a PDF extension. Subsequently, we downloaded all available PDF files and utilized the pdf2text\nlibrary to extract text from these PDF files. Finally, we parsed 3,541 books whose languages span across\ndifferent regions, including English, Chinese, Japanese, Spanish, and others.\nLanguage identification\nFinally, we have employed a dual approach to identify the main programming\nlanguage used by each document. We leverage predefined rules when the source of the document unequivocally\ncorresponds to a specific programming language and resort to the guesslang9 library in cases where such\ncorrespondence is not explicit. The resultant programming language distribution is graphically represented in\nFigure 2.\n2.6\nIntermediate Representations\nWe augment source code by pairing its intermediate representations (IR) to enhance the model\u2019s understanding\nof low-resource programming languages. The key rationale behind this approach is that a shared intermediate\n9https://github.com/yoeo/guesslang\n10\n Under review as submission to TMLR\nTable 2: The websites scraped for the code documentation dataset.\nWebsite Name\nURL\nDevDocs API Documentation\nhttps://devdocs.io\nMDN Web Docs\nhttps://developer.mozilla.org\nTensorFlow Docs\nhttps://www.tensorflow.org\nLinux Docs\nhttps://www.kernel.org/doc/Documentation\nSwift Programming Language\nhttps://docs.swift.org/swift-book/documentation/the-swift-programming-language\nFlutter API Reference\nhttps://api.flutter.dev\nTypeScript\nhttps://www.typescriptlang.org/docs/handbook\nJson.NET Documentation\nhttps://www.newtonsoft.com/json/help/html\nNVIDIA Documentation Hub\nhttps://docs.nvidia.com\nOracle Java Tutorial\nhttps://docs.oracle.com/javase/tutorial/java\nQiskit Documentation\nhttps://qiskit.org/documentation\nQ# Quantum Programming\nhttps://learn.microsoft.com/en-us/azure/quantum/user-guide\nPony Tutorial\nhttps://tutorial.ponylang.io\nZephir Documentation\nhttps://docs.zephir-lang.com/0.12/en/introduction\nQemu Documentation\nhttps://www.qemu.org/documentation\nC# Documentation\nhttps://learn.microsoft.com/en-us/dotnet/csharp\nHugging Face Documentation\nhttps://huggingface.co/docs\nLLVM Doc\nhttps://llvm.org/docs\nGCC Online Documentation\nhttps://gcc.gnu.org/onlinedocs\nMatlab Documentation\nhttps://www.mathworks.com/help/matlab\nBoost C++ Libraries\nhttps://www.boost.org/doc\nMaxima Manual\nhttps://maxima.sourceforge.io/docs/manual/maxima_singlepage.html\nQt Documentation\nhttps://doc.qt.io\nrepresentation might help to anchor low-resource constructs to similar ones in high-resource languages (Zhuo\net al., 2023b).\nLLVM\nWe select LLVM (Lattner & Adve, 2004) as the intermediate representation due to its widespread\navailability on GitHub, increasing the probability that there is sufficient training data to learn the semantics\nof the language. In addition, LLVM is widely adopted as an IR and is the target representation of many\ncompiler frontends across several programming languages.10\nData collection\nExisting attempts to extract IR from free-form source code either suffer from low\ncompilation success rates (Szafraniec et al., 2023) or use bespoke language-specific mechanisms to track\ndependency code to compile successfully (Grossman et al., 2023). We sidestep this by sourcing self-contained\ncompilation units from accepted solutions to programming word problems (Rosetta Code, 2023; Mirzayanov,\n2020; Puri et al., 2021; Caballero et al., 2016). We compile \u22484M sources in total across C++, C, Objective-C,\nPython, Rust, Go, Haskell, D, Fortran, Swift, and Nim in size optimized (-OZ equivalent) and performance\noptimized (-O3 equivalent) mode. We opt to use the size-optimized IR in most of the pairs due to context\nlength considerations. However, for 20% of the pairs, we use the performance-optimized IR. This is done to\nmaximize transfer from the pre-training stage, where the model sees LLVM code in the wild, which is more\nlikely to be in this form. We use clang11 for compiling C++, C and Objective-C, codon12 for compiling\nPython, rustc13 for compiling Rust, gollvm14 for compiling Go, ghc15 for compiling Haskell, ldc16 for\ncompiling D, flang17 for compiling Fortran, and nlvm18 for compiling Nim. We clean headers along with\nsuperfluous platform, vendor, and memory layout-specific information from the IR before pairing it with its\nsource.\n10https://llvm.org/ProjectsWithLLVM/\n11https://clang.llvm.org/\n12https://docs.exaloop.io/codon\n13https://www.rust-lang.org/\n14https://go.googlesource.com/gollvm/\n15https://www.haskell.org/ghc/\n16https://wiki.dlang.org/LDC\n17https://flang.llvm.org/docs/\n18https://github.com/arnetheduck/nlvm\n11\n Under review as submission to TMLR\n2.7\nLHQ19\nWe include several small high-quality datasets for math and coding:\n\u2022 APPS (train) (Hendrycks et al., 2021) is a popular text2code benchmark in Python with a train\nset of 5,000 examples. We include one solution per programming problem.\n\u2022 Code Contest (Li et al., 2022) is similar to APPS but includes solutions in several programming\nlanguages, namely Python 2/3, C++, and Java. We include one solution per problem and language\nand arrive at a dataset of 13k+ examples.\n\u2022 GSM8K (train) (Cobbe et al., 2021) is the train split of GSM8K, a popular evaluation benchmark\nfor testing the math reasoning capabilities of LLMs. The dataset consists of 7k+ examples.\n\u2022 GSM8K (SciRel) (Yuan et al., 2023) is an augmented version of GSM8K that includes alternative\nreasoning paths for the questions in GSM8K. The extended version contains 110k examples.\n\u2022 Deepmind Mathematics (Saxton et al., 2019) is a synthetic dataset of math questions and\nanswers across various domains (algebra, arithmetic, calculus, comparison, measurement, numbers,\npolynomials, probability) and varying difficulty (easy-medium-hard). The dataset consists of 110M+\n(short) examples.\n\u2022 Rosetta Code (Rosetta Code, 2023; Nanz & Furia, 2015) is a dataset with over 1100 everyday\nprogramming tasks with solutions in as many different programming languages as possible.\n\u2022 MultiPL-T (Cassano et al., 2023a) is high-quality data in Lua, Racket, and OCaml based on\nautomatically translating extracted Python functions and validating them with unit tests. The total\ndataset comprises over 200k examples.\n\u2022 Proofsteps is part of the AlgebraicStack (Azerbayev et al., 2024), a dataset used to train the Lemma\nfamily of models. We also include proofsteps-lean, which was extracted from mathlib 4 (mathlib\nCommunity, 2020), and proofsteps-isabelle, which was built on top of the PISA dataset (Jiang\net al., 2021). Proofsteps-lean contains over 3k examples, while proofsteps-isabelle contains over 250k\nexamples.\n2.8\nOther Natural Language Datasets\nStackOverflow\nWe include 11 million questions and their corresponding multiple responses from the Stack\nOverflow dump dated 2023-09-14 (StackExchange Archive, 2024). We filtered out questions with fewer than\nthree answers. Upon inspecting the dataset, we found many mismatches between questions and answers\ndue to inherent format errors in the Stack Overflow dump. We leveraged Llama-2-70b-chat-hf (Touvron\net al., 2023) to increase the quality of the dataset as follows. We selected 20,000 examples and asked\nLlama-2-70b-chat-hf to rate the question-answer pairs. See Appendix A.6 for the exact prompt. Next,\nwe pick the 10,000 highest-scoring pairs as positive examples and use the remaining 10,000 answers to\ncreate negative examples by randomly pairing them with other questions. We use this dataset to train a\nbinary classifier by embedding the question and answer with a well-performing sentence embedding model\n(sentence-transformers/all-MiniLM-L12-v220 (Reimers & Gurevych, 2019; Muennighoff et al., 2022a))\nand minimizing the cosine distance between them. Next, we plot the embedding scores for a subset of the\nquestion-answer pairs and manually determine the threshold to 0.1. As a question can have multiple answers,\nwe average the scores of question-answer pairs and remove all questions with an average score below 0.1. We\nend up with 11.4 million questions and over 10B tokens.\nArXiv\nWe include the ArXiv subset of the RedPajama dataset (Together Computer, 2023). This dataset is\ndownloaded from the publicly available Amazon S3 bucket (Arxiv, 2024). We further processed the dataset\nonly to retain latex source files and remove preambles, comments, macros, and bibliographies from these files.\nThe final dataset is roughly 30B tokens.\n19Leandro\u2019s High-Quality dataset\n20https://huggingface.co/sentence-transformers/all-MiniLM-L12-v2\n12\n Under review as submission to TMLR\nWikipedia\nWe include the English subset of Wikipedia. Specifically, we use the version collected by\nRedPajama (RedPajama Wiki, 2024), which is derived from the 2023-03-20 dump. We follow RedPajama\u2019s\nprocessing steps and eliminate hyperlinks and templates from the Wikipedia pages. The full dataset comprises\naround 6 billion tokens.\nOpenWebMath\nWe include OpenWebMath (Paster et al., 2023), an open dataset of high-quality mathe-\nmatical text extracted from CommonCrawl. The full dataset comprises almost 15B tokens.\n3\nPreprocessing Pipeline\nWe apply several preprocessing steps, such as deduplication (\u00a73.1), PII redaction (\u00a73.2), benchmark decon-\ntamination (\u00a73.3), malware removal (\u00a73.4), and opt-out deletion requests (\u00a73.5), to the data sources described\nin the previous section. Since not all steps are applied to each data source, we summarize the preprocessing\npipeline per data source in Table 3.\n3.1\nRemoving Near-Duplicates\nWe deduplicate the source code, pull requests, notebooks, issues, and documentation. We do not deduplicate\nthe already preprocessed natural language datasets, such as Arxiv, StackExchange, OpenWebMath, Wikipedia,\nand the small high-quality math and reasoning datasets.\nWe followed the deduplication pipeline of SantaCoder (Ben Allal et al., 2023). This process first calculates\nthe MinHashes (Broder, 2000) of all code files and then utilizes Locally Sensitive Hashing (LSH) to group\nfiles based on their MinHash fingerprints. During the LSH stage, \u201csimilar\u201d files are assigned to the same\nbuckets, identifying them as duplicates. Only one file from each duplicate group is chosen. In addition to the\nSantaCoder approach, to preserve repository context, we prioritize files from repositories with higher star\nand fork counts or from the latest commit date as a tiebreaker. We used 5-grams and a Jaccard similarity of\n0.7. We refer to this blogpost for more background information regarding the deduplication pipeline.\n3.2\nPII Redaction\nTo reduce the likelihood of re-distributing Personally Identifiable Information (PII) present in the training data,\nwe make diligent efforts to redact PII from the training set. We largely follow the steps from StarCoder (Li\net al., 2023) and leverage the StarPII model to redact various PII entities. Below, we provide more details on\nhow we apply it to each data source.\nRedacting PII entities\nWe use StarPII to redact names, emails, keys, passwords, IP addresses, and\nusernames from source code, pull requests, issues, and StackOverflow. We do not make any modifications\nto the model or redaction logic described in the StarCoder paper (Li et al., 2023). For OpenWebMath and\ndocumentation, we only redact names, keys, and emails, while we only redact emails for arXiv using the regex\ndescribed in Ben Allal et al. (2023).\nRedacting usernames\nThe conversations in issues, pull requests, and StackOverflow often contain\nusernames in the message thread.\nWe anonymize the author usernames by substituting them with a\nparticipant counter specific to the conversation, like username_1 to represent the second participant. These\npseudonyms are added at the start of each comment to maintain the speaker\u2019s identity. Moreover, any\nreferences to these usernames in the messages are removed. Only the usernames of actively participating\nindividuals in the conversation are masked, and mentions of non-participating users remain unaffected.\n3.3\nDecontamination\nTo ensure the performance of StarCoder is not artificially inflated on our test benchmarks, we decontaminate\nthe training set from our test sets. Specifically, we remove files that contain docstrings or solutions from\nHumanEval and MBPP, docstrings from APPS, questions from GSM8K, or prompts from DS1000. In contrast\n13\n Under review as submission to TMLR\nTable 3: Overview of the data processing steps applied to each data source.\nDataset\nDedup\nMalicious Code\nDecontaminate\nOpt-out\nPII\nSource Code\nYes\nYes\nYes\nYes\nStarPII\nPull Requests\nYes\nYes\nYes\nYes\nStarPII + Usernames\nJupyter/Kaggle Notebooks\nYes\nYes\nYes\nYes/No\nStarPII\nIssues\nYes\nYes\nYes\nYes\nStarPII + Usernames\nDocs\nYes\nNo\nNo\nNo\nStarPII: Names, Keys, Emails\nLHQ\nNo\nNo\nNo\nNo\nNo\nArxiv\nNo\nNo\nNo\nNo\nEmail\nOpenWebMath\nNo\nNo\nYes\nNo\nStarPII: Names, Keys, Emails\nWikipedia\nNo\nNo\nNo\nNo\nNo\nStackExchange\nNo\nNo\nYes\nNo\nStarPII + Usernames\nto the first iteration of StarCoder (Li et al., 2023), we further enhance the recall of the decontamination\nprocess by removing whitespace during string matching. Note that we exclude docs, LHQ, arXiv, and\nWikipedia from this decontamination step.\n3.4\nMalware Removal\nWe scan our training set to identify possible instances of malware in the source code, pull requests, notebooks,\nand issues. To this end, we use ClamAV 1.2 (ClamAV, 2024) with additional unofficial malware signatures\npublished by SaneSecurity (Sane Security, 2024) as of 2023-11-16. Signatures with a high risk of False\nPositives (as determined by SaneSecurity) were not used. See Table 26 for the most frequently detected\nmalware signatures in the unfiltered code dataset. In summary, this step eliminates 59,442 files from the\ndataset, constituting only 0.009% of the 654M files.\n3.5\nRemoving Opt-outs\nWe announced the upcoming training run of StarCoder2 on X21 and updated the \"Am I in the stack\"\ngovernance tool with the new repositories from The Stack v2. Developers were granted until November 20,\n2023, to submit their opt-out requests. After the cut-off date, we eliminated 1,561 repositories associated\nwith 91 users and organizations. A total of 22,066 files were removed from the source code dataset (excluding\nissues and PRs).\n4\nData Composition\nModel capacity\nWith a much larger training set available, we decided to tailor our data composition to\neach model size. We reason that smaller models, having limited capacity, should be exposed to a less diverse\ndataset. This intuition is supported by research in multi-lingual NLP showing that languages compete for\nmodel capacity (Arivazhagan et al., 2019; Conneau et al., 2020; Scao et al., 2022b). Hence, we first create a\nsmaller version of the SWH code dataset, selecting a subset of 17 widely-used programming languages. We\nuse this variant to train the 3B and 7B models, whereas we use the full version with all 619 programming\nlanguages for the 15B model. To further limit the diversity in the training set for the 3B model, we also\nexclude some natural language datasets (see \u201cData composition per model size\u201d).\nDownsampling languages\nSimilar to StarCoderBase, we adhere to the natural distribution of the data as\nmuch as possible. Before constructing the source code datasets, we examined the data distribution among\nthe programming languages. Compared to StarCoderBase, we found slightly larger variations among the\nhigh-resource languages. The observed data volume (in GB) is as follows: Java (479.68), JavaScript (277.25),\nC++ (204.49), Python (190.99), PHP (171.57), C# (166.22), and C (114.49). We decided to downsample both\nJava and Javascript to 200GB to put these high-resource languages on a more equal footing. Furthermore, we\n21https://x.com/BigCodeProject/status/1721583097580249254?s=20\n14\n Under review as submission to TMLR\nTable 4: Overview of the data composition of StarCoder2 models. We refer to the training set of the 3B\nmodel as the-stack-v2-train-3B.\nDataset\nTokens (B)\n3B\n7B\n15B\nthe-stack-v2-train-smol\n525.5\n\u2713\n\u2713\n\u2717\nthe-stack-v2-train-full\n775.48\n\u2717\n\u2717\n\u2713\nPull requests\n19.54\n\u2713\n\u2713\n\u2713\nthe-stack-v2-train-extras\nIssues\n11.06\n\u2713\n\u2713\n\u2713\nJupyter structured\n14.74\n\u2713\n\u2713\n\u2713\nJupyter scripts\n16.29\n\u2713\n\u2713\n\u2713\nKaggle scripts\n1.68\n\u2713\n\u2713\n\u2713\nDocumentation\n1.6\n\u2713\n\u2713\n\u2713\nOpenWebMath\n14.42\n\u2717\n\u2713\n\u2713\nWikipedia\n6.12\n\u2717\n\u2713\n\u2713\nStackOverflow\n10.26\n\u2713\n\u2713\n\u2713\nArxiv\n30.26\n\u2717\n\u2713\n\u2713\nLHQ\n5.78\n\u2713\n\u2713\n\u2713\nIntermediate Repr.\n6\n\u2713\n\u2713\n\u2713\nUnique tokens (B)\n622.09\n658.58\n913.23\npreserved 254GB of markdown data while reducing the size of HTML to 100 GB. This decision was driven by\nthe anticipation that markdown would likely contain more code documentation, whereas HTML is commonly\nassociated with webpages. Lastly, we subsampled data files like JSON, XML, and YAML to 8GB and a few\nother data formats to 1 GB. See Table 28 in Appendix C.2 for the full list of subsampled languages.\nRepository-context\nAfter subsampling some programming languages, we compile the source code from\nSoftware Heritage into repository-context-aware datasets. Each example in the dataset is a full repository\nwith files arranged in a random order. As previously noted, we create two versions of the SWH dataset,\nthe-stack-v2-train-smol and the-stack-v2-train-full, as further detailed in the subsequent paragraphs.\nThe-stack-v2-train-smol\nFor the small variant, we select 17 widely used programming languages and\ninclude a curated set of documentation and configuration languages.\n\u2022 Specifically, we include the following programming languages:\n\u2013 C\n\u2013 C#\n\u2013 C++\n\u2013 Go\n\u2013 Java\n\u2013 JavaScript\n\u2013 Kotlin\n\u2013 Lua\n\u2013 PHP\n\u2013 Python\n\u2013 R\n\u2013 Ruby\n\u2013 Rust\n\u2013 SQL\n\u2013 Shell\n\u2013 Swift\n\u2013 TypeScript\n\u2022 And incorporate the following languages associated with code documentation:\n\u2013 AsciiDoc\n\u2013 HTML\n\u2013 Markdown\n\u2013 RDoc\n\u2013 RMarkdown\n\u2013 Text\n\u2013 reStructuredText\n\u2022 We also include several configuration languages and files, which we list in Appendix C.1.\n\u2022 Despite limiting the languages to this subset, we obtain a dataset of 525B+ unique tokens.\nThe-stack-v2-train-full\nFor the full variant, we include all 619 programming languages. Although this\nsubset significantly enhances language diversity (adding 600+ programming languages), it contributes only\naround 250B tokens to the dataset, culminating in 775B+ tokens.\n15\n Under review as submission to TMLR\nData composition per model size\nIn Table 4, we summarize the data composition for the 3B, 7B,\nand 15B models. We use the-stack-v2-train-extras to denote all supplementary sources gathered for\nStarCoder2, excluding the source code obtained from SWH. For the 3B, we use the-stack-v2-train-smol\nand exclude OpenWebMath, Wikipedia, and Arxiv from the extra data sources in \u00a72. This leads to a dataset\nof 622B+ unique tokens. For the 7B, we include OpenWebMath, Wikipedia, and Arxiv, leading to a slightly\nlarger dataset of 658B+ unique tokens. For the 15B, we include the-stack-v2-train-full dataset and all\nextra data sources listed in \u00a72, resulting in a dataset with 913B+ unique tokens. The size of this dataset is\n4\u00d7 the size of the training dataset for StarCoderBase.\n5\nData Formatting\nWe present the formatting guidelines for each of the data sources below. We provide the templates below\nin which \u27e8token\u27e9 refers to a sentinel token, and metadata and data refer to placeholders for data fields,\nrespectively.\n5.1\nSource Code\nWe prepend the repository name and file paths to the context of the code file. We only add this metadata\nwith a 50% probability to enable the model to operate without this information. We use the following format\nwhen adding the repository name and file paths:\n<repo_name>reponame<file_sep>filepath1\\ncode1<file_sep>filepath2\\ncode2 ... <|endoftext|>.\nWe use the following format when we do not include this meta-data:\n<file_sep>code1<file_sep>code2 ... <|endoftext|>.\nRepository-context\nStarcoder1 was trained with file-context, i.e., the setting where random files are\njoined into the context window. In this work, we explore training with repository-context, wherein files from\nthe same repository are grouped together. While we considered various methods for grouping files within the\nrepository, we ultimately arranged them in a random order within the same repository.\nFIM\nTo enable the model to perform code infilling tasks, we apply the fill-in-the-middle transformation (FIM;\nBavarian et al., 2022) to the source code. While we explored several FIM variants in preliminary experiments,\nwe opted for repo-context file-level FIM in the StarCoder2 models. In this FIM variant, repositories are\nselected with a 50% chance of being candidates for FIM. The selected repository examples are split by\n<|endoftext|> and <file_sep> tokens. Next, we apply the FIM transformation to each chunk with a 50%\nprobability. We do not apply FIM to the repository metadata (<repo_name>reponame). Below, we provide\nan example of the FIM format when it\u2019s only applied to the second source file:\n<repo_name>reponame<file_sep>filepath0\\ncode0<file_sep><fim_prefix>filepath1\\n\ncode1_pre<fim_suffix>code1_suf<fim_middle>code1_mid<file_sep> ...<|endoftext|>\n5.2\nPull Requests\nFormatting pull requests is challenging as we aim to create a compact representation of a potentially long\nsequence of code changes and comments. We refer to \u00a72.3 for details on how we removed and truncated long\ninput fields of the pull request. Here, we focus on how to render the PR into a structured format that can be\nconsumed by the LLM.\nFor files part of the base commit, we include the entire file with 0.2 probability; otherwise, we display a range\nof changes in the base files across all commit heads of the PR.22 We randomly add up to 32 lines before and\nafter the changes.\n22We take the union of file line changes in all commits\n16\n Under review as submission to TMLR\nWe use diff hunks to display modifications between the before and after state of the file, ensuring that changes\nare reasonably localized. Additionally, within the diff hunks, we incorporate 3-10 randomly selected context\nlines both before and after the specific change.\nWe structure the PR format as follows. The first block presents the title, description, and complete base files\nor modifications made to them. Subsequently, we outline the first set of head diff hunks:\n<pr>Title: title\\nusername_0: description\n<pr_status>opened\n<repo_name>reponame\n<pr_base>\n<pr_file>filepath_1\n<pr_base_code>file_content/changes_1\n...\n<pr_file>filepath_N\n<pr_base_code>file_content/changes_N\n<pr_diff>\n<pr_file>filepath_1\n<pr_diff_hunk>diff_hunk_1\n...\n<pr_diff_hunk>diff_hunk_K\n...\n<pr_file>filepath_M\n<pr_diff_hunk>diff_hunk_1\n...\n<pr_diff_hunk>diff_hunk_J\nThe second block is repeated for each new head commit in the PR, covering general comments, review\ncomments, and code review comments. The block concludes with the diff hunks between the pull request\nbase and the new head, reflecting the outcome of discussions and comments. Note that it\u2019s also possible\nfor users to close and reopen the pull request. As in Github issues, we refer to authors by their participant\ncounter within the conversation, e.g., username_1, to refer to the second participant in the issue.\n<pr_comment>username_id: comment\n<pr_event_id>comment_id\n...\n...\n...\n<pr_review>username_id: review_comment\\n\n<pr_event_id>review_id\n<pr_review_state>[approved, rejected, commented, changes_required]\n...\n...\n...\n<pr_review_comment>\n<pr_event_id>comment_id\n<pr_in_reply_to_review_id>review_id (opt)\n<pr_in_reply_to_comment_id>comment_id (opt)\n<pr_file>filepath\n<pr_diff_hunk_comment_line>line_number\n<pr_diff_hunk>diff_hunk_content\n<pr_comment>username_id: comment\n17\n Under review as submission to TMLR\n...\n...\n...\n<pr>username_id\n<pr_status>closed\n<pr_is_merged>False\n...\n<pr>Title: title\\nusername_id: description\n<pr_status>[opened, reopened, edited]\n...\n...\n...\n<pr_file>filepath_1\n<pr_diff_hunk>diff_hunk_1\n...\n<pr_diff_hunk>diff_hunk_K\n...\n<pr_file>filepath_M\n<pr_diff_hunk>diff_hunk_1\n...\n<pr_diff_hunk>diff_hunk_J\nWe only add the following final block when the PR is closed.\n<pr>username_id\n<pr_status>closed\n<pr_is_merged>True\n<|endoftext|>\n5.3\nGitHub Issues\nWe use sentinel tokens to mark the opening of an issue and subsequently include its title. We separate the\nsequence of comments by a <issue_comment> token and include an anonymized speaker identifier before\nthe comment. Specifically, we refer to authors by their participant counter within the conversation, e.g.,\nusername_1, to refer to the second participant in the issue. To distinguish between the different turns,\nwe use comment_1, id1 to refer to the second comment and its anonymized speaker id, respectively. The\n<issue_closed> token is added if the issue is closed.\n<issue_start>Title: title\\nusername_id0: comment_0<issue_comment>username_id1: comment_1\n... <issue_closed (optional)><issue_comment>username_idn: comment_n<|endoftext|>\n5.4\nNotebooks\nJupyter \u2013 scripts\nWe format Jupyter scripts as a single code block, starting with a <jupyter_script>\ntoken.\n<jupyter_script>code<|endoftext|>\nJupyter \u2013 structured\nParsed Jupyter notebooks are chains of text, code, and outputs. We separate the\ncells with sentinel tokens. Note that we use text2, code2, output2 to refer to the 3rd triplet in the notebook.\n<jupyter_start><jupyter_text>text0<jupyter_code>code0\n<jupyter_output>output0<jupyter_text> ... <|endoftext|>\n18\n Under review as submission to TMLR\nKaggle \u2013 scripts\nWhen available, we prepend the associated dataset title and description to Kaggle\nnotebooks (42% of the samples). For 8.6% of the notebooks, we add granular information on the dataset\u2019s\nschema. Below is the format we use:\n<jupyter_start><jupyter_text>title\\ndescription\\nKaggle dataset identifier: data_identifier\n<jupyter_code>import pandas as pd\\n\\ndf = pd.read_csv(data_path1)\\ndf.info()\n<jupyter_output>df_info_output1\n<jupyter_text>Examples:\\nexample1_1\\n..example1_4\n...\n<jupyter_script>code<|endoftext|>\nSome notebooks might load more than one csv file, so we repeat the blocks of data information content for\nall files.\nNote that we introduce a new special token <jupyter_script> to append the final script of the converted\nKaggle notebook. This token helps differentiate the script, which is usually long, from code that follows\n<jupyter_code> token, typically shorter.\nKaggle \u2013 structured\nStructured Kaggle notebooks are similar to structured Jupyter notebooks, except\nthat they don\u2019t have an output cell, so we only include text and code blocks and keep the tokens used in\nJupyter Notebooks:\n<jupyter_start><jupyter_text>text0<jupyter_code>code0<jupyter_text> ... <|endoftext|>\n5.5\nStackExchange\nWe concatenate questions and answers in the StackOverflow dataset using a format similar to the GitHub\nissues. We start with the question and then add answers in random order. We include the upvote score\nalongside the answer and, if applicable, denote it as the selected answer. Note that we do not have the title\nof the conversations for the StackExchange dataset.\n<issue_start>username_id0: question\n<issue_comment>username_id1: answer_1\\nUpvotes: score [selected answer](Optional)\n...\n<issue_comment>username_idn: answer_n\\nUpvotes: score [selected answer](Optional)<|endoftext|>\n5.6\nIntermediate Representations\nWe split 50/50 between translating from source code to intermediate representation (code->intermediate)\nand vice-versa (intermediate->code). Regarding the intermediate representation, we use the size-optimized\nversion 80% of the time and the performance-optimized version 20% of the time. We use separate sentinel\ntokens to indicate the direction of the translation.\ncode<code_to_intermediate>intermediate_representation\nintermediate_representation<intermediate_to_code>code\n6\nModel architecture and training details\nIn this section, we provide all details regarding the model architecture (\u00a76.1), tokenizer (\u00a76.2), training details\n(\u00a76.3), and CO2 emissions during training (\u00a76.4).\n23Estimated with 6ND, where N is the number of parameters and D is the number of training tokens. Includes base and\nlong-context training.\n19\n Under review as submission to TMLR\nTable 5: Overview of the sentinel tokens.\nToken\nDescription\n<|endoftext|>\nend of text/sequence\n<fim_prefix>\nFIM prefix\n<fim_middle>\nFIM middle\n<fim_suffix>\nFIM suffix\n<fim_pad>\nFIM pad\n<repo_name>\nrepository name\n<file_sep>\nfile separator\n<issue_start>\nstart of GitHub issue\n<issue_comment>\nstart of GitHub issue comment\n<issue_closed>\nGitHub issue closed event\n<jupyter_start>\nstart of Jupyter notebook\n<jupyter_text>\nstart of Jupyter text cell\n<jupyter_code>\nstart of Jupyter code cell\n<jupyter_output>\nstart of Jupyter output cell\n<jupyter_script>\nstart of Jupyter script (converted kaggle notebook)\n<empty_output>\noutput cell without content\n<code_to_intermediate>\ntranslate source code to intermediate representation\n<intermediate_to_code>\ntranslate intermediate representation to source code\n<pr>\nstart of pull request\n<pr_status>\nstatus of pull request\n<pr_is_merged>\nwhether pr is merged\n<pr_base>\nstart of list of base files\n<pr_file>\npath of pull request file\n<pr_base_code>\ncode that is part of the base commit in the PR\n<pr_diff>\nstart of a diff\n<pr_diff_hunk>\ndiff hunk\n<pr_comment>\ngeneral comment\n<pr_event_id>\nGitHub id of review comment or code review comment\n<pr_review>\nstart of review\n<pr_review_state>\nreview state (e.g. approved, rejected)\n<pr_review_comment>\ncode review comment\n<pr_in_reply_to_review_id>\nGitHub event id of review\n<pr_in_reply_to_comment_id>\nGitHub event id of comment\n<pr_diff_hunk_comment_line>\nline number of code review comment\n6.1\nModel Architecture\nWe introduce a few architectural changes compared to StarCoderBase. First, we replace learned positional\nembeddings with Rotary Positional Encodings (RoPE; Su et al., 2021), as we confirmed significant performance\ngains in a preliminary ablation study. Following DeepseekCoder (Guo et al., 2024) and Code LLaMA (Rozi\u00e8re\net al., 2023), we use a base period \u03b8 = 1e5. The second architectural modification we make is replacing\nMulti-Query Attention (MQA; Shazeer, 2019) with Grouped Query Attention (Ainslie et al., 2023, GQA;\n). However, we keep the number of key-value heads relatively low\u20142 for the 3B, 4 for the 7B and 15B\u2014to\nprevent significantly slowing down inference.\nWe summarize all other hyperparameters, such as the number of layers and hidden dimension, in Table 6.\n20\n Under review as submission to TMLR\nTable 6: Model architecture details of the StarCoder2 models.\nParameter\nStarCoder2-3B\nStarCoder2-7B\nStarCoder2-15B\nhidden_dim\n3072\n4608\n6144\nn_heads\n24\n36\n48\nn_kv_heads\n2\n4\n4\nn_layers\n30\n32\n40\nvocab size\n49152\n49152\n49152\nseq_len\nbase-4k/long-16k\nbase-4k/long-16k\nbase-4k/long-16k\npositional encodings\nRoPE\nRoPE\nRoPE\nFLOPs23\n5.94e+22\n1.55e+23\n3.87e+23\nTable 7: Training details of StarCoder2 base models.\nModel\nlearning rate\nRoPE \u03b8\nbatch size\nn iterations\nn tokens\nn epochs\nStarCoder2-3B\n3 \u00d7 10\u22124\n1e5\n2.6M\n1.2M\n3.1T\n4.98\nStarCoder2-7B\n3 \u00d7 10\u22124\n1e5\n3.5M\n1M\n3.5T\n5.31\nStarCoder2-15B\n3 \u00d7 10\u22124\n1e4\n4.1M\n1M\n4.1T\n4.49\n6.2\nTokenizer\nWe follow the procedure of StarCoderBase and train a byte-level Byte-Pair-Encoding tokenizer on a small\nsubset of The Stack v1.24 In our preliminary experiments, we observed that increasing the vocabulary size\nto 100K did not improve performance. Hence, we decided to maintain a vocabulary size of 49,152 tokens,\nincluding the sentinel tokens from Table 5. The pre-tokenization step includes a digit-splitter and the regex\nsplitter from the GPT-2 pre-tokenizer.\n6.3\nTraining Details\nBase models\nThe models were trained with a sequence length of 4,096 using Adam (Kingma & Ba, 2015)\nwith \u03b21 = 0.9, \u03b22 = 0.95, \u03f5 = 10\u22128 and a weight decay of 0.1, without dropout. The learning rate followed a\ncosine decay after a linear warmup of 1,000 iterations. Table 7 details the training hyper-parameters for each\nmodel. RoPE \u03b8 values are different for StarCoder2-15B due to a bug in parsing the training configuration.\nMoreover, StarCoder2-15B was scheduled to train for 1.1M iterations but was early stopped after 1M iterations.\nFollowing Muennighoff et al. (2023), we repeat data for around four to five epochs.\nLong context\nWe further pre-trained each model for long-context on 200B tokens from the same pre-\ntraining corpus, using a 16,384 context length with a sliding window of 4,096, with FlashAttention-2 (Dao\net al., 2022; Dao, 2024). We increase RoPE \u03b8 and use the same configuration for the optimizer. The other\ntraining hyperparameters are provided in Table 8.\n6.4\nCO2 Emissions\nWe provide estimations of the CO2 emission of the StarCoder2 training using the Machine Learning Impact\ncalculator presented in Lacoste et al. (2019). Note that we calculate the CO2 emissions by considering the\ntotal GPU hours of the base-model training. We then extrapolate this number to the long-context fine-tuning\nbased on the number of tokens.\n3B\nThe compute infrastructure provided by ServiceNow had a carbon efficiency of 0.386 kgCO2eq/kWh. A\ncumulative of 97,120 hours of computation was performed on hardware of type A100 SXM4 80 GB (TDP of\n24https://huggingface.co/datasets/bigcode/the-stack-march-sample-special-tokens-stripped\n21\n Under review as submission to TMLR\nTable 8: Training details for the long context training of StarCoder2 models.\nModel\nlearning rate\nRoPE \u03b8\nbatch size\nn iterations\nn tokens\nStarCoder2-3B\n3 \u00d7 10\u22125\n1e6\n2.6M\n80k\n200B\nStarCoder2-7B\n2 \u00d7 10\u22125\n1e6\n3.5M\n56k\n200B\nStarCoder2-15B\n3 \u00d7 10\u22125\n1e5\n4.1M\n50k\n200B\nTable 9: Pass@1 on HumanEval(+) and MBPP(+). These results were generated using greedy decoding.\nModel\nHumanEval\nHumanEval+\nMBPP\nMBPP+\nStarCoderBase-3B\n21.3\n17.1\n42.6\n35.8\nDeepSeekCoder-1.3B\n28.7\n23.8\n55.4\n46.9\nStableCode-3B\n28.7\n24.4\n53.1\n43.1\nStarCoder2-3B\n31.7\n27.4\n57.4\n47.4\nStarCoderBase-7B\n30.5\n25.0\n47.4\n39.6\nCodeLlama-7B\n33.5\n25.6\n52.1\n41.6\nDeepSeekCoder-6.7B\n47.6\n39.6\n70.2\n56.6\nStarCoder2-7B\n35.4\n29.9\n54.4\n45.6\nStarCoderBase-15B\n29.3\n25.6\n50.6\n43.6\nCodeLlama-13B\n37.8\n32.3\n62.4\n52.4\nStarCoder2-15B\n46.3\n37.8\n66.2\n53.1\nCodeLlama-34B\n48.2\n44.3\n65.4\n52.4\nDeepSeekCoder-33B\n54.3\n46.3\n73.2\n59.1\n400W). Total emissions are estimated to be 14,995.33 kgCO2eq. The long-context fine-tuning stage adds\n1,111.68 kgCO2eq, resulting in a total of 16,107.01 kgCO2eq.\n7B\nThe compute infrastructure provided by Hugging Face had a carbon efficiency of 0.2925 kgCO2eq/kWh.\nA cumulative of 145,152 hours of computation was performed on hardware of type H100 (TDP of 660W).\nTotal emissions are estimated to be 28,021.6 kgCO2eq. The long-context fine-tuning stage adds 1601.23,\nresulting in a total of 29,622.83 kgCO2eq.\n15B\nThe paper will soon be updated with estimates for the 15B model.\n7\nEvaluation\nWe evaluate the StarCoder2 models on a variety of benchmarks from the literature and compare them to\nrecent state-of-the-art open Code LLMs: StableCode (Pinnaparaju et al., 2024), Code Llama (Rozi\u00e8re et al.,\n2023), DeepSeekCoder (Guo et al., 2024), and original StarCoder (Li et al., 2023). Since StarCoder2 is a base\nmodel, we only compare it with the base models of the model families mentioned above.\nWe group all our comparisons by model sizes. The small models have 3B or fewer parameters, the medium\nmodels have 7B or fewer parameters, and the large models have 15B or fewer parameters. Finally, we include\ntwo extra large models: CodeLlama-34B and DeepSeekCoder-33B. These models are more than twice the\nsize of the large StarCoder2 model. But, as we shall see below, StarCoder2-15B comes close to or even\noutperforms the extra-large models in several benchmarks.\n7.1\nCode Completion\nWe first evaluate the StarCoder2 models on code completion tasks, which have been widely studied in Code\nLLM work.\n22\n Under review as submission to TMLR\n7.1.1\nHumanEval, MBPP, and EvalPlus\nAbout the benchmarks\nHumanEval (Chen et al., 2021) and MBPP (Austin et al., 2021) are two of the\nmost widely studied benchmarks for Code LLMs. Each benchmark has a few hundred programming problems.\nEach HumanEval problem has a prompt\u2014a function signature and docstring\u2014and a set of hidden unit tests.\nThe prompt for each MBPP problem includes a natural language description followed by a few tests. The\nmodel under evaluation will complete the function given the prompt, and we test that function with the\nhidden unit tests. The result is considered a success only if all hidden tests pass.\nRecently, Liu et al. (2023a) identified several issues with both benchmarks. (1) Most problems have inadequate\nhidden tests that cannot detect subtle bugs in solutions (See Listings 1 and 2); and (2) Several problems\nhave wrong test cases and ambiguous descriptions, which unfairly penalize the models that interpret the\nstatements in other reasonable ways (See Listings 2). They introduce the EvalPlus framework to address\nthese problems. The resulting benchmarks (HumanEval+ and MBPP+) have 80\u00d7 and 35\u00d7 more tests than\nthe original benchmarks. For rigorous evaluation, we adopt the EvalPlus framework in this study.\nListing 1: A HumanEval task with insufficient tests\ndef common(l1: list, l2: list) -> list:\n\"\"\"Return sorted unique common elements for 2 lists\"\"\"\ncommon_elems = list(set(l1).intersection(set(l2)))\ncommon_elems.sort()\nreturn list(set(common_elems))\nassert common([4,3,2,8], []) == []\nassert common([5,3,2,8], [3,2]) == [2,3]\n...\n# [Explanation] This solution is wrong as applying set\n# to the sorted common_elems does not preserve the\n# order. Base HumanEval test inputs are too short to\n# easily manifest the flakiness.\nListing 2: An MBPP task with problematic tests\n\"\"\"Write a function to check whether all dictionaries\n,\u2192 in a list are empty or not.\"\"\"\ndef empty_dit(list1): return all(not d for d in list1)\nassert empty_dit([{},{},{}]) == True\nassert empty_dit([{1,2},{},{}]) == True # Wrong test!\nassert empty_dit([{}]) == True\n# [Explanation] First, the second base test is wrong,\n# falsifying any correct solutions. Second, the tests\n# are weak, passing the wrong solution above. The wrong\n# solution mistakingly yields False given [{}, {}, [1]]\n# where we expect True as all dictionaries are empty\n# and the non-empty is an array, not a dictionary.\nHyperparameters\nFollowing recent work on Code LLMs (Rozi\u00e8re et al., 2023; Guo et al., 2024), we use\ngreedy decoding and report the mean pass@1 (mean success rate) for all problems in the benchmark.\nResults\nThe results for HumanEval, MBPP, and their EvalPlus variants are presented in Table 9.25 From\nthe table, we can make the following observations:\n1. StarCoder2-3B is the best-performing small model on all the datasets (HumanEval, MBPP, Hu-\nmanEval+, and MBPP+). The model is significantly better than its predecessor, StarCoderBase-3B,\nexhibiting improvements of 60.2% on HumanEval+ and 32.4% on MBPP+, respectively.\n2. StarCoder2-7B comes in second place of the medium models. DeepSeekCoder-6.7B is stronger,\noutperforming StarCoder2-7B by 32.4% and 24.1% on HumanEval+ and MBPP+, respectively.\nHowever, StarCoder2-7B consistently outperforms all the other medium models, including both\nStarCoderBase-7B and CodeLlama-7B. StarCoder2-7B outperforms StarCoderBase-7B by 19.6%\nand 15.2% on HumanEval+ and MBPP+, respectively. Additionally, it surpasses CodeLlama-7B by\n16.8% and 9.6% on these benchmarks.\n3. StarCoder2-15B is the best-performing large model by a significant margin. For example, it scores\n46.3, whereas CodeLlama-13B scores 37.8 on HumanEval. The results on EvalPlus are also consistent.\nFor example, on HumanEval+, it significantly improves over StarCoderBase-15B and CodeLlama-13B\nby 47.7% and 17.0%, respectively.\n25Note that EvalPlus omits a few ill-formed and noisy problems from the MBPP dataset. It uses 399 out of the 427 problems\nfrom the MBPP subset that was sanitized by the original authors (Austin et al., 2021). For HumanEval, we kept all 164 problems\nfrom the original dataset.\n23\n Under review as submission to TMLR\nTable 10: Pass@1 results on MultiPL-E averaged over 50 samples for each problem. All models are evaluated\nat temperature 0.2 and top-p 0.95.\nModel\nC++\nC#\nD\nGo\nJava\nJulia\nJavaScript\nLua\nPHP\nStableCode-3B\n28.4\n14.4\n13.4\n19.3\n27.8\n20.6\n32.0\n17.1\n23.7\nDeepSeekCoder-1.3B 28.3\n21.3\n10.4\n19.1\n29.2\n15.0\n28.3\n19.2\n23.2\nStarCoderBase-3B\n19.4\n13.3\n5.0\n13.3\n19.2\n16.1\n21.3\n18.0\n18.6\nStarCoder2-3B\n27.2\n20.5\n12.6\n23.6\n27.4\n19.9\n35.4\n28.0\n27.6\nCodeLlama-7B\n26.4\n21.0\n11.6\n20.9\n28.2\n25.9\n31.6\n30.4\n25.1\nDeepSeekCoder-6.7B 46.7\n32.9\n18.4\n31.0\n39.7\n31.4\n46.6\n34.2\n32.6\nStarCoderBase-7B\n23.3\n19.3\n8.1\n19.6\n24.4\n21.8\n27.4\n23.4\n22.1\nStarCoder2-7B\n33.6\n20.7\n15.1\n20.2\n29.4\n20.4\n35.4\n30.7\n30.6\nCodeLlama-13B\n37.4\n24.8\n15.5\n26.6\n37.5\n27.9\n39.3\n31.6\n33.9\nStarCoderBase-15B\n30.6\n20.6\n10.0\n21.5\n28.5\n21.1\n31.7\n26.6\n26.8\nStarCoder2-15B\n41.4\n29.2\n23.6\n26.2\n33.9\n33.2\n44.2\n43.8\n39.5\nCodeLlama-34B\n41.4\n30.7\n15.3\n28.7\n40.2\n31.4\n41.7\n37.5\n40.4\nDeepSeekCoder-33B 51.2\n35.3\n17.4\n34.2\n43.8\n32.8\n51.3\n36.5\n41.8\nModel\nPerl\nR\nRuby\nRacket\nRust\nScala\nBash\nSwift\nTypeScript\nStableCode-3B\n9.4\n11.5\n0.8\n7.0\n22.9\n5.9\n8.6\n13.2\n29.6\nDeepSeekCoder-1.3B 12.5\n9.8\n24.6\n9.1\n18.6\n19.6\n9.7\n11.0\n27.4\nStarCoderBase-3B\n11.3\n10.1\n4.2\n7.9\n16.3\n16.8\n3.8\n10.0\n22.8\nStarCoder2-3B\n13.6\n14.2\n31.3\n7.8\n24.5\n18.9\n12.3\n25.1\n34.4\nCodeLlama-7B\n16.9\n14.9\n29.5\n11.4\n25.5\n22.8\n9.6\n24.9\n33.4\nDeepSeekCoder-6.7B 30.4\n20.5\n46.2\n17.4\n37.7\n35.2\n22.2\n30.3\n39.5\nStarCoderBase-7B\n15.2\n14.5\n19.6\n11.1\n22.6\n20.9\n7.3\n15.1\n27.5\nStarCoder2-7B\n16.6\n16.7\n28.3\n11.6\n29.6\n19.5\n12.2\n26.1\n36.3\nCodeLlama-13B\n23.4\n14.1\n31.9\n13.0\n31.0\n29.7\n13.3\n30.1\n40.1\nStarCoderBase-15B\n16.3\n10.2\n17.2\n11.8\n24.5\n28.8\n11.0\n16.7\n32.1\nStarCoder2-15B\n37.2\n19.8\n41.5\n22.4\n38.0\n37.4\n18.9\n34.2\n43.8\nCodeLlama-34B\n28.5\n22.7\n37.8\n16.9\n38.7\n36.7\n16.4\n35.3\n42.1\nDeepSeekCoder-33B 31.0\n20.5\n44.0\n23.4\n43.8\n43.9\n28.7\n35.8\n48.4\n4. StarCoder2-15B is even competitive with models that are more than twice its size. For example,\nStarCoder2-15B outperforms CodeLlama-34B on both MBPP and MBPP+.\nAlthough EvalPlus makes HumanEval and MBPP far more robust, the problems in these benchmarks only\nexercise basic Python built-ins. They do not test them on other programming languages and do not test\nmodels\u2019 knowledge of other Python libraries. We address these limitations in the rest of this subsection with\nmore comprehensive evaluations on code completion.\n7.1.2\nMultiPL-E: Multilingual Code Completion\nAbout the benchmark\nMultiPL-E (Cassano et al., 2023b) uses a suite of lightweight, rule-based compilers\nto translate HumanEval from Python to 18 other programming languages. Thus MultiPL-E is a multi-language\nbenchmark with the same problems translated to different languages.26\nHyperparameters\nWe sample 50 completions per prompt at temperature 0.2 with top-p 0.95. This is how\nMultiPL-E results are reported on the BigCode Models Leaderboard (Ben Allal, 2023).\nResults\nThe results on MultiPL-E appear in Table 10. We make the following observations:\n26MultiPL-E makes some small changes to the HumanEval prompts, and a few prompts fail to translate to certain languages.\nWe refer the reader to Cassano et al. (2023b) for more information.\n24\n Under review as submission to TMLR\n1. Across all size classes, there is no single model that is best at every language. Nevertheless, the\nStarCoder2 models perform well as described below.\n2. Of the small models, StarCoder2-3B performs the best on 11/18 programming languages.\n3. Of the medium models, DeepSeekCoder-6.7B performs best.\nStarCoder2-7B does better than\nCodeLlama-7B on most languages.\n4. Of the large models, StarCoder2-15B does the best on 16/18 programming languages. CodeLlama-13B\noutperforms StarCoder2-15B on Go and Java.\n5. StarCoder2-15B meets or exceeds the performance of CodeLlama-34B on 10/18 programming\nlanguages and DeepSeekCoder-33B on four lower-resource languages (D, Julia, Lua, and Perl).\n7.1.3\nDS-1000: Data Science Tasks in Python\nAbout the benchmark\nDS-1000 (Lai et al., 2023) is a widely studied benchmark with 1,000 data science\ntasks in Python. Unlike the HumanEval and MBPP problems that only use the Python standard library,\nDS-1000 exercises seven widely used libraries, from Matplotlib to TensorFlow. Therefore, here we further\nadopt DS-1000 to evaluate the performance of Code LLMs in completing data science tasks with popular\nlibraries.\nHyperparameters\nFollowing Lai et al. (2023), we use temperature 0.2 and top-p 0.95 to generate 40\nsamples per problem, and report mean pass@1.\nResults\nTable 11 reports the results on DS-1000. We make the following observations:\n1. StarCoder2-3B overall is the best-performing small model on DS-1000. Except for PyTorch and\nTensorFlow (where it is slightly worse than StableCode-3B), StarCoder2-3B achieves the best\nperformance on all the other popular libraries.\n2. StarCoder2-7B comes in second place out of the medium models, with a performance similar to\nDeepSeekCoder-6.7B.\n3. StarCoder2-15B is the best-performing large model on DS-1000. It substantially outperforms both\nStarCoderBase-15B and CodeLlama-13B by large margins, and approaches the overall performance\nof CodeLlama-34B.\n7.2\nCode Fixing and Editing\nWhile the above subsection has studied various code completion tasks, Code LLMs can be used in various\nother ways. In this subsection, we focus on studying their capabilities for fixing bugs or editing existing code.\n7.2.1\nHumanEvalFix: Fixing Bugs in Six Programming Languages\nAbout the benchmark\nHumanEvalFix (Muennighoff et al., 2024a) is a benchmark that tests a model\u2019s\nability to identify and fix bugs in code. The benchmark supports six programming languages shown in\nFigure 12. Since it is not a code completion benchmark, most base models do poorly on HumanEvalFix\nwhereas instruction-tuned (Wei et al., 2022; Sanh et al., 2022; Muennighoff et al., 2022b; 2024b) models\nperform better. Thus, we consider the instruction-tuned variants of DeepSeekCoder and CodeLlama in\nour comparison (Guo et al., 2024; Rozi\u00e8re et al., 2023). We also compare with OctoCoder, which is an\ninstruction-tuned version of the initial StarCoder using the CommitPackFT dataset (Muennighoff et al.,\n2024a; Zhuo et al., 2024; Longpre et al., 2023). We benchmarked the default HumanEvalFixTests subvariant;\nhence, there were no docstrings present to guide the model.\n25\n Under review as submission to TMLR\nTable 11: Performance of open-access models on DS-1000. Benchmarks are as follows. All models were\nevaluated at temperature 0.2 and top-p 0.95. Scores reflect mean pass@1 accuracy averaged over 40 samples.\nFormat\nModel\nMatplotlib\nNumPy\nPandas\nPyTorch\nSciPy\nScikit-\nLearn\nTensorFlow\nOverall\n# problems:\n155\n220\n291\n68\n106\n115\n45\n1,000\nCompletion\nStarCoderBase-3B\n32.1\n16.8\n5.3\n9.2\n13.2\n10.5\n17.2\n14.2\nCompletion\nStableCode-3B\n42.5\n24.5\n16.2\n15.4\n13.5\n20.2\n27.7\n22.7\nCompletion\nDeepSeekCoder-1.3B\n36.2\n18.8\n9.1\n10.7\n7.9\n13.9\n13.3\n16.2\nCompletion\nStarCoder2-3B\n45.5\n27.7\n16.2\n12.9\n15.8\n30.8\n22.8\n25.0\nCompletion\nStarCoderBase-7B\n38.0\n23.0\n8.2\n13.1\n13.7\n24.5\n14.6\n19.1\nCompletion\nDeepSeekCoder-6.7B\n52.4\n33.0\n20.0\n13.9\n19.8\n29.7\n27.4\n28.9\nCompletion\nCodeLlama-7B\n46.3\n21.6\n13.9\n12.2\n17.5\n16.7\n20.6\n21.5\nCompletion\nStarCoder2-7B\n53.6\n33.3\n16.9\n16.2\n20.6\n22.2\n31.9\n27.8\nCompletion\nStarCoderBase-15B\n47.0\n27.1\n10.1\n19.5\n21.7\n27.0\n20.5\n23.8\nCompletion\nCodeLlama-13B\n49.0\n27.2\n17.4\n12.9\n15.6\n24.0\n24.8\n25.1\nCompletion\nStarCoder2-15B\n60.3\n43.3\n23.2\n11.0\n26.4\n26.0\n36.0\n33.8\nCompletion\nDeepSeekCoder-33B\n56.1\n49.6\n25.8\n36.8\n36.8\n40.0\n46.7\n40.2\nCompletion\nCodeLlama-34B\n50.3\n42.7\n23.0\n25.0\n28.3\n33.9\n40.0\n34.3\nTable 12: Pass@1 performance on HumanEvalFix. StarCoder2 and StarCoderBase are not instruction-tuned\nthus they are at a disadvantage compared to the other models which are all instruction-tuned.\nModel\nPrompt\nPython\nJavaScript\nJava\nGo\nC++\nRust\nAvg.\nStarCoderBase-15B\nInstruct\n12.6\n16.8\n18.9\n12.5\n11.2\n0.6\n12.1\nStarCoderBase-15B\nCommit\n25.6\n29.4\n28.8\n28.7\n28.2\n19.7\n26.7\nCodeLlama-13B-Instruct\nInstruct\n19.4\n18.9\n24.1\n21.6\n10.1\n0.4\n15.8\nCodeLlama-34B-Instruct\nInstruct\n36.5\n28.1\n36.4\n25.7\n25.2\n18.5\n28.4\nDeepSeekCoder-6.7B-Instruct\nInstruct\n44.9\n55.3\n52.2\n42.9\n37.9\n19.5\n42.1\nDeepSeekCoder-33B-Instruct\nInstruct\n47.5\n47.6\n46.5\n52.0\n48.0\n10.2\n42.1\nOctoCoder-15B\nInstruct\n30.4\n28.4\n30.6\n30.2\n26.1\n16.5\n27.0\nStarCoder2-15B\nInstruct\n9.7\n20.7\n24.1\n36.3\n25.6\n15.4\n22.0\nStarCoder2-15B\nIssue\n48.6\n41.6\n48.4\n48.5\n20.7\n24.2\n38.7\nStarCoder2 issues format\nAlthough StarCoder2 is a base model, it is pretrained on GitHub issues and\nStackOverflow discussions using a special format (\u00a75.3). We experiment with prompting the model to fix\ncode bugs in the style of a discussion as follows:\n<issue_start>username_0: instruction\\n\\n\u2018\u2018\u2018buggy function\u2018\u2018\u2018\\nUpvotes: 100<issue_comment>\nusername_1: Sure, here is the fixed code.\\n\\n\u2018\u2018\u2018function start\nIn this template, \u201cinstruction\u201d is the HumanEvalFix instruction telling the model to fix the bug in the code,\n\u201cbuggy function\u201d is the function with a subtle bug, and \u201cfunction start\u201d is the function header including\nimports. The generation of the model is stopped as soon as \u2018\u2018\u2018 is generated. The evaluation code is available\nvia Ben Allal et al. (2022), and we denote this as the \u201cIssue\u201d prompt. We also benchmark StarCoder2 with\nthe same basic \u201cInstruct\u201d prompt used in Muennighoff et al. (2024a).\nHyperparameters: Following (Muennighoff et al., 2024a), we use a temperature of 0.2 to estimate pass@1\nwith 20 samples.\nResults\nUnlike the previous sections, we only evaluate StarCoder2-15B and primarily compare it to\ninstruction-tuned models. The results are in Table 12 (with best-performing models highlighted in bold and\nsecond-best underscored), and we make the following conclusions:\n26\n Under review as submission to TMLR\n1. The base models (StarCoder2-15B and StarCoderBase-15B) perform very poorly when given an\ninstruction prompt, which motivates using a different prompt format.\n2. Using the Issue prompt described above, StarCoder2-15B performs remarkable well as a base model.\nIt outperforms the instruction-tuned CodeLlama models by a significant margin and nearly reaches\nthe performance of the instruction-tuned DeepSeekCoder models.\n3. Using the Issue prompt for StarCoder2-15B leads to a larger increase in performance than using the\nCommit prompt for StarCoderBase-15B. This indicates that pre-training on pull requests (StarCoder2)\nis a viable alternative to pre-training on commits (StarCoderBase).\n4. Using the Issue prompt, StarCoder2-15B also outperforms all other open models presented in\nMuennighoff et al. (2024a).\n5. StarCoder2-15B underperforms on C++ when using the Issue prompt, which hurts its overall\nperformance. Our investigation shows that this is mainly because one-third of the code generated\nis incomplete, e.g., having an unexpected break immediately after the beginning of a for loop.\nAdditional prompt engineering may be necessary to fix this. Thus, we still see value in instruction\ntuning StarCoder2 to further improve its usability in handling similar scenarios more effectively\nwithout prompt engineering. We leave the instruction tuning or even preference alignment (Christiano\net al., 2017; Ethayarajh et al., 2024) of StarCoder2 to future work.\n7.2.2\nCode Editing\nAbout the benchmark\nCanItEdit (Cassano et al., 2024) is a hand-crafted benchmark designed to evaluate\nmodel performance in Python code editing tasks. Each problem consists of a code snippet accompanied by\nan instruction of two types: descriptive or lazy. Descriptive instructions are systematic and provide detailed\ninformation, whereas lazy instructions are brief, direct, and mimic the typical instructions humans provide\nto code completion models. The goal is to modify the code according to the instruction; both lazy and\ndescriptive instructions should lead to the same edit. The accuracy of each modification is assessed using a\nhidden test suite, and pass@1 is reported. The benchmark encompasses a variety of problems, from simple\nsingle-function, single-line edits to intricate multi-class problems requiring multiple-line edits in separate\nlocations. Some tasks demand domain-specific knowledge like mathematics, and successful completion of a\nproblem often requires the model to understand the connections between the components of the program.\nListing 3 shows an abbreviated27 sample problem from CanItEdit with its lazy instruction.\nListing 3: Abbreviated sample problem from CanItEdit\n-class C4(nn.Module):\n+class C8(nn.Module):\n-\n\"\"\"Represents the C4 class of group theory,\n+\n\"\"\"Represents the C8 class of group theory,\nwhere each element represents a discrete rotation.\"\"\"\ndef __init__(self):\nsuper().__init__()\ndef elements(self):\n\"\"\"Returns all the elements of this group\"\"\"\n-\nreturn torch.tensor([0., np.pi/2, np.pi, 3*np.pi/2])\n+\nd = np.pi / 4\n+\nreturn torch.tensor([0., d, d*2, d*3, d*4, d*5, d*6, d*7])\nCode Editing Instruction: Edit the C4 class and its methods\nto represent the C8 group.\n27The original problem includes additional methods to edit in the C4 class and a descriptive instruction.\n27\n Under review as submission to TMLR\nTable 13: Performance of instructional code editing on the CanItEdit benchmark (Cassano et al., 2024).\nThe results for non-StarCoder2 models are from the benchmark paper.\nModel\nFormat\nDescriptive Instructions\nLazy Instructions\nPass@1\nStarCoderBase-3B\nCommit\n19.62\n12.78\nStarCoder2-3B\nIssue\n21.68\n15.91\nDeepSeekCoder-Instruct-1.3B\nInstruct\n25.83\n18.33\nStarCoder2-7B\nIssue\n35.23\n18.55\nCodeLlama-Instruct-7B\nInstruct\n33.89\n27.04\nStarCoderBase-7B\nCommit\n40.64\n25.83\nDeepSeekCoder-Instruct-6.7B\nInstruct\n33.89\n33.61\nCodeLlama-Instruct-13B\nInstruct\n28.33\n20.19\nOctoCoder-15B\nInstruct\n31.46\n25.69\nStarCoderBase-15B\nCommit\n38.24\n26.38\nStarCoder2-15B\nIssue\n43.08\n38.45\nCodeLlama-Instruct-34B\nInstruct\n35.0\n26.76\nDeepSeekCoder-Instruct-33B\nInstruct\n53.06\n43.89\nHyperparameters\nWe evaluate all sizes of StarCoder2 on the CanItEdit benchmark using the Issue prompt\nformat (introduced in \u00a77.2.1) and compare its performance with other models previously assessed on this\nbenchmark. Following Cassano et al. (2024), we employ random sampling with a temperature of 0.2 and a\ntop-p of 0.95, with 100 completions per problem.\nResults\nThe results appear in Table 13. As described in \u00a77.2.1, we use an \u201cIssue\u201d prompt and \u201cCommit\u201d\nprompt for the StarCoder2 and StarCoderBase models since they are not instruction-tuned. For all the other\nmodels, we use instruction-tuned versions. From the table, we make the following observations:\n1. Of the small models, StarCoder2-3B comes in second place behind DeepSeekCoder-Instruct-1.3B.\n2. Of the medium models, StarCoder2-7B and DeepSeekCoder-Instruct-6.7B each performs best at\ndescriptive and lazy instructions respectively.\n3. StarCoder2-15B is the best-performing large model by a significant margin.\n4. StarCoder2-15B outperforms CodeLlama-Instruct-34B as well.\nThese results give further evidence that the StarCoder2 \u201cIssue\u201d format is a viable alternative to the\nStarCoderBase \u201cCommit\u201d format.\n7.3\nMath Reasoning\nAbout the benchmark\nWe use the widely studied GSM8K benchmark (Cobbe et al., 2021), a set of\nmiddle-school math problems, to evaluate the mathematical reasoning capabilities of the models. We use the\nPAL approach proposed by Gao et al. (2023): the model is prompted to generate a Python program, which is\nexecuted to produce the answer to the problem.\nHyperparameters\nWe evaluate models with greedy decoding in an 8-shot setting following Chowdhery\net al. (2023).\nResults\nThe results on GSM8K with PAL appear in Table 14 and we make the following observations:\n1. StableCode-3B is the best-performing small model. StarCoder2-3B is in second place.\n28\n Under review as submission to TMLR\nTable 14: 8-shot accuracy on the GSM8K math-reasoning benchmark.\nModel\nGSM8K (PAL)\nStarCoderBase-3B\n8.0\nDeepSeekCoder-1.3B\n12.6\nStableCode-3B\n39.7\nStarCoder2-3B\n27.7\nStarCoderBase-7B\n14.9\nDeepSeekCoder-6.7B\n41.9\nCodeLlama-7B\n27.0\nStarCoder2-7B\n40.4\nStarCoderBase-15B\n21.5\nCodeLlama-13B\n38.1\nStarCoder2-15B\n65.1\nCodeLlama-34B\n54.2\nDeepSeekCoder-33B\n58.7\n2. StarCoder2-7B comes second place. Its performance is very close to the first-place model, which is\nDeepSeekCoder-6.7B, while substantially outperforming both CodeLlama-7B and StarCoderBase-7B.\n3. StarCoder2-15B significantly outperforms all large models, including both CodeLlama-13B and\nStarCoderBase-15B.\n4. In fact, StarCoder2-15B even outperforms CodeLlama-34B and DeepSeekCoder-33B which are more\nthan twice its size.\n7.4\nCRUXEval: Code Reasoning, Understanding, and Execution\nAbout the benchmark CRUXEval (Gu et al., 2024) is a two-part benchmark consisting of 800 samples\ndesigned to evaluate code reasoning, understanding, and execution. In the first task, CRUXEval-I, the model\nis asked to predict any input such that executing a given Python function on that input produces a given\noutput. In the second task, CRUXEval-O, the model is asked to simulate the execution of a given function on\nan input and predict an output. Two samples are shown below in Listings 4 and 5. The functions and inputs\nof the benchmark were generated by CodeLlama-34B and then filtered to remove complicated functions such\nas those requiring complex arithmetic or a large number of execution steps.\nListing 4: Sample CRUXEval Problem 1\ndef f(string):\nstring_x = string.rstrip(\"a\")\nstring = string_x.rstrip(\"e\")\nreturn string\n# output prediction, CRUXEval-O\nassert f(\"xxxxaaee\") == ??\n# input prediction, CRUXEval-I\nassert f(??) == \"xxxxaa\"\nListing 5: Sample CRUXEval Problem 2\ndef f(nums):\ncount = len(nums)\nfor i in range(-count+1, 0):\nnums.append(nums[i])\nreturn nums\n# output prediction, CRUXEval-O\nassert f([2, 6, 1, 3, 1]) == ??\n# input prediction, CRUXEval-I\nassert f(??) == [2, 6, 1, 3, 1, 6, 3, 6, 6]\nHyperparameters Following (Gu et al., 2024), we use temperature 0.2 to report pass@1 and temperature\n0.8 to report pass@5, both using 10 samples.\nResults We show the pass@1 and pass@5 scores for both tasks in our benchmark in Table 15. In terms of\nerror and standard deviation, the original paper reports two sources of noise. First, the noise due to sampling\nfrom the language model for the given set of 800 candidates is around 0.2% for 10 samples. Second, the\n29\n Under review as submission to TMLR\nTable 15: Accuracy on the CRUXEval benchmark.\nModel\nCRUXEval-I\nCRUXEval-O\nPass@1\nPass@5\nPass@1\nPass@5\nStarCoderBase-3B\n27.1\n43.7\n27.4\n40.9\nDeepSeekCoder-1.3B\n27.8\n44.7\n31.0\n43.4\nStableCode-3B\n33.5\n53.3\n26.7\n43.5\nStarCoder2-3B\n32.7\n50.1\n34.2\n48.4\nStarCoderBase-7B\n29.7\n47.3\n32.2\n44.9\nCodeLlama-7B\n35.9\n52.9\n34.2\n48.4\nDeepSeekCoder-6.7B\n41.9\n62.7\n43.5\n54.8\nStarCoder2-7B\n34.6\n53.5\n36.0\n52.0\nStarCoderBase-15B\n31.3\n49.2\n34.2\n47.1\nCodeLlama-13B\n42.5\n62.0\n39.7\n53.9\nStarCoder2-15B\n48.1\n66.9\n47.1\n59.5\nCodeLlama-34B\n47.2\n66.6\n42.4\n55.9\nDeepSeekCoder-33B\n46.5\n64.9\n48.6\n61.6\nprecise samples in the benchmark were chosen from a larger set of samples, and the noise from choosing\nwhich samples to include in the benchmark when using 800 samples is about 1.5%. We make the following\nobservations:\n1. StarCoder2-3B performs competitively with other small models. It slightly underperforms StableCode-\n3B on CRUXEval-I (but within the noise margin of error) but beats all other small models on\nCRUXEval-O.\n2. For both tasks, StarCoder2-7B performs on par with CodeLlama-7B but lags significantly behind\nDeepSeekCoder-6.7B.\n3. StarCoder2-15B is the best-performing large model. It surpasses CodeLlama-13B and drastically\nimproves upon StarCoderBase-15B on both CRUXEval-I and CRUXEval-O.\n4. StarCoder2-15B performs on par with the extra-large models. On CRUXEval-I, it outperforms\nboth CodeLlama-34B and DeepSeekCoder-33B but within standard deviation. On CRUXEval-O, it\nsignificantly outperforms CodeLlama-34B and slightly underperforms DeepSeekCoder-33B.\n7.5\nFill-in-the-Middle\nAbout the benchmark\nStarCoder2 supports fill-in-the-middle (FIM), which is the ability to complete an\narbitrary span of code conditioned on both text before and after the insertion point. We use the benchmark\nfrom Ben Allal et al. (2023), which tests the ability of models to fill in a single line of code in Python,\nJavaScript, and Java solutions to HumanEval.\nHyperparameters\nFollowing Ben Allal et al. (2023), we sample 20 completions per example at temperature\n0.2 and top-p 0.95 and report the mean exact match, as done\nResults\nThe results appear in Table 16. We observe that StarCoder2-3B performs as well as StarCoderBase-\n15B on this FIM benchmark. Unfortunately, StarCoder2-15B underperforms on FIM. Due to an implementa-\ntion bug, the FIM-rate was smaller than intended for most of the training.\n30\n Under review as submission to TMLR\nTable 16: Exact-match on FIM-task (Ben Allal et al., 2023). Due to an implementation bug, FIM was\nincorrect for most of the training of StarCoder2-15B. CodeLlama results are from Rozi\u00e8re et al. (2023).\nModel\nJava\nJavaScript\nPython\nStableCode-3B\n63.7\n73.3\n59.1\nStarCoder2-3B\n75.0\n73.0\n59.1\nStarCoder2-7B\n81.1\n77.5\n61.1\nCodeLlama-13B\n80.0\n85.0\n74.5\nStarCoderBase-15B\n73\n74\n62\nStarCoder2-15B\n60.5\n54.7\n48.4\n7.6\nRepository-Level Code Completion Evaluation\nCode completion in practice often occurs within the context of a repository rather than in isolated files.\nLeveraging repository-level context for code completion is thus essential for models to perform well in real-\nworld scenarios. We evaluate models on repository-level code completion with two benchmarks: RepoBench\n(Liu et al., 2023b) and CrossCodeEval (Ding et al., 2023).\n7.6.1\nRepoBench\nAbout the benchmark\nRepoBench (Liu et al., 2023b) is a live benchmark designed for evaluating code\ncompletion at the repository level, with a focus on next-line prediction. In this work, we use the latest\nversion (v1.1) of RepoBench28,29, which sources its data from GitHub repositories created from October 6th\nto December 31st, 2023, and takes steps to avoid data leakage by removing duplicates against The Stack\nv2. Our evaluation includes five levels\u20142k, 4k, 8k, 12k, and 16k\u2014across three settings: cross-file-first,\ncross-file-random, and in-file, with each setting comprising 5,000 data points (1,000 per level). We\nreport the average edit similarity, exact match, and CodeBLEU (Ren et al., 2020) scores for the three settings.\nHyperparameters\nFollowing prior work on Code LLMs (Chen et al., 2021), we set the generation\ntemperature to 0.2 and the top-p sampling parameter to 0.95 for all models under evaluation. We constrained\nthe models to generate a maximum of 128 new tokens per prompt, and the first non-empty and non-comment\nline of the output was selected as the prediction. While StarCoder2 uses special tokens for repository-\nlevel training, we ensured uniformity in prompt construction across all models by following the official\nimplementation in line with Liu et al. (2023b). The maximum token count for prompts was set to 15,800 by\ntruncating excess cross-file context, except for StarCoderBase, which was constrained to 7,800 tokens due to\nits maximum sequence length limit of 8k.\nResults\nTable 17 presents the performance of open-access models on RepoBench v1.1. We observe that:\n1. StarCoder2, with repository-level training, consistently outperforms StarCoderBase, across all\nevaluated model sizes.\n2. StarCoder2-3B demonstrates notable performance among the smaller models, ranking as the second-\nbest one following StableCode-3B.\n3. StarCoder2-7B achieves competitive performance closely matching that of CodeLlama-7B among the\nmedium models, with DeepSeekCoder-6.7B achieving the leading performance metrics.\n4. StarCoder2-15B not only largely outperforms CodeLlama-13B but also showcases comparable, and\nin some metrics superior, performance against the significantly larger CodeLlama-34B model.\n28https://huggingface.co/datasets/tianyang/repobench_python_v1.1\n29https://huggingface.co/datasets/tianyang/repobench_java_v1.1\n31\n Under review as submission to TMLR\nTable 17: Average exact match (EM), edit similarity (ES), and CodeBLEU (CB) scores for open-access base\nmodels on RepoBench v1.1 (Liu et al., 2023b).\nModel\nPython\nJava\nEM\nES\nCB\nEM\nES\nCB\nStarCoderBase-3B\n29.99\n69.37\n36.77\n36.01\n74.18\n45.30\nDeepSeekCoder-1.3B\n31.02\n70.07\n37.88\n37.75\n75.66\n46.69\nStableCode-3B\n34.48\n71.79\n40.43\n40.13\n76.56\n49.00\nStarCoder2-3B\n32.47\n71.19\n39.25\n38.46\n76.53\n47.96\nStarCoderBase-7B\n32.70\n71.08\n39.48\n37.97\n75.66\n47.47\nCodeLlama-7B\n33.85\n71.79\n40.47\n39.61\n76.71\n48.92\nDeepSeekCoder-6.7B\n36.79\n73.85\n42.65\n42.87\n78.93\n51.69\nStarCoder2-7B\n33.72\n72.07\n40.34\n39.84\n77.23\n48.96\nStarCoderBase-15B\n33.51\n71.64\n40.39\n39.34\n76.24\n48.36\nCodeLlama-13B\n35.50\n72.98\n42.02\n41.27\n77.57\n50.26\nStarCoder2-15B\n36.99\n74.08\n43.25\n42.57\n79.05\n51.45\nCodeLlama-34B\n37.22\n73.77\n43.38\n42.35\n78.22\n50.99\nDeepSeekCoder-33B\n39.25\n75.20\n45.21\n44.59\n79.92\n52.70\n7.6.2\nCrossCodeEval\nAbout the benchmark\nCrossCodeEval (Ding et al., 2023) is a diverse and multilingual benchmark\ndesigned for repository-level code completion. It was constructed from a wide range of real-world, open-\nsourced, permissively licensed repositories in four popular programming languages: Python, Java, TypeScript,\nand C#. Through careful static analysis methods, CrossCodeEval strictly requires cross-file context for\naccurate code completion. We report results in both Code Match (Edit Similarity) and Identifier Match (F1\nScore) following the definitions in Ding et al. (2023) in all four languages.\nHyperparameters\nWe use a max sequence length of 16k for all models except for StarCoderBase, which\nonly supports 8k. In line with Ding et al. (2023), we use the retrieve-and-generate (RG) method with\nOpenAI\u2019s ada embedding, which was found to perform well in their study. To optimize the usage of the\nextended 16k context, we retrieve a maximum of 100 code segments, each comprising its file path and 10\nlines of code. The maximum cross-file context was set to 12,800 tokens and the max generation token is 50\ntokens following. Consistent with Ding et al. (2023), we use the uniform prompt formatting in the original\nimplementation, with a temperature of 0.2 and top-p of 0.95 for all model generations.\nResults\nTable 18 presents the evaluation results. We found that:\n1. Across almost all dimensions, including model sizes, programming languages, and metrics, StarCoder2\nconsistently outperforms StarCoderBase. This enhancement could likely be attributed to better\npre-training with increased context length and repository-level objectives (Section 5.1).\n2. StarCoder2-15B achieves the state-of-the-art performance compared to models of similar sizes. For\ncertain languages like Java and C#, the performance is better even than models with 2x capacity.\n3. The analysis also reveals significant performance variances in different languages for the same model,\nsimilar to the findings in MultiPL-E (\u00a77.1.2). While a model can be strong overall, achieving uniformly\nhigh performance across all programming languages remains challenging, e.g., StarCoder2-15B is\nbehind on TypeScript while StableCode-3B in C# and DeepSeekCoder-34B in Java. The disparity\ncalls for future research on building models that can achieve high performance across diverse range\nof languages in different settings.\n32\n Under review as submission to TMLR\nTable 18: CrossCodeEval (Ding et al., 2023) evaluation results. We report Code Match (Edit Similarity)\nand Identifier Match (F1) results for four languages.\nModel\nPython\nJava\nTypeScript\nC#\nCode ES\nID F1\nCode ES\nID F1\nCode ES\nID F1\nCode ES\nID F1\nStarCoderBase-3B\n69.47\n62.56\n66.43\n59.77\n41.42\n35.26\n70.11\n53.15\nDeepSeekCoder-1.3B\n72.41\n66.76\n65.92\n59.93\n63.59\n56.41\n70.98\n54.84\nStableCode-3B\n76.00\n70.75\n73.19\n67.93\n65.61\n59.61\n61.70\n48.98\nStarCoder2-3B\n73.01\n67.85\n66.31\n61.06\n38.79\n35.17\n70.86\n55.42\nStarCoderBase-7B\n72.24\n65.40\n69.91\n64.12\n44.21\n39.77\n71.93\n55.98\nDeepSeekCoder-6.7B\n77.43\n73.16\n70.60\n66.28\n69.08\n63.61\n74.84\n62.29\nCodeLlama-7B\n74.52\n69.11\n71.49\n65.99\n65.96\n59.46\n71.41\n56.66\nStarCoder2-7B\n74.52\n68.81\n70.75\n65.27\n43.19\n38.84\n72.73\n57.69\nStarCoderBase-15B\n73.43\n66.74\n70.58\n64.66\n45.24\n40.47\n71.77\n55.71\nCodeLlama-13B\n75.88\n70.97\n73.08\n68.29\n67.88\n61.46\n72.73\n59.62\nStarCoder2-15B\n78.72\n74.27\n74.92\n70.45\n48.63\n43.78\n75.38\n62.14\nCodeLlama-34B\n76.34\n71.36\n74.30\n69.45\n68.98\n63.19\n73.96\n60.07\nDeepSeekCoder-33B\n78.78\n74.51\n73.41\n69.02\n70.31\n65.14\n75.04\n63.03\nTable 19: Performance on the \u201cAsleep at the Keyboard\u201d benchmark.\nModel\nValid (\u2191)\nInsecure (\u2193)\nStarCoderBase-3B\n910/1000 (91.0%)\n224/910 (24.6%)\nDeepSeekCoder-1.3B\n893/1000 (89.3%)\n290/893 (32.5%)\nStarCoder2-3B\n925/1000 (92.5%)\n113/900 (12.2%)\nStarCoderBase-7B\n916/1000 (91.6%)\n243/916 (26.5%)\nCodeLlama-7B\n900/1000 (90.0%)\n195/900 (21.7%)\nDeepSeekCoder-6.7B\n921/1000 (92.1%)\n315/921 (34.2%)\nStarCoder2-7B\n912/1000 (91.2%)\n363/926 (39.8%)\nStarCoderBase-15B\n933/1000 (93.3%)\n332/933 (35.6%)\nCodeLlama-13B\n903/1000 (90.3%)\n273/903 (30.2%)\nStarCoder2-15B\n898/1000 (89.8%)\n352/898 (39.2%)\n7.7\n\u201cAsleep at the Keyboard\u201d Security Benchmark\nAbout the benchmark\n\u201cAsleep at the Keyboard\u201d is a benchmark designed for assessing security vulnera-\nbilities in code generation (Pearce et al., 2022). Similar to Li et al. (2023), we focus on the subset of tasks\namenable to automated evaluation, which is the Diversity of Weakness problems. These cover 18 diverse\nvulnerability classes from the MITRE Common Weakness Enumeration (CWE) taxonomy, with scenarios\ndrawn from the 2021 CWE Top 25 Most Dangerous Software Weaknesses list published by MITRE. The\nproblems have 23 scenarios in C and 17 scenarios in Python.\nHyperparameters\nFollowing Li et al. (2023), we set the temperature to 0.2 and top-p to 0.95. Each model\ngenerates 25 samples per scenario, resulting in a total of 1,000 completions.\nResults\nWe report results of selected models in Table 19. Column Valid gives the percentage of solutions\nthat were syntactically valid, and Column Insecure shows the percentage of valid solutions that include the\nvulnerability the scenario tests for. From the table, we draw the following conclusions:\n1. StarCoder2 generates comparable numbers of valid programs to StarCoderBase, CodeLlama, and\nDeepSeekCoder. Both StarCoderBase and StarCoder2 models achieve around 90% valid program\n33\n Under review as submission to TMLR\nrate. However, after some manual inspection, we notice that StarCoder2 tends to generate more\nfunctionally correct code than StarCoderBase. The observation is aligned with the evaluation in\nprevious sections.\n2. Except for StarCoder2-3B, StarCoder2-7B and StarCoder2-15B have the highest insecure program\nrate among the models having similar parameters. The high insecure rate might be a side-effect of\nits higher rate of valid and functionally correct completions. These generated programs are more\nlikely to be exposed to potential vulnerabilities, as suggested by Bhatt et al. (2023). Notably, Li\net al. (2023) find that code-cushman-001, the initial model used in commercialized Copilot, has an\ninsecure rate beyond 40%.\n7.8\nMeasuring Harmful Generations\nAbout the benchmarks\nBias in Open-ended Language Generation Dataset (BOLD) (Dhamala et al.,\n2021) is a dataset of 23,679 prompts that measure fairness across Religious Ideology, Procession, Race, Gender,\nand Political Ideology. We use the Regard classifier by Sheng et al. (2019) and average the classifier scores\nacross each category.\nWinoBias (Zhao et al., 2018) measures bias towards gender stereotypes across professions. When given a\nsentence regarding a profession, the gender is masked, and a language model is used to generate the replaced\nmasked token and the perplexity is used to calculate the bias towards male and female pronouns.\nHurtful Sentence Completion in English Language Models (HONEST) (Nozza et al., 2021) is a template-based\ncorpus to assess the hurtfulness of sentence completions generated by the language models. To calculate\nthe HONEST score, we check whether each generated completion has any terms contained in each of the\ncategories within Hurtlex30.\nRealToxicityPrompts (Gehman et al., 2020) consists of 100,000 naturally occurring, sentence-level prompts,\nwhich are extracted from the large web corpus of English text. They can be used to evaluate the risk of\nneural toxic degeneration in the language models. We use a 10,000 subset to perform the evaluation. We use\nthe classifier by Vidgen et al. (2021) to detect toxicity and report the average probability of the detected\ntoxic output as our toxicity score.\nHyperparameters\nFor each prompt in BOLD and RealToxicityPrompts, we generate one completion with\nup to 50 additional tokens. On HONEST, we generate 5 completions for each sample with up to 50 additional\ntokens.\nResults\nThe results for BOLD, WinoBias, HONEST, and RealToxicityPrompts are presented in Tables 20,\n21, 22, and 23, respectively. The tables suggest that our models LLMs that we consider produce roughly the\nsame amount of harmful content, and based on Li et al. (2023), LLMs trained primarily on code produce less\nharmful content than LLMs trained on general web text.\n8\nSearch Index and Attribution Tools\nFollowing the standard set by Li et al. (2023) we build another suite of data inspection, attribution, and\nsearch tools. The NLP community has recognized the need for data inspection and has begun producing\ncomputational documentation artifacts to complement static data descriptions (Piktus et al., 2023b; Marone\n& Van Durme, 2023; Piktus et al., 2023a; Akiki et al., 2023, among others). Open science and open data go\nbeyond releasing dumps of datasets.\nMembership checking tools\nThis work collects and constructs a dataset 4 times larger than that used in\nStarCoderBase. Compared to the initial version of The Stack, the version here contains many additional\nnon-code sources (see Table 4). As data sizes increase, it becomes even more important to construct tools that\nallow for accessible and efficient data inspection. We update the \u201cAm I in the Stack\u201d tool with repositories in\n30https://github.com/valeriobasile/hurtlex\n34\n Under review as submission to TMLR\nTable 20: BOLD evaluations of open source code models.\nModel\nCategory\nNegative Score\nNeutral Score\nOther Score\nPositive Score\nReligious Ideology\n0.16\n0.33\n0.13\n0.38\nProfession\n0.07\n0.6\n0.06\n0.27\nStarCoder2-3B\nRace\n0.05\n0.5\n0.05\n0.5\nGender\n0.05\n0.48\n0.05\n0.43\nPolitical Ideology\n0.3\n0.29\n0.18\n0.23\nReligious Ideology\n0.12\n0.32\n0.12\n0.45\nProfession\n0.07\n0.58\n0.06\n0.3\nStarCoderBase-3B\nRace\n0.04\n0.44\n0.05\n0.47\nGender\n0.04\n0.35\n0.05\n0.55\nPolitical Ideology\n0.3\n0.27\n0.18\n0.25\nReligious Ideology\n0.18\n0.25\n0.16\n0.41\nProfession\n0.08\n0.57\n0.06\n0.28\nStableCode-3B\nRace\n0.07\n0.4\n0.06\n0.46\nGender\n0.05\n0.36\n0.06\n0.53\nPolitical Ideology\n0.32\n0.27\n0.18\n0.25\nReligious Ideology\n0.19\n0.81\n0.03\n0.13\nProfession\n0.08\n0.52\n0.07\n0.33\nStarCoder2-7B\nRace\n0.06\n0.4\n0.07\n0.47\nGender\n0.06\n0.37\n0.07\n0.5\nPolitical Ideology\n0.33\n0.22\n0.21\n0.24\nReligious Ideology\n0.16\n0.28\n0.13\n0.43\nProfession\n0.07\n0.56\n0.06\n0.31\nStarCoderBase-7B\nRace\n0.05\n0.41\n0.06\n0.48\nGender\n0.04\n0.33\n0.06\n0.57\nPolitical Ideology\n0.33\n0.23\n0.19\n0.25\nReligious Ideology\n0.16\n0.27\n0.14\n0.43\nProfession\n0.07\n0.58\n0.06\n0.3\nCodeLlama-7B\nRace\n0.06\n0.42\n0.06\n0.46\nGender\n0.05\n0.38\n0.06\n0.5\nPolitical Ideology\n0.3\n0.28\n0.19\n0.24\nReligious Ideology\n0.15\n0.33\n0.13\n0.39\nProfession\n0.07\n0.61\n0.06\n0.27\nDeepSeekCoder-6.7B\nRace\n0.05\n0.46\n0.05\n0.44\nGender\n0.04\n0.34\n0.06\n0.56\nPolitical Ideology\n0.3\n0.28\n0.19\n0.23\nReligious Ideology\n0.21\n0.22\n0.16\n0.42\nProfession\n0.09\n0.51\n0.07\n0.33\nStarCoder2-15B\nRace\n0.07\n0.39\n0.07\n0.47\nGender\n0.05\n0.36\n0.07\n0.53\nPolitical Ideology\n0.25\n0.02\n0.1\n0.09\nReligious Ideology\n0.16\n0.31\n0.13\n0.41\nProfession\n0.07\n0.61\n0.06\n0.26\nStarCoderBase-15B\nRace\n0.06\n0.46\n0.06\n0.43\nGender\n0.04\n0.38\n0.06\n0.53\nPolitical Ideology\n0.32\n0.28\n0.19\n0.22\nReligious Ideology\n0.17\n0.24\n0.14\n0.45\nProfession\n0.07\n0.54\n0.06\n0.33\nCodeLlama-13B\nRace\n0.07\n0.36\n0.07\n0.5\nGender\n0.05\n0.35\n0.06\n0.53\nPolitical Ideology\n0.3\n0.23\n0.19\n0.28\nnew dataset.31 This tool allows for data inspection at the username and repository level. Marone & Van\nDurme (2023) recommend releasing a documentation artifact called a Data Portrait to support lightweight\nmembership inspection. We implement one using Bloom filters to enable matching on file contents, crucially\nincluding the non-code sources like documentation, textbooks, and papers.32 These prose data sources may\n31https://huggingface.co/spaces/bigcode/in-the-stack\n32https://stack-v2.dataportraits.org\n35\n Under review as submission to TMLR\nTable 21: WinoBias evaluations of open source code models.\nModel\nMale\nFemale\nAverage\nStarCoder2-3B\n0.33\n-0.33\n0.27\nStarCoderBase-3B\n0.42\n-0.42\n0.28\nStableCode-3B\n0.44\n-0.44\n0.39\nStarCoder2-7B\n0.45\n-0.45\n0.34\nStarCoderBase-7B\n0.51\n-0.51\n0.31\nCodeLlama-7B\n0.37\n-0.37\n0.38\nDeepSeekCoder-6.7B\n0.41\n-0.41\n0.34\nStarCoder2-15B\n0.36\n-0.36\n0.38\nStarCoderBase-15B\n0.55\n-0.55\n0.35\nCodeLlama-13B\n0.36\n-0.36\n0.37\nTable 22: HONEST evaluations.\nModel\nScore\nStarCoder2-3B\n0.11\nStarCoderBase-3B\n0.11\nStableCode-3B\n0.09\nStarCoder2-7B\n0.1\nStarCoderBase-7B\n0.11\nCodeLlama-7B\n0.11\nDeepSeekCoder-6.7B\n0.1\nStarCoder2-15B\n0.11\nStarCoderBase-15B\n0.1\nCodeLlama-13B\n0.1\nTable 23: Toxicity score evaluation of open source code models.\nModel\nToxicity Score\nStarCoder2-3B\n0.05\nStarCoderBase-3B\n0.04\nStableCode-3B\n0.05\nStarCoder2-7B\n0.08\nStarCoderBase-7B\n0.04\nCodeLlama-7B\n0.04\nDeepSeekCoder-6.7B\n0.05\nStarCoder2-15B\n0.05\nStarCoderBase-15B\n0.04\nCodeLlama-13B\n0.04\ndescribe algorithms or solutions not present elsewhere. Content creators can use our system as a simple \u201cno\ncode\u201d inspection tool to check if their material occurs verbatim in our data. It also enables a rapid first-pass\nattribution check for coding tools built on our models.33 This system takes about 70GB, substantially smaller\nthan the data, but provides only exact matches for long strings. If necessary, users can use the full search\nindex for additional analysis.\nSearch index\nThe preceding tools provide lightweight data inspection. However, it may be necessary\nto perform full-text searches that support fuzzy matching and retrieval. Following StarCoder1 (Li et al.,\n2023), we build an Elasticsearch index on the source code subset of The Stack v2 and make it available at\nhttps://huggingface.co/spaces/bigcode/search-v2.\n9\nSocial Impact and Limitations\nSocial impact and limitations have already been documented in the BigCode project (Kocetkov et al., 2023;\nBen Allal et al., 2023; Li et al., 2023; BigCode collaboration et al., 2023). In the following sections, we cover\nour project approach towards the responsible development of large language models for code and highlight\nsome more recent advances.\n33https://github.com/huggingface/llm-vscode\n36\n Under review as submission to TMLR\n9.1\nProject Approach\nOpen-science\nStarCoder2 is the output of a community research project. The project is conducted in the\nspirit of Open Science (Woelfle et al., 2011; Mendez et al., 2020), focused on the responsible development and\nuse of Code LLMs. Through open-governance practices, priority in decision-making has always yielded to the\nmore responsible option, even if this meant introducing limitations that might impact adoption or future\nresearch (BigCode collaboration et al., 2023).\nEthical data sourcing\nSignificant efforts from the BigCode community went into the careful curation,\nvalidation, decontamination, malware removal, license filtering, opt-out process, PII removal, structuring,\npackaging, hosting, licensing, and the publishing of a Dataset Card (Project, 2024) for the data used to train\nStarCoder2. Full transparency has been provided about the data used for training StarCoder2. A significant\nportion of the training dataset was sourced under license from Software Heritage (Software Heritage, 2024a).\nAccelerating research\nBigCode\u2019s open approach to scientific collaboration (BigCode collaboration et al.,\n2023), open access model distribution and licensing (BigCode Project, 2023a; Malfa et al., 2023), and openness\nand disclosures of training data, architectures, and development are essential for the research community to\nhave access to powerful, truly open LLMs, helping to accelerate future research (Groeneveld et al., 2024; Xu\net al., 2024; Soldaini et al., 2024; Singh et al., 2024; \u00dcst\u00fcn et al., 2024; Luukkonen et al., 2023; Woelfle et al.,\n2011).\nOpen, but responsible\nThe BigCode Open RAIL-M license (BigCode Project, 2023a) contains important\nuse restrictions and is accompanied by an FAQ to help guide the responsible deployment and use of the\nmodel by downstream users (BigCode Project, 2023b).\nCommunity of practice\nBigCode is very much a community of practice, with over 1,200 multi-disciplinary\nmembers from more than 60 countries working towards the responsible development of large language models\nfor code (Sholler et al., 2019; Kocetkov et al., 2023; Ben Allal et al., 2023; Li et al., 2023; Muennighoff\net al., 2024a; Zhuo et al., 2024). Of these members, 417 were active in the BigCode community collaboration\ntools within the period 27 October 2023 through 24 February 2024, the period aligning with StarCoder2\ndevelopment. There has also been considerable downstream adoption of BigCode outputs, with millions of\ndownloads collectively reported via the Hugging Face API (BigCode, 2024).\nAuditable\nThe StarCoder2 model, pre-training dataset, and supporting artifacts are easily accessible and\navailable to anyone who wishes to conduct an independent audit (Solaiman, 2023; M\u00f6kander et al., 2023;\nBigCode collaboration et al., 2023).\n9.2\nAdvancements in Code LLMs\nGovernance Card\nThe BigCode Governance Card (BigCode collaboration et al., 2023) serves as an\noverview of the different mechanisms and areas of governance in the BigCode project. It aims to support\ntransparency by providing relevant information about choices that were made during the project to the\nbroader public and to serve as an example of intentional governance (Sholler et al., 2019) of an open research\nproject that future endeavors can leverage to shape their own approach. The first section, Project Structure,\ncovers the project organization, its stated goals and values, its internal decision processes, and its funding\nand resources. The second section, Data and Model Governance, covers decisions relating to the questions of\ndata subject consent, privacy, and model release.\nArchival of software metadata:\nSoftware metadata is vital for the classification, curation, and sharing of\nfree and open-source software (FOSS). The source code landscape is very diverse. By generating linked data\nand referencing source code contributions within the Software Heritage archive from the global community of\ndevelopers and scientists (Heritage, 2024), there is potential to enable a more ethical data supply chain for\ntraining LLMs (Cosmo & Zacchiroli, 2017; Abramatic et al., 2018).\n37\n Under review as submission to TMLR\nAcceptable ML use:\nOn October 19, 2023, Software Heritage published a statement that defines the\nacceptable machine learning use of the Software Heritage archive. This is a significant milestone that opens\nthe door for more responsible data sourcing and licensing of AI training data (Software Heritage, 2023).\nSoftWare Hash IDentifiers (SWHID):\nSoftware Heritage provides the SWHID unique identifiers,\nintrinsically bound to the software components, and that need no central registry, to ensure that a resilient\nweb of knowledge can be built on top of the Software Heritage archive (The SWHID Specification Project,\n2024). This can also be used by downstream developers to support efforts for those companies that prioritize\na \u201csoftware bill of materials\u201d (SBOM) as a key building block in software security and software supply chain\ntransparency and risk management (Cybersecurity & Infrastructure Security Agency, 2024; Mirakhorli et al.,\n2024), for example by including the SWHIDs in the SBOM, alongside other relevant information such as\ncomponent names, versions, licenses, and source locations.\n9.3\nChallenges and Risks\nOpenness and safety risks\nSolaiman (2023) explains how the degree of openness in the LLM development\nprocess is connected to the potential risks associated with a model release. When systems are developed in a\nfully closed manner, it is more likely for power to become concentrated among high-resourced organizations,\nand the small development team may not fully comprehend the impact and long-term consequences of the\nmodel being deployed. In addition, closed-development systems are often less auditable by external experts\nand can impede scientific progress since researchers cannot build upon each other\u2019s work. On the other hand,\nfully open development allows for community research, democratizes access to the models, and enables audits\nthroughout the whole development process. However, without appropriate guardrails, open LLM development\nposes a higher risk of misuse, as increased model access also increases the likelihood of harm caused by the\nmodel. Even though a released API can be shut down, once the model weights are released, it is nearly\nimpossible to retract them. Discussing and implementing responsible AI practices has, therefore, been front\nand center during the development of our project\u2019s LLMs.\nPrivacy compliant generated code\nIt is difficult to correctly identify and classify the different types of\nPII so that personal data processing, transformations, and flows through code can be evaluated (Tang et al.,\n2023). Where privacy-relevant methods are invoked in generated code, checking for PII leaks to the internet,\nuse of encrypted data and anonymous IDs, will be necessary (Tang & \u00d8stvold, 2024). Downstream users are\nadvised to implement additional PII scanning, filtering, cleansing, and mitigation to ensure compliance with\ntheir intended use cases (Yang et al., 2023; Albalak et al., 2024).\nSecurity\nAs with any open scientific research that provides open access to model weights, hyper-parameters,\ndata processing code, training code, training data, and documentation, any actor can run or fine-tune the\noptimized model with very low computing costs (Governance AI, 2024). Even with the use restrictions set\nforth within the BigCode Open RAIL-M license, this will not prevent bad actors with malicious intent from\nattempting to cause harm (Mozes et al., 2023). For example, code LLMs with API access could be used\nto create sophisticated polymorphic malware (CrowdStrike, 2024) that would be highly evasive to security\nproducts that rely on signature-based detection and will be able to bypass measures such as Anti-Malware\nScanning Interface (AMSI) as it eventually executes and runs code (CyberArk, 2024; Gupta et al., 2023).\nSocietal bias\nAs has been previously established in evaluations of coding models, code LLMs can generate\ncode with a structure that reflects stereotypes about gender, race, emotion, class, the structure of names, and\nother characteristics (Chen et al., 2021; Zhuo et al., 2023a). Further evaluation and guardrail mitigations are\nrequired in the context of downstream use cases (Huang et al., 2023; Dong et al., 2024).\nRepresentation bias\nAs discussed in previous sections, there is a lot more data in the training dataset\nfor popular programming languages like Python and Java than for niche languages like Haskell and Fortran.\nAs such, the model performs better on such high-resource languages, which may reinforce the preference of\ndevelopers towards using such languages. Fortunately, there\u2019s much ongoing research on how to improve the\nperformance of Code LLMs on low-resource languages (Cassano et al., 2023a; Zhuo et al., 2023b). Furthermore,\n38\n Under review as submission to TMLR\nthe predominant natural language in source code and other datasets used is English although other languages\nare also present. As such, the model can generate code snippets provided some non-English context, but the\ngenerated code is not guaranteed to work as intended or equally as well for all languages. This could limit\nthe model\u2019s fairness and effectiveness across different coding tasks and environments (Alyafeai et al., 2024).\nTraceability\nUsing the SWHID to trace software components is not an easy task and will challenge most\nif not all, downstream developers. Future development and advancement of tools that make it easier to trace\nsoftware components will be necessary to enable more transparent and responsible data supply chains (Cosmo\net al., 2020).\nJob augmentation vs. automation\nCode LLMs serve as powerful foundation models that can be fine-\ntuned to generate high-quality code, documentation, unit tests, text summaries, automation workflows, and\nmore. Chen et al. (2023) find a positive correlation between occupation exposure and wage levels/experience\npremiums, suggesting higher-paying and experience-intensive jobs may face greater displacement risks from\nLLM-powered software. Goldman Sachs (2024) suggest that AI has the potential to automate 25% of labor\ntasks in advanced economies and 10 \u2013 20% in emerging economies, however, they also state that \"those\nfears should be counterbalanced, since AI has the potential to create new job tasks or categories requiring\nspecialized human expertise\". Autor et al. (2022) reports that \u201cRoughly 60% of employment in 2018 is\nfound in job titles that did not exist in 1940.\u201d and that \"augmentation innovations boost occupational labor\ndemand, while automation innovations erode it\". Results from the task-based analysis in (World Economic\nForum, 2024) reveal that jobs with the highest potential for automation of tasks by LLMs emphasize routine\nand repetitive procedures and do not require a high degree of interpersonal communication. Jobs with the\nhighest potential for augmentation by LLMs emphasize critical thinking and complex problem-solving skills,\nespecially those in science, technology, engineering, and mathematics (STEM) fields. Ziegler et al. (2024)\nreports the benefits of receiving AI suggestions while coding span the full range of typically investigated\naspects of productivity, such as task time, product quality, cognitive load, enjoyment, and learning. In (Peng\net al., 2023), a two-year collaboration between Google Core and Google Research (Brain Team), they find that\nof the 10k+ Google-internal developers using the code completion setup in their IDE, they measured user\u2019s\ncode acceptance rate of 25-34%. Yahoo Finance (2024) announced ServiceNow, Inc. (NYSE: NOW) 2024 Q4\nEarnings with coverage that the ServiceNow platform Now Assist skills using text-to-code (ServiceNow, 2024b)\nand text-to-workflow (ServiceNow, 2024a) LLMs (based on StarCoder), augment and increased developer\nproductivity and speed of innovation by 52%.\n10\nConclusion\nWe introduced StarCoder2, a family of LLMs designed for code generation, along with The Stack v2, the\nlargest pre-training corpus for Code LLMs built on the foundations of the Software Heritage archive. The\nStack v2 is ten times larger than its predecessor, yielding a raw dataset of 67.5 TB. Through extensive\ncleaning, filtering, and subsampling of the source code, along with the incorporation of other high-quality\ncode-related datasets, we created a training set of approximately 3TB (900B+ tokens). Leveraging this\nnew dataset, we trained StarCoder2 models with 3B, 7B, and 15B parameters. Our extensive Code LLM\nevaluations, assessing code completion, editing, and reasoning capabilities, revealed that StarCoder2-3B and\nStarCoder2-15B are state-of-the-art models within their respective size classes. By not only releasing the\nmodel weights but also ensuring complete transparency regarding the training data, we hope to increase trust\nin the developed models and empower other engineering teams and scientists to build upon our efforts.\n11\nAcknowledgements\nThis work was made possible by Software Heritage, the great library of source code:\nhttps://\nwww.softwareheritage.org, and all the developers and scientists that contribute to the open source archives.\nWe thank Joydeep Biswas (UT Austin), Northeastern Research Computing, and NCSA Delta for providing\ncomputing resources used for evaluation. Carolyn Jane Anderson and Arjun Guha were partially sponsored\nby the U.S. National Science Foundation awards SES-2326173 and SES-2326174. Jiawei Liu, Yuxiang Wei,\n39\n Under review as submission to TMLR\nand Lingming Zhang were partially sponsored by the U.S. National Science Foundation award CCF-2131943.\nFederico Cassano was partly sponsored by Roblox.\nWe thank Jenny Hui, ServiceNow, for her leadership in executing the StarCoder2 Research Collaboration\nAgreement between ServiceNow, Hugging Face, and NVIDIA to enable the training of all 3 models.\nWe thank the extended members of the BigCode community for the ongoing support and for their downstream\ncontributions back to the community.\nWe also thank Hessie Jones and the Privacy Protection Collab that shared insights and lessons learned from\ntheir work in Defining Personal Information and the Remediation Framework during early exploration and\nconsideration of PII redaction.\nEvgenii Zheltonozhskii is supported by the Adams Fellowships Program of the Israel Academy of Sciences\nand Humanities.\n40\n Under review as submission to TMLR\nReferences\nJean-Fran\u00e7ois Abramatic, Roberto Di Cosmo, and Stefano Zacchiroli. Building the universal archive of source\ncode. Communications of the ACM, 61(10):29\u201331, 2018. doi: 10.1145/3183558. URL https://cacm.acm.org/\nmagazines/2018/10/231366-building-the-universal-archive-of-source-code/fulltext.\n(cited on\npp. 3 and 37)\nJoshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebron, and Sumit Sanghai.\nGQA: Training generalized multi-query transformer models from multi-head checkpoints. In Houda Bouamor,\nJuan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural\nLanguage Processing, pp. 4895\u20134901, Singapore, December 2023. Association for Computational Linguistics.\ndoi: 10.18653/v1/2023.emnlp-main.298. URL https://aclanthology.org/2023.emnlp-main.298. (cited\non p. 20)\nChristopher Akiki, Giada Pistilli, Margot Mieskes, Matthias Gall\u00e9, Thomas Wolf, Suzana Ilic, and Yacine\nJernite. BigScience: a case study in the social construction of a multilingual large language model. In\nWorkshop on Broadening Research Collaborations 2022, 2022. URL https://openreview.net/forum?id=\n2e346l2PPOm. (cited on p. 2)\nChristopher Akiki, Odunayo Ogundepo, Aleksandra Piktus, Xinyu Zhang, Akintunde Oladipo, Jimmy Lin,\nand Martin Potthast. Spacerini: Plug-and-play search engines with pyserini and Hugging Face. In Yansong\nFeng and Els Lefever (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language\nProcessing: System Demonstrations, pp. 140\u2013148, Singapore, December 2023. Association for Computational\nLinguistics. doi: 10.18653/v1/2023.emnlp-demo.12. URL https://aclanthology.org/2023.emnlp-demo.12.\n(cited on p. 34)\nAlon Albalak, Yanai Elazar, Sang Michael Xie, Shayne Longpre, Nathan Lambert, Xinyi Wang, Niklas\nMuennighoff, Bairu Hou, Liangming Pan, Haewon Jeong, Colin Raffel, Shiyu Chang, Tatsunori Hashimoto,\nand William Yang Wang. A survey on data selection for language models. arXiv preprint, February 2024.\nURL https://arxiv.org/abs/2402.16827. (cited on p. 38)\nZaid Alyafeai, Khalid Almubarak, Ahmed Ashraf, Deema Alnuhait, Saied Alshahrani, Gubran A. Q. Ab-\ndulrahman, Gamil Ahmed, Qais Gawah, Zead Saleh, Mustafa Ghaleb, Yousef Ali, and Maged S. Al-\nShaibani. CIDAR: culturally relevant instruction dataset for Arabic. arXiv preprint, February 2024. URL\nhttps://arxiv.org/abs/2402.03177. (cited on p. 39)\nNaveen Arivazhagan, Ankur Bapna, Orhan Firat, Dmitry Lepikhin, Melvin Johnson, Maxim Krikun, Mia Xu\nChen, Yuan Cao, George Foster, Colin Cherry, Wolfgang Macherey, Zhifeng Chen, and Yonghui Wu.\nMassively multilingual neural machine translation in the wild: Findings and challenges. arXiv preprint,\nJuly 2019. URL https://arxiv.org/abs/1907.05019. (cited on p. 14)\nArxiv, 2024. URL https://info.arxiv.org/help/bulk_data_s3.html. (cited on p. 12)\nJacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen\nJiang, Carrie Cai, Michael Terry, Quoc V. Le, and Charles Sutton. Program synthesis with large language\nmodels. arXiv preprint, August 2021. URL https://arxiv.org/abs/2108.07732. (cited on pp. 3 and 23)\nDavid Autor, Caroline Chin, Anna M Salomons, and Bryan Seegmiller. New frontiers: The origins and\ncontent of new work, 1940\u20132018. Technical Report 30389, National Bureau of Economic Research, August\n2022. URL http://www.nber.org/papers/w30389. (cited on p. 39)\nZhangir Azerbayev, Hailey Schoelkopf, Keiran Paster, Marco Dos Santos, Stephen Marcus McAleer, Albert Q.\nJiang, Jia Deng, Stella Biderman, and Sean Welleck. Llemma: An open language model for mathematics. In\nThe Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/\nforum?id=4WnqRR915j. (cited on p. 12)\nMohammad Bavarian, Heewoo Jun, Nikolas Tezak, John Schulman, Christine McLeavey, Jerry Tworek, and\nMark Chen. Efficient training of language models to fill in the middle. arXiv preprint, July 2022. URL\nhttps://arxiv.org/abs/2207.14255. (cited on p. 16)\n41\n Under review as submission to TMLR\nLoubna Ben Allal. Big code models leaderboard, 2023. URL https://huggingface.co/spaces/bigcode/\nbigcode-models-leaderboard. (cited on p. 24)\nLoubna Ben Allal, Niklas Muennighoff, Logesh Kumar Umapathi, Ben Lipkin, and Leandro von Werra. A\nframework for the evaluation of code generation models. https://github.com/bigcode-project/bigcode-\nevaluation-harness, 2022. (cited on p. 26)\nLoubna Ben Allal, Raymond Li, Denis Kocetkov, Chenghao Mou, Christopher Akiki, Carlos Munoz Ferrandis,\nNiklas Muennighoff, Mayank Mishra, Alex Gu, Manan Dey, Logesh Kumar Umapathi, Carolyn Jane\nAnderson, Yangtian Zi, Joel Lamy Poirier, Hailey Schoelkopf, Sergey Troshin, Dmitry Abulkhanov, Manuel\nRomero, Michael Lappert, Francesco De Toni, Bernardo Garc\u00eda del R\u00edo, Qian Liu, Shamik Bose, Urvashi\nBhattacharyya, Terry Yue Zhuo, Ian Yu, Paulo Villegas, Marco Zocca, Sourab Mangrulkar, David Lansky,\nHuu Nguyen, Danish Contractor, Luis Villa, Jia Li, Dzmitry Bahdanau, Yacine Jernite, Sean Hughes,\nDaniel Fried, Arjun Guha, Harm de Vries, and Leandro von Werra. SantaCoder: don\u2019t reach for the stars!\narXiv preprint, August 2023. URL https://arxiv.org/abs/2301.03988. (cited on pp. 2, 13, 30, 31, 36,\nand 37)\nManish Bhatt, Sahana Chennabasappa, Cyrus Nikolaidis, Shengye Wan, Ivan Evtimov, Dominik Gabi, Daniel\nSong, Faizan Ahmad, Cornelius Aschermann, Lorenzo Fontana, Sasha Frolov, Ravi Prakash Giri, Dhaval\nKapil, Yiannis Kozyrakis, David LeBlanc, James Milazzo, Aleksandar Straumann, Gabriel Synnaeve, Varun\nVontimitta, Spencer Whitman, and Joshua Saxe. Purple llama CyberSecEval: A secure coding benchmark\nfor language models. arXiv preprint, December 2023. URL https://arxiv.org/abs/2312.04724. (cited on\np. 34)\nStella Biderman, Hailey Schoelkopf, Quentin Gregory Anthony, Herbie Bradley, Kyle O\u2019Brien, Eric Hallahan,\nMohammad Aflah Khan, Shivanshu Purohit, Usvsn Sai Prashanth, Edward Raff, Aviya Skowron, Lintang\nSutawika, and Oskar Van Der Wal. Pythia: A suite for analyzing large language models across training\nand scaling. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato,\nand Jonathan Scarlett (eds.), Proceedings of the 40th International Conference on Machine Learning,\nvolume 202 of Proceedings of Machine Learning Research, pp. 2397\u20132430. PMLR, 23\u201329 Jul 2023. URL\nhttps://proceedings.mlr.press/v202/biderman23a.html. (cited on p. 2)\nBigCode. Models by BigCode on Hugging Face, 2024. URL https://huggingface.co/api/models?author=\nbigcode&expand[]=downloadsAllTime. Accessed: 2024. (cited on p. 37)\nBigCode collaboration, Sean Hughes, Harm de Vries, Jennifer Robinson, Carlos Mu\u00f1oz Ferrandis, Loubna Ben\nAllal, Leandro von Werra, Jennifer Ding, Sebastien Paquet, and Yacine Jernite. The BigCode project\ngovernance card. arXiv preprint, December 2023. URL https://arxiv.org/abs/2312.03872. (cited on pp.\n2, 36, and 37)\nBigCode Project. Bigcode model license agreement, 2023a. URL https://huggingface.co/spaces/bigcode/\nbigcode-model-license-agreement. Accessed: 2023. (cited on p. 37)\nBigCode Project.\nBigCode open RAIL: Responsible AI licensing framework, 2023b.\nURL https:\n//www.bigcode-project.org/docs/pages/bigcode-openrail/. Accessed: 2023. (cited on p. 37)\nBigScience Workshop. BLOOM (revision 4ab0472), 2022. URL https://huggingface.co/bigscience/bloom.\n(cited on p. 2)\nBlue Oak Council, 2024. URL https://blueoakcouncil.org/list. (cited on p. 4)\nAndrei Z. Broder. Identifying and filtering near-duplicate documents. In Annual symposium on combinatorial\npattern matching, pp. 1\u201310. Springer, 2000. URL https://link.springer.com/chapter/10.1007/3-540-\n45123-4_1. (cited on p. 13)\nEthan Caballero, OpenAI, and Ilya Sutskever.\nDescription2Code dataset, August 2016.\nURL https:\n//github.com/ethancaballero/description2code. (cited on p. 11)\n42\n Under review as submission to TMLR\nFederico Cassano, John Gouwar, Francesca Lucchetti, Claire Schlesinger, Carolyn Jane Anderson, Michael\nGreenberg, Abhinav Jangda, and Arjun Guha. Knowledge transfer from high-resource to low-resource\nprogramming languages for code LLMs. arXiv preprint, August 2023a. URL https://arxiv.org/abs/\n2308.09895. (cited on pp. 12 and 38)\nFederico Cassano, John Gouwar, Daniel Nguyen, Sydney Nguyen, Luna Phipps-Costin, Donald Pinckney,\nMing-Ho Yee, Yangtian Zi, Carolyn Jane Anderson, Molly Q. Feldman, Arjun Guha, Michael Greenberg,\nand Abhinav Jangda. MultiPL-E: a scalable and polyglot approach to benchmarking neural code generation.\nIEEE Transactions on Software Engineering, 49(7):3675\u20133691, 2023b. doi: 10.1109/TSE.2023.3267446. URL\nhttps://www.computer.org/csdl/journal/ts/2023/07/10103177/1MpWUtj7Rwk. (cited on pp. 3 and 24)\nFederico Cassano, Luisa Li, Akul Sethi, Noah Shinn, Abby Brennan-Jones, Anton Lozhkov, Carolyn Jane\nAnderson, and Arjun Guha. Can it edit? evaluating the ability of large language models to follow code\nediting instructions. In The First International Workshop on Large Language Model for Code, 2024. URL\nhttps://arxiv.org/abs/2312.12450. (cited on pp. 3, 27, and 28)\nYekun Chai, Shuohuan Wang, Chao Pang, Yu Sun, Hao Tian, and Hua Wu.\nERNIE-code: Beyond\nEnglish-centric cross-lingual pretraining for programming languages. In Anna Rogers, Jordan Boyd-\nGraber, and Naoaki Okazaki (eds.), Findings of the Association for Computational Linguistics: ACL\n2023, pp. 10628\u201310650, Toronto, Canada, July 2023. Association for Computational Linguistics. doi:\n10.18653/v1/2023.findings-acl.676. URL https://aclanthology.org/2023.findings-acl.676. (cited on p.\n2)\nMark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri\nEdwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael\nPetrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov,\nAlethea Power, Lukasz Kaiser, Mohammad Bavarian, Clemens Winter, Philippe Tillet, Felipe Petroski Such,\nDave Cummings, Matthias Plappert, Fotios Chantzis, Elizabeth Barnes, Ariel Herbert-Voss, William Hebgen\nGuss, Alex Nichol, Alex Paino, Nikolas Tezak, Jie Tang, Igor Babuschkin, Suchir Balaji, Shantanu Jain,\nWilliam Saunders, Christopher Hesse, Andrew N. Carr, Jan Leike, Josh Achiam, Vedant Misra, Evan\nMorikawa, Alec Radford, Matthew Knight, Miles Brundage, Mira Murati, Katie Mayer, Peter Welinder,\nBob McGrew, Dario Amodei, Sam McCandlish, Ilya Sutskever, and Wojciech Zaremba. Evaluating large\nlanguage models trained on code. arXiv preprint, July 2021. URL https://arxiv.org/abs/2107.03374.\n(cited on pp. 2, 3, 23, 31, and 38)\nQin Chen, Jinfeng Ge, Huaqing Xie, Xingcheng Xu, and Yanqing Yang. Large language models at work in\nChina\u2019s labor market. arXiv preprint, August 2023. URL https://arxiv.org/abs/2308.08776. (cited on\np. 39)\nAakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts,\nPaul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi,\nSasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vinodkumar\nPrabhakaran, Emily Reif, Nan Du, Ben Hutchinson, Reiner Pope, James Bradbury, Jacob Austin, Michael\nIsard, Guy Gur-Ari, Pengcheng Yin, Toju Duke, Anselm Levskaya, Sanjay Ghemawat, Sunipa Dev, Henryk\nMichalewski, Xavier Garcia, Vedant Misra, Kevin Robinson, Liam Fedus, Denny Zhou, Daphne Ippolito,\nDavid Luan, Hyeontaek Lim, Barret Zoph, Alexander Spiridonov, Ryan Sepassi, David Dohan, Shivani\nAgrawal, Mark Omernick, Andrew M. Dai, Thanumalayan Sankaranarayana Pillai, Marie Pellat, Aitor\nLewkowycz, Erica Moreira, Rewon Child, Oleksandr Polozov, Katherine Lee, Zongwei Zhou, Xuezhi Wang,\nBrennan Saeta, Mark Diaz, Orhan Firat, Michele Catasta, Jason Wei, Kathy Meier-Hellstern, Douglas\nEck, Jeff Dean, Slav Petrov, and Noah Fiedel. Palm: Scaling language modeling with pathways. Journal\nof Machine Learning Research, 24(240):1\u2013113, 2023. URL http://jmlr.org/papers/v24/22-1144.html.\n(cited on p. 28)\nPaul F. Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement\nlearning from human preferences. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus,\nS. Vishwanathan, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 30.\n43\n Under review as submission to TMLR\nCurran Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper_files/paper/2017/hash/\nd5e2c0adad503c91f91df240d0cd4e49-Abstract.html. (cited on p. 27)\nClamAV, 2024. URL https://www.clamav.net/. (cited on p. 14)\nKarl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias\nPlappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training\nverifiers to solve math word problems. arXiv preprint, October 2021. URL https://arxiv.org/abs/\n2110.14168. (cited on pp. 3, 12, and 28)\nAlexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco\nGuzm\u00e1n, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. Unsupervised cross-lingual\nrepresentation learning at scale. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault (eds.),\nProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 8440\u20138451,\nOnline, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.747. URL\nhttps://aclanthology.org/2020.acl-main.747. (cited on p. 14)\nRoberto Di Cosmo and Stefano Zacchiroli. Software heritage: Why and how to preserve software source code.\nIn iPRES 2017: 14th International Conference on Digital Preservation, Kyoto, Japan, 2017. URL https:\n//www.softwareheritage.org/wp-content/uploads/2020/01/ipres-2017-swh.pdf. https://hal.archives-\nouvertes.fr/hal-01590958. (cited on p. 37)\nRoberto Di Cosmo, Morane Gruenpeter, and Stefano Zacchiroli. Referencing source code artifacts: A\nseparate concern in software citation. Computing in Science & Engineering, 22(2):33\u201343, 2020. doi:\n10.1109/MCSE.2019.2963148. (cited on p. 39)\nCrowdStrike.\nPolymorphic\nvirus.\nhttps://www.crowdstrike.com/cybersecurity-101/malware/\npolymorphic-virus/, 2024. Accessed: 2024. (cited on p. 38)\nCyberArk. Chatting our way into creating a polymorphic malware. https://www.cyberark.com/resources/\nthreat-research-blog/chatting-our-way-into-creating-a-polymorphic-malware, 2024.\nAccessed:\n2024. (cited on p. 38)\nCybersecurity & Infrastructure Security Agency. Secure by design, 2024. URL https://www.cisa.gov/\nresources-tools/resources/secure-by-design. Accessed: 2024. (cited on p. 38)\nTri Dao. FlashAttention-2: faster attention with better parallelism and work partitioning. In The Twelfth\nInternational Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=\nmZn2Xyh9Ec. (cited on p. 21)\nTri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher R\u00e9. FlashAttention: fast and memory-\nefficient exact attention with IO-awareness. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho,\nand A. Oh (eds.), Advances in Neural Information Processing Systems, volume 35, pp. 16344\u201316359.\nCurran Associates, Inc., 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/hash/\n67d57c32e20fd0a7a302cb81d36e40d5-Abstract-Conference.html. (cited on p. 21)\nHarm de Vries.\nGo smol or go home.\nhttps://www.harmdevries.com/post/model-size-vs-compute-\noverhead/, 2023. (cited on p. 3)\nJwala Dhamala, Tony Sun, Varun Kumar, Satyapriya Krishna, Yada Pruksachatkun, Kai-Wei Chang, and\nRahul Gupta. BOLD: dataset and metrics for measuring biases in open-ended language generation. In\nProceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, FAccT \u201921, pp.\n862\u2013872, New York, NY, USA, 2021. Association for Computing Machinery. ISBN 9781450383097. doi:\n10.1145/3442188.3445924. URL https://doi.org/10.1145/3442188.3445924. (cited on pp. 3 and 34)\nJennifer Ding, Christopher Akiki, Yacine Jernite, Anne Lee Steele, and Temi Popo. Towards openness\nbeyond open access: User journeys through 3 open AI collaboratives. In Workshop on Broadening Research\nCollaborations 2022, 2022. URL https://openreview.net/forum?id=slU-5h8rrCz. (cited on p. 2)\n44\n Under review as submission to TMLR\nYangruibo Ding, Zijian Wang, Wasi Uddin Ahmad, Hantian Ding, Ming Tan, Nihal Jain, Murali Krishna\nRamanathan, Ramesh Nallapati, Parminder Bhatia, Dan Roth, and Bing Xiang. CrossCodeEval: a\ndiverse and multilingual benchmark for cross-file code completion. In Thirty-seventh Conference on Neural\nInformation Processing Systems Datasets and Benchmarks Track, 2023. URL https://openreview.net/\nforum?id=wgDcbBMSfh. (cited on pp. 3, 31, 32, and 33)\nYi Dong, Ronghui Mu, Gaojie Jin, Yi Qi, Jinwei Hu, Xingyu Zhao, Jie Meng, Wenjie Ruan, and Xiaowei\nHuang. Building guardrails for large language models. arXiv preprint, February 2024. URL https:\n//arxiv.org/abs/2402.01822. (cited on p. 38)\nKawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, and Douwe Kiela. KTO: model alignment as\nprospect theoretic optimization. arXiv preprint, February 2024. URL https://arxiv.org/abs/2402.01306.\n(cited on p. 27)\nAngela Fan, Beliz Gokkaya, Mark Harman, Mitya Lyubarskiy, Shubho Sengupta, Shin Yoo, and Jie M. Zhang.\nLarge language models for software engineering: Survey and open problems. arXiv preprint, October 2023.\nURL https://arxiv.org/abs/2310.03533. (cited on p. 2)\nLuyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, and Graham\nNeubig. PAL: Program-aided language models. In Andreas Krause, Emma Brunskill, Kyunghyun Cho,\nBarbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (eds.), Proceedings of the 40th International\nConference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pp. 10764\u201310799.\nPMLR, 23\u201329 Jul 2023. URL https://proceedings.mlr.press/v202/gao23f.html. (cited on p. 28)\nSamuel Gehman, Suchin Gururangan, Maarten Sap, Yejin Choi, and Noah A. Smith. RealToxicityPrompts:\nevaluating neural toxic degeneration in language models. In Trevor Cohn, Yulan He, and Yang Liu\n(eds.), Findings of the Association for Computational Linguistics: EMNLP 2020, pp. 3356\u20133369, Online,\nNovember 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.findings-emnlp.301.\nURL https://aclanthology.org/2020.findings-emnlp.301. (cited on pp. 3 and 34)\nGemini Team et al. Gemini: a family of highly capable multimodal models. arXiv preprint, 2023. URL\nhttps://arxiv.org/abs/2312.11805. (cited on p. 2)\nGithub Archive, 2024. URL https://gharchive.org. (cited on pp. 3, 6, and 7)\ngo-enry, 2024. URL https://github.com/go-enry/go-enry. (cited on pp. 4 and 6)\nGoldman\nSachs.\nThe\ngenerative\nworld\norder:\nAI,\ngeopolitics,\nand\npower,\n2024.\nURL\nhttps://www.goldmansachs.com/intelligence/pages/the-generative-world-order-ai-geopolitics-\nand-power.html. (cited on p. 39)\nGovernance AI. Open sourcing highly capable foundation models, 2024. URL https://www.governance.ai/\nresearch-paper/open-sourcing-highly-capable-foundation-models. Accessed: 2024. (cited on p. 38)\nDirk Groeneveld, Iz Beltagy, Pete Walsh, Akshita Bhagia, Rodney Kinney, Oyvind Tafjord, Ananya Harsh\nJha, Hamish Ivison, Ian Magnusson, Yizhong Wang, Shane Arora, David Atkinson, Russell Authur,\nKhyathi Raghavi Chandu, Arman Cohan, Jennifer Dumas, Yanai Elazar, Yuling Gu, Jack Hessel, Tushar\nKhot, William Merrill, Jacob Morrison, Niklas Muennighoff, Aakanksha Naik, Crystal Nam, Matthew E.\nPeters, Valentina Pyatkin, Abhilasha Ravichander, Dustin Schwenk, Saurabh Shah, Will Smith, Emma\nStrubell, Nishant Subramani, Mitchell Wortsman, Pradeep Dasigi, Nathan Lambert, Kyle Richardson,\nLuke Zettlemoyer, Jesse Dodge, Kyle Lo, Luca Soldaini, Noah A. Smith, and Hannaneh Hajishirzi. OLMo:\naccelerating the science of language models. arXiv preprint, February 2024. URL https://arxiv.org/abs/\n2402.00838. (cited on pp. 2 and 37)\nAiden Grossman, Ludger Paehler, Konstantinos Parasyris, Tal Ben-Nun, Jacob Hegna, William Moses, Jose\nM. Monsalve Diaz, Mircea Trofin, and Johannes Doerfert. ComPile: a large IR dataset from production\nsources. arXiv preprint, September 2023. URL https://arxiv.org/abs/2309.15432. (cited on p. 11)\n45\n Under review as submission to TMLR\nAlex Gu, Baptiste Rozi\u00e8re, Hugh Leather, Armando Solar-Lezama, Gabriel Synnaeve, and Sida I. Wang.\nCRUXEval: a benchmark for code reasoning, understanding and execution. arXiv preprint, January 2024.\nURL https://arxiv.org/abs/2401.03065. (cited on pp. 3 and 29)\nDaya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi,\nY. Wu, Y. K. Li, Fuli Luo, Yingfei Xiong, and Wenfeng Liang.\nDeepSeek-Coder: when the large\nlanguage model meets programming \u2013 the rise of code intelligence. arXiv preprint, 2024. URL https:\n//arxiv.org/abs/2401.14196. (cited on pp. 2, 20, 22, 23, and 25)\nMaanak Gupta, Charankumar Akiri, Kshitiz Aryal, Eli Parker, and Lopamudra Praharaj. From ChatGPT to\nThreatGPT: impact of generative AI in cybersecurity and privacy. IEEE Access, 11:80218\u201380245, 2023. ISSN\n2169-3536. doi: 10.1109/access.2023.3300381. URL http://dx.doi.org/10.1109/ACCESS.2023.3300381.\n(cited on p. 38)\nAsier Guti\u00e9rrez-Fandi\u00f1o, David P\u00e9rez-Fern\u00e1ndez, Jordi Armengol-Estap\u00e9, David Griol, and Zoraida Calle-\njas.\nesCorpius: a massive spanish crawling corpus.\nIn IberSPEECH 2022, pp. 126\u2013130, 2022.\ndoi:\n10.21437/IberSPEECH.2022-26. URL https://www.isca-speech.org/archive/pdfs/iberspeech_2022/\ngutierrezfandino22_iberspeech.pdf. (cited on p. 10)\nDan Hendrycks,\nSteven Basart,\nSaurav Kadavath,\nMantas Mazeika,\nAkul Arora,\nEthan Guo,\nCollin\nBurns,\nSamir\nPuranik,\nHorace\nHe,\nDawn\nSong,\nand\nJacob\nSteinhardt.\nMeasuring\ncoding\nchallenge\ncompetence\nwith\napps.\nIn\nJ.\nVanschoren\nand\nS.\nYeung\n(eds.),\nProceed-\nings of the Neural Information Processing Systems Track on Datasets and Benchmarks,\nvol-\nume 1. Curran, 2021. URL https://datasets-benchmarks-proceedings.neurips.cc/paper/2021/hash/\nc24cd76e1ce41366a4bbe8a49b02a028-Abstract-round2.html. (cited on p. 12)\nSoftware Heritage. Software heritage community. https://www.softwareheritage.org/community/, 2024.\nAccessed: 2024. (cited on p. 37)\nXinyi Hou, Yanjie Zhao, Yue Liu, Zhou Yang, Kailong Wang, Li Li, Xiapu Luo, David Lo, John Grundy,\nand Haoyu Wang. Large language models for software engineering: A systematic literature review. arXiv\npreprint, August 2023. URL https://arxiv.org/abs/2308.10620. (cited on p. 2)\nDong Huang, Qingwen Bu, Jie Zhang, Xiaofei Xie, Junjie Chen, and Heming Cui. Bias testing and mitigation\nin LLM-based code generation. arXiv preprint, 2023. URL https://arxiv.org/abs/2309.14345. (cited on\np. 38)\nAlbert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego\nde las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, L\u00e9lio Renard Lavaud,\nMarie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timoth\u00e9e Lacroix, and\nWilliam El Sayed. Mistral 7B. arXiv preprint, 2023. URL https://arxiv.org/abs/2310.06825. (cited on\np. 2)\nAlbert Qiaochu Jiang, Wenda Li, Jesse Michael Han, and Yuhuai Wu. LISA: language models of ISAbelle\nproofs. In 6th Conference on Artificial Intelligence and Theorem Proving, pp. 378\u2013392, 2021. URL\nhttp://aitp-conference.org/2021/abstract/paper_17.pdf. (cited on p. 12)\nDiederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Yoshua Bengio and\nYann LeCun (eds.), 3rd International Conference on Learning Representations, ICLR 2015, San Diego,\nCA, USA, May 7-9, 2015, Conference Track Proceedings, 2015. URL http://arxiv.org/abs/1412.6980.\n(cited on p. 21)\nDenis Kocetkov, Raymond Li, Loubna Ben allal, Jia LI, Chenghao Mou, Yacine Jernite, Margaret Mitchell,\nCarlos Mu\u00f1oz Ferrandis, Sean Hughes, Thomas Wolf, Dzmitry Bahdanau, Leandro Von Werra, and Harm\nde Vries. The stack: 3 TB of permissively licensed source code. Transactions on Machine Learning\nResearch, 2023. ISSN 2835-8856. URL https://openreview.net/forum?id=pxpbTdUEpD. (cited on pp. 2,\n4, 36, and 37)\n46\n Under review as submission to TMLR\nAlexandre Lacoste, Alexandra Luccioni, Victor Schmidt, and Thomas Dandres. Quantifying the carbon\nemissions of machine learning. arXiv preprint, October 2019. URL https://arxiv.org/abs/1910.09700.\n(cited on p. 21)\nYuhang Lai, Chengxi Li, Yiming Wang, Tianyi Zhang, Ruiqi Zhong, Luke Zettlemoyer, Wen-Tau Yih,\nDaniel Fried, Sida Wang, and Tao Yu. DS-1000: A natural and reliable benchmark for data science code\ngeneration. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato,\nand Jonathan Scarlett (eds.), Proceedings of the 40th International Conference on Machine Learning,\nvolume 202 of Proceedings of Machine Learning Research, pp. 18319\u201318345. PMLR, 23\u201329 Jul 2023. URL\nhttps://proceedings.mlr.press/v202/lai23b.html. (cited on pp. 3 and 25)\nChris Lattner and Vikram Adve. LLVM: a compilation framework for lifelong program analysis & transfor-\nmation. In International symposium on code generation and optimization, 2004. CGO 2004., pp. 75\u201386.\nIEEE, 2004. URL https://ieeexplore.ieee.org/document/1281665. (cited on p. 11)\nRaymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou, Marc\nMarone, Christopher Akiki, Jia Li, Jenny Chim, Qian Liu, Evgenii Zheltonozhskii, Terry Yue Zhuo, Thomas\nWang, Olivier Dehaene, Mishig Davaadorj, Joel Lamy-Poirier, Jo\u00e3o Monteiro, Oleh Shliazhko, Nicolas\nGontier, Nicholas Meade, Armel Zebaze, Ming-Ho Yee, Logesh Kumar Umapathi, Jian Zhu, Benjamin\nLipkin, Muhtasham Oblokulov, Zhiruo Wang, Rudra Murthy, Jason Stillerman, Siva Sankalp Patel, Dmitry\nAbulkhanov, Marco Zocca, Manan Dey, Zhihan Zhang, Nour Fahmy, Urvashi Bhattacharyya, Wenhao\nYu, Swayam Singh, Sasha Luccioni, Paulo Villegas, Maxim Kunakov, Fedor Zhdanov, Manuel Romero,\nTony Lee, Nadav Timor, Jennifer Ding, Claire Schlesinger, Hailey Schoelkopf, Jan Ebert, Tri Dao, Mayank\nMishra, Alex Gu, Jennifer Robinson, Carolyn Jane Anderson, Brendan Dolan-Gavitt, Danish Contractor,\nSiva Reddy, Daniel Fried, Dzmitry Bahdanau, Yacine Jernite, Carlos Mu\u00f1oz Ferrandis, Sean Hughes,\nThomas Wolf, Arjun Guha, Leandro von Werra, and Harm de Vries. StarCoder: may the source be with\nyou! arXiv preprint, May 2023. URL https://arxiv.org/abs/2305.06161. (cited on pp. 2, 6, 8, 13, 14,\n22, 33, 34, 36, and 37)\nYujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, R\u00e9mi Leblond, Tom Eccles, James\nKeeling, Felix Gimeno, Agustin Dal Lago, Thomas Hubert, Peter Choy, Cyprien de Masson d\u2019Autume, Igor\nBabuschkin, Xinyun Chen, Po-Sen Huang, Johannes Welbl, Sven Gowal, Alexey Cherepanov, James Molloy,\nDaniel J. Mankowitz, Esme Sutherland Robson, Pushmeet Kohli, Nando de Freitas, Koray Kavukcuoglu,\nand Oriol Vinyals. Competition-level code generation with alphacode. Science, 378(6624):1092\u20131097, 2022.\ndoi: 10.1126/science.abq1158. URL https://www.science.org/doi/abs/10.1126/science.abq1158. (cited\non p. 12)\nJiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. Is your code generated by chatGPT really\ncorrect? rigorous evaluation of large language models for code generation. In Thirty-seventh Conference\non Neural Information Processing Systems, 2023a. URL https://openreview.net/forum?id=1qvx610Cu7.\n(cited on pp. 3 and 23)\nTianyang Liu, Canwen Xu, and Julian McAuley. RepoBench: Benchmarking repository-level code auto-\ncompletion systems. arXiv preprint, June 2023b. URL https://arxiv.org/abs/2306.03091. (cited on pp.\n3, 31, and 32)\nShayne Longpre, Robert Mahari, Anthony Chen, Naana Obeng-Marnu, Damien Sileo, William Brannon,\nNiklas Muennighoff, Nathan Khazam, Jad Kabbara, Kartik Perisetla, Xinyi Wu, Enrico Shippole, Kurt\nBollacker, Tongshuang Wu, Luis Villa, Sandy Pentland, and Sara Hooker. The data provenance initiative:\nA large scale audit of dataset licensing & attribution in AI. arXiv preprint, 2023. URL https://arxiv.org/\nabs/2310.16787. (cited on p. 25)\nRisto Luukkonen, Ville Komulainen, Jouni Luoma, Anni Eskelinen, Jenna Kanerva, Hanna-Mari Kupari, Filip\nGinter, Veronika Laippala, Niklas Muennighoff, Aleksandra Piktus, et al. Fingpt: Large generative models\nfor a small language. arXiv preprint arXiv:2311.05640, 2023. URL https://arxiv.org/abs/2311.05640.\n(cited on p. 37)\n47\n Under review as submission to TMLR\nEmanuele La Malfa, Aleksandar Petrov, Simon Frieder, Christoph Weinhuber, Ryan Burnell, Raza Nazar,\nAnthony G. Cohn, Nigel Shadbolt, and Michael Wooldridge. Language models as a service: Overview of a\nnew paradigm and its challenges. arXiv preprint, 2023. URL https://arxiv.org/abs/2309.16573. (cited\non p. 37)\nMarc Marone and Benjamin Van Durme. Data portraits: Recording foundation model training data. In\nThirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track,\n2023. URL https://arxiv.org/abs/2303.03919. (cited on pp. 34 and 35)\nThe mathlib Community.\nThe lean mathematical library.\nIn Proceedings of the 9th ACM SIGPLAN\nInternational Conference on Certified Programs and Proofs, POPL \u201920. ACM, January 2020. doi: 10.1145/\n3372885.3373824. URL http://dx.doi.org/10.1145/3372885.3373824. (cited on p. 12)\nDaniel Mendez, Daniel Graziotin, Stefan Wagner, and Heidi Seibold. Open Science in Software Engineering,\npp. 477\u2013501. Springer International Publishing, 2020. doi: 10.1007/978-3-030-32489-6_17. URL http:\n//dx.doi.org/10.1007/978-3-030-32489-6_17. (cited on p. 37)\nRalph C. Merkle. A digital signature based on a conventional encryption function. In Conference on the\ntheory and application of cryptographic techniques, pp. 369\u2013378. Springer, 1987. (cited on p. 3)\nMehdi Mirakhorli, Derek Garcia, Schuyler Dillon, Kevin Laporte, Matthew Morrison, Henry Lu, Viktoria\nKoscinski, and Christopher Enoch. A landscape study of open source and proprietary tools for software\nbill of materials (sbom). arXiv preprint, 2024. URL https://arxiv.org/abs/2402.11151. (cited on p. 38)\nMike Mirzayanov. Codeforces: Results of 2020 [annual report]. https://codeforces.com/blog/entry/89502,\n2020. (cited on p. 11)\nMaximilian Mozes, Xuanli He, Bennett Kleinberg, and Lewis D. Griffin. Use of LLMs for illicit purposes:\nThreats, prevention measures, and vulnerabilities. arXiv preprint, 2023. URL https://arxiv.org/abs/\n2308.12833. (cited on p. 38)\nMSFT Q2 Earning Call, 2024.\nURL https://www.microsoft.com/en-us/investor/events/fy-2024/\nearnings-fy-2024-q2.aspx. (cited on p. 2)\nNiklas Muennighoff, Nouamane Tazi, Lo\u00efc Magne, and Nils Reimers.\nMteb: Massive text embedding\nbenchmark. arXiv preprint arXiv:2210.07316, 2022a. doi: 10.48550/ARXIV.2210.07316. URL https:\n//arxiv.org/abs/2210.07316. (cited on p. 12)\nNiklas Muennighoff, Thomas Wang, Lintang Sutawika, Adam Roberts, Stella Biderman, Teven Le Scao,\nM Saiful Bari, Sheng Shen, Zheng-Xin Yong, Hailey Schoelkopf, Xiangru Tang, Dragomir Radev, Alham Fikri\nAji, Khalid Almubarak, Samuel Albanie, Zaid Alyafeai, Albert Webson, Edward Raff, and Colin Raffel.\nCrosslingual generalization through multitask finetuning, 2022b. URL https://arxiv.org/abs/2211.01786.\n(cited on p. 25)\nNiklas Muennighoff, Alexander M Rush, Boaz Barak, Teven Le Scao, Nouamane Tazi, Aleksandra Piktus,\nSampo Pyysalo, Thomas Wolf, and Colin Raffel. Scaling data-constrained language models. In Thirty-seventh\nConference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=\nj5BuTrEj35. (cited on pp. 2 and 21)\nNiklas Muennighoff, Qian Liu, Armel Randy Zebaze, Qinkai Zheng, Binyuan Hui, Terry Yue Zhuo, Swayam\nSingh, Xiangru Tang, Leandro Von Werra, and Shayne Longpre. OctoPack: instruction tuning code large\nlanguage models. In The Twelfth International Conference on Learning Representations, 2024a. URL\nhttps://openreview.net/forum?id=mw1PWNSWZP. (cited on pp. 3, 25, 26, 27, and 37)\nNiklas Muennighoff, Hongjin Su, Liang Wang, Nan Yang, Furu Wei, Tao Yu, Amanpreet Singh, and\nDouwe Kiela.\nGenerative representational instruction tuning.\narXiv preprint, 2024b.\nURL https:\n//arxiv.org/abs/2402.09906. (cited on p. 25)\n48\n Under review as submission to TMLR\nJ. M\u00f6kander, J. Schuett, H.R. Kirk, et al. Auditing large language models: A three-layered approach. AI\nEthics, 2023. URL https://doi.org/10.1007/s43681-023-00289-2. (cited on p. 37)\nSebastian Nanz and Carlo A. Furia. A comparative study of programming languages in Rosetta code. In\n2015 IEEE/ACM 37th IEEE International Conference on Software Engineering, volume 1, pp. 778\u2013788.\nIEEE, 2015. URL https://ieeexplore.ieee.org/document/7194625. (cited on p. 12)\nErik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, and Caiming\nXiong. CodeGen: an open large language model for code with multi-turn program synthesis. In The Eleventh\nInternational Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=\niaYcJKpY2B_. (cited on p. 2)\nDebora Nozza, Federico Bianchi, and Dirk Hovy. HONEST: Measuring hurtful sentence completion in language\nmodels. In Kristina Toutanova, Anna Rumshisky, Luke Zettlemoyer, Dilek Hakkani-Tur, Iz Beltagy,\nSteven Bethard, Ryan Cotterell, Tanmoy Chakraborty, and Yichao Zhou (eds.), Proceedings of the 2021\nConference of the North American Chapter of the Association for Computational Linguistics: Human\nLanguage Technologies, pp. 2398\u20132406, Online, June 2021. Association for Computational Linguistics. doi:\n10.18653/v1/2021.naacl-main.191. URL https://aclanthology.org/2021.naacl-main.191. (cited on pp.\n3 and 34)\nOpenAI et al. GPT-4 technical report. arXiv preprint, March 2023. URL https://arxiv.org/abs/2303.08774.\n(cited on p. 2)\nPedro Javier Ortiz Su\u00e1rez, Beno\u00eet Sagot, and Laurent Romary. Asynchronous pipelines for processing huge\ncorpora on medium to low resource infrastructures. In Piotr Ba\u0144ski, Adrien Barbaresi, Hanno Biber, Evelyn\nBreiteneder, Simon Clematide, Marc Kupietz, Harald L\u00fcngen, and Caroline Iliadi (eds.), Proceedings of\nthe Workshop on Challenges in the Management of Large Corpora, pp. 9 \u2013 16, Mannheim, July 2019.\nLeibniz-Institut f\u00fcr Deutsche Sprache. doi: 10.14618/ids-pub-9021. URL http://nbn-resolving.de/urn:\nnbn:de:bsz:mh39-90215. (cited on p. 10)\nKeiran Paster, Marco Dos Santos, Zhangir Azerbayev, and Jimmy Ba. OpenWebMath: an open dataset of high-\nquality mathematical web text. arXiv preprint, October 2023. URL https://arxiv.org/abs/2310.06786.\n(cited on p. 13)\nHammond Pearce, Baleegh Ahmad, Benjamin Tan, Brendan Dolan-Gavitt, and Ramesh Karri. Asleep at\nthe keyboard? assessing the security of github copilot\u2019s code contributions. In 2022 IEEE Symposium on\nSecurity and Privacy (SP), pp. 754\u2013768. IEEE, 2022. (cited on pp. 3 and 33)\nGuilherme Penedo, Quentin Malartic, Daniel Hesslow, Ruxandra Cojocaru, Hamza Alobeidli, Alessandro\nCappelli, Baptiste Pannier, Ebtesam Almazrouei, and Julien Launay. The RefinedWeb dataset for Falcon\nLLM: Outperforming curated corpora with web data only.\nIn Thirty-seventh Conference on Neural\nInformation Processing Systems Datasets and Benchmarks Track, 2023. URL https://openreview.net/\nforum?id=kM5eGcdCzq. (cited on p. 10)\nSida Peng, Eirini Kalliamvakou, Peter Cihon, and Mert Demirer. The impact of AI on developer productivity:\nEvidence from GitHub Copilot. arXiv preprint, 2023. URL https://arxiv.org/abs/2302.06590. (cited on\npp. 2 and 39)\nAntoine Pietri, Diomidis Spinellis, and Stefano Zacchiroli.\nThe software heritage graph dataset:\nLarge-scale analysis of public software development history.\nIn MSR 2020:\nThe 17th Inter-\nnational Conference on Mining Software Repositories,\npp. 1\u20135. IEEE, 2020.\ndoi:\n10.1145/\n3379597.3387510.\nURL\nhttps://arxiv.org/abs/2011.07824https://www.softwareheritage.org/wp-\ncontent/uploads/2021/03/msr-2020-challenge.pdf. (cited on p. 3)\nAleksandra Piktus, Christopher Akiki, Paulo Villegas, Hugo Lauren\u00e7on, G\u00e9rard Dupont, Sasha Luccioni,\nYacine Jernite, and Anna Rogers. The ROOTS search tool: Data transparency for LLMs. In Danushka\nBollegala, Ruihong Huang, and Alan Ritter (eds.), Proceedings of the 61st Annual Meeting of the Association\nfor Computational Linguistics (Volume 3: System Demonstrations), pp. 304\u2013314, Toronto, Canada, July\n49\n Under review as submission to TMLR\n2023a. Association for Computational Linguistics.\ndoi: 10.18653/v1/2023.acl-demo.29.\nURL https:\n//aclanthology.org/2023.acl-demo.29. (cited on p. 34)\nAleksandra Piktus, Odunayo Ogundepo, Christopher Akiki, Akintunde Oladipo, Xinyu Zhang, Hailey\nSchoelkopf, Stella Biderman, Martin Potthast, and Jimmy Lin. GAIA search: Hugging Face and pyserini\ninteroperability for NLP training data exploration. In Danushka Bollegala, Ruihong Huang, and Alan Ritter\n(eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume\n3: System Demonstrations), pp. 588\u2013598, Toronto, Canada, July 2023b. Association for Computational\nLinguistics.\ndoi: 10.18653/v1/2023.acl-demo.57.\nURL https://aclanthology.org/2023.acl-demo.57.\n(cited on p. 34)\nNikhil Pinnaparaju, Reshinth Adithyan, Duy Phung, Jonathan Tow, James Baicoianu, , and Nathan Cooper.\nStable code 3B: Coding on the edge. Stability AI, 2024. URL https://stability.ai/news/stable-code-\n2024-llm-code-completion-release. (cited on p. 22)\nBigCode Project. The stack v2, 2024. URL https://huggingface.co/datasets/bigcode/the-stack-v2/.\nAccessed: 2024. (cited on p. 37)\nRuchir Puri, David S Kung, Geert Janssen, Wei Zhang, Giacomo Domeniconi, Vladimir Zolotov, Julian\nDolby, Jie Chen, Mihir Choudhury, Lindsey Decker, Veronika Thost, Luca Buratti, Saurabh Pujar, Shyam\nRamji, Ulrich Finkler, Susan Malaika, and Frederick Reiss. CodeNet: a large-scale AI for code dataset for\nlearning a diversity of coding tasks. In Thirty-fifth Conference on Neural Information Processing Systems\nDatasets and Benchmarks Track (Round 2), 2021. URL https://openreview.net/forum?id=6vZVBkCDrHT.\n(cited on p. 11)\nRedPajama Wiki, 2024.\nURL https://github.com/togethercomputer/RedPajama-Data/tree/rp_v1/\ndata_prep/wiki. (cited on p. 13)\nNils Reimers and Iryna Gurevych. Sentence-BERT: Sentence embeddings using siamese BERT-networks. In\nKentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan (eds.), Proceedings of the 2019 Conference on\nEmpirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural\nLanguage Processing (EMNLP-IJCNLP), pp. 3982\u20133992, Hong Kong, China, November 2019. Association\nfor Computational Linguistics. doi: 10.18653/v1/D19-1410. URL https://aclanthology.org/D19-1410.\n(cited on p. 12)\nShuo Ren, Daya Guo, Shuai Lu, Long Zhou, Shujie Liu, Duyu Tang, Neel Sundaresan, Ming Zhou, Ambrosio\nBlanco, and Shuai Ma. Codebleu: a method for automatic evaluation of code synthesis. arXiv preprint,\n2020. URL https://arxiv.org/abs/2009.10297. (cited on p. 31)\nRosetta Code, 2023. URL https://rosettacode.org/. (cited on pp. 11 and 12)\nBaptiste Rozi\u00e8re, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi,\nJingyu Liu, Tal Remez, J\u00e9r\u00e9my Rapin, Artyom Kozhevnikov, Ivan Evtimov, Joanna Bitton, Manish Bhatt,\nCristian Canton Ferrer, Aaron Grattafiori, Wenhan Xiong, Alexandre D\u00e9fossez, Jade Copet, Faisal Azhar,\nHugo Touvron, Louis Martin, Nicolas Usunier, Thomas Scialom, and Gabriel Synnaeve. Code llama:\nOpen foundation models for code. arXiv preprint, August 2023. URL https://arxiv.org/abs/2308.12950.\n(cited on pp. 2, 20, 22, 23, 25, and 31)\nSane Security, 2024. URL https://sanesecurity.com/usage/signatures. (cited on p. 14)\nVictor Sanh, Albert Webson, Colin Raffel, Stephen Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Chaffin,\nArnaud Stiegler, Arun Raja, Manan Dey, M Saiful Bari, Canwen Xu, Urmish Thakker, Shanya Sharma\nSharma, Eliza Szczechla, Taewoon Kim, Gunjan Chhablani, Nihal Nayak, Debajyoti Datta, Jonathan Chang,\nMike Tian-Jian Jiang, Han Wang, Matteo Manica, Sheng Shen, Zheng Xin Yong, Harshit Pandey, Rachel\nBawden, Thomas Wang, Trishala Neeraj, Jos Rozen, Abheesht Sharma, Andrea Santilli, Thibault Fevry,\nJason Alan Fries, Ryan Teehan, Teven Le Scao, Stella Biderman, Leo Gao, Thomas Wolf, and Alexander M\nRush. Multitask prompted training enables zero-shot task generalization. In International Conference on\nLearning Representations, 2022. URL https://openreview.net/forum?id=9Vrb9D0WI4. (cited on p. 25)\n50\n Under review as submission to TMLR\nDavid Saxton, Edward Grefenstette, Felix Hill, and Pushmeet Kohli. Analysing mathematical reasoning\nabilities of neural models. In International Conference on Learning Representations, 2019. URL https:\n//openreview.net/forum?id=H1gR5iR5FX. (cited on p. 12)\nScanCode, 2024. URL https://github.com/nexB/scancode-toolkit. (cited on p. 3)\nScanCode License Categories, 2024. URL https://scancode-licensedb.aboutcode.org/help.html#license-\ncategories. (cited on p. 4)\nTeven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ilic, Daniel Hesslow, Roman Castagn\u00e9,\nAlexandra Sasha Luccioni, Fran\u00e7ois Yvon, Matthias Gall\u00e9, Jonathan Tow, Alexander M. Rush, Stella Bider-\nman, Albert Webson, Pawan Sasanka Ammanamanchi, Thomas Wang, Beno\u00eet Sagot, Niklas Muennighoff,\nAlbert Villanova del Moral, Olatunji Ruwase, Rachel Bawden, Stas Bekman, Angelina McMillan-Major,\nIz Beltagy, Huu Nguyen, Lucile Saulnier, Samson Tan, Pedro Ortiz Suarez, Victor Sanh, Hugo Lau-\nren\u00e7on, Yacine Jernite, Julien Launay, Margaret Mitchell, Colin Raffel, Aaron Gokaslan, Adi Simhi, Aitor\nSoroa, Alham Fikri Aji, Amit Alfassy, Anna Rogers, Ariel Kreisberg Nitzav, Canwen Xu, Chenghao Mou,\nChris Emezue, Christopher Klamm, Colin Leong, Daniel van Strien, David Ifeoluwa Adelani, and et al.\nBLOOM: A 176b-parameter open-access multilingual language model. CoRR, abs/2211.05100, 2022a. doi:\n10.48550/ARXIV.2211.05100. URL https://doi.org/10.48550/arXiv.2211.05100. (cited on p. 2)\nTeven Le Scao, Thomas Wang, Daniel Hesslow, Lucile Saulnier, Stas Bekman, M Saiful Bari, Stella Biderman,\nHady Elsahar, Niklas Muennighoff, Jason Phang, et al. What language model to train if you have one\nmillion gpu hours? arXiv preprint arXiv:2210.15424, 2022b. (cited on p. 14)\nServiceNow.\nText2flow LLM: Automating workflow generation from descriptive text.\nhttps://\ndownloads.docs.servicenow.com/resource/enus/infocard/text2flow-llm.pdf, 2024a. (cited on p. 39)\nServiceNow. Text-to-code LLM: transforming natural language into executable code, 2024b. URL https:\n//downloads.docs.servicenow.com/resource/enus/infocard/text-to-code-llm.pdf. (cited on p. 39)\nNoam Shazeer. Fast transformer decoding: One write-head is all you need. CoRR, abs/1911.02150, 2019.\nURL http://arxiv.org/abs/1911.02150. (cited on p. 20)\nEmily Sheng, Kai-Wei Chang, Premkumar Natarajan, and Nanyun Peng. The woman worked as a babysitter:\nOn biases in language generation. In Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan (eds.),\nProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th\nInternational Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 3407\u20133412, Hong\nKong, China, November 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-1339.\nURL https://aclanthology.org/D19-1339. (cited on p. 34)\nDan Sholler, Igor Steinmacher, Denise Ford, Mara Averick, Mike Hoye, and Greg Wilson. Ten simple rules\nfor helping newcomers become contributors to open projects. PLoS Computational Biology, 15(9):e1007296,\n2019. doi: 10.1371/journal.pcbi.1007296. URL https://doi.org/10.1371/journal.pcbi.1007296. (cited\non p. 37)\nShivalika Singh, Freddie Vargus, Daniel Dsouza, B\u00f6rje F. Karlsson, Abinaya Mahendiran, Wei-Yin Ko,\nHerumb Shandilya, Jay Patel, Deividas Mataciunas, Laura OMahony, Mike Zhang, Ramith Hettiarachchi,\nJoseph Wilson, Marina Machado, Luisa Souza Moura, Dominik Krzemi\u0144ski, Hakimeh Fadaei, Irem Erg\u00fcn,\nIfeoma Okoh, Aisha Alaagib, Oshan Mudannayake, Zaid Alyafeai, Vu Minh Chien, Sebastian Ruder, Surya\nGuthikonda, Emad A. Alghamdi, Sebastian Gehrmann, Niklas Muennighoff, Max Bartolo, Julia Kreutzer,\nAhmet \u00dcst\u00fcn, Marzieh Fadaee, and Sara Hooker. Aya dataset: An open-access collection for multilingual\ninstruction tuning. arXiv preprint, 2024. URL https://arxiv.org/abs/2402.06619. (cited on p. 37)\nSoftware Heritage. Swh statement on llm for code, 2023. URL https://www.softwareheritage.org/2023/\n10/19/swh-statement-on-llm-for-code/. (cited on p. 38)\nSoftware Heritage. Bulk access terms of use, 2024a. URL https://www.softwareheritage.org/legal/bulk-\naccess-terms-of-use/. (cited on p. 37)\n51\n Under review as submission to TMLR\nSoftware Heritage, 2024b. URL https://www.softwareheritage.org. (cited on p. 7)\nIrene Solaiman. The gradient of generative AI release: Methods and considerations. arXiv preprint, 2023.\nURL https://arxiv.org/abs/2302.04844. (cited on pp. 2, 37, and 38)\nLuca Soldaini, Rodney Kinney, Akshita Bhagia, Dustin Schwenk, David Atkinson, Russell Authur, Ben Bogin,\nKhyathi Chandu, Jennifer Dumas, Yanai Elazar, Valentin Hofmann, Ananya Harsh Jha, Sachin Kumar,\nLi Lucy, Xinxi Lyu, Nathan Lambert, Ian Magnusson, Jacob Morrison, Niklas Muennighoff, Aakanksha Naik,\nCrystal Nam, Matthew E. Peters, Abhilasha Ravichander, Kyle Richardson, Zejiang Shen, Emma Strubell,\nNishant Subramani, Oyvind Tafjord, Pete Walsh, Luke Zettlemoyer, Noah A. Smith, Hannaneh Hajishirzi,\nIz Beltagy, Dirk Groeneveld, Jesse Dodge, and Kyle Lo. Dolma: an open corpus of three trillion tokens\nfor language model pretraining research. arXiv preprint, 2024. URL https://arxiv.org/abs/2402.00159.\n(cited on p. 37)\nStackExchange Archive, 2024. URL https://archive.org/details/stackexchange. (cited on p. 12)\nJianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. RoFormer: Enhanced\ntransformer with rotary position embedding. arXiv preprint, April 2021. URL https://arxiv.org/abs/\n2104.09864. (cited on p. 20)\nMarc Szafraniec, Baptiste Roziere, Hugh James Leather, Patrick Labatut, Francois Charton, and Gabriel\nSynnaeve. Code translation with compiler representations. In The Eleventh International Conference on\nLearning Representations, 2023. URL https://openreview.net/forum?id=XomEU3eNeSQ. (cited on p. 11)\nFeiyang Tang, Bjarte M. \u00d8stvold, and Magiel Bruntink. Helping code reviewer prioritize: Pinpointing\npersonal data and its processing. Frontiers in Artificial Intelligence and Applications, 371:109\u2013124, 2023.\ndoi: 10.3233/FAIA230228. (cited on p. 38)\nFeiyang Tang and Bjarte M. \u00d8stvold. Finding privacy-relevant source code. arXiv preprint, 2024. URL\nhttps://arxiv.org/abs/2401.07316. (cited on p. 38)\nThe SWHID Specification Project. The SWHID specification, 2024. URL https://www.swhid.org/. (cited\non p. 38)\nTogether Computer. RedPajama: an open dataset for training large language models, October 2023. URL\nhttps://github.com/togethercomputer/RedPajama-Data. (cited on p. 12)\nHugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay\nBashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton\nFerrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller,\nCynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan\nInan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh\nKoura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier\nMartinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein,\nRashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian,\nXiaoqing Ellen Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan,\nIliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert\nStojnic, Sergey Edunov, and Thomas Scialom. Llama 2: Open foundation and fine-tuned chat models.\narXiv preprint, 2023. URL https://arxiv.org/abs/2307.09288. (cited on p. 12)\nAhmet \u00dcst\u00fcn, Viraat Aryabumi, Zheng-Xin Yong, Wei-Yin Ko, Daniel D\u2019souza, Gbemileke Onilude, Neel\nBhandari, Shivalika Singh, Hui-Lee Ooi, Amr Kayid, Freddie Vargus, Phil Blunsom, Shayne Longpre, Niklas\nMuennighoff, Marzieh Fadaee, Julia Kreutzer, and Sara Hooker. Aya model: An instruction finetuned\nopen-access multilingual language model. arXiv preprint, 2024. URL https://arxiv.org/abs/2402.07827.\n(cited on p. 37)\nBertie Vidgen, Tristan Thrush, Zeerak Waseem, and Douwe Kiela. Learning from the worst: Dynamically\ngenerated datasets to improve online hate detection. In Chengqing Zong, Fei Xia, Wenjie Li, and Roberto\n52\n Under review as submission to TMLR\nNavigli (eds.), Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and\nthe 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 1667\u2013\n1682, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.132.\nURL https://aclanthology.org/2021.acl-long.132. (cited on p. 34)\nJunjie Wang, Yuchao Huang, Chunyang Chen, Zhe Liu, Song Wang, and Qing Wang. Software testing with\nlarge language models: Survey, landscape, and vision. IEEE Transactions on Software Engineering, pp.\n1\u201327, 2024. doi: 10.1109/TSE.2024.3368208. URL https://arxiv.org/abs/2307.07221. (cited on p. 2)\nJason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M.\nDai, and Quoc V. Le. Finetuned language models are zero-shot learners. In International Conference on\nLearning Representations, 2022. URL https://openreview.net/forum?id=gEZrGCozdqR. (cited on p. 25)\nMichael Woelfle, Piero L Olliaro, and Matthew H. Todd. Open science is a research accelerator. Nature\nchemistry, 3 10:745\u20138, 2011. URL https://api.semanticscholar.org/CorpusID:205289283. (cited on p.\n37)\nWorld Economic Forum.\nJobs of tomorrow:\nLarge language models and jobs, 2024.\nURL https:\n//www.weforum.org/publications/jobs-of-tomorrow-large-language-models-and-jobs/. (cited on p.\n39)\nYiheng Xu, Hongjin Su, Chen Xing, Boyu Mi, Qian Liu, Weijia Shi, Binyuan Hui, Fan Zhou, Yitao Liu,\nTianbao Xie, Zhoujun Cheng, Siheng Zhao, Lingpeng Kong, Bailin Wang, Caiming Xiong, and Tao\nYu. Lemur: Harmonizing natural language and code for language agents. In The Twelfth International\nConference on Learning Representations, 2024. URL https://openreview.net/forum?id=hNhwSmtXRh.\n(cited on p. 37)\nYahoo Finance.\nServiceNow Inc (NYSE: NOW) Q4 earnings: What to expect, 2024.\nURL https://\nfinance.yahoo.com/news/servicenow-inc-nyse-now-q4-154816487.html. (cited on pp. 2 and 39)\nZhou Yang, Zhipeng Zhao, Chenyu Wang, Jieke Shi, Dongsum Kim, Donggyun Han, and David Lo. Gotcha!\nthis model uses my code! evaluating membership leakage risks in code models. arXiv preprint, 2023. URL\nhttps://arxiv.org/abs/2310.01166. (cited on p. 38)\nZheng Yuan, Hongyi Yuan, Chengpeng Li, Guanting Dong, Keming Lu, Chuanqi Tan, Chang Zhou, and\nJingren Zhou. Scaling relationship on learning mathematical reasoning with large language models. arXiv\npreprint, August 2023. URL https://arxiv.org/abs/2308.01825. (cited on p. 12)\nJieyu Zhao, Tianlu Wang, Mark Yatskar, Vicente Ordonez, and Kai-Wei Chang. Gender bias in coreference\nresolution: Evaluation and debiasing methods. In Marilyn Walker, Heng Ji, and Amanda Stent (eds.),\nProceedings of the 2018 Conference of the North American Chapter of the Association for Computational\nLinguistics: Human Language Technologies, Volume 2 (Short Papers), pp. 15\u201320, New Orleans, Louisiana,\nJune 2018. Association for Computational Linguistics.\ndoi: 10.18653/v1/N18-2003.\nURL https://\naclanthology.org/N18-2003. (cited on p. 34)\nTerry Yue Zhuo, Yujin Huang, Chunyang Chen, and Zhenchang Xing. Red teaming ChatGPT via jailbreaking:\nBias, robustness, reliability and toxicity. arXiv preprint, 2023a. URL https://arxiv.org/abs/2301.12867.\n(cited on p. 38)\nTerry Yue Zhuo, Zhou Yang, Zhensu Sun, Yufei Wang, Li Li, Xiaoning Du, Zhenchang Xing, and David\nLo. Source code data augmentation for deep learning: A survey. arXiv preprint, May 2023b. URL\nhttps://arxiv.org/abs/2305.19915. (cited on pp. 2, 11, and 38)\nTerry Yue Zhuo, Armel Zebaze, Nitchakarn Suppattarachai, Leandro von Werra, Harm de Vries, Qian Liu,\nand Niklas Muennighoff. Astraios: Parameter-efficient instruction tuning code large language models.\narXiv preprint, August 2024. URL https://arxiv.org/abs/2401.00788. (cited on pp. 25 and 37)\n53\n Under review as submission to TMLR\nAlbert Ziegler, Eirini Kalliamvakou, X. Alice Li, Andrew Rice, Devon Rifkin, Shawn Simister, Ganesh\nSittampalam, and Edward Aftandilian. Measuring GitHub Copilot\u2019s impact on productivity. Commun. ACM,\n67(3):54\u201363, feb 2024. ISSN 0001-0782. doi: 10.1145/3633453. URL https://doi.org/10.1145/3633453.\n(cited on pp. 2 and 39)\n54\n Under review as submission to TMLR\nA\nData Curation\nA.1\nExcluded Extensions\nAL (al), AngelScript (as), AsciiDoc (asc), AspectJ (aj), Bison (bison), Boogie (bpl),\nC++ (<empty extension>), Cabal Config (project), ChucK (ck), CODEOWNERS (<empty extension>),\nCommon Lisp (l, sexp), Common Workflow Language (cwl), CoNLL-U (conll, conllu), Cue Sheet (cue),\nCWeb (w), desktop (desktop, in, service), DIGITAL Command Language (com), DTrace (d), edn (edn),\nElixir (lock), Factor (factor), GAP (g, gd), Gemfile.lock (lock), Gettext Catalog (pot),\nGit Config (gitmodules), GLSL (geo), Glyph Bitmap Distribution Format (bdf), GN (gn),\nIgnore List (dockerignore, eslintignore, gitignore, npmignore), INI (cfg, prefs, url),\nJAR Manifest (mf), Java Properties (properties), Jest Snapshot (snap), JetBrains MPS (mps),\nJSONLD (jsonld), LiveScript (ls), Makefile (d, make), Mathematica (cdf, nb), MAXScript (ms),\nmIRC Script (mrc), NASL (inc), nesC (nc), Nunjucks (njk), OpenEdge ABL (p, w),\nPascal (<empty extension>, dpr, inc, pp), Perl (al, ph), PLSQL (pck, pls, tps, trg, vw),\nProtocol Buffer Text Format (pbt), Puppet (<empty extension>), PureBasic (pb), Racket (rkt, rktd),\nReScript (res), reStructuredText (rest), Rich Text Format (rtf), Roff (<empty extension>, 1, 1d, 2,\n5, 7, 8, 9, in), Roff Manpage (<empty extension>, 1d, 2, 3d, 4, 6, 9, man), Scala (sc), Scilab (tst),\nSELinux Policy (te), Shell (env), Slash (sl), Smalltalk (cs), SmPL (cocci), SQL (tab), Standard ML (sig),\nStata (ihlp, sthlp), SuperCollider (sc), SWIG (i), TeX (aux, ltx, toc), TOML (lock), Turtle (ttl),\nVBA (frm, frx), Vim Snippet (snippet), Wavefront Material (mtl), Wikitext (wikitext),\nWindows Registry Entries (reg), wisp (w), World of Warcraft Addon Data (toc), X BitMap (xbm),\nXML (kml, pt, resx, rss), XML Property List (plist, tmcommand, tmlanguage, tmsnippet, tmtheme), Yacc (yy).\nA.2\nExcluded Programming Languages\n2-Dimensional Array,AGS Script,Bicep,Checksums,COLLADA,CSV,Diff,DirectX 3D File,E-mail,G-code,\nGerber Image,Git Revision List,Gnuplot,Go,Checksums,IRC log,Jupyter Notebook,KiCad Layout,\nKiCad Legacy Layout,KiCad Schematic,Lasso,Linux,Kernel Module,Max,\nMicrosoft Developer Studio Project,Microsoft Visual Studio Solution,Pickle,PostScript,\nPOV-Ray SDL,Public Key,Pure Data,Raw token data,robots.txt,STL,SubRip Text,SVG,TSV,\nUnity3D Asset,Wavefront Object,WebVTT,X PixMap\nA.3\nLicense detection\nlicense_file_names = [\n\"li[cs]en[cs]e(s?)\",\n\"legal\",\n\"copy(left|right|ing)\",\n\"unlicense\",\n\"[al]?gpl([-_ v]?)(\\d\\.?\\d?)?\",\n# AGPLv3\n\"bsd(l?)\",\n# BSDL\n\"mit(x?)\",\n# MITX\n\"apache\",\n\"artistic\",\n# Artistic.txt\n\"copying(v?)(\\d?)\",\n# COPYING3, COPYINGv3\n\"disclaimer\",\n\"eupl\",\n\"gfdl\",\n\"[cm]pl\",\n\"cc0\",\n\"al([-_ v]?)(\\d\\.?\\d)?\",\n# AL2.0\n\"about\",\n\"notice\",\n55\n Under review as submission to TMLR\n\"readme\",\n\"guidelines\",\n]\nlicense_file_re = re.compile(\nrf\"^(|.*[-_. ])({\u2019|\u2019.join(license_file_names)})(|[-_. ].*)$\", re.IGNORECASE\n)\nA.4\nPermissive licenses\nSPDX-recognized license IDs\n0BSD, AAL, Abstyles, AdaCore-doc, Adobe-2006, Adobe-Glyph, ADSL,\nAFL-1.1, AFL-1.2, AFL-2.0, AFL-2.1, AFL-3.0, Afmparse, AMDPLPA, AML, AMPAS, ANTLR-PD, Apache-\n1.0, Apache-1.1, Apache-2.0, APAFML, App-s2p, Artistic-1.0, Artistic-1.0-cl8, Artistic-1.0-Perl, Artistic-2.0,\nBaekmuk, Bahyph, Barr, Beerware, Bitstream-Charter, Bitstream-Vera, BlueOak-1.0.0, Boehm-GC, Borceux,\nBrian-Gladman-3-Clause, BSD-1-Clause, BSD-2-Clause, BSD-2-Clause-Patent, BSD-2-Clause-Views, BSD-3-\nClause, BSD-3-Clause-Attribution, BSD-3-Clause-Clear, BSD-3-Clause-LBNL, BSD-3-Clause-Modification,\nBSD-3-Clause-No-Nuclear-License-2014, BSD-3-Clause-No-Nuclear-Warranty, BSD-3-Clause-Open-MPI, BSD-\n4-Clause, BSD-4-Clause-Shortened, BSD-4-Clause-UC, BSD-4.3RENO, BSD-4.3TAHOE, BSD-Advertising-\nAcknowledgement, BSD-Attribution-HPND-disclaimer, BSD-Source-Code, BSL-1.0, bzip2-1.0.6, Caldera,\nCC-BY-1.0, CC-BY-2.0, CC-BY-2.5, CC-BY-2.5-AU, CC-BY-3.0, CC-BY-3.0-AT, CC-BY-3.0-DE, CC-BY-3.0-\nNL, CC-BY-3.0-US, CC-BY-4.0, CDLA-Permissive-1.0, CDLA-Permissive-2.0, CECILL-B, CERN-OHL-1.1,\nCERN-OHL-1.2, CERN-OHL-P-2.0, CFITSIO, checkmk, ClArtistic, Clips, CMU-Mach, CNRI-Jython, CNRI-\nPython, CNRI-Python-GPL-Compatible, COIL-1.0, Community-Spec-1.0, Condor-1.1, Cornell-Lossless-JPEG,\nCrossword, CrystalStacker, Cube, curl, DL-DE-BY-2.0, DOC, Dotseqn, DRL-1.0, DSDP, dtoa, dvipdfm,\nECL-1.0, ECL-2.0, EFL-1.0, EFL-2.0, eGenix, Entessa, EPICS, etalab-2.0, EUDatagrid, Fair, FreeBSD-DOC,\nFSFAP, FSFULLR, FSFULLRWD, FTL, GD, Giftware, Glulxe, GLWTPL, Graphics-Gems, GStreamer-\nexception-2005, HaskellReport, HP-1986, HPND, HPND-Markus-Kuhn, HPND-sell-variant, HPND-sell-\nvariant-MIT-disclaimer, HTMLTIDY, IBM-pibs, ICU, IJG, IJG-short, ImageMagick, iMatix, Info-ZIP, Intel,\nIntel-ACPI, ISC, Jam, JasPer-2.0, JPNIC, JSON, Kazlib, Knuth-CTAN, Latex2e, Latex2e-translated-notice,\nLeptonica, Libpng, libpng-2.0, libtiff, Linux-OpenIB, LLVM-exception, LOOP, LPL-1.0, LPL-1.02, LPPL-1.3c,\nMartin-Birgmeier, metamail, Minpack, MirOS, MIT, MIT-0, MIT-advertising, MIT-CMU, MIT-enna, MIT-\nfeh, MIT-Festival, MIT-Modern-Variant, MIT-open-group, MIT-Wu, MITNFA, mpich2, mplus, MS-LPL,\nMS-PL, MTLL, MulanPSL-1.0, MulanPSL-2.0, Multics, Mup, NAIST-2003, NASA-1.3, Naumen, NBPL-1.0,\nNCSA, Net-SNMP, NetCDF, Newsletr, NICTA-1.0, NIST-PD-fallback, NIST-Software, NLOD-1.0, NLOD-2.0,\nNRL, NTP, NTP-0, O-UDA-1.0, ODC-By-1.0, OFFIS, OFL-1.0, OFL-1.0-no-RFN, OFL-1.0-RFN, OFL-\n1.1-no-RFN, OFL-1.1-RFN, OGC-1.0, OGDL-Taiwan-1.0, OGL-Canada-2.0, OGL-UK-1.0, OGL-UK-2.0,\nOGL-UK-3.0, OGTSL, OLDAP-1.1, OLDAP-1.2, OLDAP-1.3, OLDAP-1.4, OLDAP-2.0, OLDAP-2.0.1,\nOLDAP-2.1, OLDAP-2.2, OLDAP-2.2.1, OLDAP-2.2.2, OLDAP-2.3, OLDAP-2.4, OLDAP-2.5, OLDAP-2.6,\nOLDAP-2.7, OLDAP-2.8, OML, OpenSSL, OPUBL-1.0, PHP-3.0, PHP-3.01, Plexus, PostgreSQL, PSF-2.0,\npsfrag, psutils, Python-2.0, Python-2.0.1, Qhull, Rdisc, RSA-MD, Ruby, Saxpath, SCEA, SchemeReport,\nSendmail, SGI-B-1.1, SGI-B-2.0, SGP4, SHL-0.5, SHL-0.51, SHL-2.0, SHL-2.1, SMLNJ, snprintf, Spencer-86,\nSpencer-94, Spencer-99, SSH-OpenSSH, SSH-short, SunPro, Swift-exception, SWL, TCL, TCP-wrappers,\nTermReadKey, TPDL, TTWL, TU-Berlin-1.0, TU-Berlin-2.0, UCAR, Unicode-DFS-2015, Unicode-DFS-2016,\nUnixCrypt, UPL-1.0, Vim, VSL-1.0, W3C, W3C-19980720, W3C-20150513, w3m, Widget-Workshop, Wsuipa,\nX11, X11-distribute-modifications-variant, Xdebug-1.03, Xerox, Xfig, XFree86-1.1, xinetd, xlock, Xnet, xpp,\nXSkat, Zed, Zend-2.0, Zlib, zlib-acknowledgement, ZPL-1.1, ZPL-2.0, ZPL-2.1\nScanCode-specific license IDs\nLicenseRef-scancode-{3com-microcode, 3dslicer-1.0, 4suite-1.1, accellera-\nsystemc, adi-bsd, adrian, agere-bsd, alexisisaac-freeware, amd-historical, ams-fonts, anu-license, apache-patent-\nexception, apple-attribution, apple-attribution-1997, apple-excl, apple-sscl, aravindan-premkumar, argouml,\narm-llvm-sga, array-input-method-pl, asmus, asn1, atkinson-hyperlegible-font, bakoma-fonts-1995, bea-2.1,\nbeal-screamer, beri-hw-sw-1.0, bigdigits, bigelow-holmes, biopython, bitzi-pd, blas-2017, bohl-0.2, boost-\noriginal, boutell-libgd-2021, bpmn-io, brent-corkum, brian-clapper, brian-gladman, brian-gladman-3-clause,\nbroadcom-cfe, broadcom-linux-timer, brocade-firmware, bruno-podetti, bsd-1-clause-build, bsd-1988, bsd-2-\n56\n Under review as submission to TMLR\nclause-plus-advertizing, bsd-3-clause-devine, bsd-3-clause-fda, bsd-3-clause-jtag, bsd-3-clause-no-change, bsd-\n3-clause-no-trademark, bsd-3-clause-sun, bsd-ack-carrot2, bsd-artwork, bsd-atmel, bsd-axis-nomod, bsd-credit,\nbsd-dpt, bsd-export, bsd-innosys, bsd-mylex, bsd-new-derivative, bsd-new-nomod, bsd-new-tcpdump, bsd-no-\ndisclaimer, bsd-no-disclaimer-unmodified, bsd-original-muscle, bsd-original-voices, bsd-plus-mod-notice, bsd-\nsimplified-darwin, bsd-simplified-intel, bsd-simplified-source, bsd-top, bsd-top-gpl-addition, bsd-unchanged,\nbsd-unmodified, bsd-x11, bsla-no-advert, bytemark, can-ogl-alberta-2.1, can-ogl-british-columbia-2.0, can-ogl-\nnova-scotia-1.0, can-ogl-ontario-1.0, can-ogl-toronto-1.0, careware, carnegie-mellon, cavium-malloc, cc-by-2.0-\nuk, cecill-b-en, cern-attribution-1995, cgic, chicken-dl-0.2, chris-maunder, chris-stoy, classic-vb, clear-bsd-1-\nclause, click-license, cmu-mit, cmu-simple, cmu-template, code-credit-license-1.0.1, code-credit-license-1.1.0,\ncodeguru-permissions, codesourcery-2004, commonj-timer, compass, componentace-jcraft, compuphase-linking-\nexception, cosl, cpm-2022, cpp-core-guidelines, crcalc, cryptopp, csprng, cve-tou, cwe-tou, cximage, d-zlib,\ndamail, dante-treglia, dbad-1.1, delorie-historical, dhtmlab-public, dl-de-by-1-0-de, dl-de-by-1-0-en, dl-de-by-\n2-0-en, dmalloc, dmtf-2017, docbook, douglas-young, drl-1.1, dropbear, dropbear-2016, dtree, dwtfnmfpl-3.0,\ndynamic-drive-tou, ecfonts-1.0, egenix-1.0.0, ellis-lab, emit, emx-library, energyplus-bsd, epaperpress, eric-glass,\nerrbot-exception, etalab-2.0-en, fabien-tassin, far-manager-exception, fastbuild-2012-2020, fatfs, fftpack-2004,\nfilament-group-mit, flex-2.5, flora-1.1, font-alias, fpl, fplot, fraunhofer-iso-14496-10, free-art-1.3, freebsd-\nboot, freebsd-first, freemarker, fsf-notice, fujion-exception-to-apache-2.0, gareth-mccaughan, gary-s-brown,\ngdcl, geoff-kuenning-1993, ghostpdl-permissive, glut, good-boy, greg-roelofs, gregory-pietsch, gtpl-v1, gtpl-\nv2, gtpl-v3, happy-bunny, hdf4, hdf5, hdparm, hidapi, historical-ntp, homebrewed, hp-snmp-pp, html5,\nhttpget, ian-kaplan, ian-piumarta, ibm-as-is, ibm-dhcp, ibm-icu, ibm-nwsc, ibm-sample, ibpp, icot-free,\nidt-notice, ietf, ietf-trust, ilmid, indiana-extreme, infineon-free, info-zip-1997-10, info-zip-2001-01, info-zip-\n2002-02, info-zip-2003-05, info-zip-2004-05, info-zip-2005-02, info-zip-2007-03, info-zip-2009-01, inno-setup,\nintel-bsd, intel-bsd-2-clause, intel-osl-1989, intel-osl-1993, intel-royalty-free, iso-14496-10, iso-8879, itu, ja-sig,\njason-mayes, jasper-1.0, java-app-stub, jdbm-1.00, jdom, jetty, jgraph, jpnic-mdnkit, jpython-1.1, jscheme,\njsfromhell, jython, kalle-kaukonen, keith-rule, kerberos, kevan-stannard, kevlin-henney, khronos, kumar-\nrobotics, lcs-telegraphics, ldap-sdk-free-use, libgeotiff, libmib, libmng-2007, libsrv-1.0.2, lil-1, lilo, linux-device-\ndrivers, linuxbios, linuxhowtos, llnl, logica-1.0, lucre, make-human-exception, matt-gallagher-attribution,\nmatthew-kwan, mattkruse, mediainfo-lib, mgopen-font-license, michael-barr, michigan-disclaimer, mit-1995,\nmit-license-1998, mit-modification-obligations, mit-nagy, mit-no-advert-export-control, mit-no-trademarks,\nmit-old-style, mit-old-style-sparse, mit-readme, mit-specification-disclaimer, mit-synopsys, mit-taylor-variant,\nmit-veillard-variant, mod-dav-1.0, motorola, mpeg-iso, mpeg-ssg, ms-sspl, ms-ws-routing-spec, msj-sample-\ncode, mulanpsl-1.0-en, mulanpsl-2.0-en, mulle-kybernetik, musl-exception, mx4j, netcat, netcomponents,\nnetron, newlib-historical, newran, nice, niels-ferguson, nilsson-historical, nist-srd, node-js, nonexclusive,\nnortel-dasa, notre-dame, nrl-permission, ntlm, ntpl-origin, nvidia, nvidia-2002, nvidia-gov, nwhm, nysl-\n0.9982, nysl-0.9982-jp, o-young-jong, oasis-ws-security-spec, object-form-exception-to-mit, odl, odmg, ogc,\nogl-1.0a, ogl-canada-2.0-fr, ogl-wpd-3.0, openmarket-fastcgi, openorb-1.0, opensaml-1.0, openssl, opml-1.0,\nopnl-1.0, opnl-2.0, oreilly-notice, oswego-concurrent, other-permissive, owtchart, ozplb-1.0, ozplb-1.1, paolo-\nmessina-2000, paraview-1.2, patent-disclaimer, paul-mackerras, paul-mackerras-binary, paul-mackerras-new,\npaul-mackerras-simplified, paulo-soares, paypal-sdk-2013-2016, pcre, pd-mit, pd-programming, perl-1.0, peter-\ndeutsch-document, philippe-de-muyter, phorum-2.0, php-2.0.2, pine, pngsuite, politepix-pl-1.0, ppp, protobuf,\npsf-3.7.2, psytec-freesoft, purdue-bsd, pybench, pycrypto, pygres-2.2, python-cwi, qlogic-microcode, qpopper,\nqualcomm-turing, quirksmode, radvd, red-hat-attribution, red-hat-bsd-simplified, reportbug, ricebsd, richard-\nblack, robert-hubley, rsa-1990, rsa-cryptoki, rsa-demo, rsa-md4, rtools-util, rute, ryszard-szopa, saas-mit, saf,\nsash, sata, sbia-b, scancode-acknowledgment, scanlogd-license, scansoft-1.2, scintilla, scribbles, script-asylum,\nsecret-labs-2011, service-comp-arch, sgi-cid-1.0, sgi-glx-1.0, sglib, shital-shah, simpl-1.1, softfloat, softfloat-2.0,\nsoftsurfer, sparky, speechworks-1.1, ssleay, ssleay-windows, stanford-pvrg, stlport-2000, stlport-4.5, stream-\nbenchmark, stu-nicholls, sun-rpc, sun-source, sunsoft, supervisor, svndiff, swig, symphonysoft, synopsys-mit,\nsynthesis-toolkit, takao-abe, takuya-ooura, tcg-spec-license-v1, tekhvc, tested-software, tex-live, things-i-\nmade-public-license, tiger-crypto, tigra-calendar-3.2, tigra-calendar-4.0, tim-janik-2003, timestamp-picker,\ntso-license, ttcl, ttyp0, tumbolia, twisted-snmp, ubc, unicode, unicode-icu-58, unicode-mappings, unlimited-\nbinary-use-exception, unpbook, us-govt-unlimited-rights, usrobotics-permissive, utopia, vcalendar, vince,\nvisual-idiot, visual-numerics, vixie-cron, w3c-03-bsd-license, westhawk, whistle, whitecat, wide-license, william-\nalexander, wingo, wol, wordnet, wrox, ws-addressing-spec, ws-policy-specification, ws-trust-specification,\nwtfnmfpl-1.0, wxwidgets, wxwindows-u-3.0, x11-acer, x11-adobe, x11-adobe-dec, x11-dec1, x11-dec2, x11-doc,\n57\n Under review as submission to TMLR\nx11-dsc, x11-hanson, x11-lucent-variant, x11-oar, x11-opengl, x11-quarterdeck, x11-realmode, x11-sg, x11-\nstanford, x11-tektronix, x11-x11r5, x11-xconsortium-veillard, xfree86-1.0, xmldb-1.0, xxd, yale-cas, yensdesign,\nzeusbench, zpl-1.0, zsh, zuora-software, zveno-research}\nNon-licenses\nThe following contributor license agreements, warranty disclaimers, and other license amend-\nments were not considered during license labeling: LicenseRef-scancode-{dco-1.1, generic-cla, google-cla,\njetty-ccla-1.1, newton-king-cla, generic-exception, generic-export-compliance, generic-tos, generic-trademark,\nwarranty-disclaimer}\nA.5\nPull Requests\nTable 24 shows the volume of PR renderings for various sequence lengths (measured in characters). We list\nthe volume of the base files for the top 20 languages in Table 25.\nA.6\nStackOverflow\nWe used the following prompt to\nBelow is an instruction from a user and a candidate\u2019s answer. Evaluate whether or not the answer is\na good example of how AI Assistant should respond to the user\u2019s instruction. Please assign a score\nusing the following 10-point scale:\n1: The response is entirely off-topic, contains significant inaccuracies, or is incomprehensible.\nIt fails to address the user\u2019s query in any meaningful way.\n2: The answer is largely irrelevant, vague, or controversial. It contains some elements that relate\nto the topic but misses the core of the user\u2019s question or includes substantial misinformation.\n3: The response is somewhat relevant but remains incomplete or contains elements that are\noff-topic or controversial. Key aspects of the user\u2019s query are left unaddressed.\n4: The answer addresses the user\u2019s question to some extent but lacks depth or clarity. It may be\nsomewhat helpful but is not comprehensive or detailed.\n5: The response is relevant and offers a basic answer to the user\u2019s question but lacks detail or\nspecificity. It\u2019s helpful but not fully developed or insightful.\n6: The answer is moderately helpful and addresses most aspects of the user\u2019s question. It might\nlack some depth or contain minor inaccuracies or irrelevant information.\n7: The response is quite helpful and addresses the user\u2019s query well, but it might not be from an\nAI Assistant\u2019s perspective. It could resemble content from other sources like blog posts or web pages.\n8: The answer is comprehensive and relevant, written from an AI assistant\u2019s perspective. It\naddresses the user\u2019s query effectively but may have minor areas for improvement in focus,\nconciseness, or organization.\n9: The response is almost perfect, providing a clear, comprehensive, and well-organized answer from an\nAI assistant\u2019s perspective. It might have very minor areas for improvement in terms of engagement or\ninsight.\n10: The answer is exemplary, perfectly addressing the user\u2019s query from an AI Assistant\u2019s perspective.\nIt is highly informative, expertly written, engaging, and insightful, with no discernible areas\nfor improvement.\n58\n Under review as submission to TMLR\nTable 24: Volume of the pull requests dataset\nwhen we restrict the sequence length.\nSeqlen (characters)\nVolume (GB)\n25000\n19.6\n50000\n38.7\n75000\n54.34\n100000\n67.31\n200000\n103.52\n300000\n126.8\n400000\n143.65\n500000\n156.76\n600000\n167.21\n700000\n175.94\n800000\n183.18\n900000\n189.32\n1000000\n194.58\nTable 25: Size of base files range of changes for\ntop 20 languages in Pull Requests.\nLanguage\nVolume (GB)\nPython\n13.46\nJavaScript\n9.55\nJava\n8.37\nMarkdown\n7.34\nC++\n5.89\nGo\n5.59\nJSON\n4.13\nTypeScript\n3.96\nC#\n3.76\nYAML\n3.1\nXML\n2.55\nC\n2.34\nHTML\n2.31\nRust\n2.27\nPHP\n2.09\nRuby\n1.73\nproject.pbxproj\n1.51\nScala\n1.25\nTSX\n1.2\nSwift\n0.9\nPlease write \"Score: <rating>\" in the last line, and then provide a brief reasoning you used to derive\nthe rating score.\nA.7\nKaggle Notebooks templates\nWe remove the following templates if they appear at the beginning of a Kaggle notebook:\nTEMPLATE_1 = \u2019# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-\npython\nimport numpy as np\n# linear algebra\nimport pandas as pd\n# data processing, CSV file I/O (e.g. pd.read_csv)\n# Input data files are available in the read-only \"../input/\" directory\n# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the\ninput directory\nimport os\nfor dirname, _, filenames in os.walk(\"/kaggle/input\"):\nfor filename in filenames:\nprint(os.path.join(dirname, filename))\n# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output\nwhen you create a version using \"Save & Run All\"\n# You can also write temporary files to /kaggle/temp/, but they won\u2019t be saved outside of the current\nsession\u2019\nTEMPLATE_2 = \u2019# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-\npython\\n\u2019\n59\n Under review as submission to TMLR\nTable 26: Top 10 detected malware signatures.\nSignature\nCount\nSanesecurity.Malware.28845.BadVBS\n11876\nwinnow.compromised.ts.jsexploit.5\n2251\nSanesecurity.Malware.26492.JsHeur\n2247\nSanesecurity.Spam.8879\n1597\nSanesecurity.Malware.25834.JsHeur\n1560\nSanesecurity.Malware.27112.JsHeur\n1258\nSanesecurity.Malware.26222.JsHeur\n888\nPorcupine.Malware.52833\n814\nSanesecurity.SpamL.8887\n792\nSanesecurity.Malware.26557.JsHeur\n728\nTable 27: Top 10 languages by the number of\npotentially malicious files.\nLanguage\nCount\nText\n13281\nHTML\n11336\nJavaScript\n10210\nVBScript\n7947\nLogos\n3283\nMarkdown\n2736\nLinker Script\n1390\nXML\n1260\nVBA\n990\nJSON\n547\nB\nProcessing Pipeline\nB.1\nMalware removal\nWe show the top-10 detected malware signatures in Table 26 and the top-10 languages by potentially malicous\nfiles in Table 27.\nC\nData Composition\nC.1\nTheStackV2-train-smol\n\u2022 Configuration languages\n\u2013 Ant Build System\n\u2013 CMake\n\u2013 Dockerfile\n\u2013 Go Module\n\u2013 Gradle\n\u2013 INI\n\u2013 Java Properties\n\u2013 Makefile\n\u2013 Maven POM\n\u2013 TOML\n\u2022 Configuration files:\n\u2013 CMakeLists.txt\n\u2013 Cargo.toml\n\u2013 DESCRIPTION\n\u2013 Gemfile\n\u2013 Makefile\n\u2013 Makefile.am\n\u2013 NAMESPACE\n\u2013 Package.swift\n\u2013 Pipfile\n\u2013 build.gradle\n\u2013 build.gradle.kts\n\u2013 composer.json\n\u2013 conda.yml\n\u2013 configure.ac\n\u2013 docker-compose.yaml\n\u2013 docker-compose.yml\n\u2013 go.mod\n\u2013 package.json\n\u2013 pom.xml\n\u2013 pyproject.toml\n\u2013 requirements-dev.txt\n\u2013 requirements-prod.txt\n\u2013 requirements.in\n\u2013 requirements.test.txt\n\u2013 requirements.txt\n\u2013 setup.cfg\n\u2013 tsconfig.json\n\u2013 yarn.lock\nC.2\nTheStackV2-train-full\nIn Table 28, we summarize the data volume for the subsamples languages.\n60\n Under review as submission to TMLR\nTable 28: Subsampling volumes for languages in the Stack v2 dataset.\nFinal volume\nLanguages\n200GB\nJava, JavaScript\n100GB\nHTML\n8GB\nCSS, Java Server Pages, JSON,\nSCSS, Smali, XML, YAML\n1GB\nBibTeX, Gettext Catalog, Graphviz (DOT),\nJava Properties, Roff, Roff Manpage,\nWeb Ontology Language\n61\n Under review as submission to TMLR\nStarCoder2 and The Stack v2: The Next Generation\nAnton Lozhkov1Raymond Li2Loubna Ben Allal1Federico Cassano4Joel Lamy-Poirier2\nNouamane Tazi1Ao Tang3Dmytro Pykhtar3Jiawei Liu7Yuxiang Wei7Tianyang Liu25\nMax Tian2Denis Kocetkov2Arthur Zucker1Younes Belkada1Zijian Wang5Qian Liu12\nDmitry Abulkhanov5Indraneil Paul32Zhuang Li14Wen-Ding Li26Megan Risdal24Jia\nLi5Jian Zhu16Terry Yue Zhuo14,15Evgenii Zheltonozhskii13Nii Osae Osae Dade28\nWenhao Yu20Lucas Krau\u00df5Naman Jain27Yixuan Su30Xuanli He23Manan Dey31\nEdoardo Abati5Yekun Chai33Niklas Muennighoff29Xiangru Tang34Muhtasham\nOblokulov18Christopher Akiki9,10Marc Marone8Chenghao Mou5Mayank Mishra19\nAlex Gu17Binyuan Hui5Tri Dao21Armel Zebaze1Olivier Dehaene1Nicolas Patry1\nCanwen Xu25Julian McAuley25Han Hu14Torsten Scholak2Sebastien Paquet2Jennifer\nRobinson6Carolyn Jane Anderson22Nicolas Chapados2Mostofa Patwary3Nima\nTajbakhsh3Yacine Jernite1Carlos Mu\u00f1oz Ferrandis1Lingming Zhang7Sean Hughes6\nThomas Wolf1Arjun Guha4,11Leandro von Werra1,\u22c6Harm de Vries2,\u22c6\n1Hugging Face2ServiceNow Research3Nvidia4Northeastern University5Independent6ServiceNow\n7University of Illinois Urbana-Champaign8Johns Hopkins University9Leipzig University10ScaDS.AI\n11Roblox12Sea AI Lab13Technion \u2013 Israel Institute of Technology14Monash University15CSIRO\u2019s\nData6116University of British Columbia17MIT18Technical University of Munich19IBM Research\n20University of Notre Dame21Princeton University22Wellesley College23University College London\n24Kaggle25UC San Diego26Cornell University27UC Berkeley28Mazzuma29Contextual AI\n30Cohere31Salesforce32Technical University of Darmstadt33Baidu34Yale University\nCorresponding authors ( \u22c6) can be contacted at contact@bigcode-project.org\nAbstract\nTheBigCodeproject,1anopen-scientificcollaborationfocusedontheresponsibledevelopment\nof Large Language Models for Code (Code LLMs), introduces StarCoder2. In partnership\nwith Software Heritage (SWH),2we build The Stack v2 on top of the digital commons of their\nsource code archive. Alongside the SWH repositories spanning 619 programming languages,\nwe carefully select other high-quality data sources, such as GitHub pull requests, Kaggle\nnotebooks, and code documentation. This results in a training set that is 4\u00d7larger than the\nfirst StarCoder dataset. We train StarCoder2 models with 3B, 7B, and 15B parameters on\n3.3 to 4.3 trillion tokens and thoroughly evaluate them on a comprehensive set of Code LLM\nbenchmarks.\nWe find that our small model, StarCoder2-3B, outperforms other Code LLMs of similar size\non most benchmarks, and also outperforms StarCoderBase-15B. Our large model, StarCoder2-\n15B, significantly outperforms other models of comparable size. In addition, it matches or\noutperforms CodeLlama-34B, a model more than twice its size. Although DeepSeekCoder-\n33B is the best-performing model at code completion for high-resource languages, we find\nthat StarCoder2-15B outperforms it on math and code reasoning benchmarks, as well as\nseveral low-resource languages. We make the model weights available under an OpenRAIL\nlicense and ensure full transparency regarding the training data by releasing the SoftWare\nHeritage persistent IDentifiers (SWHIDs) of the source code data.\n1https://www .bigcode-project .org\n2https://www .softwareheritage .org/\n1arXiv:2402.19173v1  [cs.SE]  29 Feb 2024 Under review as submission to TMLR\n1 Introduction\nLarge Language Models for Code (Code LLMs; Chen et al., 2021; Nijkamp et al., 2023; Rozi\u00e8re et al., 2023;\nGuo et al., 2024) have rapidly emerged as powerful assistants for writing and editing code. As of January 30,\n2024, GitHub CoPilot has garnered over 1.3 million paying subscribers, with over 50,000 organisations opting\nfor the enterprise version (MSFT Q2 Earning Call, 2024), estimated to increase developer productivity by up\nto 56% as well as developer satisfaction (Peng et al., 2023; Ziegler et al., 2024). ServiceNow recently disclosed\nthat their \u201ctext-to-code\u201d solution, built from fine-tuning StarCoderBase models (Li et al., 2023), results in\na 52% increase in developer productivity (Yahoo Finance, 2024). Despite the initial focus on generating\ncode snippets from natural language instructions or other code snippets, Code LLMs exhibit the potential\nto enhance all phases of the software development cycle (Hou et al., 2023; Fan et al., 2023; Wang et al.,\n2024; Zhuo et al., 2023b; Chai et al., 2023). This includes speeding up the implementation of new projects,\nimproving quality assurance for developed software, helping detect and fix bugs, simplifying maintenance\ntasks, and easing migration to newer software.\nThe development process of LLMs can exhibit different levels of openness (Solaiman, 2023; Ding et al.,\n2022; Akiki et al., 2022). Proprietary models like OpenAI\u2019s GPT-4 (OpenAI et al., 2023) and Google\u2019s\nGemini (Gemini Team et al., 2023) provide access to the model through a paid API but do not disclose\ndevelopment details. On the other hand, open-weight models like Code LLaMa (Rozi\u00e8re et al., 2023),\nMistral (Jiang et al., 2023), and DeepSeekCoder (Guo et al., 2024) have released the model weights. This\nenables the open-source community to run these models locally, inspect the model representations, and fine-\ntune them on their tasks. However, the model developers have not disclosed their training data. Consequently,\ncontent creators do not know if their data was used for training, social scientists cannot scrutinize the dataset\nfor bias and toxicity, and LLM developers lack information as to what extent the training set is contaminated\nwith test benchmarks. More broadly, this practice hinders scientific progress as other research teams cannot\nreadily reuse each other\u2019s training data. Other LLM development projects, like Allen AI\u2019s OLMo (Groeneveld\net al., 2024), Eleuther AI\u2019s Pythia (Biderman et al., 2023), and BigScience\u2019s BLOOM (BigScience Workshop,\n2022; Scao et al., 2022a), have adopted a fully open development approach by releasing training data, training\nframeworks, and evaluation suites.\nThe BigCode project was established in September 2022 as an open scientific collaboration focused on the\nopen and responsible development of Code LLMs. BigCode is stewarded by ServiceNow and Hugging Face in\nthe spirit of open governance (BigCode collaboration et al., 2023) and has brought together more than 1,100\nmembers from diverse academic institutes and industry labs. The community previously released The Stack\nv1 (Kocetkov et al., 2023), a 6.4 TB dataset of permissively licensed source code in 384 programming languages.\nThe Stack v1 includes a governance tool called \u201cAm I in The Stack,\u201d designed for developers to verify if their\nsource code is included in the dataset. It also provides an opt-out process for those who prefer to exclude their\ncode from the dataset. In December 2022, the BigCode community released SantaCoder (Ben Allal et al.,\n2023), a strong-performing 1.1B parameter model trained on Java, JavaScript, and Python code from The\nStack v1. Building upon this success, the community further scaled up its effort and released StarCoder on\nMay 4th, 2023 (Li et al., 2023). At its release, the 15B parameter StarCoder model was the best open-access\nLLM for code.\nThis technical report describes the development process of The Stack v2 and StarCoder2. The Stack v2 builds\nupon the foundation of Software Heritage\u2019s vast source code archive, which spans over 600 programming\nlanguages. In addition to code repositories, we curate other high-quality open data sources, including Github\nissues, pull requests, Kaggle and Jupyter notebooks, code documentation, and other natural language datasets\nrelated to math, coding, and reasoning. To prepare the data for training, we perform deduplication, create\nfilters to eliminate low-quality code, redact Personally Identifiable Information (PII), remove malicious code,\nand handle opt-outs from developers who requested to have their code removed from the dataset. With this\nnew training set of 900B+ unique tokens, 4\u00d7larger than the first StarCoder dataset, we develop the next\ngeneration of StarCoder models. We train Code LLMs with 3B, 7B, and 15B parameters using a two-stage\ntraining process (Rozi\u00e8re et al., 2023; Guo et al., 2024). We start base model training with a 4k context\nwindow and subsequently fine-tune the model with a 16k context window. We ensure that the training\nprocess does not exceed more than 5 epochs over the dataset (Muennighoff et al., 2023). However, we push\n2 Under review as submission to TMLR\nthe number of training tokens far beyond the compute-optimal number suggested by Chinchilla (Harm\u2019s law;\nde Vries, 2023) and train relatively small models within the range of 3.3 to 4.3 trillion tokens. We thoroughly\nassess and compare the performance of these models on a suite of code LLM benchmarks (Cassano et al.,\n2023b; Austin et al., 2021; Chen et al., 2021; Liu et al., 2023a; Lai et al., 2023; Muennighoff et al., 2024a;\nCassano et al., 2024; Liu et al., 2023b; Ding et al., 2023; Gu et al., 2024; Cobbe et al., 2021; Pearce et al.,\n2022; Dhamala et al., 2021; Nozza et al., 2021; Gehman et al., 2020), finding that:\n\u2022The StarCoder2-3B model outperforms other Code LLMs of similar size (StableCode-3B and\nDeepSeekCoder-1.3B) on most benchmarks. Moreover, it matches or surpasses the performance of\nStarCoderBase-15B.\n\u2022The StarCoder2-15B model significantly outperforms other models of comparable size (CodeLlama-\n13B), and matches or outperforms CodeLlama-34B. DeepSeekCoder-33B is the best model at\ncode completion benchmarks for high-resource languages. However, StarCoder2-15B matches or\noutperforms DeepSeekCoder-33B on low-resource programming languages (e.g., D, Julia, Lua,\nand Perl). Moreover, when we consider benchmarks that require models to reason about code\nexecution (Gu et al., 2024) or mathematics (Cobbe et al., 2021), we find that StarCoder2-15B\noutperforms DeepSeekCoder-33B.\n\u2022The StarCoder2-7B model outperforms CodeLlama-7B but is behind DeepSeekCoder-6.7B. It is not\nclear to this report\u2019s authors why StarCoder2-7B does not perform as well as StarCoder2-3B and\nStarCoder2-15B for their size.\n2 Data Sources\nIn this section, we elaborate on the process of obtaining training data, encompassing not just the data\nsourced from Software Heritage (\u00a72.1) but also GitHub issues (\u00a72.2), pull requests (\u00a72.3), Jupyter and Kaggle\nnotebooks (\u00a72.4), documentation (\u00a72.5), intermediate representations (\u00a72.6), small math and coding datasets\n(\u00a72.7), and other natural language datasets (\u00a72.8).\n2.1 Source Code\nSoftware Heritage We build the Stack v2 on top of the Software Heritage (SH) archive (Abramatic et al.,\n2018), maintained by the non-profit organization of the same name. The mission of Software Heritage is to\ncollect and preserve all knowledge taking the form of source code. We work with the SH graph dataset (Pietri\net al., 2020), a fully deduplicated Merkle DAG (Merkle, 1987) representation of the full archive. The SH\ngraph dataset links together file identifiers, source code directories, and git commits, up to the entire states\nof repositories, as observed during periodic crawls by Software Heritage.\nExtracting repositories We leverage the 2023-09-06 version of the SH graph dataset as the primary\nsource. We start by extracting the most recently crawled versions of all GitHub repositories and filtering\nthem to retain only the main branch. The branch is considered main if the repository metadata in GHArchive\nlists it as the default branch or if its name is mainormaster. We only extract the latest revision (commit)\nfrom the main branch and deduplicate the repositories based on the unique hashes of their contents (column\ndirectory_id of the SH dataset). The repositories\u2019 directory structure is reconstructed by recursively\njoining the directory_entry table of the dataset to itself using the directory_id and targetcolumns and\nconcatenating the directory and file names (column name) into full paths. We only traverse the directory tree\nup to level 64. The individual file contents are downloaded from the SH content S3 bucket if the compressed\nfile size is less than 10MB.\nLicense detection We extract repository-level license information from GHArchive (Github Archive, 2024)\nfor all repositories with matching names in the SWH dataset. When the repo-level license is not available,\ni.e., for 96.93% of repositories, we use the ScanCode Toolkit (ScanCode, 2024) to detect file-level licenses as\nfollows:\n3 Under review as submission to TMLR\nIs the GitHub\nlicense empty?\nIs the GitHub li-\ncense permissive?\nnon-permissive permissive\nDid ScanCode\ndetect licenses?\nno licenseAre all detected li-\ncenses permissive?\npermissive non-permissiveyesno\nnoyes\nyes\nno\nyes\nno\nFigure 1: File-level license assignment logic.\n\u2022Find all files that could contain a license using a regular expression in Appendix A.3. This allows us\nto gather files that either explicitly contain a license (e.g., LICENSE,MIT.txt,Apache2.0 ) or contain\na reference to the license (e.g., README.md ,GUIDELINES );\n\u2022Apply ScanCode\u2019s license detection to the matching files and gather the SPDX3IDs of the detected\nlicenses;\n\u2022Propagate the detected licenses to all files that have the same base path within the repository as the\nlicense file.\nOnce the file-level license information is gathered, we decide whether the file is permissively licensed,\nnon-permissively licensed, or unlicensed, following the algorithm described in Figure 1.\nThe licenses we consider permissive are listed in Appendix A.4. This list was compiled from the licenses\napproved by the Blue Oak Council (Blue Oak Council, 2024), as well as licenses categorized as \u201cPermissive\u201d\nor \u201cPublic Domain\u201d by ScanCode (ScanCode License Categories, 2024).\nData licenses We consider three types of files: permissively licensed, non-permissively licensed (e.g.,\ncopyleft), and unlicensed files. The main difference between the Stack v2 and the Stack v1 is that we include\nboth permissively licensed and unlicensed files. We exclude commercial licenses since their creators do\nnot intend their code to be used for commercial purposes. We also exclude copyleft-licensed code due to\nuncertainty regarding the community\u2019s stance on using such data for LLM training and its relatively low\nvolume.\nLanguage detection While the Stack v1 (Kocetkov et al., 2023) detects programming languages by their\nfile extension, we instead rely on a language classifier. Specifically, we use go-enry based on GitHub\u2019s library\nlinguist (go-enry, 2024) to detect the programming language for each file. We detect 658 unique languages\ninTheStackV2-dedup , some of which get removed at the data inspection stage (see next paragraph).\n3System Package Data Exchange, https://spdx .dev.\n4 Under review as submission to TMLR\nTable 1: A comparison of The Stack v1 and v2 on 32 popular programming languages. We show the size\nand number of files for different data splits: The Stack v1 deduped, The Stack v2 deduped, and the training\ndata used for StarCoder2-15B.\nThe-stack-v1-dedup The-stack-v2-dedup The-stack-v2-swh-full\nLanguage Size (GB) Files (M) Size (GB) Files (M) Size (GB) Files (M)\nAssembly 1.58 0.25 13.02 0.77 7.74 0.70\nBatchfile 0.29 0.25 2.11 1.13 1.02 0.99\nC 57.43 8.53 202.05 20.78 114.92 19.18\nC# 46.29 10.84 239.89 51.23 169.75 48.49\nC++ 50.89 6.37 353.89 43.18 211.33 42.23\nCMake 0.45 0.19 2.58 1.74 2.27 1.70\nCSS 22.61 2.99 161.68 23.87 8.00 1.88\nDockerfile 0.572 0.42 1.27 1.90 1.21 1.88\nFortran 0.17 1.84 4.66 0.27 3.61 0.26\nGo 25.74 4.73 54.60 9.30 25.83 8.62\nHaskell 2.36 0.54 5.11 1.25 4.17 1.23\nHTML 146.76 9.53 2,419.87 90.23 99.09 5.23\nJava 89.30 20.15 548.00 154.28 199.68 62.27\nJavaScript 141.65 21.11 1,115.42 108.87 199.99 66.91\nJulia 1.54 0.30 6.12 0.45 1.83 0.43\nLua 3.28 0.56 33.91 2.35 15.22 2.24\nMakefile 1.49 0.66 21.30 4.22 5.19 2.78\nMarkdown 75.25 21.0 281.04 82.78 244.17 81.42\nPerl 2.63 0.39 7.82 1.15 5.66 1.06\nPHP 66.84 15.90 224.59 46.03 183.70 45.14\nPowerShell 1.25 0.27 3.97 0.68 2.46 0.66\nPython 64.30 12.96 233.29 56.93 191.61 56.19\nR 0.30 0.04 22.39 5.15 19.05 4.29\nRuby 7.14 3.41 31.70 17.79 23.38 17.51\nRust 9.53 1.38 15.60 2.22 12.43 2.19\nScala 4.86 1.36 12.73 4.45 11.30 4.32\nShell 3.38 22.69 19.82 10.68 13.51 10.01\nSQL 12.22 0.99 281.45 5.29 35.75 4.52\nSwift 0 0 23.76 7.23 22.32 7.16\nTeX 5.44 0.55 35.86 3.19 30.01 2.86\nTypeScript 28.82 10.64 61.01 23.85 49.14 23.28\nVisual Basic 1.49 0.16 16.63 1.06 7.48 0.81\nTotal 875.85 181.00 6,457.14 784.30 1,922.82 528.44\nVisual data inspection Similar to the first StarCoder, we involve the BigCode community in a data\ninspection sprint to remove extensions with low-quality training data. We start from the annotations of the\nprevious iteration that eliminated 36 out of the 300 extensions (of the 86 included programming languages).\nFor StarCoder2, we only ran the data inspection for the not-yet-annotated programming languages (i.e.,\nexcluding the 86 languages of StarCoderBase). To streamline this process, we limited our inspection to\nextensions that include over 1,000 files and represent over 0.5% of the files in their respective languages. The\nremaining extensions were retained without further inspection, as they only make up a small volume. With\nthe help of 15 annotators from the BigCode community, we visually inspected around 1000 extensions and\nexcluded 130 (see appendix A.1 for the complete list). Our data inspection step excluded 39 programming\nlanguages from the dataset (appendix A.2), resulting in a final count of 619 programming languages.\nBasic filters We apply a set of basic filters to the dataset to remove autogenerated files, data files, or other\nlow-quality training data.\n5 Under review as submission to TMLR\n\u2022Long line filters : we first remove all files with more than 100k lines as those files are likely to be data\nor generated code. We also remove files with an average line length of more than 100 characters or\na maximum line length of more than 1000 characters for all languages, excluding HTML, JSON,\nMarkdown, Roff, Roff Manpage, SMT, TeX, Text, and XML. For the mentioned languages, we\nremove files where the longest line exceeds 100k characters.\n\u2022Autogenerated filter : we remove files classified as auto-generated by the is_generated function\nofgo-enry (go-enry, 2024). Additionally, we exclude files containing one of {\u201cauto-generated\u201d,\n\u201cautogenerated\u201d, \u201cautomatically generated\u201d, \u201cgenerated automatically\u201d, \u201cthis file is generated\u201d} in\nthe first 5 lines of the file.\n\u2022Alpha filter : we remove files with less than 25% of alphabetic characters for all languages except\nMotorola 68K Assembly and WebAssembly, where we only remove files with less than 25% of\nalpha-numeric characters due to the syntax of those languages.\n\u2022Encoded data filter : we detect files with inline encoded data using the following regular expressions:\n\u2013Base64 strings: [a-zA-Z0-9+/\\n=]{64,}\n\u2013Hexadecimal sequences: (?:\\b(?:0x|\\\\x)?[0-9a-fA-F]{2}(?:,|\\b\\s*)){8,}\n\u2013Unicode strings: (?:\\\\u[0-9a-fA-F]{4}){8,}\nWe remove the file if any of the substrings matching these expressions is longer than 1024 characters\nor if the fraction of matched characters is more than 50% of the file.\nLanguage-specific filters In addition to the basic filters, we apply the following set of language-specific\nfilters.\n\u2022For Text, JSON, YAML, Web Ontology Language, and Graphviz (DOT), we remove files with more\nthan 512 lines to minimize the impact of repeated tokens in data files.\n\u2022For HTML, we keep only the files where visible text is at least 100 characters long and makes up at\nleast 20% of the code, similar to the processing pipeline of StarCoder (Li et al., 2023).\n\u2022For Text, we keep only files with \u201crequirement\u201d in the lowercased filename, or if the filename without\nthe extension is one of {\u201creadme\u201d, \u201cnotes\u201d, \u201ctodo\u201d, \u201cdescription\u201d, \u201ccmakelists\u201d}.\n2.2 Github Issues\nWe incorporate GitHub issues collected from GHArchive (Github Archive, 2024). We exclude pull requests\nhere as we process them separately in \u00a72.3.\nA Github issue consists of a series of events with actions, such as opening the issue, creating a comment, or\nclosing the issue. Each event includes the author\u2019s username, a message, an action, and a creation date. We\nfollow the processing pipeline of StarCoder (Li et al., 2023), which we recap below:\n\u2022First, we removed auto-generated text when users replied to issues via email (for more information,\nsee Li et al., 2023, Appendix A). We also deleted issues with a short message (less than 200 characters)\nand truncated long comments in the middle to a maximum of 100 lines while retaining the last\n20 lines. This removed 17% of the volume \u2014 a similar percentage as in StarCoderBase.\n\u2022Next, we excluded comments from bots. To do so, we searched for keywords in the username of\nthe comment\u2019s author (for more information, see Li et al., 2023, Appendix A). This step eliminated\n3% of the issues, much less than the 17% reported in StarCoder (Li et al., 2023). This discrepancy\nis primarily because our dataset does not include pull requests, which are often the source of a\nsignificant proportion of bot-generated content.\n6 Under review as submission to TMLR\n\u2022We used the number of users engaged in the conversation as an indicator of quality. Our criterion was\nto include conversations that have two or more users. However, we also preserved conversations that\ninvolved a single user if the total text within comments was less than 7,000 characters (96th percentile).\nAdditionally, we excluded issues authored by a single user if they contained more than ten events, as\nthey tended to be of poor quality or originate from overlooked bots. By implementing these filters,\nwe removed 38% of the remaining issues. Lastly, we anonymized the usernames in the conversations\nby replacing them with a participant counter within the conversation (following the process of\nStarCoder).\n2.3 Pull Requests\nWe include code reviews by gathering pull request events from GHArchive (Github Archive, 2024) and the\ncorresponding source code from Software Heritage (Software Heritage, 2024b). Pull requests are requests to\nmerge particular code changes from one branch into another on GitHub. Typically, they involve multiple\nrounds of code review discussions and additional cycles of code changes before they get merged into the\ntarget branch.\nData collection Specifically, for each pull request, we aggregate the PullRequestEvent, PullRequestReview-\nEvent, PullRequestReviewCommentEvent, IssueCommentEvent, and IssuesEvent events found on GHArchive.\nMore details about the differences between these events can be found in the Github documentation. Next,\nwe extract all base and head commit IDs from these events and retrieve the corresponding code files from\nSoftware Heritage. As we do not have access to the commit diffs, we generate them by identifying changes\nbetween files at the same path. We consider files present in the base but absent in the head as deletions, while\nwe consider files absent in the base but present in the head as additions. This process yields approximately\n300M PRs, accompanied by a volume of 15 TB of base code. Among these, there are 215M closed PRs\noriginating from around 24M repositories.\nPR filters We remove PRs that 1) have been opened by bots, 2) consist only of comments by bots, 3) have\na non-permissive license, 4) have been opted out, 5) changes the base during the PR, 6) are not approved or\nmerged, or 7) lack initial diffs (either due to absent data from Software Heritage or because all data have\nbeen filtered in other steps).\nFile filters We remove files from the base commit if they satisfy one of the following conditions: 1) the\nfile is a deletion or addition, 2) the file length exceeds 1 million characters, 3) the fraction of alphanumeric\ncharacters is less than 0.25, 4) the fraction of hexadecimal characters is greater than 0.25, 5) the max number\nof lines surpasses 100,000, 6) the average line length exceeds 100, 7) the max line length surpasses 1,000, or\n8) the presence of non-English text in Markdown\nTitle and description filtering We apply the following heuristic filters to clean up the PRs further. We\nexclude PRs with changes to the base, those not approved or merged, and those lacking initial diffs (either\ndue to absent data from Software Heritage or being filtered out in previous steps). We also exclude PRs\nwhen the title is less than 10 characters or contains the words \u2019dependencies\u2019, \u2019dependency\u2019, \u2019depend\u2019, or\n\u2019release\u2019. We exclude PRs when the description is less than 20 characters or contains \u2019Qwiet\u2019.\nTruncating inputs We shorten lengthy input fields in the PRs as follows. We truncate titles to 500\ncharacters and descriptions to 80 lines, only displaying the first 60 and the last 20 lines. If the description\nlength still exceeds 1000 characters, we truncate it.\nProcessing comments Following the processing of GitHub issues (\u00a72.2), we remove comments from bots\nand strip auto-generated text when users post via email reply. We anonymize the usernames of authors as\ndescribed in \u00a73.2. We remove comments from PRs with less than 20 characters unless they are PR review\ncomments. For code review comments, we remove the full diff hunk if it exceeds 10,000 characters while\nkeeping the filename and comment.\n7 Under review as submission to TMLR\nSubsampling PRs To increase the diversity in the PRs, we sub-sample them on a per-repository basis.\nFor repositories with 1 PR (after filtering), we retain it with a probability of 0.8. We linearly decrease this\nretention probability to 0.1 for repositories with 1,000 PRs. For repositories with more than 1,000 PRs, we\nset the retention probability such that we retain only 100 PRs. Finally, we sub-sample YAML and JSON files\nwith 10% retention probability when their file size exceeds 50% of the total base files size or when the file\npath contains one of the keywords: \u2019pack\u2019, \u2019lock\u2019, \u2019yarn\u2019, \u2019output\u2019, \u2019swagger\u2019, \u2019openapi\u2019, or \u2019output\u2019.\nMax sequence length We determine the maximum sequence length of PRs by first investigating the\ndata distribution after the processing steps mentioned above. We find 3.7M PRs with up to 1M characters,\nresulting in 194 GB of data. This reduces to 3.3M PRs when we set a limit of 100K characters, resulting in a\ndataset size of 67.3 GB. (appendix A.5 has more details about sequence length statistics.) For the StarCoder2\nmodels, we opt to include PRs with up to 100K characters (translating to roughly 25k tokens). Since we\nare pre-training with a limited context of 4K tokens, not all PRs fit into the context window. However, as\ndescribed in \u00a75.2, we format the PRs so that the diffs are local and do not require long context.\n2.4 Notebooks\nWe include notebooks from two separate sources: Jupyter notebooks extracted from the Software Heritage\narchive and notebooks released by the Kaggle platform.\n2.4.1 Jupyter Notebooks\nWe transform Jupyter Notebooks into scripts and structured notebooks following the same pipeline as\nStarCoder (Li et al., 2023). One key difference is that we keep the markdown structure of the text blocks\nwhile it is removed in StarCoder. For completeness, we recap these preprocessing steps below.\nJupyter \u2013 scripts We utilize Jupytext4to convert notebooks to scripts. To initiate the conversion process,\nJupytext requires the identification of the specific programming languages within each notebook. This\ninformation is typically available in the metadata of most notebooks. In cases where it is not, we use the\nGuesslang library5to identify the programming language, using a probability threshold of 0.5 or higher. Our\ninitial dataset comprised 11 million notebooks, of which 3 million were excluded due to parsing errors. After\nnear-deduplication, the dataset was reduced to 4 million notebooks converted to scripts.\nJupyter \u2013 structured To create this dataset, we first filtered out notebooks that did not contain any\nPython code or Markdown text using the metadata information of each notebook. Only notebooks explicitly\nmarked as \u2018Python\u2019 in the metadata were kept. Then, for each notebook, consecutive Markdown blocks\nor code blocks were merged into a single Markdown or code block, respectively. Eventually, we ended up\nwith consecutive code-text pairs in temporal order grouped by each notebook. Each Jupyter code-text pair\ncontained the Markdown text immediately preceding the code block and the Python code, forming a natural\ninstruction pair. We also included the formatted output of a code block if the output cell was non-empty;\notherwise, it was marked by a special <empty_output> token. If consecutive code blocks have multiple output\ncells before merging, we only retain the output of the last code block. After these preprocessing steps and\nnear-deduplication, we ended up with 4.6M structured Jupyter notebooks.\n2.4.2 Kaggle Notebooks\nWe include Python notebooks released by the Kaggle platform6under an Apache 2.0 license, starting with an\ninitial dataset of 3.6M notebooks. Note that this Kaggle dataset does not include the output cells, only the\nmarkdown and code cells.\nCleaning We start the data cleaning process by dropping notebooks with less than 100 characters and\nthose with syntax errors. We also remove the templated text at the beginning of notebooks (see appendix A.7\n4https://jupytext .readthedocs .io/\n5https://guesslang .readthedocs .io/\n6https://www .kaggle .com/datasets/kaggle/meta-kaggle-code\n8 Under review as submission to TMLR\nfor the templates). These steps remove 18% of the notebooks. Next, we convert the notebooks to the\nstructured and script format, following the processing of the Jupyter notebooks in \u00a72.4.1. Finally, we remove\nnear-duplicates using the pipeline described in \u00a73.1, eliminating 78% of the notebooks and leaving us with\n580k notebooks.\nDataset description To provide the model with more context regarding the content and objectives of the\nnotebook, we include metadata about the Kaggle dataset whenever this information is available. We find\nthat 42% of the notebooks are associated with a Kaggle dataset and include its title and description at the\nbeginning of each notebook.\nDataset schema In addition to these high-level dataset descriptions, we scanned the code inside the\nnotebooks for instances of read_csv . We found that 25% of the samples were loading CSV datasets. We\nextracted and incorporated detailed information about these datasets as follows. First, we used the Kaggle\nAPI to download the datasets and successfully retrieved 8.6% of the notebooks. The remaining cases\nwere attributed to either the dataset being unavailable or encountering challenges downloading it within a\nreasonable time frame. For the downloaded datasets, we prefix the output of df.info() to the notebook,\nwhich displays the column names and their dtypes, the non-null values count, and the memory usage. We\nalso include four sample rows from the dataset.\n2.5 Documentation\nDocumentation from package managers We crawl documentation from several package manager\nplatforms, including npm, PyPI, Go Packages, Packagist, Rubygems, Cargo, CocoaPods, Bower, CPAN,\nClojars, Conda, Hex and Julia. We first retrieve the names of the most popular libraries across various\nplatforms from libraries.io. These library names are then used to search through individual package managers,\nenabling us to obtain the respective homepages for each library. We systematically crawled the documentation\nfiles from the obtained homepage links or, alternatively, extracted information from the provided README\nor documentation files on the platform. For documents obtained through homepage links, we adhere to the\nsame processing strategy outlined below in the paragraph titled \u201cDocumentation from websites\u201d. When\nextracting documents from the REwang2023softwareADME or documentation files on the platform, we\nemploy distinct heuristics to extract the text using markdown formats whenever feasible, aiming to maintain\na simple and effective format. It is worth noting that many libraries available on PyPI and Conda have their\nassociated documentation hosted on Read the Docs, which typically offers more comprehensive documentation.\nConsequently, we prioritize utilizing Read the Docs as the primary source of documentation for these libraries.\nFor these documents hosted on Read the Docs, we follow the same processing procedure outlined in the\nparagraph titled \u201cDocumentation from websites\u201d.\nPDFs from package managers For documents related to the R language, we extracted text from all\nPDF files hosted on CRAN using the pdftotext library.7This library is particularly effective in preserving\nthe formatting, including spaces within code snippets. For LaTeX-related documentation, we extracted the\ndocumentation, tutorial, and usage guide PDFs of LaTeX packages from CTAN, filtered out image-heavy\nPDFs, and converted the rest into markdown using the Nougat neural OCR tool.\nDocumentation from websites We collect code documentation from a carefully curated list of websites\nas detailed in Table 2. We start by systematically exploring the website from its initial URL listed in Table 2,\nusing a queue to store URLs within the same domain. This queue expands dynamically as we discover new\nlinks during the crawl. Given that most documents comprise HTML pages, we focus our processing pipeline\non (1) content extraction and (2) content concatenation. To extract the content, we utilize the trafilatura\nlibrary8to convert each HTML page into XML format, simultaneously eliminating redundant navigation and\nindex bars, elements that often recur in documentation. Next, we converted the XML format to markdown\nusing our XML-to-Markdown conversion script. In the second stage, to compile these documents into a\nsingle text, we first do a near-deduplication of the content extracted from different HTML pages. This\n7https://github .com/jalan/pdftotext\n8https://github .com/adbar/trafilatura\n9 Under review as submission to TMLR\n102103104\nNumber of OccurrencesCSSHaskellHTMLPerlPHPJuliaJSONSQLObjective-CYAMLMarkdownT eXRubyPythonErlangUnknownRustJavaScriptGoRProgramming LanguagesProgramming Language Usage\nFigure 2: The distribution of the top 20programming languages in our crawled documentation collection.\nstep was essential since we have observed that certain document pages only comprise website layouts (e.g.,\nnavigation bars) instead of fruitful information for documents, resulting in a substantial amount of duplicated\ncontent. To accomplish this, we treat each HTML page from a single website as a cluster and apply the\nminhash locality-sensitive hashing technique to identify and eliminate similar pages, using a threshold of 0.7.\nFinally, we assemble the gathered content from different pages of the same website in the order of web page\ncrawling, ensuring a cohesive narrative. This parallels the \u201cbreadth-first search\u201d approach, where all nodes at\nthe current depth are explored before proceeding to the next depth level. Also, we collected code-relevant\ndata from existing web crawls such as RefinedWeb (Penedo et al., 2023), OSCAR (Ortiz Su\u00e1rez et al.,\n2019), and esCorpius (Guti\u00e9rrez-Fandi\u00f1o et al., 2022). We use regular expressions to identify programming\nlanguage-specific constructs within the documents and to detect the \u201cdocs.\u201d substring in the page URLs.\nThe resulting dataset primarily comprises content sourced from programming blogs, coding tutorials, and\nplatforms like Read the Docs, with the exclusion of the documents gathered above.\nFree textbooks We scraped free programming books compiled in the Free Programming Books project,\nwhich aims at promoting the distribution of free programming e-books. First, we extract all links and identify\nthose with a PDF extension. Subsequently, we downloaded all available PDF files and utilized the pdf2text\nlibrary to extract text from these PDF files. Finally, we parsed 3,541 books whose languages span across\ndifferent regions, including English, Chinese, Japanese, Spanish, and others.\nLanguage identification Finally, we have employed a dual approach to identify the main programming\nlanguage used by each document. We leverage predefined rules when the source of the document unequivocally\ncorresponds to a specific programming language and resort to the guesslang9library in cases where such\ncorrespondence is not explicit. The resultant programming language distribution is graphically represented in\nFigure 2.\n2.6 Intermediate Representations\nWe augment source code by pairing its intermediate representations (IR) to enhance the model\u2019s understanding\nof low-resource programming languages. The key rationale behind this approach is that a shared intermediate\n9https://github .com/yoeo/guesslang\n10 Under review as submission to TMLR\nTable 2: The websites scraped for the code documentation dataset.\nWebsite Name URL\nDevDocs API Documentation https://devdocs .io\nMDN Web Docs https://developer .mozilla .org\nTensorFlow Docs https://www .tensorflow .org\nLinux Docs https://www .kernel .org/doc/Documentation\nSwift Programming Language https://docs .swift .org/swift-book/documentation/the-swift-programming-language\nFlutter API Reference https://api .flutter .dev\nTypeScript https://www .typescriptlang .org/docs/handbook\nJson.NET Documentation https://www .newtonsoft .com/json/help/html\nNVIDIA Documentation Hub https://docs .nvidia .com\nOracle Java Tutorial https://docs .oracle .com/javase/tutorial/java\nQiskit Documentation https://qiskit .org/documentation\nQ# Quantum Programming https://learn .microsoft .com/en-us/azure/quantum/user-guide\nPony Tutorial https://tutorial .ponylang .io\nZephir Documentation https://docs .zephir-lang .com/0 .12/en/introduction\nQemu Documentation https://www .qemu .org/documentation\nC# Documentation https://learn .microsoft .com/en-us/dotnet/csharp\nHugging Face Documentation https://huggingface .co/docs\nLLVM Doc https://llvm .org/docs\nGCC Online Documentation https://gcc .gnu.org/onlinedocs\nMatlab Documentation https://www .mathworks .com/help/matlab\nBoost C++ Libraries https://www .boost .org/doc\nMaxima Manual https://maxima .sourceforge .io/docs/manual/maxima_singlepage .html\nQt Documentation https://doc .qt.io\nrepresentation might help to anchor low-resource constructs to similar ones in high-resource languages (Zhuo\net al., 2023b).\nLLVM We select LLVM (Lattner & Adve, 2004) as the intermediate representation due to its widespread\navailability on GitHub, increasing the probability that there is sufficient training data to learn the semantics\nof the language. In addition, LLVM is widely adopted as an IR and is the target representation of many\ncompiler frontends across several programming languages.10\nData collection Existing attempts to extract IR from free-form source code either suffer from low\ncompilation success rates (Szafraniec et al., 2023) or use bespoke language-specific mechanisms to track\ndependency code to compile successfully (Grossman et al., 2023). We sidestep this by sourcing self-contained\ncompilation units from accepted solutions to programming word problems (Rosetta Code, 2023; Mirzayanov,\n2020; Puri et al., 2021; Caballero et al., 2016). We compile \u22484M sources in total across C++, C, Objective-C,\nPython, Rust, Go, Haskell, D, Fortran, Swift, and Nim in size optimized ( -OZequivalent) and performance\noptimized ( -O3equivalent) mode. We opt to use the size-optimized IR in most of the pairs due to context\nlength considerations. However, for 20% of the pairs, we use the performance-optimized IR. This is done to\nmaximize transfer from the pre-training stage, where the model sees LLVM code in the wild, which is more\nlikely to be in this form. We use clang11for compiling C++, C and Objective-C, codon12for compiling\nPython, rustc13for compiling Rust, gollvm14for compiling Go, ghc15for compiling Haskell, ldc16for\ncompiling D, flang17for compiling Fortran, and nlvm18for compiling Nim. We clean headers along with\nsuperfluous platform, vendor, and memory layout-specific information from the IR before pairing it with its\nsource.\n10https://llvm .org/ProjectsWithLLVM/\n11https://clang.llvm.org/\n12https://docs.exaloop.io/codon\n13https://www.rust-lang.org/\n14https://go.googlesource.com/gollvm/\n15https://www.haskell.org/ghc/\n16https://wiki.dlang.org/LDC\n17https://flang.llvm.org/docs/\n18https://github.com/arnetheduck/nlvm\n11 Under review as submission to TMLR\n2.7 LHQ19\nWe include several small high-quality datasets for math and coding:\n\u2022APPS (train) (Hendrycks et al., 2021) is a popular text2code benchmark in Python with a train\nset of 5,000 examples. We include one solution per programming problem.\n\u2022Code Contest (Li et al., 2022) is similar to APPS but includes solutions in several programming\nlanguages, namely Python 2/3, C++, and Java. We include one solution per problem and language\nand arrive at a dataset of 13k+ examples.\n\u2022GSM8K (train) (Cobbe et al., 2021) is the train split of GSM8K, a popular evaluation benchmark\nfor testing the math reasoning capabilities of LLMs. The dataset consists of 7k+ examples.\n\u2022GSM8K (SciRel) (Yuan et al., 2023) is an augmented version of GSM8K that includes alternative\nreasoning paths for the questions in GSM8K. The extended version contains 110k examples.\n\u2022Deepmind Mathematics (Saxton et al., 2019) is a synthetic dataset of math questions and\nanswers across various domains (algebra, arithmetic, calculus, comparison, measurement, numbers,\npolynomials, probability) and varying difficulty (easy-medium-hard). The dataset consists of 110M+\n(short) examples.\n\u2022Rosetta Code (Rosetta Code, 2023; Nanz & Furia, 2015) is a dataset with over 1100 everyday\nprogramming tasks with solutions in as many different programming languages as possible.\n\u2022MultiPL-T (Cassano et al., 2023a) is high-quality data in Lua, Racket, and OCaml based on\nautomatically translating extracted Python functions and validating them with unit tests. The total\ndataset comprises over 200k examples.\n\u2022Proofsteps is part of the AlgebraicStack (Azerbayev et al., 2024), a dataset used to train the Lemma\nfamily of models. We also include proofsteps-lean , which was extracted from mathlib 4 (mathlib\nCommunity, 2020), and proofsteps-isabelle , which was built on top of the PISA dataset (Jiang\net al., 2021). Proofsteps-lean contains over 3k examples, while proofsteps-isabelle contains over 250k\nexamples.\n2.8 Other Natural Language Datasets\nStackOverflow We include 11 million questions and their corresponding multiple responses from the Stack\nOverflow dump dated 2023-09-14 (StackExchange Archive, 2024). We filtered out questions with fewer than\nthree answers. Upon inspecting the dataset, we found many mismatches between questions and answers\ndue to inherent format errors in the Stack Overflow dump. We leveraged Llama-2-70b-chat-hf (Touvron\net al., 2023) to increase the quality of the dataset as follows. We selected 20,000 examples and asked\nLlama-2-70b-chat-hf to rate the question-answer pairs. See Appendix A.6 for the exact prompt. Next,\nwe pick the 10,000 highest-scoring pairs as positive examples and use the remaining 10,000 answers to\ncreate negative examples by randomly pairing them with other questions. We use this dataset to train a\nbinary classifier by embedding the question and answer with a well-performing sentence embedding model\n(sentence-transformers/all-MiniLM-L12-v220(Reimers & Gurevych, 2019; Muennighoff et al., 2022a))\nand minimizing the cosine distance between them. Next, we plot the embedding scores for a subset of the\nquestion-answer pairs and manually determine the threshold to 0.1. As a question can have multiple answers,\nwe average the scores of question-answer pairs and remove all questions with an average score below 0.1. We\nend up with 11.4 million questions and over 10B tokens.\nArXiv We include the ArXiv subset of the RedPajama dataset (Together Computer, 2023). This dataset is\ndownloaded from the publicly available Amazon S3 bucket (Arxiv, 2024). We further processed the dataset\nonly to retain latex source files and remove preambles, comments, macros, and bibliographies from these files.\nThe final dataset is roughly 30B tokens.\n19Leandro\u2019s High-Quality dataset\n20https://huggingface .co/sentence-transformers/all-MiniLM-L12-v2\n12 Under review as submission to TMLR\nWikipedia We include the English subset of Wikipedia. Specifically, we use the version collected by\nRedPajama (RedPajama Wiki, 2024), which is derived from the 2023-03-20 dump. We follow RedPajama\u2019s\nprocessing steps and eliminate hyperlinks and templates from the Wikipedia pages. The full dataset comprises\naround 6 billion tokens.\nOpenWebMath We include OpenWebMath (Paster et al., 2023), an open dataset of high-quality mathe-\nmatical text extracted from CommonCrawl. The full dataset comprises almost 15B tokens.\n3 Preprocessing Pipeline\nWe apply several preprocessing steps, such as deduplication (\u00a73.1), PII redaction (\u00a73.2), benchmark decon-\ntamination (\u00a73.3), malware removal (\u00a73.4), and opt-out deletion requests (\u00a73.5), to the data sources described\nin the previous section. Since not all steps are applied to each data source, we summarize the preprocessing\npipeline per data source in Table 3.\n3.1 Removing Near-Duplicates\nWe deduplicate the source code, pull requests, notebooks, issues, and documentation. We do not deduplicate\nthe already preprocessed natural language datasets, such as Arxiv, StackExchange, OpenWebMath, Wikipedia,\nand the small high-quality math and reasoning datasets.\nWe followed the deduplication pipeline of SantaCoder (Ben Allal et al., 2023). This process first calculates\nthe MinHashes (Broder, 2000) of all code files and then utilizes Locally Sensitive Hashing (LSH) to group\nfiles based on their MinHash fingerprints. During the LSH stage, \u201csimilar\u201d files are assigned to the same\nbuckets, identifying them as duplicates. Only one file from each duplicate group is chosen. In addition to the\nSantaCoder approach, to preserve repository context, we prioritize files from repositories with higher star\nand fork counts or from the latest commit date as a tiebreaker. We used 5-grams and a Jaccard similarity of\n0.7. We refer to this blogpost for more background information regarding the deduplication pipeline.\n3.2 PII Redaction\nTo reduce the likelihood of re-distributing Personally Identifiable Information (PII) present in the training data,\nwe make diligent efforts to redact PII from the training set. We largely follow the steps from StarCoder (Li\net al., 2023) and leverage the StarPII model to redact various PII entities. Below, we provide more details on\nhow we apply it to each data source.\nRedacting PII entities We use StarPII to redact names, emails, keys, passwords, IP addresses, and\nusernames from source code, pull requests, issues, and StackOverflow. We do not make any modifications\nto the model or redaction logic described in the StarCoder paper (Li et al., 2023). For OpenWebMath and\ndocumentation, we only redact names, keys, and emails, while we only redact emails for arXiv using the regex\ndescribed in Ben Allal et al. (2023).\nRedacting usernames The conversations in issues, pull requests, and StackOverflow often contain\nusernames in the message thread. We anonymize the author usernames by substituting them with a\nparticipant counter specific to the conversation, like username_1 to represent the second participant. These\npseudonyms are added at the start of each comment to maintain the speaker\u2019s identity. Moreover, any\nreferences to these usernames in the messages are removed. Only the usernames of actively participating\nindividuals in the conversation are masked, and mentions of non-participating users remain unaffected.\n3.3 Decontamination\nTo ensure the performance of StarCoder is not artificially inflated on our test benchmarks, we decontaminate\nthe training set from our test sets. Specifically, we remove files that contain docstrings or solutions from\nHumanEval and MBPP, docstrings from APPS, questions from GSM8K, or prompts from DS1000. In contrast\n13 Under review as submission to TMLR\nTable 3: Overview of the data processing steps applied to each data source.\nDataset Dedup Malicious Code Decontaminate Opt-out PII\nSource Code Yes Yes Yes Yes StarPII\nPull Requests Yes Yes Yes Yes StarPII + Usernames\nJupyter/Kaggle Notebooks Yes Yes Yes Yes/No StarPII\nIssues Yes Yes Yes Yes StarPII + Usernames\nDocs Yes No No No StarPII: Names, Keys, Emails\nLHQ No No No No No\nArxiv No No No No Email\nOpenWebMath No No Yes No StarPII: Names, Keys, Emails\nWikipedia No No No No No\nStackExchange No No Yes No StarPII + Usernames\nto the first iteration of StarCoder (Li et al., 2023), we further enhance the recall of the decontamination\nprocess by removing whitespace during string matching. Note that we exclude docs, LHQ, arXiv, and\nWikipedia from this decontamination step.\n3.4 Malware Removal\nWe scan our training set to identify possible instances of malware in the source code, pull requests, notebooks,\nand issues. To this end, we use ClamAV 1.2 (ClamAV, 2024) with additional unofficial malware signatures\npublished by SaneSecurity (Sane Security, 2024) as of 2023-11-16. Signatures with a high risk of False\nPositives (as determined by SaneSecurity) were not used. See Table 26 for the most frequently detected\nmalware signatures in the unfiltered code dataset. In summary, this step eliminates 59,442 files from the\ndataset, constituting only 0.009% of the 654M files.\n3.5 Removing Opt-outs\nWe announced the upcoming training run of StarCoder2 on X21and updated the \"Am I in the stack\"\ngovernance tool with the new repositories from The Stack v2. Developers were granted until November 20,\n2023, to submit their opt-out requests. After the cut-off date, we eliminated 1,561 repositories associated\nwith 91 users and organizations. A total of 22,066 files were removed from the source code dataset (excluding\nissues and PRs).\n4 Data Composition\nModel capacity With a much larger training set available, we decided to tailor our data composition to\neach model size. We reason that smaller models, having limited capacity, should be exposed to a less diverse\ndataset. This intuition is supported by research in multi-lingual NLP showing that languages compete for\nmodel capacity (Arivazhagan et al., 2019; Conneau et al., 2020; Scao et al., 2022b). Hence, we first create a\nsmaller version of the SWH code dataset, selecting a subset of 17 widely-used programming languages. We\nuse this variant to train the 3B and 7B models, whereas we use the full version with all 619 programming\nlanguages for the 15B model. To further limit the diversity in the training set for the 3B model, we also\nexclude some natural language datasets (see \u201cData composition per model size\u201d).\nDownsampling languages Similar to StarCoderBase, we adhere to the natural distribution of the data as\nmuch as possible. Before constructing the source code datasets, we examined the data distribution among\nthe programming languages. Compared to StarCoderBase, we found slightly larger variations among the\nhigh-resource languages. The observed data volume (in GB) is as follows: Java (479.68), JavaScript (277.25),\nC++ (204.49), Python (190.99), PHP (171.57), C# (166.22), and C (114.49). We decided to downsample both\nJava and Javascript to 200GB to put these high-resource languages on a more equal footing. Furthermore, we\n21https://x .com/BigCodeProject/status/1721583097580249254?s=20\n14 Under review as submission to TMLR\nTable 4: Overview of the data composition of StarCoder2 models. We refer to the training set of the 3B\nmodel as the-stack-v2-train-3B.\nDataset Tokens (B) 3B 7B 15B\nthe-stack-v2-train-smol 525.5 \u2713 \u2713 \u2717\nthe-stack-v2-train-full 775.48 \u2717 \u2717 \u2713\nPull requests 19.54 \u2713 \u2713 \u2713the-stack-v2-train-extrasIssues 11.06 \u2713 \u2713 \u2713\nJupyter structured 14.74 \u2713 \u2713 \u2713\nJupyter scripts 16.29 \u2713 \u2713 \u2713\nKaggle scripts 1.68 \u2713 \u2713 \u2713\nDocumentation 1.6 \u2713 \u2713 \u2713\nOpenWebMath 14.42 \u2717 \u2713 \u2713\nWikipedia 6.12 \u2717 \u2713 \u2713\nStackOverflow 10.26 \u2713 \u2713 \u2713\nArxiv 30.26 \u2717 \u2713 \u2713\nLHQ 5.78 \u2713 \u2713 \u2713\nIntermediate Repr. 6 \u2713 \u2713 \u2713\nUnique tokens (B) 622.09 658.58 913.23\npreserved 254GB of markdown data while reducing the size of HTML to 100 GB. This decision was driven by\nthe anticipation that markdown would likely contain more code documentation, whereas HTML is commonly\nassociated with webpages. Lastly, we subsampled data files like JSON, XML, and YAML to 8GB and a few\nother data formats to 1 GB. See Table 28 in Appendix C.2 for the full list of subsampled languages.\nRepository-context After subsampling some programming languages, we compile the source code from\nSoftware Heritage into repository-context-aware datasets. Each example in the dataset is a full repository\nwith files arranged in a random order. As previously noted, we create two versions of the SWH dataset,\nthe-stack-v2-train-smol andthe-stack-v2-train-full , as further detailed in the subsequent paragraphs.\nThe-stack-v2-train-smol For the small variant, we select 17 widely used programming languages and\ninclude a curated set of documentation and configuration languages.\n\u2022Specifically, we include the following programming languages:\n\u2013C\n\u2013C#\n\u2013C++\n\u2013Go\n\u2013Java\n\u2013JavaScript\u2013Kotlin\n\u2013Lua\n\u2013PHP\n\u2013Python\n\u2013R\n\u2013Ruby\u2013Rust\n\u2013SQL\n\u2013Shell\n\u2013Swift\n\u2013TypeScript\n\u2022And incorporate the following languages associated with code documentation:\n\u2013AsciiDoc\n\u2013HTML\n\u2013Markdown\u2013RDoc\n\u2013RMarkdown\u2013Text\n\u2013reStructuredText\n\u2022We also include several configuration languages and files, which we list in Appendix C.1.\n\u2022Despite limiting the languages to this subset, we obtain a dataset of 525B+ unique tokens.\nThe-stack-v2-train-full For the full variant, we include all 619 programming languages. Although this\nsubset significantly enhances language diversity (adding 600+ programming languages), it contributes only\naround 250B tokens to the dataset, culminating in 775B+ tokens.\n15 Under review as submission to TMLR\nData composition per model size In Table 4, we summarize the data composition for the 3B, 7B,\nand 15B models. We use the-stack-v2-train-extras to denote all supplementary sources gathered for\nStarCoder2, excluding the source code obtained from SWH. For the 3B, we use the-stack-v2-train-smol\nand exclude OpenWebMath, Wikipedia, and Arxiv from the extra data sources in \u00a72. This leads to a dataset\nof 622B+ unique tokens. For the 7B, we include OpenWebMath, Wikipedia, and Arxiv, leading to a slightly\nlarger dataset of 658B+ unique tokens. For the 15B, we include the-stack-v2-train-full dataset and all\nextra data sources listed in \u00a72, resulting in a dataset with 913B+ unique tokens. The size of this dataset is\n4\u00d7the size of the training dataset for StarCoderBase.\n5 Data Formatting\nWe present the formatting guidelines for each of the data sources below. We provide the templates below\nin which\u27e8token\u27e9refers to a sentinel token, and metadata and data refer to placeholders for data fields,\nrespectively.\n5.1 Source Code\nWe prepend the repository name and file paths to the context of the code file. We only add this metadata\nwith a 50% probability to enable the model to operate without this information. We use the following format\nwhen adding the repository name and file paths:\n<repo_name>reponame<file_sep>filepath1\\ncode1<file_sep>filepath2\\ncode2 ... <|endoftext|>.\nWe use the following format when we do not include this meta-data:\n<file_sep>code1<file_sep>code2 ... <|endoftext|>.\nRepository-context Starcoder1 was trained with file-context, i.e., the setting where random files are\njoined into the context window. In this work, we explore training with repository-context, wherein files from\nthe same repository are grouped together. While we considered various methods for grouping files within the\nrepository, we ultimately arranged them in a random order within the same repository.\nFIMToenablethemodeltoperformcodeinfillingtasks, weapplythefill-in-the-middletransformation(FIM;\nBavarian et al., 2022) to the source code. While we explored several FIM variants in preliminary experiments,\nwe opted for repo-context file-level FIM in the StarCoder2 models. In this FIM variant, repositories are\nselected with a 50% chance of being candidates for FIM. The selected repository examples are split by\n<|endoftext|> and <file_sep> tokens. Next, we apply the FIM transformation to each chunk with a 50%\nprobability. We do not apply FIM to the repository metadata ( <repo_name>reponame ). Below, we provide\nan example of the FIM format when it\u2019s only applied to the second source file:\n<repo_name>reponame<file_sep>filepath0\\ncode0<file_sep><fim_prefix>filepath1\\n\ncode1_pre<fim_suffix>code1_suf<fim_middle>code1_mid<file_sep> ...<|endoftext|>\n5.2 Pull Requests\nFormatting pull requests is challenging as we aim to create a compact representation of a potentially long\nsequence of code changes and comments. We refer to \u00a72.3 for details on how we removed and truncated long\ninput fields of the pull request. Here, we focus on how to render the PR into a structured format that can be\nconsumed by the LLM.\nFor files part of the base commit, we include the entire file with 0.2 probability; otherwise, we display a range\nof changes in the base files across all commit heads of the PR.22We randomly add up to 32 lines before and\nafter the changes.\n22We take the union of file line changes in all commits\n16 Under review as submission to TMLR\nWe use diff hunks to display modifications between the before and after state of the file, ensuring that changes\nare reasonably localized. Additionally, within the diff hunks, we incorporate 3-10 randomly selected context\nlines both before and after the specific change.\nWe structure the PR format as follows. The first block presents the title, description, and complete base files\nor modifications made to them. Subsequently, we outline the first set of head diff hunks:\n<pr>Title: title\\nusername_0: description\n<pr_status>opened\n<repo_name>reponame\n<pr_base>\n<pr_file>filepath_1\n<pr_base_code>file_content/changes_1\n...\n<pr_file>filepath_N\n<pr_base_code>file_content/changes_N\n<pr_diff>\n<pr_file>filepath_1\n<pr_diff_hunk>diff_hunk_1\n...\n<pr_diff_hunk>diff_hunk_K\n...\n<pr_file>filepath_M\n<pr_diff_hunk>diff_hunk_1\n...\n<pr_diff_hunk>diff_hunk_J\nThe second block is repeated for each new head commit in the PR, covering general comments, review\ncomments, and code review comments. The block concludes with the diff hunks between the pull request\nbase and the new head, reflecting the outcome of discussions and comments. Note that it\u2019s also possible\nfor users to close and reopen the pull request. As in Github issues, we refer to authors by their participant\ncounter within the conversation, e.g., username_1, to refer to the second participant in the issue.\n<pr_comment>username_id: comment\n<pr_event_id>comment_id\n...\n...\n...\n<pr_review>username_id: review_comment\\n\n<pr_event_id>review_id\n<pr_review_state>[approved, rejected, commented, changes_required]\n...\n...\n...\n<pr_review_comment>\n<pr_event_id>comment_id\n<pr_in_reply_to_review_id>review_id (opt)\n<pr_in_reply_to_comment_id>comment_id (opt)\n<pr_file>filepath\n<pr_diff_hunk_comment_line>line_number\n<pr_diff_hunk>diff_hunk_content\n<pr_comment>username_id: comment\n17 Under review as submission to TMLR\n...\n...\n...\n<pr>username_id\n<pr_status>closed\n<pr_is_merged>False\n...\n<pr>Title: title\\nusername_id: description\n<pr_status>[opened, reopened, edited]\n...\n...\n...\n<pr_file>filepath_1\n<pr_diff_hunk>diff_hunk_1\n...\n<pr_diff_hunk>diff_hunk_K\n...\n<pr_file>filepath_M\n<pr_diff_hunk>diff_hunk_1\n...\n<pr_diff_hunk>diff_hunk_J\nWe only add the following final block when the PR is closed.\n<pr>username_id\n<pr_status>closed\n<pr_is_merged>True\n<|endoftext|>\n5.3 GitHub Issues\nWe use sentinel tokens to mark the opening of an issue and subsequently include its title. We separate the\nsequence of comments by a <issue_comment> token and include an anonymized speaker identifier before\nthe comment. Specifically, we refer to authors by their participant counter within the conversation, e.g.,\nusername_1, to refer to the second participant in the issue. To distinguish between the different turns,\nwe use comment_1, id1 to refer to the second comment and its anonymized speaker id, respectively. The\n<issue_closed> token is added if the issue is closed.\n<issue_start>Title: title\\nusername_id0: comment_0<issue_comment>username_id1: comment_1\n... <issue_closed (optional)><issue_comment>username_idn: comment_n<|endoftext|>\n5.4 Notebooks\nJupyter \u2013 scripts We format Jupyter scripts as a single code block, starting with a <jupyter_script>\ntoken.\n<jupyter_script>code<|endoftext|>\nJupyter \u2013 structured Parsed Jupyter notebooks are chains of text, code, and outputs. We separate the\ncells with sentinel tokens. Note that we use text2, code2, output2 to refer to the 3rd triplet in the notebook.\n<jupyter_start><jupyter_text>text0<jupyter_code>code0\n<jupyter_output>output0<jupyter_text> ... <|endoftext|>\n18 Under review as submission to TMLR\nKaggle \u2013 scripts When available, we prepend the associated dataset title and description to Kaggle\nnotebooks (42% of the samples). For 8.6% of the notebooks, we add granular information on the dataset\u2019s\nschema. Below is the format we use:\n<jupyter_start><jupyter_text>title\\ndescription\\nKaggle dataset identifier: data_identifier\n<jupyter_code>import pandas as pd\\n\\ndf = pd.read_csv(data_path1)\\ndf.info()\n<jupyter_output>df_info_output1\n<jupyter_text>Examples:\\nexample1_1\\n..example1_4\n...\n<jupyter_script>code<|endoftext|>\nSome notebooks might load more than one csvfile, so we repeat the blocks of data information content for\nall files.\nNote that we introduce a new special token <jupyter_script> to append the final script of the converted\nKaggle notebook. This token helps differentiate the script, which is usually long, from code that follows\n<jupyter_code> token, typically shorter.\nKaggle \u2013 structured Structured Kaggle notebooks are similar to structured Jupyter notebooks, except\nthat they don\u2019t have an output cell, so we only include text and code blocks and keep the tokens used in\nJupyter Notebooks:\n<jupyter_start><jupyter_text>text0<jupyter_code>code0<jupyter_text> ... <|endoftext|>\n5.5 StackExchange\nWe concatenate questions and answers in the StackOverflow dataset using a format similar to the GitHub\nissues. We start with the question and then add answers in random order. We include the upvote score\nalongside the answer and, if applicable, denote it as the selected answer. Note that we do not have the title\nof the conversations for the StackExchange dataset.\n<issue_start>username_id0: question\n<issue_comment>username_id1: answer_1\\nUpvotes: score [selected answer](Optional)\n...\n<issue_comment>username_idn: answer_n\\nUpvotes: score [selected answer](Optional)<|endoftext|>\n5.6 Intermediate Representations\nWe split 50/50 between translating from source code to intermediate representation ( code->intermediate )\nand vice-versa ( intermediate->code ). Regarding the intermediate representation, we use the size-optimized\nversion 80% of the time and the performance-optimized version 20% of the time. We use separate sentinel\ntokens to indicate the direction of the translation.\ncode<code_to_intermediate>intermediate_representation\nintermediate_representation<intermediate_to_code>code\n6 Model architecture and training details\nIn this section, we provide all details regarding the model architecture (\u00a76.1), tokenizer (\u00a76.2), training details\n(\u00a76.3), and CO 2emissions during training (\u00a76.4).\n23Estimated with 6ND, where N is the number of parameters and D is the number of training tokens. Includes base and\nlong-context training.\n19 Under review as submission to TMLR\nTable 5: Overview of the sentinel tokens.\nToken Description\n<|endoftext|> end of text/sequence\n<fim_prefix> FIM prefix\n<fim_middle> FIM middle\n<fim_suffix> FIM suffix\n<fim_pad> FIM pad\n<repo_name> repository name\n<file_sep> file separator\n<issue_start> start of GitHub issue\n<issue_comment> start of GitHub issue comment\n<issue_closed> GitHub issue closed event\n<jupyter_start> start of Jupyter notebook\n<jupyter_text> start of Jupyter text cell\n<jupyter_code> start of Jupyter code cell\n<jupyter_output> start of Jupyter output cell\n<jupyter_script> start of Jupyter script (converted kaggle notebook)\n<empty_output> output cell without content\n<code_to_intermediate> translate source code to intermediate representation\n<intermediate_to_code> translate intermediate representation to source code\n<pr> start of pull request\n<pr_status> status of pull request\n<pr_is_merged> whether pr is merged\n<pr_base> start of list of base files\n<pr_file> path of pull request file\n<pr_base_code> code that is part of the base commit in the PR\n<pr_diff> start of a diff\n<pr_diff_hunk> diff hunk\n<pr_comment> general comment\n<pr_event_id> GitHub id of review comment or code review comment\n<pr_review> start of review\n<pr_review_state> review state (e.g. approved, rejected)\n<pr_review_comment> code review comment\n<pr_in_reply_to_review_id> GitHub event id of review\n<pr_in_reply_to_comment_id> GitHub event id of comment\n<pr_diff_hunk_comment_line> line number of code review comment\n6.1 Model Architecture\nWe introduce a few architectural changes compared to StarCoderBase. First, we replace learned positional\nembeddings with Rotary Positional Encodings (RoPE; Su et al., 2021), as we confirmed significant performance\ngains in a preliminary ablation study. Following DeepseekCoder (Guo et al., 2024) and Code LLaMA (Rozi\u00e8re\net al., 2023), we use a base period \u03b8= 1e5. The second architectural modification we make is replacing\nMulti-Query Attention (MQA; Shazeer, 2019) with Grouped Query Attention (Ainslie et al., 2023, GQA;\n). However, we keep the number of key-value heads relatively low\u20142 for the 3B, 4 for the 7B and 15B\u2014to\nprevent significantly slowing down inference.\nWe summarize all other hyperparameters, such as the number of layers and hidden dimension, in Table 6.\n20 Under review as submission to TMLR\nTable 6: Model architecture details of the StarCoder2 models.\nParameter StarCoder2-3B StarCoder2-7B StarCoder2-15B\nhidden_dim 3072 4608 6144\nn_heads 24 36 48\nn_kv_heads 2 4 4\nn_layers 30 32 40\nvocab size 49152 49152 49152\nseq_len base-4k/long-16k base-4k/long-16k base-4k/long-16k\npositional encodings RoPE RoPE RoPE\nFLOPs235.94e+22 1.55e+23 3.87e+23\nTable 7: Training details of StarCoder2 base models.\nModel learning rate RoPE \u03b8batch size niterations ntokensnepochs\nStarCoder2-3B 3\u00d710\u221241e5 2.6M 1.2M 3.1T 4.98\nStarCoder2-7B 3\u00d710\u221241e5 3.5M 1M 3.5T 5.31\nStarCoder2-15B 3\u00d710\u221241e4 4.1M 1M 4.1T 4.49\n6.2 Tokenizer\nWe follow the procedure of StarCoderBase and train a byte-level Byte-Pair-Encoding tokenizer on a small\nsubset of The Stack v1.24In our preliminary experiments, we observed that increasing the vocabulary size\nto 100K did not improve performance. Hence, we decided to maintain a vocabulary size of 49,152 tokens,\nincluding the sentinel tokens from Table 5. The pre-tokenization step includes a digit-splitter and the regex\nsplitter from the GPT-2 pre-tokenizer.\n6.3 Training Details\nBase models The models were trained with a sequence length of 4,096 using Adam (Kingma & Ba, 2015)\nwith\u03b21= 0.9,\u03b22= 0.95,\u03f5= 10\u22128and a weight decay of 0.1, without dropout. The learning rate followed a\ncosine decay after a linear warmup of 1,000 iterations. Table 7 details the training hyper-parameters for each\nmodel. RoPE \u03b8values are different for StarCoder2-15B due to a bug in parsing the training configuration.\nMoreover, StarCoder2-15B was scheduled to train for 1.1M iterations but was early stopped after 1M iterations.\nFollowing Muennighoff et al. (2023), we repeat data for around four to five epochs.\nLong context We further pre-trained each model for long-context on 200B tokens from the same pre-\ntraining corpus, using a 16,384 context length with a sliding window of 4,096, with FlashAttention-2 (Dao\net al., 2022; Dao, 2024). We increase RoPE \u03b8and use the same configuration for the optimizer. The other\ntraining hyperparameters are provided in Table 8.\n6.4 CO2 Emissions\nWe provide estimations of the CO 2emission of the StarCoder2 training using the Machine Learning Impact\ncalculator presented in Lacoste et al. (2019). Note that we calculate the CO 2emissions by considering the\ntotal GPU hours of the base-model training. We then extrapolate this number to the long-context fine-tuning\nbased on the number of tokens.\n3BThe compute infrastructure provided by ServiceNow had a carbon efficiency of 0.386 kgCO 2eq/kWh. A\ncumulative of 97,120 hours of computation was performed on hardware of type A100 SXM4 80 GB (TDP of\n24https://huggingface .co/datasets/bigcode/the-stack-march-sample-special-tokens-stripped\n21 Under review as submission to TMLR\nTable 8: Training details for the long context training of StarCoder2 models.\nModel learning rate RoPE \u03b8batch size niterations ntokens\nStarCoder2-3B 3\u00d710\u221251e6 2.6M 80k 200B\nStarCoder2-7B 2\u00d710\u221251e6 3.5M 56k 200B\nStarCoder2-15B 3\u00d710\u221251e5 4.1M 50k 200B\nTable 9: Pass@1 on HumanEval(+) and MBPP(+). These results were generated using greedy decoding.\nModel HumanEval HumanEval+ MBPP MBPP+\nStarCoderBase-3B 21.3 17.1 42.6 35.8\nDeepSeekCoder-1.3B 28.7 23.8 55.4 46.9\nStableCode-3B 28.7 24.4 53.1 43.1\nStarCoder2-3B 31.7 27.4 57.4 47.4\nStarCoderBase-7B 30.5 25.0 47.4 39.6\nCodeLlama-7B 33.5 25.6 52.1 41.6\nDeepSeekCoder-6.7B 47.6 39.6 70.2 56.6\nStarCoder2-7B 35.4 29.9 54.4 45.6\nStarCoderBase-15B 29.3 25.6 50.6 43.6\nCodeLlama-13B 37.8 32.3 62.4 52.4\nStarCoder2-15B 46.3 37.8 66.2 53.1\nCodeLlama-34B 48.2 44.3 65.4 52.4\nDeepSeekCoder-33B 54.3 46.3 73.2 59.1\n400W). Total emissions are estimated to be 14,995.33 kgCO 2eq. The long-context fine-tuning stage adds\n1,111.68 kgCO 2eq, resulting in a total of 16,107.01 kgCO 2eq.\n7BThe compute infrastructure provided by Hugging Face had a carbon efficiency of 0.2925 kgCO 2eq/kWh.\nA cumulative of 145,152 hours of computation was performed on hardware of type H100 (TDP of 660W).\nTotal emissions are estimated to be 28,021.6 kgCO 2eq. The long-context fine-tuning stage adds 1601.23,\nresulting in a total of 29,622.83 kgCO 2eq.\n15BThe paper will soon be updated with estimates for the 15B model.\n7 Evaluation\nWe evaluate the StarCoder2 models on a variety of benchmarks from the literature and compare them to\nrecent state-of-the-art open Code LLMs: StableCode (Pinnaparaju et al., 2024), Code Llama (Rozi\u00e8re et al.,\n2023), DeepSeekCoder (Guo et al., 2024), and original StarCoder (Li et al., 2023). Since StarCoder2 is a base\nmodel, we only compare it with the base models of the model families mentioned above.\nWe group all our comparisons by model sizes. The smallmodels have 3B or fewer parameters, the medium\nmodels have 7B or fewer parameters, and the largemodels have 15B or fewer parameters. Finally, we include\ntwoextra large models: CodeLlama-34B and DeepSeekCoder-33B. These models are more than twice the\nsize of the large StarCoder2 model. But, as we shall see below, StarCoder2-15B comes close to or even\noutperforms the extra-large models in several benchmarks.\n7.1 Code Completion\nWe first evaluate the StarCoder2 models on code completion tasks, which have been widely studied in Code\nLLM work.\n22 Under review as submission to TMLR\n7.1.1 HumanEval, MBPP, and EvalPlus\nAbout the benchmarks HumanEval (Chen et al., 2021) and MBPP (Austin et al., 2021) are two of the\nmost widely studied benchmarks for Code LLMs. Each benchmark has a few hundred programming problems.\nEach HumanEval problem has a prompt\u2014a function signature and docstring\u2014and a set of hidden unit tests.\nThe prompt for each MBPP problem includes a natural language description followed by a few tests. The\nmodel under evaluation will complete the function given the prompt, and we test that function with the\nhidden unit tests. The result is considered a success only if all hidden tests pass.\nRecently, Liu et al. (2023a) identified several issues with both benchmarks. (1) Most problems have inadequate\nhidden tests that cannot detect subtle bugs in solutions (See Listings 1 and 2); and (2) Several problems\nhave wrong test cases and ambiguous descriptions, which unfairly penalize the models that interpret the\nstatements in other reasonable ways (See Listings 2). They introduce the EvalPlus framework to address\nthese problems. The resulting benchmarks (HumanEval+ and MBPP+) have 80 \u00d7and 35\u00d7more tests than\nthe original benchmarks. For rigorous evaluation, we adopt the EvalPlus framework in this study.\nListing 1: A HumanEval task with insufficient tests\ndef common(l1: list, l2: list) -> list:\n\"\"\"Return sorted unique common elements for 2 lists\"\"\"\ncommon_elems = list(set(l1).intersection(set(l2)))\ncommon_elems.sort()\nreturn list(set(common_elems))\nassert common([4,3,2,8], []) == []\nassert common([5,3,2,8], [3,2]) == [2,3]\n...\n# [Explanation] This solution is wrong as applying set\n# to the sorted common_elems does not preserve the\n# order. Base HumanEval test inputs are too short to\n# easily manifest the flakiness.Listing 2: An MBPP task with problematic tests\n\"\"\"Write a function to check whether all dictionaries\n\u25c1arrowhookleft\u2192in a list are empty or not.\"\"\"\ndef empty_dit(list1): return all(not d for d in list1)\nassert empty_dit([{},{},{}]) == True\nassert empty_dit([{1,2},{},{}]) == True # Wrong test!\nassert empty_dit([{}]) == True\n# [Explanation] First, the second base test is wrong,\n# falsifying any correct solutions. Second, the tests\n# are weak, passing the wrong solution above. The wrong\n# solution mistakingly yields False given [{}, {}, [1]]\n# where we expect True as all dictionaries are empty\n# and the non-empty is an array, not a dictionary.\nHyperparameters Following recent work on Code LLMs (Rozi\u00e8re et al., 2023; Guo et al., 2024), we use\ngreedy decoding and report the mean pass@1 (mean success rate) for all problems in the benchmark.\nResults The results for HumanEval, MBPP, and their EvalPlus variants are presented in Table 9.25From\nthe table, we can make the following observations:\n1.StarCoder2-3B is the best-performing small model on all the datasets (HumanEval, MBPP, Hu-\nmanEval+, and MBPP+). The model is significantly better than its predecessor, StarCoderBase-3B,\nexhibiting improvements of 60.2% on HumanEval+ and 32.4% on MBPP+, respectively.\n2.StarCoder2-7B comes in second place of the medium models. DeepSeekCoder-6.7B is stronger,\noutperforming StarCoder2-7B by 32.4% and 24.1% on HumanEval+ and MBPP+, respectively.\nHowever, StarCoder2-7B consistently outperforms all the other medium models, including both\nStarCoderBase-7B and CodeLlama-7B. StarCoder2-7B outperforms StarCoderBase-7B by 19.6%\nand 15.2% on HumanEval+ and MBPP+, respectively. Additionally, it surpasses CodeLlama-7B by\n16.8% and 9.6% on these benchmarks.\n3.StarCoder2-15B is the best-performing large model by a significant margin. For example, it scores\n46.3, whereas CodeLlama-13B scores 37.8 on HumanEval. The results on EvalPlus are also consistent.\nFor example, on HumanEval+, it significantly improves over StarCoderBase-15B and CodeLlama-13B\nby 47.7% and 17.0%, respectively.\n25Note that EvalPlus omits a few ill-formed and noisy problems from the MBPP dataset. It uses 399 out of the 427 problems\nfrom the MBPP subset that was sanitized by the original authors (Austin et al., 2021). For HumanEval, we kept all 164 problems\nfrom the original dataset.\n23 Under review as submission to TMLR\nTable 10: Pass@1 results on MultiPL-E averaged over 50 samples for each problem. All models are evaluated\nat temperature 0.2and top-p 0.95.\nModel C++ C# D Go Java Julia JavaScript Lua PHP\nStableCode-3B 28.4 14.4 13.4 19.3 27.8 20.6 32.0 17.1 23.7\nDeepSeekCoder-1.3B 28.3 21.3 10.4 19.1 29.2 15.0 28.3 19.2 23.2\nStarCoderBase-3B 19.4 13.3 5.0 13.3 19.2 16.1 21.3 18.0 18.6\nStarCoder2-3B 27.2 20.5 12.6 23.6 27.4 19.9 35.4 28.0 27.6\nCodeLlama-7B 26.4 21.0 11.6 20.9 28.2 25.9 31.6 30.4 25.1\nDeepSeekCoder-6.7B 46.7 32.9 18.4 31.0 39.7 31.4 46.6 34.2 32.6\nStarCoderBase-7B 23.3 19.3 8.1 19.6 24.4 21.8 27.4 23.4 22.1\nStarCoder2-7B 33.6 20.7 15.1 20.2 29.4 20.4 35.4 30.7 30.6\nCodeLlama-13B 37.4 24.8 15.5 26.6 37.5 27.9 39.3 31.6 33.9\nStarCoderBase-15B 30.6 20.6 10.0 21.5 28.5 21.1 31.7 26.6 26.8\nStarCoder2-15B 41.4 29.2 23.6 26.2 33.9 33.2 44.2 43.8 39.5\nCodeLlama-34B 41.4 30.7 15.3 28.7 40.2 31.4 41.7 37.5 40.4\nDeepSeekCoder-33B 51.2 35.3 17.4 34.2 43.8 32.8 51.3 36.5 41.8\nModel Perl R Ruby Racket Rust Scala Bash Swift TypeScript\nStableCode-3B 9.4 11.5 0.8 7.0 22.9 5.9 8.6 13.2 29.6\nDeepSeekCoder-1.3B 12.5 9.8 24.6 9.1 18.6 19.6 9.7 11.0 27.4\nStarCoderBase-3B 11.3 10.1 4.2 7.9 16.3 16.8 3.8 10.0 22.8\nStarCoder2-3B 13.6 14.2 31.3 7.8 24.5 18.9 12.3 25.1 34.4\nCodeLlama-7B 16.9 14.9 29.5 11.4 25.5 22.8 9.6 24.9 33.4\nDeepSeekCoder-6.7B 30.4 20.5 46.2 17.4 37.7 35.2 22.2 30.3 39.5\nStarCoderBase-7B 15.2 14.5 19.6 11.1 22.6 20.9 7.3 15.1 27.5\nStarCoder2-7B 16.6 16.7 28.3 11.6 29.6 19.5 12.2 26.1 36.3\nCodeLlama-13B 23.4 14.1 31.9 13.0 31.0 29.7 13.3 30.1 40.1\nStarCoderBase-15B 16.3 10.2 17.2 11.8 24.5 28.8 11.0 16.7 32.1\nStarCoder2-15B 37.2 19.8 41.5 22.4 38.0 37.4 18.9 34.2 43.8\nCodeLlama-34B 28.5 22.7 37.8 16.9 38.7 36.7 16.4 35.3 42.1\nDeepSeekCoder-33B 31.0 20.5 44.0 23.4 43.8 43.9 28.7 35.8 48.4\n4.StarCoder2-15B is even competitive with models that are more than twice its size. For example,\nStarCoder2-15B outperforms CodeLlama-34B on both MBPP and MBPP+.\nAlthough EvalPlus makes HumanEval and MBPP far more robust, the problems in these benchmarks only\nexercise basic Python built-ins. They do not test them on other programming languages and do not test\nmodels\u2019 knowledge of other Python libraries. We address these limitations in the rest of this subsection with\nmore comprehensive evaluations on code completion.\n7.1.2 MultiPL-E: Multilingual Code Completion\nAbout the benchmark MultiPL-E (Cassano et al., 2023b) uses a suite of lightweight, rule-based compilers\nto translate HumanEval from Python to 18 other programming languages. Thus MultiPL-E is a multi-language\nbenchmark with the same problems translated to different languages.26\nHyperparameters We sample 50 completions per prompt at temperature 0.2 with top-p 0.95. This is how\nMultiPL-E results are reported on the BigCode Models Leaderboard (Ben Allal, 2023).\nResults The results on MultiPL-E appear in Table 10. We make the following observations:\n26MultiPL-E makes some small changes to the HumanEval prompts, and a few prompts fail to translate to certain languages.\nWe refer the reader to Cassano et al. (2023b) for more information.\n24 Under review as submission to TMLR\n1.Across all size classes, there is no single model that is best at every language. Nevertheless, the\nStarCoder2 models perform well as described below.\n2. Of the small models, StarCoder2-3B performs the best on 11/18 programming languages.\n3.Of the medium models, DeepSeekCoder-6.7B performs best. StarCoder2-7B does better than\nCodeLlama-7B on most languages.\n4.Of the large models, StarCoder2-15B does the best on 16/18 programming languages. CodeLlama-13B\noutperforms StarCoder2-15B on Go and Java.\n5.StarCoder2-15B meets or exceeds the performance of CodeLlama-34B on 10/18 programming\nlanguages and DeepSeekCoder-33B on four lower-resource languages (D, Julia, Lua, and Perl).\n7.1.3 DS-1000: Data Science Tasks in Python\nAbout the benchmark DS-1000 (Lai et al., 2023) is a widely studied benchmark with 1,000 data science\ntasks in Python. Unlike the HumanEval and MBPP problems that only use the Python standard library,\nDS-1000 exercises seven widely used libraries, from Matplotlib to TensorFlow. Therefore, here we further\nadopt DS-1000 to evaluate the performance of Code LLMs in completing data science tasks with popular\nlibraries.\nHyperparameters Following Lai et al. (2023), we use temperature 0.2and top-p 0.95to generate 40\nsamples per problem, and report mean pass@1.\nResults Table 11 reports the results on DS-1000. We make the following observations:\n1.StarCoder2-3B overall is the best-performing small model on DS-1000. Except for PyTorch and\nTensorFlow (where it is slightly worse than StableCode-3B), StarCoder2-3B achieves the best\nperformance on all the other popular libraries.\n2.StarCoder2-7B comes in second place out of the medium models, with a performance similar to\nDeepSeekCoder-6.7B.\n3.StarCoder2-15B is the best-performing large model on DS-1000. It substantially outperforms both\nStarCoderBase-15B and CodeLlama-13B by large margins, and approaches the overall performance\nof CodeLlama-34B.\n7.2 Code Fixing and Editing\nWhile the above subsection has studied various code completion tasks, Code LLMs can be used in various\nother ways. In this subsection, we focus on studying their capabilities for fixing bugs or editing existing code.\n7.2.1 HumanEvalFix: Fixing Bugs in Six Programming Languages\nAbout the benchmark HumanEvalFix (Muennighoff et al., 2024a) is a benchmark that tests a model\u2019s\nability to identify and fix bugs in code. The benchmark supports six programming languages shown in\nFigure 12. Since it is not a code completion benchmark, most base models do poorly on HumanEvalFix\nwhereas instruction-tuned (Wei et al., 2022; Sanh et al., 2022; Muennighoff et al., 2022b; 2024b) models\nperform better. Thus, we consider the instruction-tuned variants of DeepSeekCoder and CodeLlama in\nour comparison (Guo et al., 2024; Rozi\u00e8re et al., 2023). We also compare with OctoCoder, which is an\ninstruction-tuned version of the initial StarCoder using the CommitPackFT dataset (Muennighoff et al.,\n2024a; Zhuo et al., 2024; Longpre et al., 2023). We benchmarked the default HumanEvalFixTests subvariant;\nhence, there were no docstrings present to guide the model.\n25 Under review as submission to TMLR\nTable 11: Performance of open-access models on DS-1000. Benchmarks are as follows. All models were\nevaluated at temperature 0.2and top-p 0.95. Scores reflect mean pass@1 accuracy averaged over 40 samples.\nFormat Model Matplotlib NumPy Pandas PyTorch SciPyScikit-\nLearnTensorFlowOverall\n# problems: 155 220 291 68 106 115 45 1,000\nCompletion StarCoderBase-3B 32.1 16.8 5.3 9.2 13.2 10.5 17.2 14.2\nCompletion StableCode-3B 42.5 24.5 16.2 15.4 13.5 20.2 27.7 22.7\nCompletion DeepSeekCoder-1.3B 36.2 18.8 9.1 10.7 7.9 13.9 13.3 16.2\nCompletion StarCoder2-3B 45.5 27.7 16.2 12.9 15.8 30.8 22.8 25.0\nCompletion StarCoderBase-7B 38.0 23.0 8.2 13.1 13.7 24.5 14.6 19.1\nCompletion DeepSeekCoder-6.7B 52.4 33.0 20.0 13.9 19.8 29.7 27.4 28.9\nCompletion CodeLlama-7B 46.3 21.6 13.9 12.2 17.5 16.7 20.6 21.5\nCompletion StarCoder2-7B 53.6 33.3 16.916.2 20.6 22.2 31.9 27.8\nCompletion StarCoderBase-15B 47.0 27.1 10.1 19.5 21.7 27.0 20.5 23.8\nCompletion CodeLlama-13B 49.0 27.2 17.4 12.9 15.6 24.0 24.8 25.1\nCompletion StarCoder2-15B 60.3 43.3 23.2 11.0 26.4 26.0 36.0 33.8\nCompletion DeepSeekCoder-33B 56.1 49.6 25.8 36.8 36.8 40.0 46.7 40.2\nCompletion CodeLlama-34B 50.3 42.7 23.0 25.0 28.3 33.9 40.0 34.3\nTable 12: Pass@1 performance on HumanEvalFix. StarCoder2 and StarCoderBase are not instruction-tuned\nthus they are at a disadvantage compared to the other models which are all instruction-tuned.\nModel Prompt Python JavaScript Java Go C++ Rust Avg.\nStarCoderBase-15B Instruct 12.6 16.8 18.9 12.5 11.2 0.6 12.1\nStarCoderBase-15B Commit 25.6 29.4 28.8 28.7 28.2 19.7 26.7\nCodeLlama-13B-Instruct Instruct 19.4 18.9 24.1 21.6 10.1 0.4 15.8\nCodeLlama-34B-Instruct Instruct 36.5 28.1 36.4 25.7 25.2 18.5 28.4\nDeepSeekCoder-6.7B-Instruct Instruct 44.9 55.3 52.2 42.9 37.9 19.5 42.1\nDeepSeekCoder-33B-Instruct Instruct 47.5 47.6 46.5 52.0 48.0 10.2 42.1\nOctoCoder-15B Instruct 30.4 28.4 30.6 30.2 26.1 16.5 27.0\nStarCoder2-15B Instruct 9.7 20.7 24.1 36.3 25.6 15.4 22.0\nStarCoder2-15B Issue 48.6 41.6 48.4 48.5 20.7 24.2 38.7\nStarCoder2 issues format Although StarCoder2 is a base model, it is pretrained on GitHub issues and\nStackOverflow discussions using a special format (\u00a75.3). We experiment with prompting the model to fix\ncode bugs in the style of a discussion as follows:\n<issue_start>username_0: instruction\\n\\n\u2018\u2018\u2018buggy function\u2018\u2018\u2018\\nUpvotes: 100<issue_comment>\nusername_1: Sure, here is the fixed code.\\n\\n\u2018\u2018\u2018function start\nIn this template, \u201cinstruction\u201d is the HumanEvalFix instruction telling the model to fix the bug in the code,\n\u201cbuggy function\u201d is the function with a subtle bug, and \u201cfunction start\u201d is the function header including\nimports. The generation of the model is stopped as soon as \u2018\u2018\u2018is generated. The evaluation code is available\nvia Ben Allal et al. (2022), and we denote this as the \u201cIssue\u201d prompt. We also benchmark StarCoder2 with\nthe same basic \u201cInstruct\u201d prompt used in Muennighoff et al. (2024a).\nHyperparameters : Following (Muennighoff et al., 2024a), we use a temperature of 0.2 to estimate pass@1\nwith 20 samples.\nResults Unlike the previous sections, we only evaluate StarCoder2-15B and primarily compare it to\ninstruction-tuned models. The results are in Table 12 (with best-performing models highlighted in bold and\nsecond-best underscored), and we make the following conclusions:\n26 Under review as submission to TMLR\n1.The base models (StarCoder2-15B and StarCoderBase-15B) perform very poorly when given an\ninstruction prompt, which motivates using a different prompt format.\n2.Using the Issue prompt described above, StarCoder2-15B performs remarkable well as a base model.\nIt outperforms the instruction-tuned CodeLlama models by a significant margin and nearly reaches\nthe performance of the instruction-tuned DeepSeekCoder models.\n3.Using the Issue prompt for StarCoder2-15B leads to a larger increase in performance than using the\nCommitprompt forStarCoderBase-15B.This indicates thatpre-trainingonpull requests (StarCoder2)\nis a viable alternative to pre-training on commits (StarCoderBase).\n4.Using the Issue prompt, StarCoder2-15B also outperforms all other open models presented in\nMuennighoff et al. (2024a).\n5.StarCoder2-15B underperforms on C++ when using the Issue prompt, which hurts its overall\nperformance. Our investigation shows that this is mainly because one-third of the code generated\nis incomplete, e.g., having an unexpected break immediately after the beginning of a forloop.\nAdditional prompt engineering may be necessary to fix this. Thus, we still see value in instruction\ntuning StarCoder2 to further improve its usability in handling similar scenarios more effectively\nwithout prompt engineering. We leave the instruction tuning or even preference alignment (Christiano\net al., 2017; Ethayarajh et al., 2024) of StarCoder2 to future work.\n7.2.2 Code Editing\nAbout the benchmark CanItEdit (Cassano et al., 2024) is a hand-crafted benchmark designed to evaluate\nmodel performance in Python code editing tasks. Each problem consists of a code snippet accompanied by\nan instruction of two types: descriptive orlazy. Descriptive instructions are systematic and provide detailed\ninformation, whereas lazy instructions are brief, direct, and mimic the typical instructions humans provide\nto code completion models. The goal is to modify the code according to the instruction; both lazy and\ndescriptive instructions should lead to the same edit. The accuracy of each modification is assessed using a\nhidden test suite, and pass@1 is reported. The benchmark encompasses a variety of problems, from simple\nsingle-function, single-line edits to intricate multi-class problems requiring multiple-line edits in separate\nlocations. Some tasks demand domain-specific knowledge like mathematics, and successful completion of a\nproblem often requires the model to understand the connections between the components of the program.\nListing 3 shows an abbreviated27sample problem from CanItEdit with its lazy instruction.\nListing 3: Abbreviated sample problem from CanItEdit\n-class C4(nn.Module):\n+class C8(nn.Module):\n- \"\"\"Represents the C4 class of group theory,\n+ \"\"\"Represents the C8 class of group theory,\nwhere each element represents a discrete rotation.\"\"\"\ndef __init__(self):\nsuper().__init__()\ndef elements(self):\n\"\"\"Returns all the elements of this group\"\"\"\n- return torch.tensor([0., np.pi/2, np.pi, 3*np.pi/2])\n+ d = np.pi / 4\n+ return torch.tensor([0., d, d*2, d*3, d*4, d*5, d*6, d*7])\nCode Editing Instruction: Edit the C4 class and its methods\nto represent the C8 group.\n27The original problem includes additional methods to edit in the C4 class and a descriptive instruction.\n27 Under review as submission to TMLR\nTable 13: Performance of instructional code editing on the CanItEdit benchmark (Cassano et al., 2024).\nThe results for non-StarCoder2 models are from the benchmark paper.\nModel FormatDescriptive Instructions Lazy Instructions\nPass@1\nStarCoderBase-3B Commit 19.62 12.78\nStarCoder2-3B Issue 21.68 15.91\nDeepSeekCoder-Instruct-1.3B Instruct 25.83 18.33\nStarCoder2-7B Issue 35.23 18.55\nCodeLlama-Instruct-7B Instruct 33.89 27.04\nStarCoderBase-7B Commit 40.64 25.83\nDeepSeekCoder-Instruct-6.7B Instruct 33.89 33.61\nCodeLlama-Instruct-13B Instruct 28.33 20.19\nOctoCoder-15B Instruct 31.46 25.69\nStarCoderBase-15B Commit 38.24 26.38\nStarCoder2-15B Issue 43.08 38.45\nCodeLlama-Instruct-34B Instruct 35.0 26.76\nDeepSeekCoder-Instruct-33B Instruct 53.06 43.89\nHyperparameters We evaluate all sizes of StarCoder2 on the CanItEdit benchmark using the Issue prompt\nformat (introduced in \u00a77.2.1) and compare its performance with other models previously assessed on this\nbenchmark. Following Cassano et al. (2024), we employ random sampling with a temperature of 0.2and a\ntop-pof0.95, with 100completions per problem.\nResults The results appear in Table 13. As described in \u00a77.2.1, we use an \u201cIssue\u201d prompt and \u201cCommit\u201d\nprompt for the StarCoder2 and StarCoderBase models since they are not instruction-tuned. For all the other\nmodels, we use instruction-tuned versions. From the table, we make the following observations:\n1. Of the small models, StarCoder2-3B comes in second place behind DeepSeekCoder-Instruct-1.3B.\n2.Of the medium models, StarCoder2-7B and DeepSeekCoder-Instruct-6.7B each performs best at\ndescriptive and lazy instructions respectively.\n3. StarCoder2-15B is the best-performing large model by a significant margin.\n4. StarCoder2-15B outperforms CodeLlama-Instruct-34B as well.\nThese results give further evidence that the StarCoder2 \u201cIssue\u201d format is a viable alternative to the\nStarCoderBase \u201cCommit\u201d format.\n7.3 Math Reasoning\nAbout the benchmark We use the widely studied GSM8K benchmark (Cobbe et al., 2021), a set of\nmiddle-school math problems, to evaluate the mathematical reasoning capabilities of the models. We use the\nPAL approach proposed by Gao et al. (2023): the model is prompted to generate a Python program, which is\nexecuted to produce the answer to the problem.\nHyperparameters We evaluate models with greedy decoding in an 8-shot setting following Chowdhery\net al. (2023).\nResults The results on GSM8K with PAL appear in Table 14 and we make the following observations:\n1. StableCode-3B is the best-performing small model. StarCoder2-3B is in second place.\n28 Under review as submission to TMLR\nTable 14: 8-shot accuracy on the GSM8K math-reasoning benchmark.\nModel GSM8K (PAL)\nStarCoderBase-3B 8.0\nDeepSeekCoder-1.3B 12.6\nStableCode-3B 39.7\nStarCoder2-3B 27.7\nStarCoderBase-7B 14.9\nDeepSeekCoder-6.7B 41.9\nCodeLlama-7B 27.0\nStarCoder2-7B 40.4\nStarCoderBase-15B 21.5\nCodeLlama-13B 38.1\nStarCoder2-15B 65.1\nCodeLlama-34B 54.2\nDeepSeekCoder-33B 58.7\n2.StarCoder2-7B comes second place. Its performance is very close to the first-place model, which is\nDeepSeekCoder-6.7B, while substantially outperforming both CodeLlama-7B and StarCoderBase-7B.\n3.StarCoder2-15B significantly outperforms all large models, including both CodeLlama-13B and\nStarCoderBase-15B.\n4.In fact, StarCoder2-15B even outperforms CodeLlama-34B and DeepSeekCoder-33B which are more\nthan twice its size.\n7.4 CRUXEval: Code Reasoning, Understanding, and Execution\nAbout the benchmark CRUXEval (Gu et al., 2024) is a two-part benchmark consisting of 800samples\ndesigned to evaluate code reasoning, understanding, and execution. In the first task, CRUXEval-I, the model\nis asked to predict any input such that executing a given Python function on that input produces a given\noutput. In the second task, CRUXEval-O, the model is asked to simulate the execution of a given function on\nan input and predict an output. Two samples are shown below in Listings 4 and 5. The functions and inputs\nof the benchmark were generated by CodeLlama-34B and then filtered to remove complicated functions such\nas those requiring complex arithmetic or a large number of execution steps.\nListing 4: Sample CRUXEval Problem 1\ndef f(string):\nstring_x = string.rstrip(\"a\")\nstring = string_x.rstrip(\"e\")\nreturn string\n# output prediction, CRUXEval-O\nassert f(\"xxxxaaee\") == ??\n# input prediction, CRUXEval-I\nassert f(??) == \"xxxxaa\"Listing 5: Sample CRUXEval Problem 2\ndef f(nums):\ncount = len(nums)\nfor i in range(-count+1, 0):\nnums.append(nums[i])\nreturn nums\n# output prediction, CRUXEval-O\nassert f([2, 6, 1, 3, 1]) == ??\n# input prediction, CRUXEval-I\nassert f(??) == [2, 6, 1, 3, 1, 6, 3, 6, 6]\nHyperparameters Following (Gu et al., 2024), we use temperature 0.2 to report pass@1 and temperature\n0.8 to report pass@5, both using 10 samples.\nResults We show the pass@1 and pass@5 scores for both tasks in our benchmark in Table 15. In terms of\nerror and standard deviation, the original paper reports two sources of noise. First, the noise due to sampling\nfrom the language model for the given set of 800candidates is around 0.2%for 10 samples. Second, the\n29 Under review as submission to TMLR\nTable 15: Accuracy on the CRUXEval benchmark.\nModel CRUXEval-I CRUXEval-O\nPass@1 Pass@5 Pass@1 Pass@5\nStarCoderBase-3B 27.1 43.7 27.4 40.9\nDeepSeekCoder-1.3B 27.8 44.7 31.0 43.4\nStableCode-3B 33.5 53.3 26.7 43.5\nStarCoder2-3B 32.7 50.1 34.2 48.4\nStarCoderBase-7B 29.7 47.3 32.2 44.9\nCodeLlama-7B 35.9 52.9 34.2 48.4\nDeepSeekCoder-6.7B 41.9 62.7 43.5 54.8\nStarCoder2-7B 34.6 53.5 36.0 52.0\nStarCoderBase-15B 31.3 49.2 34.2 47.1\nCodeLlama-13B 42.5 62.0 39.7 53.9\nStarCoder2-15B 48.1 66.9 47.1 59.5\nCodeLlama-34B 47.2 66.6 42.4 55.9\nDeepSeekCoder-33B 46.5 64.9 48.6 61.6\nprecise samples in the benchmark were chosen from a larger set of samples, and the noise from choosing\nwhich samples to include in the benchmark when using 800samples is about 1.5%. We make the following\nobservations:\n1.StarCoder2-3B performs competitively with other small models. It slightly underperforms StableCode-\n3B on CRUXEval-I (but within the noise margin of error) but beats all other small models on\nCRUXEval-O.\n2.For both tasks, StarCoder2-7B performs on par with CodeLlama-7B but lags significantly behind\nDeepSeekCoder-6.7B.\n3.StarCoder2-15B is the best-performing large model. It surpasses CodeLlama-13B and drastically\nimproves upon StarCoderBase-15B on both CRUXEval-I and CRUXEval-O.\n4.StarCoder2-15B performs on par with the extra-large models. On CRUXEval-I, it outperforms\nboth CodeLlama-34B and DeepSeekCoder-33B but within standard deviation. On CRUXEval-O, it\nsignificantly outperforms CodeLlama-34B and slightly underperforms DeepSeekCoder-33B.\n7.5 Fill-in-the-Middle\nAbout the benchmark StarCoder2 supports fill-in-the-middle (FIM), which is the ability to complete an\narbitrary span of code conditioned on both text before and after the insertion point. We use the benchmark\nfrom Ben Allal et al. (2023), which tests the ability of models to fill in a single line of code in Python,\nJavaScript, and Java solutions to HumanEval.\nHyperparameters Following Ben Allal et al. (2023), we sample 20 completions per example at temperature\n0.2 and top-p 0.95 and report the mean exact match, as done\nResults The results appear in Table 16. We observe that StarCoder2-3B performs as well as StarCoderBase-\n15B on this FIM benchmark. Unfortunately, StarCoder2-15B underperforms on FIM. Due to an implementa-\ntion bug, the FIM-rate was smaller than intended for most of the training.\n30 Under review as submission to TMLR\nTable 16: Exact-match on FIM-task (Ben Allal et al., 2023). Due to an implementation bug, FIM was\nincorrect for most of the training of StarCoder2-15B. CodeLlama results are from Rozi\u00e8re et al. (2023).\nModel Java JavaScript Python\nStableCode-3B 63.7 73.3 59.1\nStarCoder2-3B 75.0 73.0 59.1\nStarCoder2-7B 81.1 77.5 61.1\nCodeLlama-13B 80.0 85.0 74.5\nStarCoderBase-15B 73 74 62\nStarCoder2-15B 60.5 54.7 48.4\n7.6 Repository-Level Code Completion Evaluation\nCode completion in practice often occurs within the context of a repository rather than in isolated files.\nLeveraging repository-level context for code completion is thus essential for models to perform well in real-\nworld scenarios. We evaluate models on repository-level code completion with two benchmarks: RepoBench\n(Liu et al., 2023b) and CrossCodeEval (Ding et al., 2023).\n7.6.1 RepoBench\nAbout the benchmark RepoBench (Liu et al., 2023b) is a live benchmark designed for evaluating code\ncompletion at the repository level, with a focus on next-line prediction. In this work, we use the latest\nversion (v1.1) of RepoBench28,29, which sources its data from GitHub repositories created from October 6th\nto December 31st, 2023, and takes steps to avoid data leakage by removing duplicates against The Stack\nv2. Our evaluation includes five levels\u20142k, 4k, 8k, 12k, and 16k\u2014across three settings: cross-file-first ,\ncross-file-random , and in-file, with each setting comprising 5,000 data points (1,000 per level). We\nreport the average edit similarity, exact match, and CodeBLEU (Ren et al., 2020) scores for the three settings.\nHyperparameters Following prior work on Code LLMs (Chen et al., 2021), we set the generation\ntemperature to 0.2and the top- psampling parameter to 0.95for all models under evaluation. We constrained\nthe models to generate a maximum of 128 new tokens per prompt, and the first non-empty and non-comment\nline of the output was selected as the prediction. While StarCoder2 uses special tokens for repository-\nlevel training, we ensured uniformity in prompt construction across all models by following the official\nimplementation in line with Liu et al. (2023b). The maximum token count for prompts was set to 15,800 by\ntruncating excess cross-file context, except for StarCoderBase, which was constrained to 7,800 tokens due to\nits maximum sequence length limit of 8k.\nResults Table 17 presents the performance of open-access models on RepoBench v1.1. We observe that:\n1.StarCoder2, with repository-level training, consistently outperforms StarCoderBase, across all\nevaluated model sizes.\n2.StarCoder2-3B demonstrates notable performance among the smaller models, ranking as the second-\nbest one following StableCode-3B.\n3.StarCoder2-7B achieves competitive performance closely matching that of CodeLlama-7B among the\nmedium models, with DeepSeekCoder-6.7B achieving the leading performance metrics.\n4.StarCoder2-15B not only largely outperforms CodeLlama-13B but also showcases comparable, and\nin some metrics superior, performance against the significantly larger CodeLlama-34B model.\n28https://huggingface .co/datasets/tianyang/repobench_python_v1 .1\n29https://huggingface .co/datasets/tianyang/repobench_java_v1 .1\n31 Under review as submission to TMLR\nTable 17: Average exact match (EM), edit similarity (ES), and CodeBLEU (CB) scores for open-access base\nmodels on RepoBench v1.1 (Liu et al., 2023b).\nModelPython Java\nEM ES CB EM ES CB\nStarCoderBase-3B 29.99 69.37 36.77 36.01 74.18 45.30\nDeepSeekCoder-1.3B 31.02 70.07 37.88 37.75 75.66 46.69\nStableCode-3B 34.48 71.79 40.43 40.13 76.56 49.00\nStarCoder2-3B 32.47 71.19 39.25 38.46 76.53 47.96\nStarCoderBase-7B 32.70 71.08 39.48 37.97 75.66 47.47\nCodeLlama-7B 33.85 71.79 40.47 39.61 76.71 48.92\nDeepSeekCoder-6.7B 36.79 73.85 42.65 42.87 78.93 51.69\nStarCoder2-7B 33.72 72.07 40.34 39.84 77.23 48.96\nStarCoderBase-15B 33.51 71.64 40.39 39.34 76.24 48.36\nCodeLlama-13B 35.50 72.98 42.02 41.27 77.57 50.26\nStarCoder2-15B 36.99 74.08 43.25 42.57 79.05 51.45\nCodeLlama-34B 37.22 73.77 43.38 42.35 78.22 50.99\nDeepSeekCoder-33B 39.25 75.20 45.21 44.59 79.92 52.70\n7.6.2 CrossCodeEval\nAbout the benchmark CrossCodeEval (Ding et al., 2023) is a diverse and multilingual benchmark\ndesigned for repository-level code completion. It was constructed from a wide range of real-world, open-\nsourced, permissively licensed repositories in four popular programming languages: Python, Java, TypeScript,\nand C#. Through careful static analysis methods, CrossCodeEval strictly requires cross-file context for\naccurate code completion. We report results in both Code Match (Edit Similarity) and Identifier Match (F1\nScore) following the definitions in Ding et al. (2023) in all four languages.\nHyperparameters We use a max sequence length of 16k for all models except for StarCoderBase, which\nonly supports 8k. In line with Ding et al. (2023), we use the retrieve-and-generate (RG) method with\nOpenAI\u2019s ada embedding, which was found to perform well in their study. To optimize the usage of the\nextended 16k context, we retrieve a maximum of 100 code segments, each comprising its file path and 10\nlines of code. The maximum cross-file context was set to 12,800 tokens and the max generation token is 50\ntokens following. Consistent with Ding et al. (2023), we use the uniform prompt formatting in the original\nimplementation, with a temperature of 0.2 and top-p of 0.95 for all model generations.\nResults Table 18 presents the evaluation results. We found that:\n1.Across almost all dimensions, including model sizes, programming languages, and metrics, StarCoder2\nconsistently outperforms StarCoderBase. This enhancement could likely be attributed to better\npre-training with increased context length and repository-level objectives (Section 5.1).\n2.StarCoder2-15B achieves the state-of-the-art performance compared to models of similar sizes. For\ncertain languages like Java and C#, the performance is better even than models with 2x capacity.\n3.The analysis also reveals significant performance variances in different languages for the same model,\nsimilar to the findings in MultiPL-E (\u00a77.1.2). While a model can be strong overall, achieving uniformly\nhigh performance across all programming languages remains challenging, e.g., StarCoder2-15B is\nbehind on TypeScript while StableCode-3B in C# and DeepSeekCoder-34B in Java. The disparity\ncalls for future research on building models that can achieve high performance across diverse range\nof languages in different settings.\n32 Under review as submission to TMLR\nTable 18: CrossCodeEval (Ding et al., 2023) evaluation results. We report Code Match (Edit Similarity)\nand Identifier Match (F1) results for four languages.\nModelPython Java TypeScript C#\nCode ES ID F1 Code ES ID F1 Code ES ID F1 Code ES ID F1\nStarCoderBase-3B 69.47 62.56 66.43 59.77 41.42 35.26 70.11 53.15\nDeepSeekCoder-1.3B 72.41 66.76 65.92 59.93 63.59 56.41 70.98 54.84\nStableCode-3B 76.00 70.75 73.19 67.93 65.61 59.61 61.70 48.98\nStarCoder2-3B 73.01 67.85 66.31 61.06 38.79 35.17 70.86 55.42\nStarCoderBase-7B 72.24 65.40 69.91 64.12 44.21 39.77 71.93 55.98\nDeepSeekCoder-6.7B 77.43 73.16 70.60 66.28 69.08 63.61 74.84 62.29\nCodeLlama-7B 74.52 69.11 71.49 65.99 65.96 59.46 71.41 56.66\nStarCoder2-7B 74.52 68.81 70.75 65.27 43.19 38.84 72.73 57.69\nStarCoderBase-15B 73.43 66.74 70.58 64.66 45.24 40.47 71.77 55.71\nCodeLlama-13B 75.88 70.97 73.08 68.29 67.88 61.46 72.73 59.62\nStarCoder2-15B 78.72 74.27 74.92 70.45 48.63 43.78 75.38 62.14\nCodeLlama-34B 76.34 71.36 74.30 69.45 68.98 63.19 73.96 60.07\nDeepSeekCoder-33B 78.78 74.51 73.41 69.02 70.31 65.14 75.04 63.03\nTable 19: Performance on the \u201cAsleep at the Keyboard\u201d benchmark.\nModel Valid ( \u2191) Insecure ( \u2193)\nStarCoderBase-3B 910/1000 (91.0%) 224/910 (24.6%)\nDeepSeekCoder-1.3B 893/1000 (89.3%) 290/893 (32.5%)\nStarCoder2-3B 925/1000 (92.5%) 113/900 (12.2%)\nStarCoderBase-7B 916/1000 (91.6%) 243/916 (26.5%)\nCodeLlama-7B 900/1000 (90.0%) 195/900 (21.7%)\nDeepSeekCoder-6.7B 921/1000 (92.1%) 315/921 (34.2%)\nStarCoder2-7B 912/1000 (91.2%) 363/926 (39.8%)\nStarCoderBase-15B 933/1000 (93.3%) 332/933 (35.6%)\nCodeLlama-13B 903/1000 (90.3%) 273/903 (30.2%)\nStarCoder2-15B 898/1000 (89.8%) 352/898 (39.2%)\n7.7 \u201cAsleep at the Keyboard\u201d Security Benchmark\nAbout the benchmark \u201cAsleep at the Keyboard\u201d is a benchmark designed for assessing security vulnera-\nbilities in code generation (Pearce et al., 2022). Similar to Li et al. (2023), we focus on the subset of tasks\namenable to automated evaluation, which is the Diversity of Weakness problems. These cover 18 diverse\nvulnerability classes from the MITRE Common Weakness Enumeration (CWE) taxonomy, with scenarios\ndrawn from the 2021 CWE Top 25 Most Dangerous Software Weaknesses list published by MITRE. The\nproblems have 23 scenarios in C and 17 scenarios in Python.\nHyperparameters Following Li et al. (2023), we set the temperature to 0.2 and top-p to 0.95. Each model\ngenerates 25 samples per scenario, resulting in a total of 1,000 completions.\nResults We report results of selected models in Table 19. Column Validgives the percentage of solutions\nthat were syntactically valid, and Column Insecure shows the percentage of valid solutions that include the\nvulnerability the scenario tests for. From the table, we draw the following conclusions:\n1.StarCoder2 generates comparable numbers of valid programs to StarCoderBase, CodeLlama, and\nDeepSeekCoder. Both StarCoderBase and StarCoder2 models achieve around 90% valid program\n33 Under review as submission to TMLR\nrate. However, after some manual inspection, we notice that StarCoder2 tends to generate more\nfunctionally correct code than StarCoderBase. The observation is aligned with the evaluation in\nprevious sections.\n2.Except for StarCoder2-3B, StarCoder2-7B and StarCoder2-15B have the highest insecure program\nrate among the models having similar parameters. The high insecure rate might be a side-effect of\nits higher rate of valid and functionally correct completions. These generated programs are more\nlikely to be exposed to potential vulnerabilities, as suggested by Bhatt et al. (2023). Notably, Li\net al. (2023) find that code-cushman-001 , the initial model used in commercialized Copilot, has an\ninsecure rate beyond 40%.\n7.8 Measuring Harmful Generations\nAbout the benchmarks Bias in Open-ended Language Generation Dataset (BOLD) (Dhamala et al.,\n2021) is a dataset of 23,679 prompts that measure fairness across Religious Ideology, Procession, Race, Gender,\nand Political Ideology. We use the Regard classifier by Sheng et al. (2019) and average the classifier scores\nacross each category.\nWinoBias (Zhao et al., 2018) measures bias towards gender stereotypes across professions. When given a\nsentence regarding a profession, the gender is masked, and a language model is used to generate the replaced\nmasked token and the perplexity is used to calculate the bias towards male and female pronouns.\nHurtful Sentence Completion in English Language Models (HONEST) (Nozza et al., 2021) is a template-based\ncorpus to assess the hurtfulness of sentence completions generated by the language models. To calculate\nthe HONEST score, we check whether each generated completion has any terms contained in each of the\ncategories within Hurtlex30.\nRealToxicityPrompts (Gehman et al., 2020) consists of 100,000 naturally occurring, sentence-level prompts,\nwhich are extracted from the large web corpus of English text. They can be used to evaluate the risk of\nneural toxic degeneration in the language models. We use a 10,000 subset to perform the evaluation. We use\nthe classifier by Vidgen et al. (2021) to detect toxicity and report the average probability of the detected\ntoxic output as our toxicity score.\nHyperparameters For each prompt in BOLD and RealToxicityPrompts, we generate one completion with\nup to 50 additional tokens. On HONEST, we generate 5 completions for each sample with up to 50 additional\ntokens.\nResults The results for BOLD, WinoBias, HONEST, and RealToxicityPrompts are presented in Tables 20,\n21, 22, and 23, respectively. The tables suggest that our models LLMs that we consider produce roughly the\nsame amount of harmful content, and based on Li et al. (2023), LLMs trained primarily on code produce less\nharmful content than LLMs trained on general web text.\n8 Search Index and Attribution Tools\nFollowing the standard set by Li et al. (2023) we build another suite of data inspection, attribution, and\nsearch tools. The NLP community has recognized the need for data inspection and has begun producing\ncomputational documentation artifacts to complement static data descriptions (Piktus et al., 2023b; Marone\n& Van Durme, 2023; Piktus et al., 2023a; Akiki et al., 2023, among others). Open science and open data go\nbeyond releasing dumps of datasets.\nMembership checking tools This work collects and constructs a dataset 4 times larger than that used in\nStarCoderBase. Compared to the initial version of The Stack, the version here contains many additional\nnon-code sources (see Table 4). As data sizes increase, it becomes even more important to construct tools that\nallow for accessible and efficient data inspection. We update the \u201cAm I in the Stack\u201d tool with repositories in\n30https://github .com/valeriobasile/hurtlex\n34 Under review as submission to TMLR\nTable 20: BOLD evaluations of open source code models.\nModel Category Negative Score Neutral Score Other Score Positive Score\nReligious Ideology 0.16 0.33 0.13 0.38\nProfession 0.07 0.6 0.06 0.27\nStarCoder2-3B Race 0.05 0.5 0.05 0.5\nGender 0.05 0.48 0.05 0.43\nPolitical Ideology 0.3 0.29 0.18 0.23\nReligious Ideology 0.12 0.32 0.12 0.45\nProfession 0.07 0.58 0.06 0.3\nStarCoderBase-3B Race 0.04 0.44 0.05 0.47\nGender 0.04 0.35 0.05 0.55\nPolitical Ideology 0.3 0.27 0.18 0.25\nReligious Ideology 0.18 0.25 0.16 0.41\nProfession 0.08 0.57 0.06 0.28\nStableCode-3B Race 0.07 0.4 0.06 0.46\nGender 0.05 0.36 0.06 0.53\nPolitical Ideology 0.32 0.27 0.18 0.25\nReligious Ideology 0.19 0.81 0.03 0.13\nProfession 0.08 0.52 0.07 0.33\nStarCoder2-7B Race 0.06 0.4 0.07 0.47\nGender 0.06 0.37 0.07 0.5\nPolitical Ideology 0.33 0.22 0.21 0.24\nReligious Ideology 0.16 0.28 0.13 0.43\nProfession 0.07 0.56 0.06 0.31\nStarCoderBase-7B Race 0.05 0.41 0.06 0.48\nGender 0.04 0.33 0.06 0.57\nPolitical Ideology 0.33 0.23 0.19 0.25\nReligious Ideology 0.16 0.27 0.14 0.43\nProfession 0.07 0.58 0.06 0.3\nCodeLlama-7B Race 0.06 0.42 0.06 0.46\nGender 0.05 0.38 0.06 0.5\nPolitical Ideology 0.3 0.28 0.19 0.24\nReligious Ideology 0.15 0.33 0.13 0.39\nProfession 0.07 0.61 0.06 0.27\nDeepSeekCoder-6.7B Race 0.05 0.46 0.05 0.44\nGender 0.04 0.34 0.06 0.56\nPolitical Ideology 0.3 0.28 0.19 0.23\nReligious Ideology 0.21 0.22 0.16 0.42\nProfession 0.09 0.51 0.07 0.33\nStarCoder2-15B Race 0.07 0.39 0.07 0.47\nGender 0.05 0.36 0.07 0.53\nPolitical Ideology 0.25 0.02 0.1 0.09\nReligious Ideology 0.16 0.31 0.13 0.41\nProfession 0.07 0.61 0.06 0.26\nStarCoderBase-15B Race 0.06 0.46 0.06 0.43\nGender 0.04 0.38 0.06 0.53\nPolitical Ideology 0.32 0.28 0.19 0.22\nReligious Ideology 0.17 0.24 0.14 0.45\nProfession 0.07 0.54 0.06 0.33\nCodeLlama-13B Race 0.07 0.36 0.07 0.5\nGender 0.05 0.35 0.06 0.53\nPolitical Ideology 0.3 0.23 0.19 0.28\nnew dataset.31This tool allows for data inspection at the username and repository level. Marone & Van\nDurme (2023) recommend releasing a documentation artifact called a Data Portrait to support lightweight\nmembership inspection. We implement one using Bloom filters to enable matching on file contents, crucially\nincluding the non-code sources like documentation, textbooks, and papers.32These prose data sources may\n31https://huggingface .co/spaces/bigcode/in-the-stack\n32https://stack-v2 .dataportraits .org\n35 Under review as submission to TMLR\nTable 21: WinoBias evaluations of open source code models.\nModel Male Female Average\nStarCoder2-3B 0.33 -0.33 0.27\nStarCoderBase-3B 0.42 -0.42 0.28\nStableCode-3B 0.44 -0.44 0.39\nStarCoder2-7B 0.45 -0.45 0.34\nStarCoderBase-7B 0.51 -0.51 0.31\nCodeLlama-7B 0.37 -0.37 0.38\nDeepSeekCoder-6.7B 0.41 -0.41 0.34\nStarCoder2-15B 0.36 -0.36 0.38\nStarCoderBase-15B 0.55 -0.55 0.35\nCodeLlama-13B 0.36 -0.36 0.37Table 22: HONEST evaluations.\nModel Score\nStarCoder2-3B 0.11\nStarCoderBase-3B 0.11\nStableCode-3B 0.09\nStarCoder2-7B 0.1\nStarCoderBase-7B 0.11\nCodeLlama-7B 0.11\nDeepSeekCoder-6.7B 0.1\nStarCoder2-15B 0.11\nStarCoderBase-15B 0.1\nCodeLlama-13B 0.1\nTable 23: Toxicity score evaluation of open source code models.\nModel Toxicity Score\nStarCoder2-3B 0.05\nStarCoderBase-3B 0.04\nStableCode-3B 0.05\nStarCoder2-7B 0.08\nStarCoderBase-7B 0.04\nCodeLlama-7B 0.04\nDeepSeekCoder-6.7B 0.05\nStarCoder2-15B 0.05\nStarCoderBase-15B 0.04\nCodeLlama-13B 0.04\ndescribe algorithms or solutions not present elsewhere. Content creators can use our system as a simple \u201cno\ncode\u201d inspection tool to check if their material occurs verbatim in our data. It also enables a rapid first-pass\nattribution check for coding tools built on our models.33This system takes about 70GB, substantially smaller\nthan the data, but provides only exact matches for long strings. If necessary, users can use the full search\nindex for additional analysis.\nSearch index The preceding tools provide lightweight data inspection. However, it may be necessary\nto perform full-text searches that support fuzzy matching and retrieval. Following StarCoder1 (Li et al.,\n2023), we build an Elasticsearch index on the source code subset of The Stack v2 and make it available at\nhttps://huggingface .co/spaces/bigcode/search-v2 .\n9 Social Impact and Limitations\nSocial impact and limitations have already been documented in the BigCode project (Kocetkov et al., 2023;\nBen Allal et al., 2023; Li et al., 2023; BigCode collaboration et al., 2023). In the following sections, we cover\nour project approach towards the responsible development of large language models for code and highlight\nsome more recent advances.\n33https://github .com/huggingface/llm-vscode\n36 Under review as submission to TMLR\n9.1 Project Approach\nOpen-science StarCoder2 is the output of a community research project. The project is conducted in the\nspirit of Open Science (Woelfle et al., 2011; Mendez et al., 2020), focused on the responsible development and\nuse of Code LLMs. Through open-governance practices, priority in decision-making has always yielded to the\nmore responsible option, even if this meant introducing limitations that might impact adoption or future\nresearch (BigCode collaboration et al., 2023).\nEthical data sourcing Significant efforts from the BigCode community went into the careful curation,\nvalidation, decontamination, malware removal, license filtering, opt-out process, PII removal, structuring,\npackaging, hosting, licensing, and the publishing of a Dataset Card (Project, 2024) for the data used to train\nStarCoder2. Full transparency has been provided about the data used for training StarCoder2. A significant\nportion of the training dataset was sourced under license from Software Heritage (Software Heritage, 2024a).\nAccelerating research BigCode\u2019s open approach to scientific collaboration (BigCode collaboration et al.,\n2023), open access model distribution and licensing (BigCode Project, 2023a; Malfa et al., 2023), and openness\nand disclosures of training data, architectures, and development are essential for the research community to\nhave access to powerful, truly open LLMs, helping to accelerate future research (Groeneveld et al., 2024; Xu\net al., 2024; Soldaini et al., 2024; Singh et al., 2024; \u00dcst\u00fcn et al., 2024; Luukkonen et al., 2023; Woelfle et al.,\n2011).\nOpen, but responsible The BigCode Open RAIL-M license (BigCode Project, 2023a) contains important\nuse restrictions and is accompanied by an FAQ to help guide the responsible deployment and use of the\nmodel by downstream users (BigCode Project, 2023b).\nCommunity of practice BigCode is very much a community of practice, with over 1,200 multi-disciplinary\nmembers from more than 60 countries working towards the responsible development of large language models\nfor code (Sholler et al., 2019; Kocetkov et al., 2023; Ben Allal et al., 2023; Li et al., 2023; Muennighoff\net al., 2024a; Zhuo et al., 2024). Of these members, 417 were active in the BigCode community collaboration\ntools within the period 27 October 2023 through 24 February 2024, the period aligning with StarCoder2\ndevelopment. There has also been considerable downstream adoption of BigCode outputs, with millions of\ndownloads collectively reported via the Hugging Face API (BigCode, 2024).\nAuditable The StarCoder2 model, pre-training dataset, and supporting artifacts are easily accessible and\navailable to anyone who wishes to conduct an independent audit (Solaiman, 2023; M\u00f6kander et al., 2023;\nBigCode collaboration et al., 2023).\n9.2 Advancements in Code LLMs\nGovernance Card The BigCode Governance Card (BigCode collaboration et al., 2023) serves as an\noverview of the different mechanisms and areas of governance in the BigCode project. It aims to support\ntransparency by providing relevant information about choices that were made during the project to the\nbroader public and to serve as an example of intentional governance (Sholler et al., 2019) of an open research\nproject that future endeavors can leverage to shape their own approach. The first section, Project Structure,\ncovers the project organization, its stated goals and values, its internal decision processes, and its funding\nand resources. The second section, Data and Model Governance, covers decisions relating to the questions of\ndata subject consent, privacy, and model release.\nArchival of software metadata: Software metadata is vital for the classification, curation, and sharing of\nfree and open-source software (FOSS). The source code landscape is very diverse. By generating linked data\nand referencing source code contributions within the Software Heritage archive from the global community of\ndevelopers and scientists (Heritage, 2024), there is potential to enable a more ethical data supply chain for\ntraining LLMs (Cosmo & Zacchiroli, 2017; Abramatic et al., 2018).\n37 Under review as submission to TMLR\nAcceptable ML use: On October 19, 2023, Software Heritage published a statement that defines the\nacceptable machine learning use of the Software Heritage archive. This is a significant milestone that opens\nthe door for more responsible data sourcing and licensing of AI training data (Software Heritage, 2023).\nSoftWare Hash IDentifiers (SWHID): Software Heritage provides the SWHID unique identifiers,\nintrinsically bound to the software components, and that need no central registry, to ensure that a resilient\nweb of knowledge can be built on top of the Software Heritage archive (The SWHID Specification Project,\n2024). This can also be used by downstream developers to support efforts for those companies that prioritize\na \u201csoftware bill of materials\u201d (SBOM) as a key building block in software security and software supply chain\ntransparency and risk management (Cybersecurity & Infrastructure Security Agency, 2024; Mirakhorli et al.,\n2024), for example by including the SWHIDs in the SBOM, alongside other relevant information such as\ncomponent names, versions, licenses, and source locations.\n9.3 Challenges and Risks\nOpenness and safety risks Solaiman (2023) explains how the degree of openness in the LLM development\nprocess is connected to the potential risks associated with a model release. When systems are developed in a\nfully closed manner, it is more likely for power to become concentrated among high-resourced organizations,\nand the small development team may not fully comprehend the impact and long-term consequences of the\nmodel being deployed. In addition, closed-development systems are often less auditable by external experts\nand can impede scientific progress since researchers cannot build upon each other\u2019s work. On the other hand,\nfully open development allows for community research, democratizes access to the models, and enables audits\nthroughout the whole development process. However, without appropriate guardrails, open LLM development\nposes a higher risk of misuse, as increased model access also increases the likelihood of harm caused by the\nmodel. Even though a released API can be shut down, once the model weights are released, it is nearly\nimpossible to retract them. Discussing and implementing responsible AI practices has, therefore, been front\nand center during the development of our project\u2019s LLMs.\nPrivacy compliant generated code It is difficult to correctly identify and classify the different types of\nPII so that personal data processing, transformations, and flows through code can be evaluated (Tang et al.,\n2023). Where privacy-relevant methods are invoked in generated code, checking for PII leaks to the internet,\nuse of encrypted data and anonymous IDs, will be necessary (Tang & \u00d8stvold, 2024). Downstream users are\nadvised to implement additional PII scanning, filtering, cleansing, and mitigation to ensure compliance with\ntheir intended use cases (Yang et al., 2023; Albalak et al., 2024).\nSecurity As with any open scientific research that provides open access to model weights, hyper-parameters,\ndata processing code, training code, training data, and documentation, any actor can run or fine-tune the\noptimized model with very low computing costs (Governance AI, 2024). Even with the use restrictions set\nforth within the BigCode Open RAIL-M license, this will not prevent bad actors with malicious intent from\nattempting to cause harm (Mozes et al., 2023). For example, code LLMs with API access could be used\nto create sophisticated polymorphic malware (CrowdStrike, 2024) that would be highly evasive to security\nproducts that rely on signature-based detection and will be able to bypass measures such as Anti-Malware\nScanning Interface (AMSI) as it eventually executes and runs code (CyberArk, 2024; Gupta et al., 2023).\nSocietal bias As has been previously established in evaluations of coding models, code LLMs can generate\ncode with a structure that reflects stereotypes about gender, race, emotion, class, the structure of names, and\nother characteristics (Chen et al., 2021; Zhuo et al., 2023a). Further evaluation and guardrail mitigations are\nrequired in the context of downstream use cases (Huang et al., 2023; Dong et al., 2024).\nRepresentation bias As discussed in previous sections, there is a lot more data in the training dataset\nfor popular programming languages like Python and Java than for niche languages like Haskell and Fortran.\nAs such, the model performs better on such high-resource languages, which may reinforce the preference of\ndevelopers towards using such languages. Fortunately, there\u2019s much ongoing research on how to improve the\nperformance of Code LLMs on low-resource languages (Cassano et al., 2023a; Zhuo et al., 2023b). Furthermore,\n38 Under review as submission to TMLR\nthe predominant natural language in source code and other datasets used is English although other languages\nare also present. As such, the model can generate code snippets provided some non-English context, but the\ngenerated code is not guaranteed to work as intended or equally as well for all languages. This could limit\nthe model\u2019s fairness and effectiveness across different coding tasks and environments (Alyafeai et al., 2024).\nTraceability Using the SWHID to trace software components is not an easy task and will challenge most\nif not all, downstream developers. Future development and advancement of tools that make it easier to trace\nsoftware components will be necessary to enable more transparent and responsible data supply chains (Cosmo\net al., 2020).\nJob augmentation vs. automation Code LLMs serve as powerful foundation models that can be fine-\ntuned to generate high-quality code, documentation, unit tests, text summaries, automation workflows, and\nmore. Chen et al. (2023) find a positive correlation between occupation exposure and wage levels/experience\npremiums, suggesting higher-paying and experience-intensive jobs may face greater displacement risks from\nLLM-powered software. Goldman Sachs (2024) suggest that AI has the potential to automate 25% of labor\ntasks in advanced economies and 10 \u2013 20% in emerging economies, however, they also state that \"those\nfears should be counterbalanced, since AI has the potential to create new job tasks or categories requiring\nspecialized human expertise\". Autor et al. (2022) reports that \u201cRoughly 60% of employment in 2018 is\nfound in job titles that did not exist in 1940.\u201d and that \"augmentation innovations boost occupational labor\ndemand, while automation innovations erode it\". Results from the task-based analysis in (World Economic\nForum, 2024) reveal that jobs with the highest potential for automation of tasks by LLMs emphasize routine\nand repetitive procedures and do not require a high degree of interpersonal communication. Jobs with the\nhighest potential for augmentation by LLMs emphasize critical thinking and complex problem-solving skills,\nespecially those in science, technology, engineering, and mathematics (STEM) fields. Ziegler et al. (2024)\nreports the benefits of receiving AI suggestions while coding span the full range of typically investigated\naspects of productivity, such as task time, product quality, cognitive load, enjoyment, and learning. In (Peng\net al., 2023), a two-year collaboration between Google Core and Google Research (Brain Team), they find that\nof the 10k+ Google-internal developers using the code completion setup in their IDE, they measured user\u2019s\ncode acceptance rate of 25-34%. Yahoo Finance (2024) announced ServiceNow, Inc. (NYSE: NOW) 2024 Q4\nEarnings with coverage that the ServiceNow platform Now Assist skills using text-to-code (ServiceNow, 2024b)\nand text-to-workflow (ServiceNow, 2024a) LLMs (based on StarCoder), augment and increased developer\nproductivity and speed of innovation by 52%.\n10 Conclusion\nWe introduced StarCoder2, a family of LLMs designed for code generation, along with The Stack v2, the\nlargest pre-training corpus for Code LLMs built on the foundations of the Software Heritage archive. The\nStack v2 is ten times larger than its predecessor, yielding a raw dataset of 67.5 TB. Through extensive\ncleaning, filtering, and subsampling of the source code, along with the incorporation of other high-quality\ncode-related datasets, we created a training set of approximately 3TB (900B+ tokens). Leveraging this\nnew dataset, we trained StarCoder2 models with 3B, 7B, and 15B parameters. Our extensive Code LLM\nevaluations, assessing code completion, editing, and reasoning capabilities, revealed that StarCoder2-3B and\nStarCoder2-15B are state-of-the-art models within their respective size classes. By not only releasing the\nmodel weights but also ensuring complete transparency regarding the training data, we hope to increase trust\nin the developed models and empower other engineering teams and scientists to build upon our efforts.\n11 Acknowledgements\nThis work was made possible by Software Heritage, the great library of source code: https://\nwww.softwareheritage .org, and all the developers and scientists that contribute to the open source archives.\nWe thank Joydeep Biswas (UT Austin), Northeastern Research Computing, and NCSA Delta for providing\ncomputing resources used for evaluation. Carolyn Jane Anderson and Arjun Guha were partially sponsored\nby the U.S. National Science Foundation awards SES-2326173 and SES-2326174. Jiawei Liu, Yuxiang Wei,\n39 Under review as submission to TMLR\nand Lingming Zhang were partially sponsored by the U.S. National Science Foundation award CCF-2131943.\nFederico Cassano was partly sponsored by Roblox.\nWe thank Jenny Hui, ServiceNow, for her leadership in executing the StarCoder2 Research Collaboration\nAgreement between ServiceNow, Hugging Face, and NVIDIA to enable the training of all 3 models.\nWe thank the extended members of the BigCode community for the ongoing support and for their downstream\ncontributions back to the community.\nWe also thank Hessie Jones and the Privacy Protection Collab that shared insights and lessons learned from\ntheir work in Defining Personal Information and the Remediation Framework during early exploration and\nconsideration of PII redaction.\nEvgenii Zheltonozhskii is supported by the Adams Fellowships Program of the Israel Academy of Sciences\nand Humanities.\n40 Under review as submission to TMLR\nReferences\nJean-Fran\u00e7ois Abramatic, Roberto Di Cosmo, and Stefano Zacchiroli. Building the universal archive of source\ncode.Communications of the ACM ,61(10):29\u201331, 2018. doi: 10 .1145/3183558. URL https://cacm .acm.org/\nmagazines/2018/10/231366-building-the-universal-archive-of-source-code/fulltext . (cited on\npp. 3 and 37)\nJoshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebron, and Sumit Sanghai.\nGQA:Traininggeneralizedmulti-querytransformermodelsfrommulti-headcheckpoints. InHoudaBouamor,\nJuan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural\nLanguage Processing , pp. 4895\u20134901, Singapore, December 2023. Association for Computational Linguistics.\ndoi: 10.18653/v1/2023 .emnlp-main .298. URL https://aclanthology .org/2023.emnlp-main .298. (cited\non p. 20)\nChristopher Akiki, Giada Pistilli, Margot Mieskes, Matthias Gall\u00e9, Thomas Wolf, Suzana Ilic, and Yacine\nJernite. BigScience: a case study in the social construction of a multilingual large language model. In\nWorkshop on Broadening Research Collaborations 2022 , 2022. URL https://openreview .net/forum?id=\n2e346l2PPOm . (cited on p. 2)\nChristopher Akiki, Odunayo Ogundepo, Aleksandra Piktus, Xinyu Zhang, Akintunde Oladipo, Jimmy Lin,\nand Martin Potthast. Spacerini: Plug-and-play search engines with pyserini and Hugging Face. In Yansong\nFeng and Els Lefever (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language\nProcessing: System Demonstrations , pp. 140\u2013148, Singapore, December 2023. Association for Computational\nLinguistics. doi: 10 .18653/v1/2023 .emnlp-demo .12. URL https://aclanthology .org/2023.emnlp-demo .12.\n(cited on p. 34)\nAlon Albalak, Yanai Elazar, Sang Michael Xie, Shayne Longpre, Nathan Lambert, Xinyi Wang, Niklas\nMuennighoff, Bairu Hou, Liangming Pan, Haewon Jeong, Colin Raffel, Shiyu Chang, Tatsunori Hashimoto,\nand William Yang Wang. A survey on data selection for language models. arXiv preprint , February 2024.\nURL https://arxiv .org/abs/2402 .16827. (cited on p. 38)\nZaid Alyafeai, Khalid Almubarak, Ahmed Ashraf, Deema Alnuhait, Saied Alshahrani, Gubran A. Q. Ab-\ndulrahman, Gamil Ahmed, Qais Gawah, Zead Saleh, Mustafa Ghaleb, Yousef Ali, and Maged S. Al-\nShaibani. CIDAR: culturally relevant instruction dataset for Arabic. arXiv preprint , February 2024. URL\nhttps://arxiv .org/abs/2402 .03177. (cited on p. 39)\nNaveen Arivazhagan, Ankur Bapna, Orhan Firat, Dmitry Lepikhin, Melvin Johnson, Maxim Krikun, Mia Xu\nChen, Yuan Cao, George Foster, Colin Cherry, Wolfgang Macherey, Zhifeng Chen, and Yonghui Wu.\nMassively multilingual neural machine translation in the wild: Findings and challenges. arXiv preprint ,\nJuly 2019. URL https://arxiv .org/abs/1907 .05019. (cited on p. 14)\nArxiv, 2024. URL https://info .arxiv.org/help/bulk_data_s3 .html. (cited on p. 12)\nJacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen\nJiang, Carrie Cai, Michael Terry, Quoc V. Le, and Charles Sutton. Program synthesis with large language\nmodels.arXiv preprint , August 2021. URL https://arxiv .org/abs/2108 .07732. (cited on pp. 3 and 23)\nDavid Autor, Caroline Chin, Anna M Salomons, and Bryan Seegmiller. New frontiers: The origins and\ncontent of new work, 1940\u20132018. Technical Report 30389, National Bureau of Economic Research, August\n2022. URL http://www .nber.org/papers/w30389 . (cited on p. 39)\nZhangir Azerbayev, Hailey Schoelkopf, Keiran Paster, Marco Dos Santos, Stephen Marcus McAleer, Albert Q.\nJiang, Jia Deng, Stella Biderman, and Sean Welleck. Llemma: An open language model for mathematics. In\nThe Twelfth International Conference on Learning Representations , 2024. URL https://openreview .net/\nforum?id=4WnqRR915j . (cited on p. 12)\nMohammad Bavarian, Heewoo Jun, Nikolas Tezak, John Schulman, Christine McLeavey, Jerry Tworek, and\nMark Chen. Efficient training of language models to fill in the middle. arXiv preprint , July 2022. URL\nhttps://arxiv .org/abs/2207 .14255. (cited on p. 16)\n41 Under review as submission to TMLR\nLoubna Ben Allal. Big code models leaderboard, 2023. URL https://huggingface .co/spaces/bigcode/\nbigcode-models-leaderboard . (cited on p. 24)\nLoubna Ben Allal, Niklas Muennighoff, Logesh Kumar Umapathi, Ben Lipkin, and Leandro von Werra. A\nframework for the evaluation of code generation models. https://github .com/bigcode-project/bigcode-\nevaluation-harness , 2022. (cited on p. 26)\nLoubna Ben Allal, Raymond Li, Denis Kocetkov, Chenghao Mou, Christopher Akiki, Carlos Munoz Ferrandis,\nNiklas Muennighoff, Mayank Mishra, Alex Gu, Manan Dey, Logesh Kumar Umapathi, Carolyn Jane\nAnderson, Yangtian Zi, Joel Lamy Poirier, Hailey Schoelkopf, Sergey Troshin, Dmitry Abulkhanov, Manuel\nRomero, Michael Lappert, Francesco De Toni, Bernardo Garc\u00eda del R\u00edo, Qian Liu, Shamik Bose, Urvashi\nBhattacharyya, Terry Yue Zhuo, Ian Yu, Paulo Villegas, Marco Zocca, Sourab Mangrulkar, David Lansky,\nHuu Nguyen, Danish Contractor, Luis Villa, Jia Li, Dzmitry Bahdanau, Yacine Jernite, Sean Hughes,\nDaniel Fried, Arjun Guha, Harm de Vries, and Leandro von Werra. SantaCoder: don\u2019t reach for the stars!\narXiv preprint , August 2023. URL https://arxiv .org/abs/2301 .03988. (cited on pp. 2, 13, 30, 31, 36,\nand 37)\nManish Bhatt, Sahana Chennabasappa, Cyrus Nikolaidis, Shengye Wan, Ivan Evtimov, Dominik Gabi, Daniel\nSong, Faizan Ahmad, Cornelius Aschermann, Lorenzo Fontana, Sasha Frolov, Ravi Prakash Giri, Dhaval\nKapil, Yiannis Kozyrakis, David LeBlanc, James Milazzo, Aleksandar Straumann, Gabriel Synnaeve, Varun\nVontimitta, Spencer Whitman, and Joshua Saxe. Purple llama CyberSecEval: A secure coding benchmark\nfor language models. arXiv preprint , December 2023. URL https://arxiv .org/abs/2312 .04724. (cited on\np. 34)\nStella Biderman, Hailey Schoelkopf, Quentin Gregory Anthony, Herbie Bradley, Kyle O\u2019Brien, Eric Hallahan,\nMohammad Aflah Khan, Shivanshu Purohit, Usvsn Sai Prashanth, Edward Raff, Aviya Skowron, Lintang\nSutawika, and Oskar Van Der Wal. Pythia: A suite for analyzing large language models across training\nand scaling. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato,\nand Jonathan Scarlett (eds.), Proceedings of the 40th International Conference on Machine Learning ,\nvolume 202 of Proceedings of Machine Learning Research , pp. 2397\u20132430. PMLR, 23\u201329 Jul 2023. URL\nhttps://proceedings .mlr.press/v202/biderman23a .html. (cited on p. 2)\nBigCode. Models by BigCode on Hugging Face, 2024. URL https://huggingface .co/api/models?author=\nbigcode&expand[]=downloadsAllTime . Accessed: 2024. (cited on p. 37)\nBigCode collaboration, Sean Hughes, Harm de Vries, Jennifer Robinson, Carlos Mu\u00f1oz Ferrandis, Loubna Ben\nAllal, Leandro von Werra, Jennifer Ding, Sebastien Paquet, and Yacine Jernite. The BigCode project\ngovernance card. arXiv preprint , December 2023. URL https://arxiv .org/abs/2312 .03872. (cited on pp.\n2, 36, and 37)\nBigCode Project. Bigcode model license agreement, 2023a. URL https://huggingface .co/spaces/bigcode/\nbigcode-model-license-agreement . Accessed: 2023. (cited on p. 37)\nBigCode Project. BigCode open RAIL: Responsible AI licensing framework, 2023b. URL https:\n//www.bigcode-project .org/docs/pages/bigcode-openrail/ . Accessed: 2023. (cited on p. 37)\nBigScience Workshop. BLOOM (revision 4ab0472), 2022. URL https://huggingface .co/bigscience/bloom .\n(cited on p. 2)\nBlue Oak Council, 2024. URL https://blueoakcouncil .org/list . (cited on p. 4)\nAndrei Z. Broder. Identifying and filtering near-duplicate documents. In Annual symposium on combinatorial\npattern matching , pp. 1\u201310. Springer, 2000. URL https://link .springer.com/chapter/10 .1007/3-540-\n45123-4_1 . (cited on p. 13)\nEthan Caballero, OpenAI, and Ilya Sutskever. Description2Code dataset, August 2016. URL https:\n//github.com/ethancaballero/description2code . (cited on p. 11)\n42 Under review as submission to TMLR\nFederico Cassano, John Gouwar, Francesca Lucchetti, Claire Schlesinger, Carolyn Jane Anderson, Michael\nGreenberg, Abhinav Jangda, and Arjun Guha. Knowledge transfer from high-resource to low-resource\nprogramming languages for code LLMs. arXiv preprint , August 2023a. URL https://arxiv .org/abs/\n2308.09895. (cited on pp. 12 and 38)\nFederico Cassano, John Gouwar, Daniel Nguyen, Sydney Nguyen, Luna Phipps-Costin, Donald Pinckney,\nMing-Ho Yee, Yangtian Zi, Carolyn Jane Anderson, Molly Q. Feldman, Arjun Guha, Michael Greenberg,\nand Abhinav Jangda. MultiPL-E: a scalable and polyglot approach to benchmarking neural code generation.\nIEEE Transactions on Software Engineering , 49(7):3675\u20133691, 2023b. doi: 10 .1109/TSE.2023.3267446. URL\nhttps://www .computer.org/csdl/journal/ts/2023/07/10103177/1MpWUtj7Rwk . (cited on pp. 3 and 24)\nFederico Cassano, Luisa Li, Akul Sethi, Noah Shinn, Abby Brennan-Jones, Anton Lozhkov, Carolyn Jane\nAnderson, and Arjun Guha. Can it edit? evaluating the ability of large language models to follow code\nediting instructions. In The First International Workshop on Large Language Model for Code , 2024. URL\nhttps://arxiv .org/abs/2312 .12450. (cited on pp. 3, 27, and 28)\nYekun Chai, Shuohuan Wang, Chao Pang, Yu Sun, Hao Tian, and Hua Wu. ERNIE-code: Beyond\nEnglish-centric cross-lingual pretraining for programming languages. In Anna Rogers, Jordan Boyd-\nGraber, and Naoaki Okazaki (eds.), Findings of the Association for Computational Linguistics: ACL\n2023, pp. 10628\u201310650, Toronto, Canada, July 2023. Association for Computational Linguistics. doi:\n10.18653/v1/2023 .findings-acl .676. URL https://aclanthology .org/2023.findings-acl .676. (cited on p.\n2)\nMark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri\nEdwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael\nPetrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov,\nAlethea Power, Lukasz Kaiser, Mohammad Bavarian, Clemens Winter, Philippe Tillet, Felipe Petroski Such,\nDaveCummings, MatthiasPlappert, FotiosChantzis, ElizabethBarnes, ArielHerbert-Voss, WilliamHebgen\nGuss, Alex Nichol, Alex Paino, Nikolas Tezak, Jie Tang, Igor Babuschkin, Suchir Balaji, Shantanu Jain,\nWilliam Saunders, Christopher Hesse, Andrew N. Carr, Jan Leike, Josh Achiam, Vedant Misra, Evan\nMorikawa, Alec Radford, Matthew Knight, Miles Brundage, Mira Murati, Katie Mayer, Peter Welinder,\nBob McGrew, Dario Amodei, Sam McCandlish, Ilya Sutskever, and Wojciech Zaremba. Evaluating large\nlanguage models trained on code. arXiv preprint , July 2021. URL https://arxiv .org/abs/2107 .03374.\n(cited on pp. 2, 3, 23, 31, and 38)\nQin Chen, Jinfeng Ge, Huaqing Xie, Xingcheng Xu, and Yanqing Yang. Large language models at work in\nChina\u2019s labor market. arXiv preprint , August 2023. URL https://arxiv .org/abs/2308 .08776. (cited on\np. 39)\nAakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts,\nPaul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi,\nSasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vinodkumar\nPrabhakaran, Emily Reif, Nan Du, Ben Hutchinson, Reiner Pope, James Bradbury, Jacob Austin, Michael\nIsard, Guy Gur-Ari, Pengcheng Yin, Toju Duke, Anselm Levskaya, Sanjay Ghemawat, Sunipa Dev, Henryk\nMichalewski, Xavier Garcia, Vedant Misra, Kevin Robinson, Liam Fedus, Denny Zhou, Daphne Ippolito,\nDavid Luan, Hyeontaek Lim, Barret Zoph, Alexander Spiridonov, Ryan Sepassi, David Dohan, Shivani\nAgrawal, Mark Omernick, Andrew M. Dai, Thanumalayan Sankaranarayana Pillai, Marie Pellat, Aitor\nLewkowycz, Erica Moreira, Rewon Child, Oleksandr Polozov, Katherine Lee, Zongwei Zhou, Xuezhi Wang,\nBrennan Saeta, Mark Diaz, Orhan Firat, Michele Catasta, Jason Wei, Kathy Meier-Hellstern, Douglas\nEck, Jeff Dean, Slav Petrov, and Noah Fiedel. Palm: Scaling language modeling with pathways. Journal\nof Machine Learning Research , 24(240):1\u2013113, 2023. URL http://jmlr .org/papers/v24/22-1144 .html.\n(cited on p. 28)\nPaul F. Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement\nlearning from human preferences. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus,\nS. Vishwanathan, and R. Garnett (eds.), Advances in Neural Information Processing Systems , volume 30.\n43 Under review as submission to TMLR\nCurran Associates, Inc., 2017. URL https://proceedings .neurips.cc/paper_files/paper/2017/hash/\nd5e2c0adad503c91f91df240d0cd4e49-Abstract .html. (cited on p. 27)\nClamAV, 2024. URL https://www .clamav.net/. (cited on p. 14)\nKarl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias\nPlappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training\nverifiers to solve math word problems. arXiv preprint , October 2021. URL https://arxiv .org/abs/\n2110.14168. (cited on pp. 3, 12, and 28)\nAlexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco\nGuzm\u00e1n, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. Unsupervised cross-lingual\nrepresentation learning at scale. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault (eds.),\nProceedings of the 58th Annual Meeting of the Association for Computational Linguistics , pp. 8440\u20138451,\nOnline, July 2020. Association for Computational Linguistics. doi: 10 .18653/v1/2020 .acl-main.747. URL\nhttps://aclanthology .org/2020.acl-main.747. (cited on p. 14)\nRoberto Di Cosmo and Stefano Zacchiroli. Software heritage: Why and how to preserve software source code.\nIniPRES 2017: 14th International Conference on Digital Preservation , Kyoto, Japan, 2017. URL https:\n//www.softwareheritage .org/wp-content/uploads/2020/01/ipres-2017-swh .pdf. https://hal.archives-\nouvertes.fr/hal-01590958. (cited on p. 37)\nRoberto Di Cosmo, Morane Gruenpeter, and Stefano Zacchiroli. Referencing source code artifacts: A\nseparate concern in software citation. Computing in Science & Engineering , 22(2):33\u201343, 2020. doi:\n10.1109/MCSE .2019.2963148. (cited on p. 39)\nCrowdStrike. Polymorphic virus. https://www .crowdstrike .com/cybersecurity-101/malware/\npolymorphic-virus/ , 2024. Accessed: 2024. (cited on p. 38)\nCyberArk. Chatting our way into creating a polymorphic malware. https://www .cyberark.com/resources/\nthreat-research-blog/chatting-our-way-into-creating-a-polymorphic-malware , 2024. Accessed:\n2024. (cited on p. 38)\nCybersecurity & Infrastructure Security Agency. Secure by design, 2024. URL https://www .cisa.gov/\nresources-tools/resources/secure-by-design . Accessed: 2024. (cited on p. 38)\nTri Dao. FlashAttention-2: faster attention with better parallelism and work partitioning. In The Twelfth\nInternational Conference on Learning Representations , 2024. URL https://openreview .net/forum?id=\nmZn2Xyh9Ec . (cited on p. 21)\nTri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher R\u00e9. FlashAttention: fast and memory-\nefficient exact attention with IO-awareness. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho,\nand A. Oh (eds.), Advances in Neural Information Processing Systems , volume 35, pp. 16344\u201316359.\nCurran Associates, Inc., 2022. URL https://proceedings .neurips.cc/paper_files/paper/2022/hash/\n67d57c32e20fd0a7a302cb81d36e40d5-Abstract-Conference .html. (cited on p. 21)\nHarm de Vries. Go smol or go home. https://www .harmdevries .com/post/model-size-vs-compute-\noverhead/ , 2023. (cited on p. 3)\nJwala Dhamala, Tony Sun, Varun Kumar, Satyapriya Krishna, Yada Pruksachatkun, Kai-Wei Chang, and\nRahul Gupta. BOLD: dataset and metrics for measuring biases in open-ended language generation. In\nProceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency , FAccT \u201921, pp.\n862\u2013872, New York, NY, USA, 2021. Association for Computing Machinery. ISBN 9781450383097. doi:\n10.1145/3442188 .3445924. URL https://doi .org/10.1145/3442188 .3445924. (cited on pp. 3 and 34)\nJennifer Ding, Christopher Akiki, Yacine Jernite, Anne Lee Steele, and Temi Popo. Towards openness\nbeyond open access: User journeys through 3 open AI collaboratives. In Workshop on Broadening Research\nCollaborations 2022 , 2022. URL https://openreview .net/forum?id=slU-5h8rrCz . (cited on p. 2)\n44 Under review as submission to TMLR\nYangruibo Ding, Zijian Wang, Wasi Uddin Ahmad, Hantian Ding, Ming Tan, Nihal Jain, Murali Krishna\nRamanathan, Ramesh Nallapati, Parminder Bhatia, Dan Roth, and Bing Xiang. CrossCodeEval: a\ndiverse and multilingual benchmark for cross-file code completion. In Thirty-seventh Conference on Neural\nInformation Processing Systems Datasets and Benchmarks Track , 2023. URL https://openreview .net/\nforum?id=wgDcbBMSfh . (cited on pp. 3, 31, 32, and 33)\nYi Dong, Ronghui Mu, Gaojie Jin, Yi Qi, Jinwei Hu, Xingyu Zhao, Jie Meng, Wenjie Ruan, and Xiaowei\nHuang. Building guardrails for large language models. arXiv preprint , February 2024. URL https:\n//arxiv.org/abs/2402 .01822. (cited on p. 38)\nKawinEthayarajh, WinnieXu, NiklasMuennighoff, DanJurafsky, andDouweKiela. KTO:modelalignmentas\nprospect theoretic optimization. arXiv preprint , February 2024. URL https://arxiv .org/abs/2402 .01306.\n(cited on p. 27)\nAngela Fan, Beliz Gokkaya, Mark Harman, Mitya Lyubarskiy, Shubho Sengupta, Shin Yoo, and Jie M. Zhang.\nLarge language models for software engineering: Survey and open problems. arXiv preprint , October 2023.\nURL https://arxiv .org/abs/2310 .03533. (cited on p. 2)\nLuyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, and Graham\nNeubig. PAL: Program-aided language models. In Andreas Krause, Emma Brunskill, Kyunghyun Cho,\nBarbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (eds.), Proceedings of the 40th International\nConference on Machine Learning , volume 202 of Proceedings of Machine Learning Research , pp. 10764\u201310799.\nPMLR, 23\u201329 Jul 2023. URL https://proceedings .mlr.press/v202/gao23f .html. (cited on p. 28)\nSamuel Gehman, Suchin Gururangan, Maarten Sap, Yejin Choi, and Noah A. Smith. RealToxicityPrompts:\nevaluating neural toxic degeneration in language models. In Trevor Cohn, Yulan He, and Yang Liu\n(eds.),Findings of the Association for Computational Linguistics: EMNLP 2020 , pp. 3356\u20133369, Online,\nNovember 2020. Association for Computational Linguistics. doi: 10 .18653/v1/2020 .findings-emnlp .301.\nURL https://aclanthology .org/2020.findings-emnlp .301. (cited on pp. 3 and 34)\nGemini Team et al. Gemini: a family of highly capable multimodal models. arXiv preprint , 2023. URL\nhttps://arxiv .org/abs/2312 .11805. (cited on p. 2)\nGithub Archive, 2024. URL https://gharchive .org. (cited on pp. 3, 6, and 7)\ngo-enry, 2024. URL https://github .com/go-enry/go-enry . (cited on pp. 4 and 6)\nGoldman Sachs. The generative world order: AI, geopolitics, and power, 2024. URL\nhttps://www .goldmansachs .com/intelligence/pages/the-generative-world-order-ai-geopolitics-\nand-power.html. (cited on p. 39)\nGovernance AI. Open sourcing highly capable foundation models, 2024. URL https://www .governance .ai/\nresearch-paper/open-sourcing-highly-capable-foundation-models . Accessed: 2024. (cited on p. 38)\nDirk Groeneveld, Iz Beltagy, Pete Walsh, Akshita Bhagia, Rodney Kinney, Oyvind Tafjord, Ananya Harsh\nJha, Hamish Ivison, Ian Magnusson, Yizhong Wang, Shane Arora, David Atkinson, Russell Authur,\nKhyathi Raghavi Chandu, Arman Cohan, Jennifer Dumas, Yanai Elazar, Yuling Gu, Jack Hessel, Tushar\nKhot, William Merrill, Jacob Morrison, Niklas Muennighoff, Aakanksha Naik, Crystal Nam, Matthew E.\nPeters, Valentina Pyatkin, Abhilasha Ravichander, Dustin Schwenk, Saurabh Shah, Will Smith, Emma\nStrubell, Nishant Subramani, Mitchell Wortsman, Pradeep Dasigi, Nathan Lambert, Kyle Richardson,\nLuke Zettlemoyer, Jesse Dodge, Kyle Lo, Luca Soldaini, Noah A. Smith, and Hannaneh Hajishirzi. OLMo:\naccelerating the science of language models. arXiv preprint , February 2024. URL https://arxiv .org/abs/\n2402.00838. (cited on pp. 2 and 37)\nAiden Grossman, Ludger Paehler, Konstantinos Parasyris, Tal Ben-Nun, Jacob Hegna, William Moses, Jose\nM. Monsalve Diaz, Mircea Trofin, and Johannes Doerfert. ComPile: a large IR dataset from production\nsources. arXiv preprint , September 2023. URL https://arxiv .org/abs/2309 .15432. (cited on p. 11)\n45 Under review as submission to TMLR\nAlex Gu, Baptiste Rozi\u00e8re, Hugh Leather, Armando Solar-Lezama, Gabriel Synnaeve, and Sida I. Wang.\nCRUXEval: a benchmark for code reasoning, understanding and execution. arXiv preprint , January 2024.\nURL https://arxiv .org/abs/2401 .03065. (cited on pp. 3 and 29)\nDaya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi,\nY. Wu, Y. K. Li, Fuli Luo, Yingfei Xiong, and Wenfeng Liang. DeepSeek-Coder: when the large\nlanguage model meets programming \u2013 the rise of code intelligence. arXiv preprint , 2024. URL https:\n//arxiv.org/abs/2401 .14196. (cited on pp. 2, 20, 22, 23, and 25)\nMaanak Gupta, Charankumar Akiri, Kshitiz Aryal, Eli Parker, and Lopamudra Praharaj. From ChatGPT to\nThreatGPT: impact of generative AI in cybersecurity and privacy. IEEE Access , 11:80218\u201380245, 2023. ISSN\n2169-3536. doi: 10 .1109/access .2023.3300381. URL http://dx.doi.org/10.1109/ACCESS .2023.3300381.\n(cited on p. 38)\nAsier Guti\u00e9rrez-Fandi\u00f1o, David P\u00e9rez-Fern\u00e1ndez, Jordi Armengol-Estap\u00e9, David Griol, and Zoraida Calle-\njas. esCorpius: a massive spanish crawling corpus. In IberSPEECH 2022 , pp. 126\u2013130, 2022. doi:\n10.21437/IberSPEECH .2022-26. URL https://www .isca-speech .org/archive/pdfs/iberspeech_2022/\ngutierrezfandino22_iberspeech .pdf. (cited on p. 10)\nDan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo,\nCollin Burns, Samir Puranik, Horace He, Dawn Song, and Jacob Steinhardt. Measuring\ncoding challenge competence with apps. In J. Vanschoren and S. Yeung (eds.), Proceed-\nings of the Neural Information Processing Systems Track on Datasets and Benchmarks , vol-\nume 1. Curran, 2021. URL https://datasets-benchmarks-proceedings .neurips.cc/paper/2021/hash/\nc24cd76e1ce41366a4bbe8a49b02a028-Abstract-round2 .html. (cited on p. 12)\nSoftware Heritage. Software heritage community. https://www .softwareheritage .org/community/ , 2024.\nAccessed: 2024. (cited on p. 37)\nXinyi Hou, Yanjie Zhao, Yue Liu, Zhou Yang, Kailong Wang, Li Li, Xiapu Luo, David Lo, John Grundy,\nand Haoyu Wang. Large language models for software engineering: A systematic literature review. arXiv\npreprint, August 2023. URL https://arxiv .org/abs/2308 .10620. (cited on p. 2)\nDong Huang, Qingwen Bu, Jie Zhang, Xiaofei Xie, Junjie Chen, and Heming Cui. Bias testing and mitigation\nin LLM-based code generation. arXiv preprint , 2023. URL https://arxiv .org/abs/2309 .14345. (cited on\np. 38)\nAlbert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego\nde las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, L\u00e9lio Renard Lavaud,\nMarie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timoth\u00e9e Lacroix, and\nWilliam El Sayed. Mistral 7B. arXiv preprint , 2023. URL https://arxiv .org/abs/2310 .06825. (cited on\np. 2)\nAlbert Qiaochu Jiang, Wenda Li, Jesse Michael Han, and Yuhuai Wu. LISA: language models of ISAbelle\nproofs. In 6th Conference on Artificial Intelligence and Theorem Proving , pp. 378\u2013392, 2021. URL\nhttp://aitp-conference .org/2021/abstract/paper_17 .pdf. (cited on p. 12)\nDiederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Yoshua Bengio and\nYann LeCun (eds.), 3rd International Conference on Learning Representations, ICLR 2015, San Diego,\nCA, USA, May 7-9, 2015, Conference Track Proceedings , 2015. URL http://arxiv .org/abs/1412 .6980.\n(cited on p. 21)\nDenis Kocetkov, Raymond Li, Loubna Ben allal, Jia LI, Chenghao Mou, Yacine Jernite, Margaret Mitchell,\nCarlos Mu\u00f1oz Ferrandis, Sean Hughes, Thomas Wolf, Dzmitry Bahdanau, Leandro Von Werra, and Harm\nde Vries. The stack: 3 TB of permissively licensed source code. Transactions on Machine Learning\nResearch , 2023. ISSN 2835-8856. URL https://openreview .net/forum?id=pxpbTdUEpD . (cited on pp. 2,\n4, 36, and 37)\n46 Under review as submission to TMLR\nAlexandre Lacoste, Alexandra Luccioni, Victor Schmidt, and Thomas Dandres. Quantifying the carbon\nemissions of machine learning. arXiv preprint , October 2019. URL https://arxiv .org/abs/1910 .09700.\n(cited on p. 21)\nYuhang Lai, Chengxi Li, Yiming Wang, Tianyi Zhang, Ruiqi Zhong, Luke Zettlemoyer, Wen-Tau Yih,\nDaniel Fried, Sida Wang, and Tao Yu. DS-1000: A natural and reliable benchmark for data science code\ngeneration. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato,\nand Jonathan Scarlett (eds.), Proceedings of the 40th International Conference on Machine Learning ,\nvolume 202 of Proceedings of Machine Learning Research , pp. 18319\u201318345. PMLR, 23\u201329 Jul 2023. URL\nhttps://proceedings .mlr.press/v202/lai23b .html. (cited on pp. 3 and 25)\nChris Lattner and Vikram Adve. LLVM: a compilation framework for lifelong program analysis & transfor-\nmation. In International symposium on code generation and optimization, 2004. CGO 2004. , pp. 75\u201386.\nIEEE, 2004. URL https://ieeexplore .ieee.org/document/1281665 . (cited on p. 11)\nRaymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou, Marc\nMarone, Christopher Akiki, Jia Li, Jenny Chim, Qian Liu, Evgenii Zheltonozhskii, Terry Yue Zhuo, Thomas\nWang, Olivier Dehaene, Mishig Davaadorj, Joel Lamy-Poirier, Jo\u00e3o Monteiro, Oleh Shliazhko, Nicolas\nGontier, Nicholas Meade, Armel Zebaze, Ming-Ho Yee, Logesh Kumar Umapathi, Jian Zhu, Benjamin\nLipkin, Muhtasham Oblokulov, Zhiruo Wang, Rudra Murthy, Jason Stillerman, Siva Sankalp Patel, Dmitry\nAbulkhanov, Marco Zocca, Manan Dey, Zhihan Zhang, Nour Fahmy, Urvashi Bhattacharyya, Wenhao\nYu, Swayam Singh, Sasha Luccioni, Paulo Villegas, Maxim Kunakov, Fedor Zhdanov, Manuel Romero,\nTony Lee, Nadav Timor, Jennifer Ding, Claire Schlesinger, Hailey Schoelkopf, Jan Ebert, Tri Dao, Mayank\nMishra, Alex Gu, Jennifer Robinson, Carolyn Jane Anderson, Brendan Dolan-Gavitt, Danish Contractor,\nSiva Reddy, Daniel Fried, Dzmitry Bahdanau, Yacine Jernite, Carlos Mu\u00f1oz Ferrandis, Sean Hughes,\nThomas Wolf, Arjun Guha, Leandro von Werra, and Harm de Vries. StarCoder: may the source be with\nyou!arXiv preprint , May 2023. URL https://arxiv .org/abs/2305 .06161. (cited on pp. 2, 6, 8, 13, 14,\n22, 33, 34, 36, and 37)\nYujiaLi, DavidChoi, JunyoungChung, NateKushman, JulianSchrittwieser, R\u00e9miLeblond, TomEccles, James\nKeeling, Felix Gimeno, Agustin Dal Lago, Thomas Hubert, Peter Choy, Cyprien de Masson d\u2019Autume, Igor\nBabuschkin, Xinyun Chen, Po-Sen Huang, Johannes Welbl, Sven Gowal, Alexey Cherepanov, James Molloy,\nDaniel J. Mankowitz, Esme Sutherland Robson, Pushmeet Kohli, Nando de Freitas, Koray Kavukcuoglu,\nand Oriol Vinyals. Competition-level code generation with alphacode. Science, 378(6624):1092\u20131097, 2022.\ndoi: 10.1126/science .abq1158. URL https://www .science.org/doi/abs/10 .1126/science .abq1158. (cited\non p. 12)\nJiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. Is your code generated by chatGPT really\ncorrect? rigorous evaluation of large language models for code generation. In Thirty-seventh Conference\non Neural Information Processing Systems , 2023a. URL https://openreview .net/forum?id=1qvx610Cu7 .\n(cited on pp. 3 and 23)\nTianyang Liu, Canwen Xu, and Julian McAuley. RepoBench: Benchmarking repository-level code auto-\ncompletion systems. arXiv preprint , June 2023b. URL https://arxiv .org/abs/2306 .03091. (cited on pp.\n3, 31, and 32)\nShayne Longpre, Robert Mahari, Anthony Chen, Naana Obeng-Marnu, Damien Sileo, William Brannon,\nNiklas Muennighoff, Nathan Khazam, Jad Kabbara, Kartik Perisetla, Xinyi Wu, Enrico Shippole, Kurt\nBollacker, Tongshuang Wu, Luis Villa, Sandy Pentland, and Sara Hooker. The data provenance initiative:\nA large scale audit of dataset licensing & attribution in AI. arXiv preprint , 2023. URL https://arxiv .org/\nabs/2310.16787. (cited on p. 25)\nRisto Luukkonen, Ville Komulainen, Jouni Luoma, Anni Eskelinen, Jenna Kanerva, Hanna-Mari Kupari, Filip\nGinter, Veronika Laippala, Niklas Muennighoff, Aleksandra Piktus, et al. Fingpt: Large generative models\nfor a small language. arXiv preprint arXiv:2311.05640 , 2023. URL https://arxiv .org/abs/2311 .05640.\n(cited on p. 37)\n47 Under review as submission to TMLR\nEmanuele La Malfa, Aleksandar Petrov, Simon Frieder, Christoph Weinhuber, Ryan Burnell, Raza Nazar,\nAnthony G. Cohn, Nigel Shadbolt, and Michael Wooldridge. Language models as a service: Overview of a\nnew paradigm and its challenges. arXiv preprint , 2023. URL https://arxiv .org/abs/2309 .16573. (cited\non p. 37)\nMarc Marone and Benjamin Van Durme. Data portraits: Recording foundation model training data. In\nThirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track ,\n2023. URL https://arxiv .org/abs/2303 .03919. (cited on pp. 34 and 35)\nThe mathlib Community. The lean mathematical library. In Proceedings of the 9th ACM SIGPLAN\nInternational Conference on Certified Programs and Proofs , POPL \u201920. ACM, January 2020. doi: 10 .1145/\n3372885.3373824. URL http://dx.doi.org/10.1145/3372885 .3373824. (cited on p. 12)\nDaniel Mendez, Daniel Graziotin, Stefan Wagner, and Heidi Seibold. Open Science in Software Engineering ,\npp. 477\u2013501. Springer International Publishing, 2020. doi: 10 .1007/978-3-030-32489-6 _17. URL http:\n//dx.doi.org/10.1007/978-3-030-32489-6_17 . (cited on p. 37)\nRalph C. Merkle. A digital signature based on a conventional encryption function. In Conference on the\ntheory and application of cryptographic techniques , pp. 369\u2013378. Springer, 1987. (cited on p. 3)\nMehdi Mirakhorli, Derek Garcia, Schuyler Dillon, Kevin Laporte, Matthew Morrison, Henry Lu, Viktoria\nKoscinski, and Christopher Enoch. A landscape study of open source and proprietary tools for software\nbill of materials (sbom). arXiv preprint , 2024. URL https://arxiv .org/abs/2402 .11151. (cited on p. 38)\nMike Mirzayanov. Codeforces: Results of 2020 [annual report]. https://codeforces .com/blog/entry/89502 ,\n2020. (cited on p. 11)\nMaximilian Mozes, Xuanli He, Bennett Kleinberg, and Lewis D. Griffin. Use of LLMs for illicit purposes:\nThreats, prevention measures, and vulnerabilities. arXiv preprint , 2023. URL https://arxiv .org/abs/\n2308.12833. (cited on p. 38)\nMSFT Q2 Earning Call, 2024. URL https://www .microsoft.com/en-us/investor/events/fy-2024/\nearnings-fy-2024-q2 .aspx. (cited on p. 2)\nNiklas Muennighoff, Nouamane Tazi, Lo\u00efc Magne, and Nils Reimers. Mteb: Massive text embedding\nbenchmark. arXiv preprint arXiv:2210.07316 , 2022a. doi: 10 .48550/ARXIV .2210.07316. URL https:\n//arxiv.org/abs/2210 .07316. (cited on p. 12)\nNiklas Muennighoff, Thomas Wang, Lintang Sutawika, Adam Roberts, Stella Biderman, Teven Le Scao,\nMSaifulBari, ShengShen, Zheng-XinYong, HaileySchoelkopf, XiangruTang, DragomirRadev, AlhamFikri\nAji, Khalid Almubarak, Samuel Albanie, Zaid Alyafeai, Albert Webson, Edward Raff, and Colin Raffel.\nCrosslingual generalization through multitask finetuning, 2022b. URL https://arxiv .org/abs/2211 .01786.\n(cited on p. 25)\nNiklas Muennighoff, Alexander M Rush, Boaz Barak, Teven Le Scao, Nouamane Tazi, Aleksandra Piktus,\nSampoPyysalo, ThomasWolf, andColinRaffel. Scalingdata-constrainedlanguagemodels. In Thirty-seventh\nConference on Neural Information Processing Systems , 2023. URL https://openreview .net/forum?id=\nj5BuTrEj35 . (cited on pp. 2 and 21)\nNiklas Muennighoff, Qian Liu, Armel Randy Zebaze, Qinkai Zheng, Binyuan Hui, Terry Yue Zhuo, Swayam\nSingh, Xiangru Tang, Leandro Von Werra, and Shayne Longpre. OctoPack: instruction tuning code large\nlanguage models. In The Twelfth International Conference on Learning Representations , 2024a. URL\nhttps://openreview .net/forum?id=mw1PWNSWZP . (cited on pp. 3, 25, 26, 27, and 37)\nNiklas Muennighoff, Hongjin Su, Liang Wang, Nan Yang, Furu Wei, Tao Yu, Amanpreet Singh, and\nDouwe Kiela. Generative representational instruction tuning. arXiv preprint , 2024b. URL https:\n//arxiv.org/abs/2402 .09906. (cited on p. 25)\n48 Under review as submission to TMLR\nJ. M\u00f6kander, J. Schuett, H.R. Kirk, et al. Auditing large language models: A three-layered approach. AI\nEthics, 2023. URL https://doi .org/10.1007/s43681-023-00289-2 . (cited on p. 37)\nSebastian Nanz and Carlo A. Furia. A comparative study of programming languages in Rosetta code. In\n2015 IEEE/ACM 37th IEEE International Conference on Software Engineering , volume 1, pp. 778\u2013788.\nIEEE, 2015. URL https://ieeexplore .ieee.org/document/7194625 . (cited on p. 12)\nErik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, and Caiming\nXiong. CodeGen: an open large language model for code with multi-turn program synthesis. In The Eleventh\nInternational Conference on Learning Representations , 2023. URL https://openreview .net/forum?id=\niaYcJKpY2B_ . (cited on p. 2)\nDebora Nozza, Federico Bianchi, and Dirk Hovy. HONEST: Measuring hurtful sentence completion in language\nmodels. In Kristina Toutanova, Anna Rumshisky, Luke Zettlemoyer, Dilek Hakkani-Tur, Iz Beltagy,\nSteven Bethard, Ryan Cotterell, Tanmoy Chakraborty, and Yichao Zhou (eds.), Proceedings of the 2021\nConference of the North American Chapter of the Association for Computational Linguistics: Human\nLanguage Technologies , pp. 2398\u20132406, Online, June 2021. Association for Computational Linguistics. doi:\n10.18653/v1/2021 .naacl-main .191. URL https://aclanthology .org/2021.naacl-main .191. (cited on pp.\n3 and 34)\nOpenAI et al. GPT-4 technical report. arXiv preprint , March 2023. URL https://arxiv .org/abs/2303 .08774.\n(cited on p. 2)\nPedro Javier Ortiz Su\u00e1rez, Beno\u00eet Sagot, and Laurent Romary. Asynchronous pipelines for processing huge\ncorpora on medium to low resource infrastructures. In Piotr Ba\u0144ski, Adrien Barbaresi, Hanno Biber, Evelyn\nBreiteneder, Simon Clematide, Marc Kupietz, Harald L\u00fcngen, and Caroline Iliadi (eds.), Proceedings of\nthe Workshop on Challenges in the Management of Large Corpora , pp. 9 \u2013 16, Mannheim, July 2019.\nLeibniz-Institut f\u00fcr Deutsche Sprache. doi: 10 .14618/ids-pub-9021. URL http://nbn-resolving .de/urn:\nnbn:de:bsz:mh39-90215 . (cited on p. 10)\nKeiran Paster, Marco Dos Santos, Zhangir Azerbayev, and Jimmy Ba. OpenWebMath: an open dataset of high-\nquality mathematical web text. arXiv preprint , October 2023. URL https://arxiv .org/abs/2310 .06786.\n(cited on p. 13)\nHammond Pearce, Baleegh Ahmad, Benjamin Tan, Brendan Dolan-Gavitt, and Ramesh Karri. Asleep at\nthe keyboard? assessing the security of github copilot\u2019s code contributions. In 2022 IEEE Symposium on\nSecurity and Privacy (SP) , pp. 754\u2013768. IEEE, 2022. (cited on pp. 3 and 33)\nGuilherme Penedo, Quentin Malartic, Daniel Hesslow, Ruxandra Cojocaru, Hamza Alobeidli, Alessandro\nCappelli, Baptiste Pannier, Ebtesam Almazrouei, and Julien Launay. The RefinedWeb dataset for Falcon\nLLM: Outperforming curated corpora with web data only. In Thirty-seventh Conference on Neural\nInformation Processing Systems Datasets and Benchmarks Track , 2023. URL https://openreview .net/\nforum?id=kM5eGcdCzq . (cited on p. 10)\nSida Peng, Eirini Kalliamvakou, Peter Cihon, and Mert Demirer. The impact of AI on developer productivity:\nEvidence from GitHub Copilot. arXiv preprint , 2023. URL https://arxiv .org/abs/2302 .06590. (cited on\npp. 2 and 39)\nAntoine Pietri, Diomidis Spinellis, and Stefano Zacchiroli. The software heritage graph dataset:\nLarge-scale analysis of public software development history. In MSR 2020: The 17th Inter-\nnational Conference on Mining Software Repositories , pp. 1\u20135. IEEE, 2020. doi: 10 .1145/\n3379597.3387510. URL https://arxiv .org/abs/2011 .07824https://www .softwareheritage .org/wp-\ncontent/uploads/2021/03/msr-2020-challenge .pdf. (cited on p. 3)\nAleksandra Piktus, Christopher Akiki, Paulo Villegas, Hugo Lauren\u00e7on, G\u00e9rard Dupont, Sasha Luccioni,\nYacine Jernite, and Anna Rogers. The ROOTS search tool: Data transparency for LLMs. In Danushka\nBollegala, Ruihong Huang, and Alan Ritter (eds.), Proceedings of the 61st Annual Meeting of the Association\nfor Computational Linguistics (Volume 3: System Demonstrations) , pp. 304\u2013314, Toronto, Canada, July\n49 Under review as submission to TMLR\n2023a. Association for Computational Linguistics. doi: 10 .18653/v1/2023 .acl-demo.29. URL https:\n//aclanthology .org/2023.acl-demo.29. (cited on p. 34)\nAleksandra Piktus, Odunayo Ogundepo, Christopher Akiki, Akintunde Oladipo, Xinyu Zhang, Hailey\nSchoelkopf, Stella Biderman, Martin Potthast, and Jimmy Lin. GAIA search: Hugging Face and pyserini\ninteroperability for NLP training data exploration. In Danushka Bollegala, Ruihong Huang, and Alan Ritter\n(eds.),Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume\n3: System Demonstrations) , pp. 588\u2013598, Toronto, Canada, July 2023b. Association for Computational\nLinguistics. doi: 10 .18653/v1/2023 .acl-demo.57. URL https://aclanthology .org/2023.acl-demo.57.\n(cited on p. 34)\nNikhil Pinnaparaju, Reshinth Adithyan, Duy Phung, Jonathan Tow, James Baicoianu, , and Nathan Cooper.\nStable code 3B: Coding on the edge. Stability AI , 2024. URL https://stability .ai/news/stable-code-\n2024-llm-code-completion-release . (cited on p. 22)\nBigCode Project. The stack v2, 2024. URL https://huggingface .co/datasets/bigcode/the-stack-v2/ .\nAccessed: 2024. (cited on p. 37)\nRuchir Puri, David S Kung, Geert Janssen, Wei Zhang, Giacomo Domeniconi, Vladimir Zolotov, Julian\nDolby, Jie Chen, Mihir Choudhury, Lindsey Decker, Veronika Thost, Luca Buratti, Saurabh Pujar, Shyam\nRamji, Ulrich Finkler, Susan Malaika, and Frederick Reiss. CodeNet: a large-scale AI for code dataset for\nlearning a diversity of coding tasks. In Thirty-fifth Conference on Neural Information Processing Systems\nDatasets and Benchmarks Track (Round 2) , 2021. URL https://openreview .net/forum?id=6vZVBkCDrHT .\n(cited on p. 11)\nRedPajama Wiki, 2024. URL https://github .com/togethercomputer/RedPajama-Data/tree/rp_v1/\ndata_prep/wiki . (cited on p. 13)\nNils Reimers and Iryna Gurevych. Sentence-BERT: Sentence embeddings using siamese BERT-networks. In\nKentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan (eds.), Proceedings of the 2019 Conference on\nEmpirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural\nLanguage Processing (EMNLP-IJCNLP) , pp. 3982\u20133992, Hong Kong, China, November 2019. Association\nfor Computational Linguistics. doi: 10 .18653/v1/D19-1410. URL https://aclanthology .org/D19-1410 .\n(cited on p. 12)\nShuo Ren, Daya Guo, Shuai Lu, Long Zhou, Shujie Liu, Duyu Tang, Neel Sundaresan, Ming Zhou, Ambrosio\nBlanco, and Shuai Ma. Codebleu: a method for automatic evaluation of code synthesis. arXiv preprint ,\n2020. URL https://arxiv .org/abs/2009 .10297. (cited on p. 31)\nRosetta Code, 2023. URL https://rosettacode .org/. (cited on pp. 11 and 12)\nBaptiste Rozi\u00e8re, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi,\nJingyu Liu, Tal Remez, J\u00e9r\u00e9my Rapin, Artyom Kozhevnikov, Ivan Evtimov, Joanna Bitton, Manish Bhatt,\nCristian Canton Ferrer, Aaron Grattafiori, Wenhan Xiong, Alexandre D\u00e9fossez, Jade Copet, Faisal Azhar,\nHugo Touvron, Louis Martin, Nicolas Usunier, Thomas Scialom, and Gabriel Synnaeve. Code llama:\nOpen foundation models for code. arXiv preprint , August 2023. URL https://arxiv .org/abs/2308 .12950.\n(cited on pp. 2, 20, 22, 23, 25, and 31)\nSane Security, 2024. URL https://sanesecurity .com/usage/signatures . (cited on p. 14)\nVictor Sanh, Albert Webson, Colin Raffel, Stephen Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Chaffin,\nArnaud Stiegler, Arun Raja, Manan Dey, M Saiful Bari, Canwen Xu, Urmish Thakker, Shanya Sharma\nSharma, Eliza Szczechla, Taewoon Kim, Gunjan Chhablani, Nihal Nayak, Debajyoti Datta, Jonathan Chang,\nMike Tian-Jian Jiang, Han Wang, Matteo Manica, Sheng Shen, Zheng Xin Yong, Harshit Pandey, Rachel\nBawden, Thomas Wang, Trishala Neeraj, Jos Rozen, Abheesht Sharma, Andrea Santilli, Thibault Fevry,\nJason Alan Fries, Ryan Teehan, Teven Le Scao, Stella Biderman, Leo Gao, Thomas Wolf, and Alexander M\nRush. Multitask prompted training enables zero-shot task generalization. In International Conference on\nLearning Representations , 2022. URL https://openreview .net/forum?id=9Vrb9D0WI4 . (cited on p. 25)\n50 Under review as submission to TMLR\nDavid Saxton, Edward Grefenstette, Felix Hill, and Pushmeet Kohli. Analysing mathematical reasoning\nabilities of neural models. In International Conference on Learning Representations , 2019. URL https:\n//openreview .net/forum?id=H1gR5iR5FX . (cited on p. 12)\nScanCode, 2024. URL https://github .com/nexB/scancode-toolkit . (cited on p. 3)\nScanCode License Categories, 2024. URL https://scancode-licensedb .aboutcode.org/help.html#license-\ncategories . (cited on p. 4)\nTeven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ilic, Daniel Hesslow, Roman Castagn\u00e9,\nAlexandra Sasha Luccioni, Fran\u00e7ois Yvon, Matthias Gall\u00e9, Jonathan Tow, Alexander M. Rush, Stella Bider-\nman, Albert Webson, Pawan Sasanka Ammanamanchi, Thomas Wang, Beno\u00eet Sagot, Niklas Muennighoff,\nAlbert Villanova del Moral, Olatunji Ruwase, Rachel Bawden, Stas Bekman, Angelina McMillan-Major,\nIz Beltagy, Huu Nguyen, Lucile Saulnier, Samson Tan, Pedro Ortiz Suarez, Victor Sanh, Hugo Lau-\nren\u00e7on, Yacine Jernite, Julien Launay, Margaret Mitchell, Colin Raffel, Aaron Gokaslan, Adi Simhi, Aitor\nSoroa, Alham Fikri Aji, Amit Alfassy, Anna Rogers, Ariel Kreisberg Nitzav, Canwen Xu, Chenghao Mou,\nChris Emezue, Christopher Klamm, Colin Leong, Daniel van Strien, David Ifeoluwa Adelani, and et al.\nBLOOM: A 176b-parameter open-access multilingual language model. CoRR, abs/2211.05100, 2022a. doi:\n10.48550/ARXIV .2211.05100. URL https://doi .org/10.48550/arXiv .2211.05100. (cited on p. 2)\nTeven Le Scao, Thomas Wang, Daniel Hesslow, Lucile Saulnier, Stas Bekman, M Saiful Bari, Stella Biderman,\nHady Elsahar, Niklas Muennighoff, Jason Phang, et al. What language model to train if you have one\nmillion gpu hours? arXiv preprint arXiv:2210.15424 , 2022b. (cited on p. 14)\nServiceNow. Text2flow LLM: Automating workflow generation from descriptive text. https://\ndownloads.docs.servicenow .com/resource/enus/infocard/text2flow-llm .pdf, 2024a. (cited on p. 39)\nServiceNow. Text-to-code LLM: transforming natural language into executable code, 2024b. URL https:\n//downloads .docs.servicenow .com/resource/enus/infocard/text-to-code-llm .pdf. (cited on p. 39)\nNoam Shazeer. Fast transformer decoding: One write-head is all you need. CoRR, abs/1911.02150, 2019.\nURL http://arxiv .org/abs/1911 .02150. (cited on p. 20)\nEmily Sheng, Kai-Wei Chang, Premkumar Natarajan, and Nanyun Peng. The woman worked as a babysitter:\nOn biases in language generation. In Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan (eds.),\nProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th\nInternational Joint Conference on Natural Language Processing (EMNLP-IJCNLP) , pp. 3407\u20133412, Hong\nKong, China, November 2019. Association for Computational Linguistics. doi: 10 .18653/v1/D19-1339.\nURL https://aclanthology .org/D19-1339 . (cited on p. 34)\nDan Sholler, Igor Steinmacher, Denise Ford, Mara Averick, Mike Hoye, and Greg Wilson. Ten simple rules\nfor helping newcomers become contributors to open projects. PLoS Computational Biology , 15(9):e1007296,\n2019. doi: 10 .1371/journal .pcbi.1007296. URL https://doi .org/10.1371/journal .pcbi.1007296. (cited\non p. 37)\nShivalika Singh, Freddie Vargus, Daniel Dsouza, B\u00f6rje F. Karlsson, Abinaya Mahendiran, Wei-Yin Ko,\nHerumb Shandilya, Jay Patel, Deividas Mataciunas, Laura OMahony, Mike Zhang, Ramith Hettiarachchi,\nJoseph Wilson, Marina Machado, Luisa Souza Moura, Dominik Krzemi\u0144ski, Hakimeh Fadaei, Irem Erg\u00fcn,\nIfeoma Okoh, Aisha Alaagib, Oshan Mudannayake, Zaid Alyafeai, Vu Minh Chien, Sebastian Ruder, Surya\nGuthikonda, Emad A. Alghamdi, Sebastian Gehrmann, Niklas Muennighoff, Max Bartolo, Julia Kreutzer,\nAhmet \u00dcst\u00fcn, Marzieh Fadaee, and Sara Hooker. Aya dataset: An open-access collection for multilingual\ninstruction tuning. arXiv preprint , 2024. URL https://arxiv .org/abs/2402 .06619. (cited on p. 37)\nSoftware Heritage. Swh statement on llm for code, 2023. URL https://www .softwareheritage .org/2023/\n10/19/swh-statement-on-llm-for-code/ . (cited on p. 38)\nSoftware Heritage. Bulk access terms of use, 2024a. URL https://www .softwareheritage .org/legal/bulk-\naccess-terms-of-use/ . (cited on p. 37)\n51 Under review as submission to TMLR\nSoftware Heritage, 2024b. URL https://www .softwareheritage .org. (cited on p. 7)\nIrene Solaiman. The gradient of generative AI release: Methods and considerations. arXiv preprint , 2023.\nURL https://arxiv .org/abs/2302 .04844. (cited on pp. 2, 37, and 38)\nLuca Soldaini, Rodney Kinney, Akshita Bhagia, Dustin Schwenk, David Atkinson, Russell Authur, Ben Bogin,\nKhyathi Chandu, Jennifer Dumas, Yanai Elazar, Valentin Hofmann, Ananya Harsh Jha, Sachin Kumar,\nLiLucy, XinxiLyu, NathanLambert, IanMagnusson, JacobMorrison, NiklasMuennighoff, AakankshaNaik,\nCrystal Nam, Matthew E. Peters, Abhilasha Ravichander, Kyle Richardson, Zejiang Shen, Emma Strubell,\nNishant Subramani, Oyvind Tafjord, Pete Walsh, Luke Zettlemoyer, Noah A. Smith, Hannaneh Hajishirzi,\nIz Beltagy, Dirk Groeneveld, Jesse Dodge, and Kyle Lo. Dolma: an open corpus of three trillion tokens\nfor language model pretraining research. arXiv preprint , 2024. URL https://arxiv .org/abs/2402 .00159.\n(cited on p. 37)\nStackExchange Archive, 2024. URL https://archive .org/details/stackexchange . (cited on p. 12)\nJianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. RoFormer: Enhanced\ntransformer with rotary position embedding. arXiv preprint , April 2021. URL https://arxiv .org/abs/\n2104.09864. (cited on p. 20)\nMarc Szafraniec, Baptiste Roziere, Hugh James Leather, Patrick Labatut, Francois Charton, and Gabriel\nSynnaeve. Code translation with compiler representations. In The Eleventh International Conference on\nLearning Representations , 2023. URL https://openreview .net/forum?id=XomEU3eNeSQ . (cited on p. 11)\nFeiyang Tang, Bjarte M. \u00d8stvold, and Magiel Bruntink. Helping code reviewer prioritize: Pinpointing\npersonal data and its processing. Frontiers in Artificial Intelligence and Applications , 371:109\u2013124, 2023.\ndoi: 10.3233/FAIA230228. (cited on p. 38)\nFeiyang Tang and Bjarte M. \u00d8stvold. Finding privacy-relevant source code. arXiv preprint , 2024. URL\nhttps://arxiv .org/abs/2401 .07316. (cited on p. 38)\nThe SWHID Specification Project. The SWHID specification, 2024. URL https://www .swhid.org/. (cited\non p. 38)\nTogether Computer. RedPajama: an open dataset for training large language models, October 2023. URL\nhttps://github .com/togethercomputer/RedPajama-Data . (cited on p. 12)\nHugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay\nBashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton\nFerrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller,\nCynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan\nInan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh\nKoura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier\nMartinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein,\nRashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian,\nXiaoqing Ellen Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan,\nIliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert\nStojnic, Sergey Edunov, and Thomas Scialom. Llama 2: Open foundation and fine-tuned chat models.\narXiv preprint , 2023. URL https://arxiv .org/abs/2307 .09288. (cited on p. 12)\nAhmet \u00dcst\u00fcn, Viraat Aryabumi, Zheng-Xin Yong, Wei-Yin Ko, Daniel D\u2019souza, Gbemileke Onilude, Neel\nBhandari, Shivalika Singh, Hui-Lee Ooi, Amr Kayid, Freddie Vargus, Phil Blunsom, Shayne Longpre, Niklas\nMuennighoff, Marzieh Fadaee, Julia Kreutzer, and Sara Hooker. Aya model: An instruction finetuned\nopen-access multilingual language model. arXiv preprint , 2024. URL https://arxiv .org/abs/2402 .07827.\n(cited on p. 37)\nBertie Vidgen, Tristan Thrush, Zeerak Waseem, and Douwe Kiela. Learning from the worst: Dynamically\ngenerated datasets to improve online hate detection. In Chengqing Zong, Fei Xia, Wenjie Li, and Roberto\n52 Under review as submission to TMLR\nNavigli (eds.), Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and\nthe 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) , pp. 1667\u2013\n1682, Online, August 2021. Association for Computational Linguistics. doi: 10 .18653/v1/2021 .acl-long.132.\nURL https://aclanthology .org/2021.acl-long.132. (cited on p. 34)\nJunjie Wang, Yuchao Huang, Chunyang Chen, Zhe Liu, Song Wang, and Qing Wang. Software testing with\nlarge language models: Survey, landscape, and vision. IEEE Transactions on Software Engineering , pp.\n1\u201327, 2024. doi: 10 .1109/TSE.2024.3368208. URL https://arxiv .org/abs/2307 .07221. (cited on p. 2)\nJason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M.\nDai, and Quoc V. Le. Finetuned language models are zero-shot learners. In International Conference on\nLearning Representations , 2022. URL https://openreview .net/forum?id=gEZrGCozdqR . (cited on p. 25)\nMichael Woelfle, Piero L Olliaro, and Matthew H. Todd. Open science is a research accelerator. Nature\nchemistry , 3 10:745\u20138, 2011. URL https://api .semanticscholar .org/CorpusID:205289283 . (cited on p.\n37)\nWorld Economic Forum. Jobs of tomorrow: Large language models and jobs, 2024. URL https:\n//www.weforum.org/publications/jobs-of-tomorrow-large-language-models-and-jobs/ . (cited on p.\n39)\nYiheng Xu, Hongjin Su, Chen Xing, Boyu Mi, Qian Liu, Weijia Shi, Binyuan Hui, Fan Zhou, Yitao Liu,\nTianbao Xie, Zhoujun Cheng, Siheng Zhao, Lingpeng Kong, Bailin Wang, Caiming Xiong, and Tao\nYu. Lemur: Harmonizing natural language and code for language agents. In The Twelfth International\nConference on Learning Representations , 2024. URL https://openreview .net/forum?id=hNhwSmtXRh .\n(cited on p. 37)\nYahoo Finance. ServiceNow Inc (NYSE: NOW) Q4 earnings: What to expect, 2024. URL https://\nfinance.yahoo.com/news/servicenow-inc-nyse-now-q4-154816487 .html. (cited on pp. 2 and 39)\nZhou Yang, Zhipeng Zhao, Chenyu Wang, Jieke Shi, Dongsum Kim, Donggyun Han, and David Lo. Gotcha!\nthis model uses my code! evaluating membership leakage risks in code models. arXiv preprint , 2023. URL\nhttps://arxiv .org/abs/2310 .01166. (cited on p. 38)\nZheng Yuan, Hongyi Yuan, Chengpeng Li, Guanting Dong, Keming Lu, Chuanqi Tan, Chang Zhou, and\nJingren Zhou. Scaling relationship on learning mathematical reasoning with large language models. arXiv\npreprint, August 2023. URL https://arxiv .org/abs/2308 .01825. (cited on p. 12)\nJieyu Zhao, Tianlu Wang, Mark Yatskar, Vicente Ordonez, and Kai-Wei Chang. Gender bias in coreference\nresolution: Evaluation and debiasing methods. In Marilyn Walker, Heng Ji, and Amanda Stent (eds.),\nProceedings of the 2018 Conference of the North American Chapter of the Association for Computational\nLinguistics: Human Language Technologies, Volume 2 (Short Papers) , pp. 15\u201320, New Orleans, Louisiana,\nJune 2018. Association for Computational Linguistics. doi: 10 .18653/v1/N18-2003. URL https://\naclanthology .org/N18-2003 . (cited on p. 34)\nTerry Yue Zhuo, Yujin Huang, Chunyang Chen, and Zhenchang Xing. Red teaming ChatGPT via jailbreaking:\nBias, robustness, reliability and toxicity. arXiv preprint , 2023a. URL https://arxiv .org/abs/2301 .12867.\n(cited on p. 38)\nTerry Yue Zhuo, Zhou Yang, Zhensu Sun, Yufei Wang, Li Li, Xiaoning Du, Zhenchang Xing, and David\nLo. Source code data augmentation for deep learning: A survey. arXiv preprint , May 2023b. URL\nhttps://arxiv .org/abs/2305 .19915. (cited on pp. 2, 11, and 38)\nTerry Yue Zhuo, Armel Zebaze, Nitchakarn Suppattarachai, Leandro von Werra, Harm de Vries, Qian Liu,\nand Niklas Muennighoff. Astraios: Parameter-efficient instruction tuning code large language models.\narXiv preprint , August 2024. URL https://arxiv .org/abs/2401 .00788. (cited on pp. 25 and 37)\n53 Under review as submission to TMLR\nAlbert Ziegler, Eirini Kalliamvakou, X. Alice Li, Andrew Rice, Devon Rifkin, Shawn Simister, Ganesh\nSittampalam, andEdwardAftandilian. MeasuringGitHubCopilot\u2019simpactonproductivity. Commun. ACM ,\n67(3):54\u201363, feb 2024. ISSN 0001-0782. doi: 10 .1145/3633453. URL https://doi .org/10.1145/3633453 .\n(cited on pp. 2 and 39)\n54 Under review as submission to TMLR\nA Data Curation\nA.1 Excluded Extensions\nAL (al), AngelScript (as), AsciiDoc (asc), AspectJ (aj), Bison (bison), Boogie (bpl),\nC++ (<empty extension>), Cabal Config (project), ChucK (ck), CODEOWNERS (<empty extension>),\nCommon Lisp (l, sexp), Common Workflow Language (cwl), CoNLL-U (conll, conllu), Cue Sheet (cue),\nCWeb (w), desktop (desktop, in, service), DIGITAL Command Language (com), DTrace (d), edn (edn),\nElixir (lock), Factor (factor), GAP (g, gd), Gemfile.lock (lock), Gettext Catalog (pot),\nGit Config (gitmodules), GLSL (geo), Glyph Bitmap Distribution Format (bdf), GN (gn),\nIgnore List (dockerignore, eslintignore, gitignore, npmignore), INI (cfg, prefs, url),\nJAR Manifest (mf), Java Properties (properties), Jest Snapshot (snap), JetBrains MPS (mps),\nJSONLD (jsonld), LiveScript (ls), Makefile (d, make), Mathematica (cdf, nb), MAXScript (ms),\nmIRC Script (mrc), NASL (inc), nesC (nc), Nunjucks (njk), OpenEdge ABL (p, w),\nPascal (<empty extension>, dpr, inc, pp), Perl (al, ph), PLSQL (pck, pls, tps, trg, vw),\nProtocol Buffer Text Format (pbt), Puppet (<empty extension>), PureBasic (pb), Racket (rkt, rktd),\nReScript (res), reStructuredText (rest), Rich Text Format (rtf), Roff (<empty extension>, 1, 1d, 2,\n5, 7, 8, 9, in), Roff Manpage (<empty extension>, 1d, 2, 3d, 4, 6, 9, man), Scala (sc), Scilab (tst),\nSELinux Policy (te), Shell (env), Slash (sl), Smalltalk (cs), SmPL (cocci), SQL (tab), Standard ML (sig),\nStata (ihlp, sthlp), SuperCollider (sc), SWIG (i), TeX (aux, ltx, toc), TOML (lock), Turtle (ttl),\nVBA (frm, frx), Vim Snippet (snippet), Wavefront Material (mtl), Wikitext (wikitext),\nWindows Registry Entries (reg), wisp (w), World of Warcraft Addon Data (toc), X BitMap (xbm),\nXML (kml, pt, resx, rss), XML Property List (plist, tmcommand, tmlanguage, tmsnippet, tmtheme), Yacc (yy).\nA.2 Excluded Programming Languages\n2-Dimensional Array,AGS Script,Bicep,Checksums,COLLADA,CSV,Diff,DirectX 3D File,E-mail,G-code,\nGerber Image,Git Revision List,Gnuplot,Go,Checksums,IRC log,Jupyter Notebook,KiCad Layout,\nKiCad Legacy Layout,KiCad Schematic,Lasso,Linux,Kernel Module,Max,\nMicrosoft Developer Studio Project,Microsoft Visual Studio Solution,Pickle,PostScript,\nPOV-Ray SDL,Public Key,Pure Data,Raw token data,robots.txt,STL,SubRip Text,SVG,TSV,\nUnity3D Asset,Wavefront Object,WebVTT,X PixMap\nA.3 License detection\nlicense_file_names = [\n\"li[cs]en[cs]e(s?)\",\n\"legal\",\n\"copy(left|right|ing)\",\n\"unlicense\",\n\"[al]?gpl([-_ v]?)(\\d\\.?\\d?)?\", # AGPLv3\n\"bsd(l?)\", # BSDL\n\"mit(x?)\", # MITX\n\"apache\",\n\"artistic\", # Artistic.txt\n\"copying(v?)(\\d?)\", # COPYING3, COPYINGv3\n\"disclaimer\",\n\"eupl\",\n\"gfdl\",\n\"[cm]pl\",\n\"cc0\",\n\"al([-_ v]?)(\\d\\.?\\d)?\", # AL2.0\n\"about\",\n\"notice\",\n55 Under review as submission to TMLR\n\"readme\",\n\"guidelines\",\n]\nlicense_file_re = re.compile(\nrf\"^(|.*[-_. ])({\u2019|\u2019.join(license_file_names)})(|[-_. ].*)$\", re.IGNORECASE\n)\nA.4 Permissive licenses\nSPDX-recognized license IDs 0BSD, AAL, Abstyles, AdaCore-doc, Adobe-2006, Adobe-Glyph, ADSL,\nAFL-1.1, AFL-1.2, AFL-2.0, AFL-2.1, AFL-3.0, Afmparse, AMDPLPA, AML, AMPAS, ANTLR-PD, Apache-\n1.0, Apache-1.1, Apache-2.0, APAFML, App-s2p, Artistic-1.0, Artistic-1.0-cl8, Artistic-1.0-Perl, Artistic-2.0,\nBaekmuk, Bahyph, Barr, Beerware, Bitstream-Charter, Bitstream-Vera, BlueOak-1.0.0, Boehm-GC, Borceux,\nBrian-Gladman-3-Clause, BSD-1-Clause, BSD-2-Clause, BSD-2-Clause-Patent, BSD-2-Clause-Views, BSD-3-\nClause, BSD-3-Clause-Attribution, BSD-3-Clause-Clear, BSD-3-Clause-LBNL, BSD-3-Clause-Modification,\nBSD-3-Clause-No-Nuclear-License-2014, BSD-3-Clause-No-Nuclear-Warranty, BSD-3-Clause-Open-MPI, BSD-\n4-Clause, BSD-4-Clause-Shortened, BSD-4-Clause-UC, BSD-4.3RENO, BSD-4.3TAHOE, BSD-Advertising-\nAcknowledgement, BSD-Attribution-HPND-disclaimer, BSD-Source-Code, BSL-1.0, bzip2-1.0.6, Caldera,\nCC-BY-1.0, CC-BY-2.0, CC-BY-2.5, CC-BY-2.5-AU,CC-BY-3.0, CC-BY-3.0-AT,CC-BY-3.0-DE,CC-BY-3.0-\nNL, CC-BY-3.0-US, CC-BY-4.0, CDLA-Permissive-1.0, CDLA-Permissive-2.0, CECILL-B, CERN-OHL-1.1,\nCERN-OHL-1.2, CERN-OHL-P-2.0, CFITSIO, checkmk, ClArtistic, Clips, CMU-Mach, CNRI-Jython, CNRI-\nPython, CNRI-Python-GPL-Compatible, COIL-1.0, Community-Spec-1.0, Condor-1.1, Cornell-Lossless-JPEG,\nCrossword, CrystalStacker, Cube, curl, DL-DE-BY-2.0, DOC, Dotseqn, DRL-1.0, DSDP, dtoa, dvipdfm,\nECL-1.0, ECL-2.0, EFL-1.0, EFL-2.0, eGenix, Entessa, EPICS, etalab-2.0, EUDatagrid, Fair, FreeBSD-DOC,\nFSFAP, FSFULLR, FSFULLRWD, FTL, GD, Giftware, Glulxe, GLWTPL, Graphics-Gems, GStreamer-\nexception-2005, HaskellReport, HP-1986, HPND, HPND-Markus-Kuhn, HPND-sell-variant, HPND-sell-\nvariant-MIT-disclaimer, HTMLTIDY, IBM-pibs, ICU, IJG, IJG-short, ImageMagick, iMatix, Info-ZIP, Intel,\nIntel-ACPI, ISC, Jam, JasPer-2.0, JPNIC, JSON, Kazlib, Knuth-CTAN, Latex2e, Latex2e-translated-notice,\nLeptonica, Libpng, libpng-2.0, libtiff, Linux-OpenIB, LLVM-exception, LOOP, LPL-1.0, LPL-1.02, LPPL-1.3c,\nMartin-Birgmeier, metamail, Minpack, MirOS, MIT, MIT-0, MIT-advertising, MIT-CMU, MIT-enna, MIT-\nfeh, MIT-Festival, MIT-Modern-Variant, MIT-open-group, MIT-Wu, MITNFA, mpich2, mplus, MS-LPL,\nMS-PL, MTLL, MulanPSL-1.0, MulanPSL-2.0, Multics, Mup, NAIST-2003, NASA-1.3, Naumen, NBPL-1.0,\nNCSA, Net-SNMP, NetCDF, Newsletr, NICTA-1.0, NIST-PD-fallback, NIST-Software, NLOD-1.0, NLOD-2.0,\nNRL, NTP, NTP-0, O-UDA-1.0, ODC-By-1.0, OFFIS, OFL-1.0, OFL-1.0-no-RFN, OFL-1.0-RFN, OFL-\n1.1-no-RFN, OFL-1.1-RFN, OGC-1.0, OGDL-Taiwan-1.0, OGL-Canada-2.0, OGL-UK-1.0, OGL-UK-2.0,\nOGL-UK-3.0, OGTSL, OLDAP-1.1, OLDAP-1.2, OLDAP-1.3, OLDAP-1.4, OLDAP-2.0, OLDAP-2.0.1,\nOLDAP-2.1, OLDAP-2.2, OLDAP-2.2.1, OLDAP-2.2.2, OLDAP-2.3, OLDAP-2.4, OLDAP-2.5, OLDAP-2.6,\nOLDAP-2.7, OLDAP-2.8, OML, OpenSSL, OPUBL-1.0, PHP-3.0, PHP-3.01, Plexus, PostgreSQL, PSF-2.0,\npsfrag, psutils, Python-2.0, Python-2.0.1, Qhull, Rdisc, RSA-MD, Ruby, Saxpath, SCEA, SchemeReport,\nSendmail, SGI-B-1.1, SGI-B-2.0, SGP4, SHL-0.5, SHL-0.51, SHL-2.0, SHL-2.1, SMLNJ, snprintf, Spencer-86,\nSpencer-94, Spencer-99, SSH-OpenSSH, SSH-short, SunPro, Swift-exception, SWL, TCL, TCP-wrappers,\nTermReadKey, TPDL, TTWL, TU-Berlin-1.0, TU-Berlin-2.0, UCAR, Unicode-DFS-2015, Unicode-DFS-2016,\nUnixCrypt, UPL-1.0, Vim, VSL-1.0, W3C, W3C-19980720, W3C-20150513, w3m, Widget-Workshop, Wsuipa,\nX11, X11-distribute-modifications-variant, Xdebug-1.03, Xerox, Xfig, XFree86-1.1, xinetd, xlock, Xnet, xpp,\nXSkat, Zed, Zend-2.0, Zlib, zlib-acknowledgement, ZPL-1.1, ZPL-2.0, ZPL-2.1\nScanCode-specific license IDs LicenseRef-scancode-{3com-microcode, 3dslicer-1.0, 4suite-1.1, accellera-\nsystemc, adi-bsd, adrian, agere-bsd, alexisisaac-freeware, amd-historical, ams-fonts, anu-license, apache-patent-\nexception, apple-attribution, apple-attribution-1997, apple-excl, apple-sscl, aravindan-premkumar, argouml,\narm-llvm-sga, array-input-method-pl, asmus, asn1, atkinson-hyperlegible-font, bakoma-fonts-1995, bea-2.1,\nbeal-screamer, beri-hw-sw-1.0, bigdigits, bigelow-holmes, biopython, bitzi-pd, blas-2017, bohl-0.2, boost-\noriginal, boutell-libgd-2021, bpmn-io, brent-corkum, brian-clapper, brian-gladman, brian-gladman-3-clause,\nbroadcom-cfe, broadcom-linux-timer, brocade-firmware, bruno-podetti, bsd-1-clause-build, bsd-1988, bsd-2-\n56 Under review as submission to TMLR\nclause-plus-advertizing, bsd-3-clause-devine, bsd-3-clause-fda, bsd-3-clause-jtag, bsd-3-clause-no-change, bsd-\n3-clause-no-trademark, bsd-3-clause-sun, bsd-ack-carrot2, bsd-artwork, bsd-atmel, bsd-axis-nomod, bsd-credit,\nbsd-dpt, bsd-export, bsd-innosys, bsd-mylex, bsd-new-derivative, bsd-new-nomod, bsd-new-tcpdump, bsd-no-\ndisclaimer, bsd-no-disclaimer-unmodified, bsd-original-muscle, bsd-original-voices, bsd-plus-mod-notice, bsd-\nsimplified-darwin, bsd-simplified-intel, bsd-simplified-source, bsd-top, bsd-top-gpl-addition, bsd-unchanged,\nbsd-unmodified, bsd-x11, bsla-no-advert, bytemark, can-ogl-alberta-2.1, can-ogl-british-columbia-2.0, can-ogl-\nnova-scotia-1.0, can-ogl-ontario-1.0, can-ogl-toronto-1.0, careware, carnegie-mellon, cavium-malloc, cc-by-2.0-\nuk, cecill-b-en, cern-attribution-1995, cgic, chicken-dl-0.2, chris-maunder, chris-stoy, classic-vb, clear-bsd-1-\nclause, click-license, cmu-mit, cmu-simple, cmu-template, code-credit-license-1.0.1, code-credit-license-1.1.0,\ncodeguru-permissions, codesourcery-2004, commonj-timer, compass, componentace-jcraft, compuphase-linking-\nexception, cosl, cpm-2022, cpp-core-guidelines, crcalc, cryptopp, csprng, cve-tou, cwe-tou, cximage, d-zlib,\ndamail, dante-treglia, dbad-1.1, delorie-historical, dhtmlab-public, dl-de-by-1-0-de, dl-de-by-1-0-en, dl-de-by-\n2-0-en, dmalloc, dmtf-2017, docbook, douglas-young, drl-1.1, dropbear, dropbear-2016, dtree, dwtfnmfpl-3.0,\ndynamic-drive-tou, ecfonts-1.0, egenix-1.0.0, ellis-lab, emit, emx-library, energyplus-bsd, epaperpress, eric-glass,\nerrbot-exception, etalab-2.0-en, fabien-tassin, far-manager-exception, fastbuild-2012-2020, fatfs, fftpack-2004,\nfilament-group-mit, flex-2.5, flora-1.1, font-alias, fpl, fplot, fraunhofer-iso-14496-10, free-art-1.3, freebsd-\nboot, freebsd-first, freemarker, fsf-notice, fujion-exception-to-apache-2.0, gareth-mccaughan, gary-s-brown,\ngdcl, geoff-kuenning-1993, ghostpdl-permissive, glut, good-boy, greg-roelofs, gregory-pietsch, gtpl-v1, gtpl-\nv2, gtpl-v3, happy-bunny, hdf4, hdf5, hdparm, hidapi, historical-ntp, homebrewed, hp-snmp-pp, html5,\nhttpget, ian-kaplan, ian-piumarta, ibm-as-is, ibm-dhcp, ibm-icu, ibm-nwsc, ibm-sample, ibpp, icot-free,\nidt-notice, ietf, ietf-trust, ilmid, indiana-extreme, infineon-free, info-zip-1997-10, info-zip-2001-01, info-zip-\n2002-02, info-zip-2003-05, info-zip-2004-05, info-zip-2005-02, info-zip-2007-03, info-zip-2009-01, inno-setup,\nintel-bsd, intel-bsd-2-clause, intel-osl-1989, intel-osl-1993, intel-royalty-free, iso-14496-10, iso-8879, itu, ja-sig,\njason-mayes, jasper-1.0, java-app-stub, jdbm-1.00, jdom, jetty, jgraph, jpnic-mdnkit, jpython-1.1, jscheme,\njsfromhell, jython, kalle-kaukonen, keith-rule, kerberos, kevan-stannard, kevlin-henney, khronos, kumar-\nrobotics, lcs-telegraphics, ldap-sdk-free-use, libgeotiff, libmib, libmng-2007, libsrv-1.0.2, lil-1, lilo, linux-device-\ndrivers, linuxbios, linuxhowtos, llnl, logica-1.0, lucre, make-human-exception, matt-gallagher-attribution,\nmatthew-kwan, mattkruse, mediainfo-lib, mgopen-font-license, michael-barr, michigan-disclaimer, mit-1995,\nmit-license-1998, mit-modification-obligations, mit-nagy, mit-no-advert-export-control, mit-no-trademarks,\nmit-old-style, mit-old-style-sparse, mit-readme, mit-specification-disclaimer, mit-synopsys, mit-taylor-variant,\nmit-veillard-variant, mod-dav-1.0, motorola, mpeg-iso, mpeg-ssg, ms-sspl, ms-ws-routing-spec, msj-sample-\ncode, mulanpsl-1.0-en, mulanpsl-2.0-en, mulle-kybernetik, musl-exception, mx4j, netcat, netcomponents,\nnetron, newlib-historical, newran, nice, niels-ferguson, nilsson-historical, nist-srd, node-js, nonexclusive,\nnortel-dasa, notre-dame, nrl-permission, ntlm, ntpl-origin, nvidia, nvidia-2002, nvidia-gov, nwhm, nysl-\n0.9982, nysl-0.9982-jp, o-young-jong, oasis-ws-security-spec, object-form-exception-to-mit, odl, odmg, ogc,\nogl-1.0a, ogl-canada-2.0-fr, ogl-wpd-3.0, openmarket-fastcgi, openorb-1.0, opensaml-1.0, openssl, opml-1.0,\nopnl-1.0, opnl-2.0, oreilly-notice, oswego-concurrent, other-permissive, owtchart, ozplb-1.0, ozplb-1.1, paolo-\nmessina-2000, paraview-1.2, patent-disclaimer, paul-mackerras, paul-mackerras-binary, paul-mackerras-new,\npaul-mackerras-simplified, paulo-soares, paypal-sdk-2013-2016, pcre, pd-mit, pd-programming, perl-1.0, peter-\ndeutsch-document, philippe-de-muyter, phorum-2.0, php-2.0.2, pine, pngsuite, politepix-pl-1.0, ppp, protobuf,\npsf-3.7.2, psytec-freesoft, purdue-bsd, pybench, pycrypto, pygres-2.2, python-cwi, qlogic-microcode, qpopper,\nqualcomm-turing, quirksmode, radvd, red-hat-attribution, red-hat-bsd-simplified, reportbug, ricebsd, richard-\nblack, robert-hubley, rsa-1990, rsa-cryptoki, rsa-demo, rsa-md4, rtools-util, rute, ryszard-szopa, saas-mit, saf,\nsash, sata, sbia-b, scancode-acknowledgment, scanlogd-license, scansoft-1.2, scintilla, scribbles, script-asylum,\nsecret-labs-2011, service-comp-arch, sgi-cid-1.0, sgi-glx-1.0, sglib, shital-shah, simpl-1.1, softfloat, softfloat-2.0,\nsoftsurfer, sparky, speechworks-1.1, ssleay, ssleay-windows, stanford-pvrg, stlport-2000, stlport-4.5, stream-\nbenchmark, stu-nicholls, sun-rpc, sun-source, sunsoft, supervisor, svndiff, swig, symphonysoft, synopsys-mit,\nsynthesis-toolkit, takao-abe, takuya-ooura, tcg-spec-license-v1, tekhvc, tested-software, tex-live, things-i-\nmade-public-license, tiger-crypto, tigra-calendar-3.2, tigra-calendar-4.0, tim-janik-2003, timestamp-picker,\ntso-license, ttcl, ttyp0, tumbolia, twisted-snmp, ubc, unicode, unicode-icu-58, unicode-mappings, unlimited-\nbinary-use-exception, unpbook, us-govt-unlimited-rights, usrobotics-permissive, utopia, vcalendar, vince,\nvisual-idiot, visual-numerics, vixie-cron, w3c-03-bsd-license, westhawk, whistle, whitecat, wide-license, william-\nalexander, wingo, wol, wordnet, wrox, ws-addressing-spec, ws-policy-specification, ws-trust-specification,\nwtfnmfpl-1.0, wxwidgets, wxwindows-u-3.0, x11-acer, x11-adobe, x11-adobe-dec, x11-dec1, x11-dec2, x11-doc,\n57 Under review as submission to TMLR\nx11-dsc, x11-hanson, x11-lucent-variant, x11-oar, x11-opengl, x11-quarterdeck, x11-realmode, x11-sg, x11-\nstanford, x11-tektronix, x11-x11r5, x11-xconsortium-veillard, xfree86-1.0, xmldb-1.0, xxd, yale-cas, yensdesign,\nzeusbench, zpl-1.0, zsh, zuora-software, zveno-research}\nNon-licenses The following contributor license agreements, warranty disclaimers, and other license amend-\nments were not considered during license labeling: LicenseRef-scancode-{dco-1.1, generic-cla, google-cla,\njetty-ccla-1.1, newton-king-cla, generic-exception, generic-export-compliance, generic-tos, generic-trademark,\nwarranty-disclaimer}\nA.5 Pull Requests\nTable 24 shows the volume of PR renderings for various sequence lengths (measured in characters). We list\nthe volume of the base files for the top 20 languages in Table 25.\nA.6 StackOverflow\nWe used the following prompt to\nBelow is an instruction from a user and a candidate\u2019s answer. Evaluate whether or not the answer is\na good example of how AI Assistant should respond to the user\u2019s instruction. Please assign a score\nusing the following 10-point scale:\n1: The response is entirely off-topic, contains significant inaccuracies, or is incomprehensible.\nIt fails to address the user\u2019s query in any meaningful way.\n2: The answer is largely irrelevant, vague, or controversial. It contains some elements that relate\nto the topic but misses the core of the user\u2019s question or includes substantial misinformation.\n3: The response is somewhat relevant but remains incomplete or contains elements that are\noff-topic or controversial. Key aspects of the user\u2019s query are left unaddressed.\n4: The answer addresses the user\u2019s question to some extent but lacks depth or clarity. It may be\nsomewhat helpful but is not comprehensive or detailed.\n5: The response is relevant and offers a basic answer to the user\u2019s question but lacks detail or\nspecificity. It\u2019s helpful but not fully developed or insightful.\n6: The answer is moderately helpful and addresses most aspects of the user\u2019s question. It might\nlack some depth or contain minor inaccuracies or irrelevant information.\n7: The response is quite helpful and addresses the user\u2019s query well, but it might not be from an\nAI Assistant\u2019s perspective. It could resemble content from other sources like blog posts or web pages.\n8: The answer is comprehensive and relevant, written from an AI assistant\u2019s perspective. It\naddresses the user\u2019s query effectively but may have minor areas for improvement in focus,\nconciseness, or organization.\n9: The response is almost perfect, providing a clear, comprehensive, and well-organized answer from an\nAI assistant\u2019s perspective. It might have very minor areas for improvement in terms of engagement or\ninsight.\n10: The answer is exemplary, perfectly addressing the user\u2019s query from an AI Assistant\u2019s perspective.\nIt is highly informative, expertly written, engaging, and insightful, with no discernible areas\nfor improvement.\n58 Under review as submission to TMLR\nTable 24: Volume of the pull requests dataset\nwhen we restrict the sequence length.\nSeqlen (characters) Volume (GB)\n25000 19.6\n50000 38.7\n75000 54.34\n100000 67.31\n200000 103.52\n300000 126.8\n400000 143.65\n500000 156.76\n600000 167.21\n700000 175.94\n800000 183.18\n900000 189.32\n1000000 194.58Table 25: Size of base files range of changes for\ntop 20 languages in Pull Requests.\nLanguage Volume (GB)\nPython 13.46\nJavaScript 9.55\nJava 8.37\nMarkdown 7.34\nC++ 5.89\nGo 5.59\nJSON 4.13\nTypeScript 3.96\nC# 3.76\nYAML 3.1\nXML 2.55\nC 2.34\nHTML 2.31\nRust 2.27\nPHP 2.09\nRuby 1.73\nproject.pbxproj 1.51\nScala 1.25\nTSX 1.2\nSwift 0.9\nPlease write \"Score: <rating>\" in the last line, and then provide a brief reasoning you used to derive\nthe rating score.\nA.7 Kaggle Notebooks templates\nWe remove the following templates if they appear at the beginning of a Kaggle notebook:\nTEMPLATE_1 = \u2019# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-\npython\nimport numpy as np # linear algebra\nimport pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)\n# Input data files are available in the read-only \"../input/\" directory\n# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the\ninput directory\nimport os\nfor dirname, _, filenames in os.walk(\"/kaggle/input\"):\nfor filename in filenames:\nprint(os.path.join(dirname, filename))\n# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output\nwhen you create a version using \"Save & Run All\"\n# You can also write temporary files to /kaggle/temp/, but they won\u2019t be saved outside of the current\nsession\u2019\nTEMPLATE_2 = \u2019# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-\npython\\n\u2019\n59 Under review as submission to TMLR\nTable 26: Top 10 detected malware signatures.\nSignature Count\nSanesecurity.Malware.28845.BadVBS 11876\nwinnow.compromised.ts.jsexploit.5 2251\nSanesecurity.Malware.26492.JsHeur 2247\nSanesecurity.Spam.8879 1597\nSanesecurity.Malware.25834.JsHeur 1560\nSanesecurity.Malware.27112.JsHeur 1258\nSanesecurity.Malware.26222.JsHeur 888\nPorcupine.Malware.52833 814\nSanesecurity.SpamL.8887 792\nSanesecurity.Malware.26557.JsHeur 728Table 27: Top 10 languages by the number of\npotentially malicious files.\nLanguage Count\nText 13281\nHTML 11336\nJavaScript 10210\nVBScript 7947\nLogos 3283\nMarkdown 2736\nLinker Script 1390\nXML 1260\nVBA 990\nJSON 547\nB Processing Pipeline\nB.1 Malware removal\nWe show the top-10 detected malware signatures in Table 26 and the top-10 languages by potentially malicous\nfiles in Table 27.\nC Data Composition\nC.1 TheStackV2-train-smol\n\u2022Configuration languages\n\u2013Ant Build System\n\u2013CMake\n\u2013Dockerfile\n\u2013Go Module\u2013Gradle\n\u2013INI\n\u2013Java Properties\u2013Makefile\n\u2013Maven POM\n\u2013TOML\n\u2022Configuration files:\n\u2013CMakeLists.txt\n\u2013Cargo.toml\n\u2013DESCRIPTION\n\u2013Gemfile\n\u2013Makefile\n\u2013Makefile.am\n\u2013NAMESPACE\n\u2013Package.swift\n\u2013Pipfile\n\u2013build.gradle\u2013build.gradle.kts\n\u2013composer.json\n\u2013conda.yml\n\u2013configure.ac\n\u2013docker-compose.yaml\n\u2013docker-compose.yml\n\u2013go.mod\n\u2013package.json\n\u2013pom.xml\u2013pyproject.toml\n\u2013requirements-dev.txt\n\u2013requirements-prod.txt\n\u2013requirements.in\n\u2013requirements.test.txt\n\u2013requirements.txt\n\u2013setup.cfg\n\u2013tsconfig.json\n\u2013yarn.lock\nC.2 TheStackV2-train-full\nIn Table 28, we summarize the data volume for the subsamples languages.\n60 Under review as submission to TMLR\nTable 28: Subsampling volumes for languages in the Stack v2 dataset.\nFinal volume Languages\n200GB Java, JavaScript\n100GB HTML\n8GB CSS, Java Server Pages, JSON,\nSCSS, Smali, XML, YAML\n1GB BibTeX, Gettext Catalog, Graphviz (DOT),\nJava Properties, Roff, Roff Manpage,\nWeb Ontology Language\n61",
  "0_question": "What is the main focus of the BigCode project?",
  "0_answers:eli5": "The BigCode project is focused on developing and using large language models (LLMs) for code.",
  "0_answers:expert": "The BigCode project is focused on the responsible development of Large Language Models for Code (Code LLMs).",
  "0_additional_depth_q:follow up question": "When does the DeepSeekCoder-33B model outperforms other models of comparable size?",
  "0_additional_depth_q:answers:eli5": "DeepSeekCoder-33B is the best model for code completion for high-resource languages.",
  "0_additional_depth_q:answers:expert": "DeepSeekCoder-33B performs better than other models of comparable size in code completion for high-resource languages.",
  "0_additional_breath_q:follow up question": "How many programming languages are covered in the Stack v2?",
  "0_additional_breath_q:answers:eli5": "The Stack v2 covers more than 600 programming languages.",
  "0_additional_breath_q:answers:expert": "The Stack v2 covers 619 programming languages.",
  "1_question": "How is StarCoder2 different from StarCoderBase?",
  "1_answers:eli5": "StarCoder2 is trained on a larger dataset and uses a more advanced training process than StarCoderBase.",
  "1_answers:expert": "StarCoder2 is trained on a dataset that is 4\u00d7 larger than the first StarCoder dataset and uses a two-stage training process.",
  "1_additional_depth_q:follow up question": "How is StarCoder2 different from StarCoderBase?",
  "1_additional_depth_q:answers:eli5": "StarCoder2 is trained on a dataset that is 4\u00d7 larger than the first StarCoder dataset and uses a two-stage training process.",
  "1_additional_depth_q:answers:expert": "StarCoder2 is trained on a dataset that is 4\u00d7 larger than the first StarCoder dataset and uses a two-stage training process. The first stage uses a 4k context window, and the second stage uses a 16k context window. StarCoder2 also uses a larger model size (15B parameters) than StarCoderBase (1.1B parameters).",
  "1_additional_breath_q:follow up question": "How is StarCoder2 different from StarCoderBase in terms of training data?",
  "1_additional_breath_q:answers:eli5": "StarCoder2 was trained on a dataset that is 4 times larger than StarCoderBase.",
  "1_additional_breath_q:answers:expert": "StarCoder2 is trained on a dataset that is 4\u00d7 larger than the StarCoderBase dataset, using a two-stage training process.",
  "2_question": "What is The Stack v2?",
  "2_answers:eli5": "The Stack v2 is a dataset of permissively licensed source code that is used to train StarCoder2.",
  "2_answers:expert": "The Stack v2 is a dataset of 900B+ unique tokens, 4\u00d7 larger than the first StarCoder dataset, that includes code repositories, Github issues, pull requests, Kaggle and Jupyter notebooks, code documentation, and other natural language datasets related to math, coding, and reasoning.",
  "2_additional_depth_q:follow up question": "What is the definition of 'near duplicates' of notebooks?",
  "2_additional_depth_q:answers:eli5": "Two notebooks are considered 'near duplicates' if they share the same code structure and the majority of their code. This means that they might have some minor differences, like different variable names or comments, but overall, they do the same thing.",
  "2_additional_depth_q:answers:expert": "In our context, 'near duplicates' refers to notebooks that share a high degree of similarity in their code structure and content. We determined this similarity using a combination of textual and structural analysis. Specifically, we used a sequence matching algorithm to compare the code sequences of different notebooks. We also considered the structural similarity of the notebooks, such as the order and nesting of code blocks and Markdown cells.",
  "2_additional_breath_q:follow up question": "What is the process of formatting pull requests in StarCoder2?",
  "2_additional_breath_q:answers:eli5": "Pull requests are requests to merge particular code changes from one branch into another on GitHub. Typically, they involve multiple rounds of code review discussions and additional cycles of code changes before they get merged into the target branch.\\n\\nWhen formatting the pull requests in StarCoder2, the engineers first remove any changes to the base, those that are not approved or merged, and those that lack initial diffs. Then, they remove any titles less than 10 characters or that contain specific words related to dependencies or releases. They also remove descriptions less than 20 characters or that contain a specific word.\\n\\nNext, they truncate the titles to 500 characters and the descriptions to 80 lines, only displaying the first 60 and the last 20 lines. If the description length still exceeds 1000 characters, it is truncated.\\n\\nFinally, they remove comments from bots and strip auto-generated text when users post via email reply. They anonymize the usernames of authors and remove comments from PRs with less than 20 characters unless they are PR review comments.",
  "2_additional_breath_q:answers:expert": "1) Remove any changes to the base, those that are not approved or merged, and those that lack initial diffs.\\n\n2) Remove any titles less than 10 characters or that contain specific words related to dependencies or releases.\\n\\n3) Remove descriptions less than 20 characters or that contain a specific word.\\n\n4) Truncate the titles to 500 characters and the descriptions to 80 lines, only displaying the first 60 and the last 20 lines. If the description length still exceeds 1000 characters, it is truncated.\\n\\n5) Remove comments from bots and strip auto-generated text when users post via email reply.\\n\\n6) Anonymize the usernames of authors.\\n\\n7) Remove comments from PRs with less than 20 characters unless they are PR review comments.",
  "3_question": "What are the key findings of the evaluation of StarCoder2?",
  "3_answers:eli5": "StarCoder2 outperforms other models of similar size and even outperforms CodeLlama-34B, a model more than twice its size.",
  "3_answers:expert": "The StarCoder2-3B model outperforms other Code LLMs of similar size (StableCode-3B and DeepSeekCoder-1.3B) on most benchmarks. Moreover, it matches or surpasses the performance of StarCoderBase-15B. The StarCoder2-15B model significantly outperforms other models of comparable size (CodeLlama-13B), and matches or outperforms CodeLlama-34B. DeepSeekCoder-33B is the best model at code completion benchmarks for high-resource languages. However, StarCoder2-15B matches or outperforms DeepSeekCoder-33B on low-resource programming languages (e.g., D, Julia, Lua, and Perl). Moreover, when we consider benchmarks that require models to reason about code execution (Gu et al., 2024) or mathematics (Cobbe et al., 2021), we find that StarCoder2-15B outperforms DeepSeekCoder-33B.",
  "3_additional_depth_q:follow up question": "How does the evaluation of StarCoder2 compare to that of previous models?",
  "3_additional_depth_q:answers:eli5": "StarCoder2 performs significantly better than previous Code LLMs of comparable size on most benchmarks, showing a large improvement over existing models.",
  "3_additional_depth_q:answers:expert": "The evaluation of StarCoder2 demonstrates its superior performance compared to previous Code LLMs of similar size across a comprehensive set of benchmarks. StarCoder2-3B outperforms other Code LLMs of comparable size (StableCode-3B and DeepSeekCoder-1.3B) on most benchmarks and matches or surpasses the performance of StarCoderBase-15B. StarCoder2-15B significantly outperforms other models of comparable size (CodeLlama-13B) and matches or outperforms CodeLlama-34B. While DeepSeekCoder-33B exhibits the best performance at code completion benchmarks for high-resource languages, StarCoder2-15B matches or outperforms DeepSeekCoder-33B on low-resource programming languages (e.g., D, Julia, Lua, and Perl). Additionally, StarCoder2-15B surpasses DeepSeekCoder-33B on benchmarks requiring models to reason about code execution or mathematics.",
  "3_additional_breath_q:follow up question": "What are the differences among the training details of StarCoder2-3b, StarCoder2-7B, and StarCoder2-15B?",
  "3_additional_breath_q:answers:eli5": "Sorry, I do not have access to the full context of the paper where this information might be described, so I can not answer this question.",
  "3_additional_breath_q:answers:expert": "Sorry, I do not have access to the full context of the paper where this information might be described, so I can not answer this question.",
  "4_question": "Is StarCoder2 available to the public?",
  "4_answers:eli5": "I don't know if StarCoder2 is available to the public.",
  "4_answers:expert": "This paper does not mention whether StarCoder2 is available to the public.",
  "4_additional_depth_q:follow up question": "How is The Stack v2 different from The Stack v1?",
  "4_additional_depth_q:answers:eli5": "The Stack v2 is like a bigger and better library of code that the AI can learn from. It includes even more code from different programming languages, as well as other types of code-related information like GitHub issues and pull requests.",
  "4_additional_depth_q:answers:expert": "The Stack v2 is an expanded version of The Stack v1, a dataset of permissively licensed source code. The Stack v2 includes code from 619 programming languages, compared to 384 in The Stack v1. It also includes additional data sources, such as GitHub issues, pull requests, Kaggle notebooks, code documentation, and other natural language datasets related to math, coding, and reasoning.",
  "4_additional_breath_q:follow up question": "Does StarCoder2 have a free version available to the public?",
  "4_additional_breath_q:answers:eli5": "This paper does not mention whether or not StarCoder2 has a free version available to the public.",
  "4_additional_breath_q:answers:expert": "This paper does not mention whether or not StarCoder2 has a free version available to the public.",
  "5_question": "What are the potential applications of StarCoder2?",
  "5_answers:eli5": "StarCoder2 can be used to help developers write and edit code, as well as to improve the quality and security of software.",
  "5_answers:expert": "StarCoder2 has the potential to enhance all phases of the software development cycle, including speeding up the implementation of new projects, improving quality assurance for developed software, helping detect and fix bugs, simplifying maintenance tasks, and easing migration to newer software.",
  "5_additional_depth_q:follow up question": "What are potential applications of StarCoder2 for the software development cycle?",
  "5_additional_depth_q:answers:eli5": "StarCoder2 can be applied to various phases of the software development cycle, including the initiation of new projects, the improvement of the quality of developed software, the detection and correction of bugs, the simplification of maintenance tasks, and the facilitation of migration to newer software.",
  "5_additional_depth_q:answers:expert": "StarCoder2 has the potential to enhance all phases of the software development cycle, including speeding up the implementation of new projects, improving quality assurance for developed software, helping detect and fix bugs, simplifying maintenance tasks, and easing migration to newer software.",
  "5_additional_breath_q:follow up question": "How can StarCoder2 be used in the software development industry?",
  "5_additional_breath_q:answers:eli5": "StarCoder2 can be used in the software development industry to help developers do their jobs more quickly and efficiently. For example, StarCoder2 can be used to generate code snippets, debug code, and find documentation. This can save developers time and effort, allowing them to focus on more creative tasks.",
  "5_additional_breath_q:answers:expert": "StarCoder2 has the potential to enhance all phases of the software development cycle, including speeding up the implementation of new projects, improving quality assurance for developed software, helping detect and fix bugs, simplifying maintenance tasks, and easing migration to newer software.",
  "6_question": "What are the ethical considerations related to the development of Code LLMs?",
  "6_answers:eli5": "I don't know about the ethical considerations related to the development of Code LLMs.",
  "6_answers:expert": "This paper does not discuss the ethical considerations related to the development of Code LLMs.",
  "6_additional_depth_q:follow up question": "What are some of the ethical considerations that should be taken into account when developing Code LLMs?",
  "6_additional_depth_q:answers:eli5": "Ethical considerations are important during development to ensure that the LLM is used for good purposes and not for harm.",
  "6_additional_depth_q:answers:expert": "Some of the ethical considerations that should be taken into account when developing Code LLMs include:\n1. **Bias and Discrimination**: Code LLMs can perpetuate biases and discrimination present in the training data. Developers should take steps to mitigate these biases by using diverse datasets and techniques to reduce bias.\n2. **Malicious Use**: Code LLMs can be used to create malware, spam, and other malicious content. Developers should include safeguards to prevent the LLM from being used for these purposes.\n3. **Job Displacement**: Code LLMs can automate tasks that are currently performed by humans. Developers should consider the potential impact on employment and develop strategies to mitigate job displacement.\n4. **Privacy**: Code LLMs can generate code that accesses or leaks sensitive information. Developers should include safeguards to protect user privacy.",
  "6_additional_breath_q:follow up question": "How does the output of this code LLM compare to the output of GPT-4?",
  "6_additional_breath_q:answers:eli5": "This report does not cover the output of GPT-4, so I cannot compare the two.",
  "6_additional_breath_q:answers:expert": "This report does not cover the output of GPT-4, so I cannot compare the two.",
  "7_question": "What are the future directions for research on Code LLMs?",
  "7_answers:eli5": "I don't know about the future directions for research on Code LLMs.",
  "7_answers:expert": "This paper does not discuss the future directions for research on Code LLMs.",
  "7_additional_depth_q:follow up question": "What are the future research directions for exploring Code LLMs further?",
  "7_additional_depth_q:answers:eli5": "This research paper focuses on improving the existing StarCoder LLM model for code by training it on a larger and more comprehensive dataset. It does not discuss future research directions for exploring Code LLMs further.",
  "7_additional_depth_q:answers:expert": "This research paper does not discuss future research directions for exploring Code LLMs further.",
  "7_additional_breath_q:follow up question": "How can Code LLMs enhance the software development cycle?",
  "7_additional_breath_q:answers:eli5": "Code LLMs can help software developers become more productive and satisfied with their work by speeding up the implementation of new projects, improving quality assurance, detecting and fixing bugs, simplifying maintenance tasks, and easing migration to newer software.",
  "7_additional_breath_q:answers:expert": "Code LLMs can enhance the software development cycle in a variety of ways. For example, they can help to speed up the implementation of new projects by generating code snippets and templates. They can also improve quality assurance by identifying potential bugs and suggesting fixes. Additionally, Code LLMs can help to detect and fix bugs by analyzing code and identifying errors. They can also simplify maintenance tasks by automating repetitive tasks, such as refactoring code. Finally, Code LLMs can ease migration to newer software by generating code that is compatible with the new software.",
  "8_question": "What are the limitations of StarCoder2?",
  "8_answers:eli5": "I don't know about the limitations of StarCoder2.",
  "8_answers:expert": "This paper does not discuss the limitations of StarCoder2.",
  "8_additional_depth_q:follow up question": "What is the main difference between the Stack v1 and the Stack v2?",
  "8_additional_depth_q:answers:eli5": "The Stack v2 is built on top of the Software Heritage archive, which spans over 600 programming languages, while the Stack v1 only included 384. The Stack v2 also includes other high-quality open data sources, such as Github issues, pull requests, Kaggle and Jupyter notebooks, code documentation, and other natural language datasets related to math, coding, and reasoning.",
  "8_additional_depth_q:answers:expert": "The main difference between the Stack v1 and the Stack v2 is that the Stack v2 includes both permissively licensed and unlicensed files, while the Stack v1 only included permissively licensed files. The Stack v2 also includes a number of other improvements, such as a larger dataset size, a more comprehensive set of programming languages, and a more sophisticated data cleaning process.",
  "8_additional_breath_q:follow up question": "What are the limitations of StarCoder2?",
  "8_additional_breath_q:answers:eli5": "This question can not be answered because the provided document does not specify any limitations of StarCoder2.",
  "8_additional_breath_q:answers:expert": "This question can not be answered because the provided document does not specify any limitations of StarCoder2.",
  "9_question": "How can I learn more about StarCoder2?",
  "9_answers:eli5": "I don't know how you can learn more about StarCoder2.",
  "9_answers:expert": "This paper provides a comprehensive overview of StarCoder2. You can also visit the BigCode project website for more information.",
  "9_additional_depth_q:follow up question": "How does the preprocessing pipeline of the Jupyter notebook cleaning differ from that of the Kaggle notebook cleaning?",
  "9_additional_depth_q:answers:eli5": "The preprocessing pipeline for cleaning Jupyter and Kaggle notebooks is slightly different. For Jupyter notebooks, the goal is to preserve the markdown structure of the text blocks, while for Kaggle notebooks, the output cells are not included. Additionally, Kaggle notebooks undergo a templated text removal step at the beginning of the notebooks.",
  "9_additional_depth_q:answers:expert": "The preprocessing pipeline for cleaning Jupyter and Kaggle notebooks differs in two main ways. First, for Jupyter notebooks, the markdown structure of the text blocks is preserved, while for Kaggle notebooks, the output cells are not included. Second, Kaggle notebooks undergo an additional templated text removal step at the beginning of the notebooks.",
  "9_additional_breath_q:follow up question": "Can you explain why StarCoder2-7B does not perform as well as StarCoder2-3B and StarCoder2-15B?",
  "9_additional_breath_q:answers:eli5": "The reason why StarCoder2-7B does not perform as well as StarCoder2-3B and StarCoder2-15B is not entirely clear to the authors of this document.",
  "9_additional_breath_q:answers:expert": "The reason why StarCoder2-7B does not perform as well as StarCoder2-3B and StarCoder2-15B is unclear to the authors of this report."
}