Spaces:

bigcode
/

README

Running

File size: 9,846 Bytes

---
title: README
emoji: ✨
colorFrom: gray
colorTo: red
sdk: static
pinned: false
---

<img id="bclogo" src="https://huggingface.co/datasets/loubnabnl/repo-images/resolve/main/bigcode_light.png" alt="drawing" width="440"/>
<style type="text/css">
    #bclogo {
        display: block;
        margin-left: auto;
        margin-right: auto }
</style>

# BigCode

BigCode is an open scientific collaboration working on responsible training of large language models for coding applications. You can find more information on the main [website](https://www.bigcode-project.org/) or follow Big Code on [Twitter](https://twitter.com/BigCodeProject). In this organization you can find the artefacts of this collaboration: **StarCoder 2**, a state-of-the-art language model for code, and the previous **StarCoder** family of models, **The Stack**, the largest available pretraining dataset with perimssive code, **Astraios**, scaling instruction-tuned language models for code via diverse fine-tuning methods, **OctoPack**, artifacts for instruction tuning large code models, and **SantaCoder**, a 1.1B parameter model for code.

---
<details>
  <summary>
    <b><font size="+1">💫StarCoder 2</font></b>
  </summary>
  StarCoder2 models are a series of 3B, 7B, and 15B models trained on 3.3 to 4.3 trillion tokens of code from The Stack v2 dataset, with over 600 programming languages. The models use GQA, a context window of 16,384 tokens with a sliding window attention of 4,096 tokens, and were trained using the Fill-in-the-Middle objective.
  
  ### Models
  - [Paper](https://drive.google.com/file/d/17iGn3c-sYNiLyRSY-A85QOzgzGnGiVI3/view): A technical report about StarCoder2.
  - [GitHub](https://github.com/bigcode-project/starcoder2): All you need to know about using or fine-tuning StarCoder2.
  - [StarCoder2-15B](https://huggingface.co/bigcode/starcoder2-15b): 15B model trained on 600+ programming languages and 4.3T tokens.
  - [StarCoder2-7B](https://huggingface.co/bigcode/starcoder2-7b): 7B model trained on 17 programming languages for 3.7T tokens.
  - [StarCoder2-3B](https://huggingface.co/bigcode/starcoder2-3b): 3B model trained on 17 programming languages for 3.3T tokens.

  ### Data & Governance
  - [Governance Card](https://huggingface.co/datasets/bigcode/governance-card): A card outlining the governance of the model.
  - [StarCoder2 License Agreement](https://huggingface.co/spaces/bigcode/bigcode-model-license-agreement): The model is licensed under the BigCode OpenRAIL-M v1 license agreement.
  - [The Stack train smol](https://huggingface.co/datasets/bigcode/the-stack-v2-train-smol-ids): The Software Heritage identifiers for the training dataset of StarCoder2 3B and 7B with 600B+ unique tokens.
  - [The Stack train full](https://huggingface.co/datasets/bigcode/the-stack-v2-train-full-ids): The Software Heritage identifiers for the training dataset of StarCoder2 15B with 900B+ unique tokens.
  - [StarCoder2 Search](https://huggingface.co/spaces/bigcode/search-v2): Full-text search code in the pretraining dataset.
  - [StarCoder2 Membership Test](https://stack-v2.dataportraits.org/): Blazing fast test if code was present in pretraining dataset.
</details>
---
<details>
  <summary>
    <b><font size="+1">📑The Stack v2</font></b>
  </summary>
  The Stack v2 is a 67.5TB dataset of source code in over 600 programming languages with permissive licenses or no license.

  - [The Stack v2](https://huggingface.co/datasets/bigcode/the-stack-v2): Exact deduplicated version of The Stack v2.
  - [The Stack v2 dedup](https://huggingface.co/datasets/bigcode/the-stack-v2-dedup): Near deduplicated version of The Stack v2 (recommended for training).
  - [Am I in the Stack](https://huggingface.co/spaces/bigcode/in-the-stack): Check if your data is in The Stack and request opt-out.
</details>
---
<details>
  <summary>
    <b><font size="+1">💫StarCoder</font></b>
  </summary>
  StarCoder is a 15.5B parameters language model for code trained for 1T tokens on 80+ programming languages. It uses MQA for efficient generation, has 8,192 tokens context window and can do fill-in-the-middle.
  
  ### Models
  - [Paper](https://arxiv.org/abs/2305.06161): A technical report about StarCoder.
  - [GitHub](https://github.com/bigcode-project/starcoder/tree/main): All you need to know about using or fine-tuning StarCoder.
  - [StarCoder](https://huggingface.co/bigcode/starcoder): StarCoderBase further trained on Python.
  - [StarCoderBase](https://huggingface.co/bigcode/starcoderbase): Trained on 80+ languages from The Stack.
  - [StarCoder+](https://huggingface.co/bigcode/starcoderplus): StarCoderBase further trained on English web data.
  - [StarEncoder](https://huggingface.co/bigcode/starencoder): Encoder model trained on TheStack.
  - [StarPii](https://huggingface.co/bigcode/starpii): StarEncoder based PII detector.
  
  ### Tools & Demos
  - [StarCoder Playground](https://huggingface.co/spaces/bigcode/bigcode-playground): Write with StarCoder Models!
  - [VSCode Extension](https://marketplace.visualstudio.com/items?itemName=HuggingFace.huggingface-vscode): Code with StarCoder!
  - [StarChat](https://huggingface.co/spaces/HuggingFaceH4/starchat-playground): Chat with StarCoder!
  - [Tech Assistant Prompt](https://huggingface.co/datasets/bigcode/ta-prompt): With this prompt you can turn StarCoder into tech assistant.
  - [StarCoder Editor](https://huggingface.co/spaces/bigcode/bigcode-editor): Edit with StarCoder!
  
  ### Data & Governance
  - [Governance Card](https://huggingface.co/datasets/bigcode/governance-card): A card outlining the governance of the model.
  - [StarCoder License Agreement](https://huggingface.co/spaces/bigcode/bigcode-model-license-agreement): The model is licensed under the BigCode OpenRAIL-M v1 license agreement.
  - [StarCoder Data](https://huggingface.co/datasets/bigcode/starcoderdata): Pretraining dataset of StarCoder.
  - [StarCoder Search](https://huggingface.co/spaces/bigcode/search): Full-text search code in the pretraining dataset.
  - [StarCoder Membership Test](https://stack.dataportraits.org/): Blazing fast test if code was present in pretraining dataset.
</details>
---
<details>
  <summary>
    <b><font size="+1">📑The Stack</font></b>
  </summary>
  The Stack v1 is a 6.4TB dataset of source code in 358 programming languages from permissive licenses.
  
  - [The Stack](https://huggingface.co/datasets/bigcode/the-stack): Exact deduplicated version of The Stack.
  - [The Stack dedup](https://huggingface.co/datasets/bigcode/the-stack-dedup): Near deduplicated version of The Stack (recommended for training).
  - [Am I in the Stack](https://huggingface.co/spaces/bigcode/in-the-stack): Check if your data is in The Stack and request opt-out.
</details>
---
<details>
  <summary>
    <b><font size="+1">🐙OctoPack</font></b>
  </summary>
  
  OctoPack consists of data, evals & models relating to Code LLMs that follow human instructions.
  
  - [Paper](https://arxiv.org/abs/2308.07124): Research paper with details about all components of OctoPack.
  - [GitHub](https://github.com/bigcode-project/octopack): All code used for the creation of OctoPack.
  - [CommitPack](https://huggingface.co/datasets/bigcode/commitpack): 4TB of Git commits.
  - [Am I in the CommitPack](https://huggingface.co/spaces/bigcode/in-the-commitpack): Check if your code is in the CommitPack.
  - [CommitPackFT](https://huggingface.co/datasets/bigcode/commitpackft): 2GB of high-quality Git commits that resemble instructions.
  - [HumanEvalPack](https://huggingface.co/datasets/bigcode/humanevalpack): Benchmark for Code Fixing/Explaining/Synthesizing across Python/JavaScript/Java/Go/C++/Rust.
  - [OctoCoder](https://huggingface.co/bigcode/octocoder): Instruction tuned model of StarCoder by training on CommitPackFT.
  - [OctoCoder Demo](https://huggingface.co/spaces/bigcode/OctoCoder-Demo): Play with OctoCoder.
  - [OctoGeeX](https://huggingface.co/bigcode/octogeex): Instruction tuned model of CodeGeeX2 by training on CommitPackFT.
</details>
---
<details>
  <summary>
    <b><font size="+1">✨Astraios</font></b>
  </summary>
  
  Astraios is a model suite of scaling 28 instruction-tuned language models for code.
  
  - [Paper](https://arxiv.org/abs/2401.00788): Research paper with details about all components of Astraios.
  - [GitHub](https://github.com/bigcode-project/astraios): All code used for the creation of Astraios.
  - [Astraios-1B](https://huggingface.co/collections/bigcode/astraios-1b-6576ff1b8e449026ae327c1c): Collection of StarCoderBase-1B models instruction tuned on CommitPackFT + OASST with 7 method.
  - [Astraios-3B](https://huggingface.co/collections/bigcode/astraios-3b-6577127317ee44ff547252d3): Collection of StarCoderBase-3B models instruction tuned on CommitPackFT + OASST with 7 method.
  - [Astraios-7B](https://huggingface.co/collections/bigcode/astraios-7b-65788b509c5c26f96c08d576): Collection of StarCoderBase-7B models instruction tuned on CommitPackFT + OASST with 7 method.
  - [Astraios-15B](https://huggingface.co/collections/bigcode/astraios-15b-65788b7476b6de79781054cc): Collection of StarCoderBase-15B models instruction tuned on CommitPackFT + OASST with 7 method.
</details>
---
<details>
  <summary>
    <b><font size="+1">🎅SantaCoder</font></b>
  </summary>
  SantaCoder aka smol StarCoder: same architecture but only trained on Python, Java, JavaScript.
  
  - [SantaCoder](https://huggingface.co/bigcode/santacoder): SantaCoder Model.
  - [SantaCoder Demo](https://huggingface.co/spaces/bigcode/santacoder-demo): Write with SantaCoder.
  - [SantaCoder Search](https://huggingface.co/spaces/bigcode/santacoder-search): Search code in the pretraining dataset.
  - [SantaCoder License](https://huggingface.co/spaces/bigcode/license): The OpenRAIL license for SantaCoder.
</details>
---