File size: 9,846 Bytes
49cf832
 
c02139b
49cf832
 
 
 
 
 
5867bc5
 
 
 
 
 
 
3ab57b2
5867bc5
aa053f8
f5e08bb
5867bc5
a9caed7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
13f37cd
a9caed7
 
 
 
 
 
 
 
 
 
 
 
5867bc5
57b9a99
 
f137b12
21ba93a
57b9a99
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
eda1407
f5e08bb
 
 
 
 
 
 
 
 
 
 
181a039
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5867bc5
181a039
 
f5e08bb
181a039
 
f5e08bb
 
 
 
 
 
 
 
181a039
5867bc5
181a039
 
 
 
 
 
 
 
 
 
41215a7
2475504
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
---
title: README
emoji: 
colorFrom: gray
colorTo: red
sdk: static
pinned: false
---

<img id="bclogo" src="https://huggingface.co/datasets/loubnabnl/repo-images/resolve/main/bigcode_light.png" alt="drawing" width="440"/>
<style type="text/css">
    #bclogo {
        display: block;
        margin-left: auto;
        margin-right: auto }
</style>

# BigCode

BigCode is an open scientific collaboration working on responsible training of large language models for coding applications. You can find more information on the main [website](https://www.bigcode-project.org/) or follow Big Code on [Twitter](https://twitter.com/BigCodeProject). In this organization you can find the artefacts of this collaboration: **StarCoder 2**, a state-of-the-art language model for code, and the previous **StarCoder** family of models, **The Stack**, the largest available pretraining dataset with perimssive code, **Astraios**, scaling instruction-tuned language models for code via diverse fine-tuning methods, **OctoPack**, artifacts for instruction tuning large code models, and **SantaCoder**, a 1.1B parameter model for code.

---
<details>
  <summary>
    <b><font size="+1">💫StarCoder 2</font></b>
  </summary>
  StarCoder2 models are a series of 3B, 7B, and 15B models trained on 3.3 to 4.3 trillion tokens of code from The Stack v2 dataset, with over 600 programming languages. The models use GQA, a context window of 16,384 tokens with a sliding window attention of 4,096 tokens, and were trained using the Fill-in-the-Middle objective.
  
  ### Models
  - [Paper](https://drive.google.com/file/d/17iGn3c-sYNiLyRSY-A85QOzgzGnGiVI3/view): A technical report about StarCoder2.
  - [GitHub](https://github.com/bigcode-project/starcoder2): All you need to know about using or fine-tuning StarCoder2.
  - [StarCoder2-15B](https://huggingface.co/bigcode/starcoder2-15b): 15B model trained on 600+ programming languages and 4.3T tokens.
  - [StarCoder2-7B](https://huggingface.co/bigcode/starcoder2-7b): 7B model trained on 17 programming languages for 3.7T tokens.
  - [StarCoder2-3B](https://huggingface.co/bigcode/starcoder2-3b): 3B model trained on 17 programming languages for 3.3T tokens.

  ### Data & Governance
  - [Governance Card](https://huggingface.co/datasets/bigcode/governance-card): A card outlining the governance of the model.
  - [StarCoder2 License Agreement](https://huggingface.co/spaces/bigcode/bigcode-model-license-agreement): The model is licensed under the BigCode OpenRAIL-M v1 license agreement.
  - [The Stack train smol](https://huggingface.co/datasets/bigcode/the-stack-v2-train-smol-ids): The Software Heritage identifiers for the training dataset of StarCoder2 3B and 7B with 600B+ unique tokens.
  - [The Stack train full](https://huggingface.co/datasets/bigcode/the-stack-v2-train-full-ids): The Software Heritage identifiers for the training dataset of StarCoder2 15B with 900B+ unique tokens.
  - [StarCoder2 Search](https://huggingface.co/spaces/bigcode/search-v2): Full-text search code in the pretraining dataset.
  - [StarCoder2 Membership Test](https://stack-v2.dataportraits.org/): Blazing fast test if code was present in pretraining dataset.
</details>
---
<details>
  <summary>
    <b><font size="+1">📑The Stack v2</font></b>
  </summary>
  The Stack v2 is a 67.5TB dataset of source code in over 600 programming languages with permissive licenses or no license.

  - [The Stack v2](https://huggingface.co/datasets/bigcode/the-stack-v2): Exact deduplicated version of The Stack v2.
  - [The Stack v2 dedup](https://huggingface.co/datasets/bigcode/the-stack-v2-dedup): Near deduplicated version of The Stack v2 (recommended for training).
  - [Am I in the Stack](https://huggingface.co/spaces/bigcode/in-the-stack): Check if your data is in The Stack and request opt-out.
</details>
---
<details>
  <summary>
    <b><font size="+1">💫StarCoder</font></b>
  </summary>
  StarCoder is a 15.5B parameters language model for code trained for 1T tokens on 80+ programming languages. It uses MQA for efficient generation, has 8,192 tokens context window and can do fill-in-the-middle.
  
  ### Models
  - [Paper](https://arxiv.org/abs/2305.06161): A technical report about StarCoder.
  - [GitHub](https://github.com/bigcode-project/starcoder/tree/main): All you need to know about using or fine-tuning StarCoder.
  - [StarCoder](https://huggingface.co/bigcode/starcoder): StarCoderBase further trained on Python.
  - [StarCoderBase](https://huggingface.co/bigcode/starcoderbase): Trained on 80+ languages from The Stack.
  - [StarCoder+](https://huggingface.co/bigcode/starcoderplus): StarCoderBase further trained on English web data.
  - [StarEncoder](https://huggingface.co/bigcode/starencoder): Encoder model trained on TheStack.
  - [StarPii](https://huggingface.co/bigcode/starpii): StarEncoder based PII detector.
  
  ### Tools & Demos
  - [StarCoder Playground](https://huggingface.co/spaces/bigcode/bigcode-playground): Write with StarCoder Models!
  - [VSCode Extension](https://marketplace.visualstudio.com/items?itemName=HuggingFace.huggingface-vscode): Code with StarCoder!
  - [StarChat](https://huggingface.co/spaces/HuggingFaceH4/starchat-playground): Chat with StarCoder!
  - [Tech Assistant Prompt](https://huggingface.co/datasets/bigcode/ta-prompt): With this prompt you can turn StarCoder into tech assistant.
  - [StarCoder Editor](https://huggingface.co/spaces/bigcode/bigcode-editor): Edit with StarCoder!
  
  ### Data & Governance
  - [Governance Card](https://huggingface.co/datasets/bigcode/governance-card): A card outlining the governance of the model.
  - [StarCoder License Agreement](https://huggingface.co/spaces/bigcode/bigcode-model-license-agreement): The model is licensed under the BigCode OpenRAIL-M v1 license agreement.
  - [StarCoder Data](https://huggingface.co/datasets/bigcode/starcoderdata): Pretraining dataset of StarCoder.
  - [StarCoder Search](https://huggingface.co/spaces/bigcode/search): Full-text search code in the pretraining dataset.
  - [StarCoder Membership Test](https://stack.dataportraits.org/): Blazing fast test if code was present in pretraining dataset.
</details>
---
<details>
  <summary>
    <b><font size="+1">📑The Stack</font></b>
  </summary>
  The Stack v1 is a 6.4TB dataset of source code in 358 programming languages from permissive licenses.
  
  - [The Stack](https://huggingface.co/datasets/bigcode/the-stack): Exact deduplicated version of The Stack.
  - [The Stack dedup](https://huggingface.co/datasets/bigcode/the-stack-dedup): Near deduplicated version of The Stack (recommended for training).
  - [Am I in the Stack](https://huggingface.co/spaces/bigcode/in-the-stack): Check if your data is in The Stack and request opt-out.
</details>
---
<details>
  <summary>
    <b><font size="+1">🐙OctoPack</font></b>
  </summary>
  
  OctoPack consists of data, evals & models relating to Code LLMs that follow human instructions.
  
  - [Paper](https://arxiv.org/abs/2308.07124): Research paper with details about all components of OctoPack.
  - [GitHub](https://github.com/bigcode-project/octopack): All code used for the creation of OctoPack.
  - [CommitPack](https://huggingface.co/datasets/bigcode/commitpack): 4TB of Git commits.
  - [Am I in the CommitPack](https://huggingface.co/spaces/bigcode/in-the-commitpack): Check if your code is in the CommitPack.
  - [CommitPackFT](https://huggingface.co/datasets/bigcode/commitpackft): 2GB of high-quality Git commits that resemble instructions.
  - [HumanEvalPack](https://huggingface.co/datasets/bigcode/humanevalpack): Benchmark for Code Fixing/Explaining/Synthesizing across Python/JavaScript/Java/Go/C++/Rust.
  - [OctoCoder](https://huggingface.co/bigcode/octocoder): Instruction tuned model of StarCoder by training on CommitPackFT.
  - [OctoCoder Demo](https://huggingface.co/spaces/bigcode/OctoCoder-Demo): Play with OctoCoder.
  - [OctoGeeX](https://huggingface.co/bigcode/octogeex): Instruction tuned model of CodeGeeX2 by training on CommitPackFT.
</details>
---
<details>
  <summary>
    <b><font size="+1">✨Astraios</font></b>
  </summary>
  
  Astraios is a model suite of scaling 28 instruction-tuned language models for code.
  
  - [Paper](https://arxiv.org/abs/2401.00788): Research paper with details about all components of Astraios.
  - [GitHub](https://github.com/bigcode-project/astraios): All code used for the creation of Astraios.
  - [Astraios-1B](https://huggingface.co/collections/bigcode/astraios-1b-6576ff1b8e449026ae327c1c): Collection of StarCoderBase-1B models instruction tuned on CommitPackFT + OASST with 7 method.
  - [Astraios-3B](https://huggingface.co/collections/bigcode/astraios-3b-6577127317ee44ff547252d3): Collection of StarCoderBase-3B models instruction tuned on CommitPackFT + OASST with 7 method.
  - [Astraios-7B](https://huggingface.co/collections/bigcode/astraios-7b-65788b509c5c26f96c08d576): Collection of StarCoderBase-7B models instruction tuned on CommitPackFT + OASST with 7 method.
  - [Astraios-15B](https://huggingface.co/collections/bigcode/astraios-15b-65788b7476b6de79781054cc): Collection of StarCoderBase-15B models instruction tuned on CommitPackFT + OASST with 7 method.
</details>
---
<details>
  <summary>
    <b><font size="+1">🎅SantaCoder</font></b>
  </summary>
  SantaCoder aka smol StarCoder: same architecture but only trained on Python, Java, JavaScript.
  
  - [SantaCoder](https://huggingface.co/bigcode/santacoder): SantaCoder Model.
  - [SantaCoder Demo](https://huggingface.co/spaces/bigcode/santacoder-demo): Write with SantaCoder.
  - [SantaCoder Search](https://huggingface.co/spaces/bigcode/santacoder-search): Search code in the pretraining dataset.
  - [SantaCoder License](https://huggingface.co/spaces/bigcode/license): The OpenRAIL license for SantaCoder.
</details>
---