loubnabnl HF staff commited on
Commit
0faf2d6
1 Parent(s): 2475504

add SC2 and stack v2

Browse files
Files changed (1) hide show
  1. README.md +27 -4
README.md CHANGED
@@ -19,6 +19,28 @@ pinned: false
19
 
20
  BigCode is an open scientific collaboration working on responsible training of large language models for coding applications. You can find more information on the main [website](https://www.bigcode-project.org/) or follow Big Code on [Twitter](https://twitter.com/BigCodeProject). In this organization you can find the artefacts of this collaboration: **StarCoder**, a state-of-the-art language model for code, **OctoPack**, artifacts for instruction tuning large code models, **The Stack**, the largest available pretraining dataset with perimssive code, and **SantaCoder**, a 1.1B parameter model for code.
21
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
22
  ---
23
  <details>
24
  <summary>
@@ -72,12 +94,13 @@ BigCode is an open scientific collaboration working on responsible training of l
72
  <summary>
73
  <b><font size="+1">📑The Stack</font></b>
74
  </summary>
75
- The Stack is a 6.4TB of source code in 358 programming languages from permissive licenses.
76
-
 
77
  - [The Stack](https://huggingface.co/datasets/bigcode/the-stack): Exact deduplicated version of The Stack.
 
78
  - [The Stack dedup](https://huggingface.co/datasets/bigcode/the-stack-dedup): Near deduplicated version of The Stack (recommended for training).
79
- - [The Stack issues](https://huggingface.co/datasets/bigcode/the-stack-github-issues): Collection of GitHub issues.
80
- - [The Stack Metadata](https://huggingface.co/datasets/bigcode/the-stack-metadata): Metadata of the repositories in The Stack.
81
  - [Am I in the Stack](https://huggingface.co/spaces/bigcode/in-the-stack): Check if your data is in The Stack and request opt-out.
82
  </details>
83
  ---
 
19
 
20
  BigCode is an open scientific collaboration working on responsible training of large language models for coding applications. You can find more information on the main [website](https://www.bigcode-project.org/) or follow Big Code on [Twitter](https://twitter.com/BigCodeProject). In this organization you can find the artefacts of this collaboration: **StarCoder**, a state-of-the-art language model for code, **OctoPack**, artifacts for instruction tuning large code models, **The Stack**, the largest available pretraining dataset with perimssive code, and **SantaCoder**, a 1.1B parameter model for code.
21
 
22
+ ---
23
+ <details>
24
+ <summary>
25
+ <b><font size="+1">💫StarCoder 2</font></b>
26
+ </summary>
27
+ StarCoder2 models are a series of 3B, 7B, and 15B models trained on 3.3 to 4.3 trillion tokens of code from The Stack v2 dataset, with over 600 programming languages. The models use GQA, a context window of 16,384 tokens with a sliding window attention of 4,096 tokens, and were trained using the Fill-in-the-Middle objective.
28
+
29
+ ### Models
30
+ - [Paper](https://drive.google.com/file/d/17iGn3c-sYNiLyRSY-A85QOzgzGnGiVI3/view): A technical report about StarCoder2.
31
+ - [GitHub](https://github.com/bigcode-project/starcoder2): All you need to know about using or fine-tuning StarCoder2.
32
+ - [StarCoder2-15B](https://huggingface.co/bigcode/starcoder2-15b): 15B model trained on 600+ programming languages and 4.3T tokens.
33
+ - [StarCoder2-7B](https://huggingface.co/bigcode/starcoder2-7b): 7B model trained on 17 programming languages for 3.7T tokens.
34
+ - [StarCoder2-3B](https://huggingface.co/bigcode/starcoder2-3b): 3B model trained on 17 programming languages for 3.3T tokens.
35
+
36
+ ### Data & Governance
37
+ - [Governance Card](https://huggingface.co/datasets/bigcode/governance-card): A card outlining the governance of the model.
38
+ - [StarCoder License Agreement](https://huggingface.co/spaces/bigcode/bigcode-model-license-agreement): The model is licensed under the BigCode OpenRAIL-M v1 license agreement.
39
+ - [The Stack train smol](https://huggingface.co/datasets/bigcode/the-stack-v2-train-smol-ids): The Software Heritage identifiers for the training dataset of StarCoder2 3B and 7B with 600B+ unique tokens.
40
+ - [The Stack train full](https://huggingface.co/datasets/bigcode/the-stack-v2-train-full-ids): The Software Heritage identifiers for the training dataset of StarCoder2 15B with 900B+ unique tokens.
41
+ - [StarCoder2 Search](https://huggingface.co/spaces/bigcode/search-v2): Full-text search code in the pretraining dataset.
42
+ - [StarCoder2 Membership Test](): Blazing fast test if code was present in pretraining dataset TODO.
43
+ </details>
44
  ---
45
  <details>
46
  <summary>
 
94
  <summary>
95
  <b><font size="+1">📑The Stack</font></b>
96
  </summary>
97
+ The Stack v1 is a 6.4TB of source code in 358 programming languages from permissive licenses.<br>
98
+ The Stack v2 is a 67.5TB of source code in over 600 programming languages with permissive licenses or no license.
99
+
100
  - [The Stack](https://huggingface.co/datasets/bigcode/the-stack): Exact deduplicated version of The Stack.
101
+ - [The Stack v2](https://huggingface.co/datasets/bigcode/the-stack-v2): Exact deduplicated version of The Stack v2.
102
  - [The Stack dedup](https://huggingface.co/datasets/bigcode/the-stack-dedup): Near deduplicated version of The Stack (recommended for training).
103
+ - [The Stack v2 dedup](https://huggingface.co/datasets/bigcode/the-stack-v2-dedup): Near deduplicated version of The Stack v2 (recommended for training).
 
104
  - [Am I in the Stack](https://huggingface.co/spaces/bigcode/in-the-stack): Check if your data is in The Stack and request opt-out.
105
  </details>
106
  ---