lvwerra HF staff commited on
Commit
181a039
1 Parent(s): f137b12

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +40 -34
README.md CHANGED
@@ -20,7 +20,6 @@ pinned: false
20
  BigCode is an open scientific collaboration working on responsible training of large language models for coding applications. You can find more information on the main [website](https://www.bigcode-project.org/) or follow Big Code on [Twitter](https://twitter.com/BigCodeProject). In this organization you can find the artefacts of this collaboration: **StarCoder**, a state-of-the-art language model for code, **OctoPack**, artifacts for instruction tuning large code models, **The Stack**, the largest available pretraining dataset with perimssive code, and **SantaCoder**, a 1.1B parameter model for code.
21
 
22
  ---
23
-
24
  <details>
25
  <summary>
26
  <b><font size="+1">💫StarCoder</font></b>
@@ -50,39 +49,46 @@ BigCode is an open scientific collaboration working on responsible training of l
50
  - [StarCoder Search](https://huggingface.co/spaces/bigcode/search): Full-text search code in the pretraining dataset.
51
  - [StarCoder Membership Test](https://stack.dataportraits.org/): Blazing fast test if code was present in pretraining dataset.
52
  </details>
53
-
54
  ---
55
-
56
- ## 🐙OctoPack
57
- OctoPack consists of data, evals & models relating to Code LLMs that follow human instructions.
58
-
59
- - [Paper](https://arxiv.org/abs/2308.07124): Research paper with details about all components of OctoPack.
60
- - [GitHub](https://github.com/bigcode-project/octopack): All code used for the creation of OctoPack.
61
- - [CommitPack](https://huggingface.co/datasets/bigcode/commitpack): 4TB of Git commits.
62
- - [Am I in the CommitPack](https://huggingface.co/spaces/bigcode/in-the-commitpack): Check if your code is in the CommitPack.
63
- - [CommitPackFT](https://huggingface.co/datasets/bigcode/commitpackft): 2GB of high-quality Git commits that resemble instructions.
64
- - [HumanEvalPack](https://huggingface.co/datasets/bigcode/humanevalpack): Benchmark for Code Fixing/Explaining/Synthesizing across Python/JavaScript/Java/Go/C++/Rust.
65
- - [OctoCoder](https://huggingface.co/bigcode/octocoder): Instruction tuned model of StarCoder by training on CommitPackFT.
66
- - [OctoCoder Demo](https://huggingface.co/spaces/bigcode/OctoCoder-Demo): Play with OctoCoder.
67
- - [OctoGeeX](https://huggingface.co/bigcode/octogeex): Instruction tuned model of CodeGeeX2 by training on CommitPackFT.
68
-
 
 
 
69
  ---
70
-
71
- ## 📑The Stack
72
- The Stack is a 6.4TB of source code in 358 programming languages from permissive licenses.
73
-
74
- - [The Stack](https://huggingface.co/datasets/bigcode/the-stack): Exact deduplicated version of The Stack.
75
- - [The Stack dedup](https://huggingface.co/datasets/bigcode/the-stack-dedup): Near deduplicated version of The Stack (recommended for training).
76
- - [The Stack issues](https://huggingface.co/datasets/bigcode/the-stack-github-issues): Collection of GitHub issues.
77
- - [The Stack Metadata](https://huggingface.co/datasets/bigcode/the-stack-metadata): Metadata of the repositories in The Stack.
78
- - [Am I in the Stack](https://huggingface.co/spaces/bigcode/in-the-stack): Check if your data is in The Stack and request opt-out.
79
-
 
 
80
  ---
81
-
82
- ## 🎅SantaCoder
83
- SantaCoder aka smol StarCoder: same architecture but only trained on Python, Java, JavaScript.
84
-
85
- - [SantaCoder](https://huggingface.co/bigcode/santacoder): SantaCoder Model.
86
- - [SantaCoder Demo](https://huggingface.co/spaces/bigcode/santacoder-demo): Write with SantaCoder.
87
- - [SantaCoder Search](https://huggingface.co/spaces/bigcode/santacoder-search): Search code in the pretraining dataset.
88
- - [SantaCoder License](https://huggingface.co/spaces/bigcode/license): The OpenRAIL license for SantaCoder.
 
 
 
 
20
  BigCode is an open scientific collaboration working on responsible training of large language models for coding applications. You can find more information on the main [website](https://www.bigcode-project.org/) or follow Big Code on [Twitter](https://twitter.com/BigCodeProject). In this organization you can find the artefacts of this collaboration: **StarCoder**, a state-of-the-art language model for code, **OctoPack**, artifacts for instruction tuning large code models, **The Stack**, the largest available pretraining dataset with perimssive code, and **SantaCoder**, a 1.1B parameter model for code.
21
 
22
  ---
 
23
  <details>
24
  <summary>
25
  <b><font size="+1">💫StarCoder</font></b>
 
49
  - [StarCoder Search](https://huggingface.co/spaces/bigcode/search): Full-text search code in the pretraining dataset.
50
  - [StarCoder Membership Test](https://stack.dataportraits.org/): Blazing fast test if code was present in pretraining dataset.
51
  </details>
 
52
  ---
53
+ <details>
54
+ <summary>
55
+ <b><font size="+1">🐙OctoPack</font></b>
56
+ </summary>
57
+
58
+ OctoPack consists of data, evals & models relating to Code LLMs that follow human instructions.
59
+
60
+ - [Paper](https://arxiv.org/abs/2308.07124): Research paper with details about all components of OctoPack.
61
+ - [GitHub](https://github.com/bigcode-project/octopack): All code used for the creation of OctoPack.
62
+ - [CommitPack](https://huggingface.co/datasets/bigcode/commitpack): 4TB of Git commits.
63
+ - [Am I in the CommitPack](https://huggingface.co/spaces/bigcode/in-the-commitpack): Check if your code is in the CommitPack.
64
+ - [CommitPackFT](https://huggingface.co/datasets/bigcode/commitpackft): 2GB of high-quality Git commits that resemble instructions.
65
+ - [HumanEvalPack](https://huggingface.co/datasets/bigcode/humanevalpack): Benchmark for Code Fixing/Explaining/Synthesizing across Python/JavaScript/Java/Go/C++/Rust.
66
+ - [OctoCoder](https://huggingface.co/bigcode/octocoder): Instruction tuned model of StarCoder by training on CommitPackFT.
67
+ - [OctoCoder Demo](https://huggingface.co/spaces/bigcode/OctoCoder-Demo): Play with OctoCoder.
68
+ - [OctoGeeX](https://huggingface.co/bigcode/octogeex): Instruction tuned model of CodeGeeX2 by training on CommitPackFT.
69
+ </details>
70
  ---
71
+ <details>
72
+ <summary>
73
+ <b><font size="+1">📑The Stack</font></b>
74
+ </summary>
75
+ The Stack is a 6.4TB of source code in 358 programming languages from permissive licenses.
76
+
77
+ - [The Stack](https://huggingface.co/datasets/bigcode/the-stack): Exact deduplicated version of The Stack.
78
+ - [The Stack dedup](https://huggingface.co/datasets/bigcode/the-stack-dedup): Near deduplicated version of The Stack (recommended for training).
79
+ - [The Stack issues](https://huggingface.co/datasets/bigcode/the-stack-github-issues): Collection of GitHub issues.
80
+ - [The Stack Metadata](https://huggingface.co/datasets/bigcode/the-stack-metadata): Metadata of the repositories in The Stack.
81
+ - [Am I in the Stack](https://huggingface.co/spaces/bigcode/in-the-stack): Check if your data is in The Stack and request opt-out.
82
+ </details>
83
  ---
84
+ <details>
85
+ <summary>
86
+ <b><font size="+1">🎅SantaCoder</font></b>
87
+ </summary>
88
+ SantaCoder aka smol StarCoder: same architecture but only trained on Python, Java, JavaScript.
89
+
90
+ - [SantaCoder](https://huggingface.co/bigcode/santacoder): SantaCoder Model.
91
+ - [SantaCoder Demo](https://huggingface.co/spaces/bigcode/santacoder-demo): Write with SantaCoder.
92
+ - [SantaCoder Search](https://huggingface.co/spaces/bigcode/santacoder-search): Search code in the pretraining dataset.
93
+ - [SantaCoder License](https://huggingface.co/spaces/bigcode/license): The OpenRAIL license for SantaCoder.
94
+ </details>