lvwerra HF staff commited on
Commit
94f2d5b
1 Parent(s): 315312f

Update Org Card.

Browse files
Files changed (1) hide show
  1. README.md +55 -9
README.md CHANGED
@@ -6,15 +6,61 @@ colorTo: red
6
  sdk: static
7
  pinned: false
8
  ---
9
- <p>
10
- <img src="https://huggingface.co/datasets/loubnabnl/repo-images/resolve/main/bigcode_light.png" alt="drawing" width="440"/>
11
- </p>
12
 
13
- <p>
14
- BigCode is an open scientific collaboration working on responsible training of large language models for coding applications.
 
 
 
 
 
15
 
16
- You can find more information on the main website at <a href="https://www.bigcode-project.org/" class="underline">https://www.bigcode-project.org</a>. You can also follow Big Code on Twitter at <a href="https://twitter.com/BigCodeProject" class="underline">https://twitter.com/BigCodeProject</a>.
17
 
18
- In this organization, you can find <a href="https://huggingface.co/datasets/bigcode/the-stack" class="underline">The Stack</a>, a 6.4TB of source code in 358 programming languages from permissive licenses.
19
- You can also find <a href="https://huggingface.co/bigcode/santacoder" class="underline">SantaCoder</a>, a strong 1.1B code generation model trained on Java, JavaScript & Python. In addition to some data governance tools.
20
- </p>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6
  sdk: static
7
  pinned: false
8
  ---
 
 
 
9
 
10
+ <img id="bclogo" src="https://huggingface.co/datasets/loubnabnl/repo-images/resolve/main/bigcode_light.png" alt="drawing" width="440"/>
11
+ <style type="text/css">
12
+ #bclogo {
13
+ display: block;
14
+ margin-left: auto;
15
+ margin-right: auto }
16
+ </style>
17
 
18
+ # BigCode
19
 
20
+ BigCode is an open scientific collaboration working on responsible training of large language models for coding applications. You can find more information on the main [website](https://www.bigcode-project.org/) or follow Big Code on [Twitter](https://twitter.com/BigCodeProject). In this organization you can find the artefacts of this collaboration: StarCoder, a state-of-the-art language model for code, The Stack, the largest available pretraining dataset with perimssive code, and SantaCoder, a 1.1B parameter model for code.
21
+
22
+ ---
23
+
24
+ ## 💫StarCoder
25
+ StarCoder is a 15.5B parameters language model for code trained for 1T tokens on 80+ programming languages. It uses MQA for efficient generation, has 8,192 tokens context window and can do fill-in-the-middle.
26
+
27
+ ### Models
28
+ - [StarCoder](https://huggingface.co/bigcode/starcoder): StarCoderBase further trained on Python.
29
+ - [StarCoderBase](https://huggingface.co/bigcode/starcoderbase): Trained on 80+ languages from The Stack.
30
+ - [StarEncoder](https://huggingface.co/bigcode/starencoder): Encoder model trained on TheStack.
31
+ - [StarPii](https://huggingface.co/bigcode/starpii): StarEncoder based PII detector.
32
+
33
+ ### Tools & Demos
34
+ - [StarCoder Chat](hf.co/chat/starcoder): Chat with StarCoder!
35
+ - [VSCode Extension](https://marketplace.visualstudio.com/items?itemName=HuggingFace.huggingface-vscode): Code with StarCoder!
36
+ - [StarCoder Playground](https://huggingface.co/spaces/bigcode/bigcode-playground): Write with StarCoder!
37
+ - [StarCoder Editor](https://huggingface.co/spaces/bigcode/bigcode-playground): Edit with StarCoder!
38
+
39
+ ### Data & Governance
40
+ - [StarCoderData](https://huggingface.co/datasets/bigcode/starcoderdata): Pretraining dataset of StarCoder.
41
+ - [Tech Assistant Prompt](https://huggingface.co/datasets/bigcode/ta-prompt): With this prompt you can turn StarCoder into tech assistant.
42
+ - [Governance Card](): A card outlining the governance of the model.
43
+ - [StarCoder License Agreement](https://huggingface.co/spaces/bigcode/bigcode-model-license-agreement): The model is licensed under the BigCode OpenRAIL-M v1 license agreement.
44
+ - [StarCoder Search](https://huggingface.co/spaces/bigcode/search): Full-text search code in the pretraining dataset.
45
+ - [StarCoder Membership Test](stack.dataportraits.org): Blazing fast test if code was present in pretraining dataset.
46
+
47
+ ---
48
+
49
+ ## 📑The Stack
50
+ The Stack is a 6.4TB of source code in 358 programming languages from permissive licenses.
51
+
52
+ - [The Stack](https://huggingface.co/datasets/bigcode/the-stack): Exact deduplicated version of The Stack.
53
+ - [The Stack dedup](https://huggingface.co/datasets/bigcode/the-stack-dedup): Near deduplicated version of The Stack (recommended for training).
54
+ - [The Stack issues](https://huggingface.co/datasets/bigcode/the-stack-issues): Collection of GitHub issues.
55
+ - [The Stack Metadata](https://huggingface.co/datasets/bigcode/the-stack-metadata): Metadata of the repositories in The Stack.
56
+ - [Am I in the Stack](https://huggingface.co/spaces/bigcode/in-the-stack): Check if your data is in The Stack and request opt-out.
57
+
58
+ ---
59
+
60
+ ## 🎅SantaCoder
61
+ SantaCoder aka little StarCoder: same architecture but only trained on Python, Java, JavaScript.
62
+
63
+ - [SantaCoder](https://huggingface.co/bigcode/santacoder): SantaCoder Model.
64
+ - [SantaCoder Demo](https://huggingface.co/spaces/bigcode/santacoder-demo): Write with SantaCoder.
65
+ - [SantaCoder Search](https://huggingface.co/spaces/bigcode/santacoder-search): Search code in the pretraining dataset.
66
+ - [SantaCoder License](https://huggingface.co/spaces/bigcode/license): The OpenRAIL license for SantaCoder.