File size: 4,248 Bytes
49cf832
 
c02139b
49cf832
 
 
 
 
 
94f2d5b
 
 
 
 
 
 
3ab57b2
94f2d5b
aa053f8
94f2d5b
 
 
 
 
 
 
 
7ce4f8e
ab1f9b0
94f2d5b
 
 
 
 
 
 
 
 
 
 
 
 
 
4e05844
94f2d5b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
81a7150
94f2d5b
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
---
title: README
emoji: 
colorFrom: gray
colorTo: red
sdk: static
pinned: false
---

<img id="bclogo" src="https://huggingface.co/datasets/loubnabnl/repo-images/resolve/main/bigcode_light.png" alt="drawing" width="440"/>
<style type="text/css">
    #bclogo {
        display: block;
        margin-left: auto;
        margin-right: auto }
</style>

# BigCode

BigCode is an open scientific collaboration working on responsible training of large language models for coding applications. You can find more information on the main [website](https://www.bigcode-project.org/) or follow Big Code on [Twitter](https://twitter.com/BigCodeProject). In this organization you can find the artefacts of this collaboration: StarCoder, a state-of-the-art language model for code, The Stack, the largest available pretraining dataset with perimssive code, and SantaCoder, a 1.1B parameter model for code.

---

## 💫StarCoder
StarCoder is a 15.5B parameters language model for code trained for 1T tokens on 80+ programming languages. It uses MQA for efficient generation, has 8,192 tokens context window and can do fill-in-the-middle.

### Models
- [Paper](https://drive.google.com/file/d/1cN-b9GnWtHzQRoE7M7gAEyivY0kl4BYs/view): A technical report about StarCoder.
- [GitHub](https://github.com/bigcode-project/starcoder/tree/main): All you need to know about using or fine-tuning StarCoder.
- [StarCoder](https://huggingface.co/bigcode/starcoder): StarCoderBase further trained on Python.
- [StarCoderBase](https://huggingface.co/bigcode/starcoderbase): Trained on 80+ languages from The Stack.
- [StarEncoder](https://huggingface.co/bigcode/starencoder): Encoder model trained on TheStack.
- [StarPii](https://huggingface.co/bigcode/starpii): StarEncoder based PII detector.

### Tools & Demos
- [StarCoder Chat](hf.co/chat/starcoder): Chat with StarCoder!
- [VSCode Extension](https://marketplace.visualstudio.com/items?itemName=HuggingFace.huggingface-vscode): Code with StarCoder!
- [StarCoder Playground](https://huggingface.co/spaces/bigcode/bigcode-playground): Write with StarCoder!
- [StarCoder Editor](https://huggingface.co/spaces/bigcode/bigcode-playground): Edit with StarCoder!

### Data & Governance
- [StarCoderData](https://huggingface.co/datasets/bigcode/starcoderdata): Pretraining dataset of StarCoder.
- [Tech Assistant Prompt](https://huggingface.co/datasets/bigcode/ta-prompt): With this prompt you can turn StarCoder into tech assistant.
- [Governance Card](https://huggingface.co/spaces/bigcode/governance-card): A card outlining the governance of the model.
- [StarCoder License Agreement](https://huggingface.co/spaces/bigcode/bigcode-model-license-agreement): The model is licensed under the BigCode OpenRAIL-M v1 license agreement.
- [StarCoder Search](https://huggingface.co/spaces/bigcode/search): Full-text search code in the pretraining dataset.
- [StarCoder Membership Test](stack.dataportraits.org): Blazing fast test if code was present in pretraining dataset.

---

## 📑The Stack
The Stack is a 6.4TB of source code in 358 programming languages from permissive licenses.

- [The Stack](https://huggingface.co/datasets/bigcode/the-stack): Exact deduplicated version of The Stack.
- [The Stack dedup](https://huggingface.co/datasets/bigcode/the-stack-dedup): Near deduplicated version of The Stack (recommended for training).
- [The Stack issues](https://huggingface.co/datasets/bigcode/the-stack-issues): Collection of GitHub issues.
- [The Stack Metadata](https://huggingface.co/datasets/bigcode/the-stack-metadata): Metadata of the repositories in The Stack.
- [Am I in the Stack](https://huggingface.co/spaces/bigcode/in-the-stack): Check if your data is in The Stack and request opt-out.

---

## 🎅SantaCoder
SantaCoder aka smol StarCoder: same architecture but only trained on Python, Java, JavaScript.

- [SantaCoder](https://huggingface.co/bigcode/santacoder): SantaCoder Model.
- [SantaCoder Demo](https://huggingface.co/spaces/bigcode/santacoder-demo): Write with SantaCoder.
- [SantaCoder Search](https://huggingface.co/spaces/bigcode/santacoder-search): Search code in the pretraining dataset.
- [SantaCoder License](https://huggingface.co/spaces/bigcode/license): The OpenRAIL license for SantaCoder.