metadata

title: README
emoji: ✨
colorFrom: gray
colorTo: red
sdk: static
pinned: false

BigCode

BigCode is an open scientific collaboration working on responsible training of large language models for coding applications. You can find more information on the main website or follow Big Code on Twitter. In this organization you can find the artefacts of this collaboration: StarCoder, a state-of-the-art language model for code, The Stack, the largest available pretraining dataset with perimssive code, and SantaCoder, a 1.1B parameter model for code.

💫StarCoder

StarCoder is a 15.5B parameters language model for code trained for 1T tokens on 80+ programming languages. It uses MQA for efficient generation, has 8,192 tokens context window and can do fill-in-the-middle.

Models

StarCoder: StarCoderBase further trained on Python.
StarCoderBase: Trained on 80+ languages from The Stack.
StarEncoder: Encoder model trained on TheStack.
StarPii: StarEncoder based PII detector.

Tools & Demos

StarCoder Chat: Chat with StarCoder!
VSCode Extension: Code with StarCoder!
StarCoder Playground: Write with StarCoder!
StarCoder Editor: Edit with StarCoder!

Data & Governance

StarCoderData: Pretraining dataset of StarCoder.
Tech Assistant Prompt: With this prompt you can turn StarCoder into tech assistant.
Governance Card: A card outlining the governance of the model.
StarCoder License Agreement: The model is licensed under the BigCode OpenRAIL-M v1 license agreement.
StarCoder Search: Full-text search code in the pretraining dataset.
StarCoder Membership Test: Blazing fast test if code was present in pretraining dataset.

📑The Stack

The Stack is a 6.4TB of source code in 358 programming languages from permissive licenses.

The Stack: Exact deduplicated version of The Stack.
The Stack dedup: Near deduplicated version of The Stack (recommended for training).
The Stack issues: Collection of GitHub issues.
The Stack Metadata: Metadata of the repositories in The Stack.
Am I in the Stack: Check if your data is in The Stack and request opt-out.

🎅SantaCoder

SantaCoder aka smol StarCoder: same architecture but only trained on Python, Java, JavaScript.

SantaCoder: SantaCoder Model.
SantaCoder Demo: Write with SantaCoder.
SantaCoder Search: Search code in the pretraining dataset.
SantaCoder License: The OpenRAIL license for SantaCoder.