Research interests

None defined yet.

Team members 315

Organization Card
About org cards


BigCode is an open scientific collaboration working on responsible training of large language models for coding applications. You can find more information on the main website or follow Big Code on Twitter. In this organization you can find the artefacts of this collaboration: StarCoder, a state-of-the-art language model for code, The Stack, the largest available pretraining dataset with perimssive code, and SantaCoder, a 1.1B parameter model for code.


StarCoder is a 15.5B parameters language model for code trained for 1T tokens on 80+ programming languages. It uses MQA for efficient generation, has 8,192 tokens context window and can do fill-in-the-middle.


  • Paper: A technical report about StarCoder.
  • GitHub: All you need to know about using or fine-tuning StarCoder.
  • StarCoder: StarCoderBase further trained on Python.
  • StarCoderBase: Trained on 80+ languages from The Stack.
  • StarEncoder: Encoder model trained on TheStack.
  • StarPii: StarEncoder based PII detector.

Tools & Demos

Data & Governance

📑The Stack

The Stack is a 6.4TB of source code in 358 programming languages from permissive licenses.


SantaCoder aka smol StarCoder: same architecture but only trained on Python, Java, JavaScript.