Spaces:
Running
Running
title: README | |
emoji: ✨ | |
colorFrom: gray | |
colorTo: red | |
sdk: static | |
pinned: false | |
<img id="bclogo" src="https://huggingface.co/datasets/loubnabnl/repo-images/resolve/main/bigcode_light.png" alt="drawing" width="440"/> | |
<style type="text/css"> | |
#bclogo { | |
display: block; | |
margin-left: auto; | |
margin-right: auto } | |
</style> | |
# BigCode | |
BigCode is an open scientific collaboration working on responsible training of large language models for coding applications. You can find more information on the main [website](https://www.bigcode-project.org/) or follow Big Code on [Twitter](https://twitter.com/BigCodeProject). In this organization you can find the artefacts of this collaboration: **StarCoder**, a state-of-the-art language model for code, **OctoPack**, artifacts for instruction tuning large code models, **The Stack**, the largest available pretraining dataset with perimssive code, and **SantaCoder**, a 1.1B parameter model for code. | |
--- | |
<details> | |
<summary> | |
<h2> | |
💫StarCoder | |
</h2> | |
</summary> | |
StarCoder is a 15.5B parameters language model for code trained for 1T tokens on 80+ programming languages. It uses MQA for efficient generation, has 8,192 tokens context window and can do fill-in-the-middle. | |
### Models | |
- [Paper](https://arxiv.org/abs/2305.06161): A technical report about StarCoder. | |
- [GitHub](https://github.com/bigcode-project/starcoder/tree/main): All you need to know about using or fine-tuning StarCoder. | |
- [StarCoder](https://huggingface.co/bigcode/starcoder): StarCoderBase further trained on Python. | |
- [StarCoderBase](https://huggingface.co/bigcode/starcoderbase): Trained on 80+ languages from The Stack. | |
- [StarCoder+](https://huggingface.co/bigcode/starcoderplus): StarCoderBase further trained on English web data. | |
- [StarEncoder](https://huggingface.co/bigcode/starencoder): Encoder model trained on TheStack. | |
- [StarPii](https://huggingface.co/bigcode/starpii): StarEncoder based PII detector. | |
### Tools & Demos | |
- [StarCoder Playground](https://huggingface.co/spaces/bigcode/bigcode-playground): Write with StarCoder Models! | |
- [VSCode Extension](https://marketplace.visualstudio.com/items?itemName=HuggingFace.huggingface-vscode): Code with StarCoder! | |
- [StarChat](https://huggingface.co/spaces/HuggingFaceH4/starchat-playground): Chat with StarCoder! | |
- [Tech Assistant Prompt](https://huggingface.co/datasets/bigcode/ta-prompt): With this prompt you can turn StarCoder into tech assistant. | |
- [StarCoder Editor](https://huggingface.co/spaces/bigcode/bigcode-editor): Edit with StarCoder! | |
### Data & Governance | |
- [Governance Card](https://huggingface.co/datasets/bigcode/governance-card): A card outlining the governance of the model. | |
- [StarCoder License Agreement](https://huggingface.co/spaces/bigcode/bigcode-model-license-agreement): The model is licensed under the BigCode OpenRAIL-M v1 license agreement. | |
- [StarCoder Data](https://huggingface.co/datasets/bigcode/starcoderdata): Pretraining dataset of StarCoder. | |
- [StarCoder Search](https://huggingface.co/spaces/bigcode/search): Full-text search code in the pretraining dataset. | |
- [StarCoder Membership Test](https://stack.dataportraits.org/): Blazing fast test if code was present in pretraining dataset. | |
</details> | |
--- | |
## 🐙OctoPack | |
OctoPack consists of data, evals & models relating to Code LLMs that follow human instructions. | |
- [Paper](https://arxiv.org/abs/2308.07124): Research paper with details about all components of OctoPack. | |
- [GitHub](https://github.com/bigcode-project/octopack): All code used for the creation of OctoPack. | |
- [CommitPack](https://huggingface.co/datasets/bigcode/commitpack): 4TB of Git commits. | |
- [Am I in the CommitPack](https://huggingface.co/spaces/bigcode/in-the-commitpack): Check if your code is in the CommitPack. | |
- [CommitPackFT](https://huggingface.co/datasets/bigcode/commitpackft): 2GB of high-quality Git commits that resemble instructions. | |
- [HumanEvalPack](https://huggingface.co/datasets/bigcode/humanevalpack): Benchmark for Code Fixing/Explaining/Synthesizing across Python/JavaScript/Java/Go/C++/Rust. | |
- [OctoCoder](https://huggingface.co/bigcode/octocoder): Instruction tuned model of StarCoder by training on CommitPackFT. | |
- [OctoCoder Demo](https://huggingface.co/spaces/bigcode/OctoCoder-Demo): Play with OctoCoder. | |
- [OctoGeeX](https://huggingface.co/bigcode/octogeex): Instruction tuned model of CodeGeeX2 by training on CommitPackFT. | |
--- | |
## 📑The Stack | |
The Stack is a 6.4TB of source code in 358 programming languages from permissive licenses. | |
- [The Stack](https://huggingface.co/datasets/bigcode/the-stack): Exact deduplicated version of The Stack. | |
- [The Stack dedup](https://huggingface.co/datasets/bigcode/the-stack-dedup): Near deduplicated version of The Stack (recommended for training). | |
- [The Stack issues](https://huggingface.co/datasets/bigcode/the-stack-github-issues): Collection of GitHub issues. | |
- [The Stack Metadata](https://huggingface.co/datasets/bigcode/the-stack-metadata): Metadata of the repositories in The Stack. | |
- [Am I in the Stack](https://huggingface.co/spaces/bigcode/in-the-stack): Check if your data is in The Stack and request opt-out. | |
--- | |
## 🎅SantaCoder | |
SantaCoder aka smol StarCoder: same architecture but only trained on Python, Java, JavaScript. | |
- [SantaCoder](https://huggingface.co/bigcode/santacoder): SantaCoder Model. | |
- [SantaCoder Demo](https://huggingface.co/spaces/bigcode/santacoder-demo): Write with SantaCoder. | |
- [SantaCoder Search](https://huggingface.co/spaces/bigcode/santacoder-search): Search code in the pretraining dataset. | |
- [SantaCoder License](https://huggingface.co/spaces/bigcode/license): The OpenRAIL license for SantaCoder. |