You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Model License Agreement

Please read the BigCode OpenRAIL-M license agreement before accepting it.

StarEnCoder

Model Summary
Training
Use
Limitations
License

Model Summary

This is an encoder-only model (i.e., bi-directionally self-attentive Transformers) trained on The Stack dataset.

Project Website: bigcode-project.org
Point of Contact: contact@bigcode-project.org
Languages: 80+ Programming languages

We leveraged the :

Masked Language Modelling (MLM) and Next Sentence Prediction (NSP) objectives from BERT.
Predicted masked-out tokens from an input sentence and whether a pair of sentences occur as neighbors in a document.

Training

We train for 100,000 steps with a global batch size of 4,096 sequences of a maximum length of 1,024 so that approximately 400B~tokens are observed. This takes roughly two days using 64 NVIDIA A100 GPUs. Details about the model architecture are reported in the table below.

Hyperparameter	Value
Hidden size	768
Intermediate size	3072
Max. position embeddings	1024
Num. of attention heads	12
Num. of hidden layers	12
Attention	Multi-head
Num. of parameters	≈125M

Use

This model is trained on 86 programming languages from GitHub code including GitHub issues and Git Commits, and can be efficiently fine-tuned for both code- and text-related tasks. We fine-tuned on a token classification task to detect PII and have released StaPII model.

Limitations

There are limitations to consider when using StarEncoder. It is an encoder-only model, which limits its flexibility in certain code generation or completion tasks, and it was trained on data containing PII, which could pose privacy concerns. Performance may vary across the 80+ supported programming languages, particularly for less common ones, and the model might struggle with understanding domains outside programming languages.

License

The model is licensed under the BigCode OpenRAIL-M v1 license agreement. You can find the full agreement here.

Downloads last month: 1,708

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for bigcode/starencoder

Finetunes

16 models

Quantizations

1 model

Spaces using bigcode/starencoder 2

Paper for bigcode/starencoder

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Paper • 1810.04805 • Published Oct 11, 2018 • 32

bigcode
/

starencoder