static/formats.md · Fisharp/starcoder-playground at 2bc3f5c8ef8b42117f5bc2f28abe1c57fa602eee

Model Formats

The model is pretrained on code and is formatted with special tokens in addition to the pure code data,
such as prefixes specifying the source of the file or tokens separating code from a commit message.
Use these templates to explore the model's capacities:

1. Prefixes 🏷️

For pure code files, use any combination of the following prefixes:

<reponame>REPONAME<filename>FILENAME<gh_stars>STARS\ncode<|endoftext|>

STARS can be one of: 0, 1-10, 10-100, 100-1000, 1000+

2. Commits 💾

The commits data is formatted as follows:

<commit_before>code<commit_msg>text<commit_after>code<|endoftext|>

3. Jupyter Notebooks 📓

The model is trained on Jupyter notebooks as Python scripts and structured formats like:

<start_jupyter><jupyter_text>text<jupyter_code>code<jupyter_output>output<jupyter_text>

4. Issues 🐛

We also trained on GitHub issues using the following formatting:

<issue_start><issue_comment>text<issue_comment>...<issue_closed>

5. Fill-in-the-middle 🧩

Fill in the middle requires rearranging the model inputs. The playground handles this for you - all you need is to specify where to fill:

code before<FILL_HERE>code after