fim_tokens, what is its use?

#10
by NickyNicky - opened

Hello, I hope everything goes well.

https://huggingface.co/Salesforce/xgen-7b-8k-inst/blob/main/tokenization_xgen.py

fim_tokens = [
            "<fim_prefix>",
            "<fim_middle>",
            "<fim_suffix>",
            "<fim_pad>",
            "<filename>",
            "<gh_stars>",
            "<issue_start>",
            "<issue_comment>",
            "<issue_closed>",
            "<jupyter_start>",
            "<jupyter_text>",
            "<jupyter_code>",
            "<jupyter_output>",
            "<empty_output>",
            "<commit_before>",
            "<commit_msg>",
            "<commit_after>",
            "<reponame>"
        ]

Could you explain these special tokens how they are used, thanks

Salesforce org

The following appears in StarCoderData, the code data we used for training the model:

            "<filename>",
            "<gh_stars>",
            "<issue_start>",
            "<issue_comment>",
            "<issue_closed>",
            "<jupyter_start>",
            "<jupyter_text>",
            "<jupyter_code>",
            "<jupyter_output>",
            "<empty_output>",
            "<commit_before>",
            "<commit_msg>",
            "<commit_after>",
            "<reponame>"

Please refer to the StarCoder paper for more details. You could, for example, condition the generation using these special tokens to bias the model prediction.

The remaining (as follows) are the special tokens used by StarCoder for their FIM training, but we did not use them. You can ignore these tokens:

            "<fim_prefix>",
            "<fim_middle>",
            "<fim_suffix>",
            "<fim_pad>",
rooa changed discussion status to closed

Sign up or log in to comment