main-horse/mpt-7b-tokenizer

This is a modified version of the MPT-7B tokenizer with the following special tokens substituted in:

    {
      "id": 50277,
      "content": "<info>",
      "single_word": false,
      "lstrip": false,
      "rstrip": false,
      "normalized": false,
      "special": true
    },
    {
      "id": 50278,
      "content": "</info>",
      "single_word": false,
      "lstrip": false,
      "rstrip": false,
      "normalized": false,
      "special": true
    },
    {
      "id": 50279,
      "content": "<im_start>",
      "single_word": false,
      "lstrip": false,
      "rstrip": false,
      "normalized": false,
      "special": true
    },
    {
      "id": 50280,
      "content": "<im_end>",
      "single_word": false,
      "lstrip": false,
      "rstrip": false,
      "normalized": false,
      "special": true
    }

This works, because MPT was designed to have extra unused tokens for this purpose:

The tokenizer has a vocabulary size of 50257, but we set the model vocabulary size to 50432.

if there is any easier way to create something like this programmatically without git cloning and manually editing the json files please tell me