How to convert without hitting tokeniser errors?

#2
by anujn - opened

Hey Michael, first thank you for all your converted models and HF hub tool. Both are super super helpful!!

I have been trying to convert the following model to CT2 for inference "ehartford/WizardLM-13B-V1.0-Uncensored"

I keep hitting vocab size difference errors. I think it may be because CT2 is adding an unk token? Not sure very new to CT2!

Just wondered if you could point me in the right direction or give me steps how to convert this particular model to CT2 for inf.

Thanks again!

Anuj

Please open a Issue at CTranslate2 :) https://github.com/OpenNMT/CTranslate2/blob/9885fad95f8ce24809d1ab64b418ac9f99c75562/python/ctranslate2/converters/transformers.py#L1182

You might be able to add:

    def get_vocabulary(self, model, tokenizer):
        tokens = super().get_vocabulary(model, tokenizer)

        extra_ids = model.config.vocab_size - len(tokens)
        for i in range(extra_ids):
            # fix for additional vocab, see GPTNeoX Converter
            tokens.append("<extra_id_%d>" % i)

        return tokens
michaelfeil changed discussion status to closed

Sign up or log in to comment