TGI "Fast Tokenizer" support?

#1
by rocca - opened

If I use ghcr.io/huggingface/text-generation-inference:1.1.1 with these option: --quantize gptq --model-id TheBloke/LLaMA2-13B-Tiefighter-GPTQ --revision 2a797d26cdafd37714f033503fa085bb58fcd4bc --max-total-tokens 4096 --port 3000 then it complains that a fast tokenizer isn't found, which means that the server isn't able to dynamically compute max_new_tokens (based on length of input text + --max-total-tokens) and it instead needs to be computed manually and sent along with the API request.

If I understand correctly, to fix this it would just require uploading the correct tokenizer.jsonfile to this repo? Would be as easy as copying it from another llama2 repo I'm guessing? If so, would it be possible to add that to this and future repos? πŸ™ (or is this perhaps a bug with TGI? If so, I can file an issue)

Sign up or log in to comment