model_max_length in tokenizer_config.json is incorrect

#10
by dfurman - opened

The model_max_length in the tokenizer.config should be "2048" not "1000000000000000019884624838656".

Databricks org

It seems that this must be set automatically during the checkpointing process. We did not explicitly add this setting. Did you notice an issue associated with this? Googling I found this thread: https://discuss.huggingface.co/t/tokenizers-what-this-max-length-number/28484

Yes - the model should have thrown a warning when you feed in a chunk that is larger than the window size but instead it just errors out. You should explicitly add the max window size in that variable (seems the Dolly-v1 model did have this correct).

@matthayes . This is still erred and should be hard-coded in to whatever the window size is for the model (I think it is 2048 based on initial testing). It is important because people can use the truncation functionality when tokenizing - so that when you feed in a chunk larger than the window size, the model doesn't just err out.
IMG_8774.jpg

Databricks org

@matthayes it seems safe to just set the tokenizer max length to 2048, as that is certainly an upper bound on the max possible input size. I can just try that.

I am trying it too haha, will report back

Databricks org

I did some investigation a couple days ago and commented here: https://github.com/databrickslabs/dolly/issues/102

Databricks org

I believe that setting the max length to 2048 is still going to lead to errors. The model won't be able to generate any new tokens as after the first token is generated it will exceed the max length. Also it will truncate the prompt, so I suspect that the missing ### Response: could cause poor performance.

Awesome thanks @matthayes , so we should basically set it to 1769? Yes, I see that issue with truncating the necessary special tokens, hmmm, not an easy fix then.

Databricks org

If you look at the source for TextGenerationPipeline (which our pipeline is somewhat derived from), it has an option handle_long_generation:

            handle_long_generation (`str`, *optional*):
                By default, this pipelines does not handle long generation (ones that exceed in one form or the other
                the model maximum length). There is no perfect way to adress this (more info
                :https://github.com/huggingface/transformers/issues/14033#issuecomment-948385227). This provides common
                strategies to work around that problem depending on your use case.

                - `None` : default strategy where nothing in particular happens
                - `"hole"`: Truncates left of input, and leaves a gap wide enough to let generation happen (might
                  truncate a lot of the prompt and not suitable when generation exceed the model capacity)

We could do something similar, where basically we truncate the instruction before it is formatted into the full prompt. This isn't a perfect solution but could be an okay workaround.

Databricks org

Yeah 2048 is still "too large" but less "too large" than the Very Large Integer it is now. It would at least fail more cleanly I think for most of the cases that trigger it.

Databricks org

so we should basically set it to 1769?

At the moment if you are using the InstructionTextGenerationPipeline as is I think you should truncate the text before passing it into the pipeline. Unfortunately it is a bit tricky to know what you should truncate it to, as you don't know the length until you tokenize it. You could tokenize, truncate, decode, and pass that into the pipeline. If you are customizing InstructionTextGenerationPipeline then I suggest truncating the instruction preprocess before it is formatted, and derive the max length from 2048 - length of prompt - max_new tokens. You could get the prompt length by computing the length of PROMPT_FOR_GENERATION_FORMAT.format(instruction="") for example.

It is pretty odd that a lot of the new models don't seem to add model_max_length (like the new stabilityai models don't either, it is that same large int) but other models (like flan-t5/flan-ul2) do have that in there. I wonder if this is something the Hugging Face team should check out, seems odd to default to some really larger integer...

Thanks @matthayes , that makes sense!

Maybe I will just for now hard code a raise Exception within the InstructionTextGenerationPipeline directly before the model call (to err if len token ids is less than 2048).

Sign up or log in to comment