model_max_length in tokenizer_config.json is incorrect

#10

by dfurman - opened Apr 13, 2023

Discussion

dfurman

Apr 13, 2023

The model_max_length in the tokenizer.config should be "2048" not "1000000000000000019884624838656".

matthayes

Databricks org Apr 13, 2023

It seems that this must be set automatically during the checkpointing process. We did not explicitly add this setting. Did you notice an issue associated with this? Googling I found this thread: https://discuss.huggingface.co/t/tokenizers-what-this-max-length-number/28484

dfurman

Apr 13, 2023

Yes - the model should have thrown a warning when you feed in a chunk that is larger than the window size but instead it just errors out. You should explicitly add the max window size in that variable (seems the Dolly-v1 model did have this correct).

dfurmanWMP

Apr 27, 2023

@matthayes . This is still erred and should be hard-coded in to whatever the window size is for the model (I think it is 2048 based on initial testing). It is important because people can use the truncation functionality when tokenizing - so that when you feed in a chunk larger than the window size, the model doesn't just err out.

srowen

Databricks org Apr 27, 2023

@matthayes it seems safe to just set the tokenizer max length to 2048, as that is certainly an upper bound on the max possible input size. I can just try that.

dfurmanWMP

Apr 27, 2023

I am trying it too haha, will report back

matthayes

Databricks org Apr 27, 2023

I did some investigation a couple days ago and commented here: https://github.com/databrickslabs/dolly/issues/102

matthayes

Databricks org Apr 27, 2023

I believe that setting the max length to 2048 is still going to lead to errors. The model won't be able to generate any new tokens as after the first token is generated it will exceed the max length. Also it will truncate the prompt, so I suspect that the missing ### Response: could cause poor performance.

dfurmanWMP

Apr 27, 2023

Awesome thanks @matthayes , so we should basically set it to 1769? Yes, I see that issue with truncating the necessary special tokens, hmmm, not an easy fix then.

matthayes

Databricks org Apr 27, 2023

If you look at the source for TextGenerationPipeline (which our pipeline is somewhat derived from), it has an option handle_long_generation:

            handle_long_generation (`str`, *optional*):
                By default, this pipelines does not handle long generation (ones that exceed in one form or the other
                the model maximum length). There is no perfect way to adress this (more info
                :https://github.com/huggingface/transformers/issues/14033#issuecomment-948385227). This provides common
                strategies to work around that problem depending on your use case.

                - `None` : default strategy where nothing in particular happens
                - `"hole"`: Truncates left of input, and leaves a gap wide enough to let generation happen (might
                  truncate a lot of the prompt and not suitable when generation exceed the model capacity)

We could do something similar, where basically we truncate the instruction before it is formatted into the full prompt. This isn't a perfect solution but could be an okay workaround.

srowen

Databricks org Apr 27, 2023

Yeah 2048 is still "too large" but less "too large" than the Very Large Integer it is now. It would at least fail more cleanly I think for most of the cases that trigger it.

matthayes

Databricks org Apr 27, 2023

so we should basically set it to 1769?

At the moment if you are using the InstructionTextGenerationPipeline as is I think you should truncate the text before passing it into the pipeline. Unfortunately it is a bit tricky to know what you should truncate it to, as you don't know the length until you tokenize it. You could tokenize, truncate, decode, and pass that into the pipeline. If you are customizing InstructionTextGenerationPipeline then I suggest truncating the instruction preprocess before it is formatted, and derive the max length from 2048 - length of prompt - max_new tokens. You could get the prompt length by computing the length of PROMPT_FOR_GENERATION_FORMAT.format(instruction="") for example.

dfurmanWMP

Apr 27, 2023

It is pretty odd that a lot of the new models don't seem to add model_max_length (like the new stabilityai models don't either, it is that same large int) but other models (like flan-t5/flan-ul2) do have that in there. I wonder if this is something the Hugging Face team should check out, seems odd to default to some really larger integer...

dfurmanWMP

Apr 27, 2023

Thanks @matthayes , that makes sense!

dfurmanWMP

Apr 27, 2023

Maybe I will just for now hard code a raise Exception within the InstructionTextGenerationPipeline directly before the model call (to err if len token ids is less than 2048).

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment