bigcode/starpii · Update/Fix incorrect model_max

Mar 27, 2024

•

edited Mar 27, 2024

Currently, the field model_max_length is set to be 1000000000000000019884624838656 tokens which is incorrect. This leads to this model when being used in a pipeline either cannot enable automatic truncating when the length gets exceeded which get an error thrown like RuntimeError: The expanded size of the tensor (<SOME NUMBER LARGER THAN 1024>) must match the existing size (1024) at non-singleton dimension 1. Target sizes: [1, <SOME NUMBER LARGER THAN 1024>]. Tensor sizes: [1, 1024], or it cannot use the stride option which also relies on a correct model_max_length being provided.
Description of stride option in a token classification pipeline:
If stride is provided, the pipeline is applied on all the text. The text is split into chunks of size model_max_length. Works only with fast tokenizers and aggregation_strategy different from NONE. The value of this argument defines the number of overlapping tokens between chunks. In other words, the model will shift forward by tokenizer.model_max_length - stride tokens each step.

Update the model_max_length to 1024 tokens4c9455e4

davidxmle changed pull request title from Update the model_max_length to 1024 tokens to Update/Fix incorrect model_max_length to 1024 tokens Mar 28, 2024