hkunlp/instructor-xl · Can the Performance of the Model be Maintained by Shortening the Max Input Length?

The default input length for instructor-xl is 512. However, when analyzing my own dataset (which consists of approximately 10 million entries), I found that the average length of my strings is around 160 tokens. So, when I perform custom vectorization on my dataset, a significant portion of GPU memory is actually wasted. Am I correct in my understanding?

In order to improve GPU utilization, if I change the model's input length to 256, in theory, I should be able to use a larger batch size, thereby enhancing the efficiency of vectorization overall.

However, inevitably, there is a question:
Will reducing the input length of the model by half affect the quality of the output vectors? In other words, may I ask what is the average length of strings in your dataset?

Thank you