Can the Performance of the Model be Maintained by Shortening the Max Input Length?

#16
by Boman - opened

The default input length for instructor-xl is 512. However, when analyzing my own dataset (which consists of approximately 10 million entries), I found that the average length of my strings is around 160 tokens. So, when I perform custom vectorization on my dataset, a significant portion of GPU memory is actually wasted. Am I correct in my understanding?

In order to improve GPU utilization, if I change the model's input length to 256, in theory, I should be able to use a larger batch size, thereby enhancing the efficiency of vectorization overall.

However, inevitably, there is a question:
Will reducing the input length of the model by half affect the quality of the output vectors? In other words, may I ask what is the average length of strings in your dataset?

Thank you

NLP Group of The University of Hong Kong org

Hi, Thanks a lot for your interest in the INSTRUCTOR!

As we trained the model with the maximum length 512 with many long retrieval tasks, it may be best to have consistent settings in the inference. However, it is reasonable to reduce the input length by half, and I guess there should not be significant performance drop.

Feel free to add any further questions or comments!

Sign up or log in to comment