antoinelouis/colbert-xm · Maximum chunk length

Mar 13, 2024

Hi! I am currently trying multilingual models for a RAG implementation with mostly Dutch legal texts. It appears your model does best in this application so far, so thanks for your hard work!

However, I was wondering what the models strategy is with regard to (exceeding) the maximum context length. Is the prompt truncated? Or when going over the maximum context length of 514, will the embedding result be calculated just like with a smaller context length, but over the whole prompt? I am asking, because it looks like the performance doesn't decrease with a chunk length of 1024 in AnythingLLM.

Thanks in advance!
Vincent

antoinelouis

Owner May 31, 2024

Hi Vincent, thanks for the nice words! The model was trained with a max document length of 128 tokens but ColBERT models are known to generalize well to longer sequences thanks to the MaxSim operation. I haven't personnaly tested or looked into the AnythingLLM embedder but I assume that if you increase the chunk length to something longer than the model's max input length (i.e., 512 tokens), it will ignore it and truncate the document to that maximum length. It's good to hear that the performance is good for long text sequences too using a chunking strategy.

antoinelouis changed discussion status to closed Sep 29, 2024