128 Token Limit for Chunk Size

#5
by ClassicRob - opened

Is there a version of ColBERT that can generate embeddings for chunks larger than 128 tokens?

Hey!

ColBERT can technically take in chunks of up to 512 tokens (not a ColBERT limitation, but a base model one: it's initiated off bert-base).

ColBERTv2 by default has a truncates documents to 180 tokens, but you can update the config at indexing (or encoding) time to increase it and it'll run just fine. It scales really easily to longer token lengths (256 and 300 are well-explored and work very well, up to 512 also seems to work well but there are fewer examples).

Sign up or log in to comment