128 Token Limit for Chunk Size

by ClassicRob - opened Jan 25, 2024

Discussion

ClassicRob

Jan 25, 2024

Is there a version of ColBERT that can generate embeddings for chunks larger than 128 tokens?

bclavie

Jan 27, 2024

•

edited Jan 27, 2024

Hey!

ColBERT can technically take in chunks of up to 512 tokens (not a ColBERT limitation, but a base model one: it's initiated off bert-base).

ColBERTv2 by default has a truncates documents to 180 tokens, but you can update the config at indexing (or encoding) time to increase it and it'll run just fine. It scales really easily to longer token lengths (256 and 300 are well-explored and work very well, up to 512 also seems to work well but there are fewer examples).

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment