Do zero-shot classification models have a maximum token length?

#34
by stathacker - opened

I have a database that consists of very long strings. Do different zero-shot NLP models have different token lengths and if so, how can I find that out for each one?
If so, can I break up my text into smaller sentences and average the scores of all the sentences to get a single score for the larger texts?

I think the limit equals to "max_position_embeddings": 1024 config parameter. I suggest you to tokenize your text string before you your try to embed it. I think that trying to embed a function longer than 1024 tokens will cause a crash.

If you want to sqeeze as much information in 1024 tokens segments you could remove stopwords from your strings before embedding them.

Sign up or log in to comment