Text Classification of documents with high variance in lengths

#1
by hanshupe - opened

I want to categorize text documents, which have a length between 25 and 3000 words. Language models like BERT only support 512 tokens. It looks like sometimes all 3000 words are needed semantically.

I wonder if a use a model like Longformer, if there is any problem if there is such a high variance in document lengths? Would it be better to train 2 separate classifiers for different lengths?

Sign up or log in to comment