aari1995/German_Semantic_STS_V2 · Vectorization of long text

Dear aari1995,

thanks again for the great model :)

I would like to know if you have experience regarding the creation of embeddings for long text document. In my use case, i have multiple documents/texts where each document has a word length > 500 words. This exceeds the 200-300 word length recommendation for BERT models, regarding the vectorization of the tokens. Therefore, my documents are being truncated heavely, resulting in a lot of loss of information. I need to find a way to create one Embedding vector for each document, without loosing information through the truncation. Unfortunately, reducing the length of the document by applying some pre-processing steps is not possible, since i already done that to a maximum. Using a summarization model/algorithm to summarize the document in a prior step into a suitable length and then creating the Embeddings is also not an option for me. Therefore i am doing the following steps right now:

converting the document into tokens
splitting the tokens into chunks of 512 tokens
vectorizing the chunks to get one Embedding vector per chunk (now i have multiple vectors per document)
mean averaging the vectors into one vector

Unfortunately, my averaged Embedding vector for each document is not a good representation of the documet`s content. I know this, because my next step, clustering each Embedding vector using k-Means, has poorly results.
I guess the averaging of the vectors for each document is the problem, since there is probabely a huge loss of information.

Have you any experience with this topic? Could you recommend any further ways to create meaningful Embeddings for my (long) documents?

Thank you and best regards,
Marcel