Embeddings : Mean pool
jinaai/jina-embeddings-v2-base-de
Objective : classify whether two pages( or paragraphs) of text are similar.
In my training data I have 30 % texts --> german text 1 , german text 2 label 0 or 1 depending on whether they are similar
70 % english text 1 , english text 2 label 0 or 1 depending on whether they are similar
text 1 could be 1024 but text 2 is more than 1024 90% of time.I'm truncating these at 1024 limit ( after cleaning up special characters ) . Passing inputs_ids , attention masks for both the texts , getting mean of last hidden state . Passing these embeddings through a CosineEmbeddings loss function. I have noticed two things
- Though the batch size is 8 and in total I have about 430 samples in training data it took 10 hours to run through 10 epochs.
2.Inference was really bad on the validation set .
Any suggestions on possible improvements to my approach?