Embeddings : Mean pool

#13
by sr33kant - opened

jinaai/jina-embeddings-v2-base-de
Objective : classify whether two pages( or paragraphs) of text are similar.

In my training data I have 30 % texts --> german text 1 , german text 2 label 0 or 1 depending on whether they are similar
70 % english text 1 , english text 2 label 0 or 1 depending on whether they are similar

text 1 could be 1024 but text 2 is more than 1024 90% of time.I'm truncating these at 1024 limit ( after cleaning up special characters ) . Passing inputs_ids , attention masks for both the texts , getting mean of last hidden state . Passing these embeddings through a CosineEmbeddings loss function. I have noticed two things

  1. Though the batch size is 8 and in total I have about 430 samples in training data it took 10 hours to run through 10 epochs.
    2.Inference was really bad on the validation set .

Any suggestions on possible improvements to my approach?

Jina AI org

@sr33kant can you try:

  1. get mean of word embeddings, instead of last hidden state
  2. freeze the embedding model, only train a classification layer

Sign up or log in to comment