Slightly different output from sentence transformers VS using the transformers library VS deploying the model in torch serve

#9
by Mihail - opened

Hi there,

First of all, thanks for the great model! I wanted to use it within torchserve and just did a check if I am getting the same output if run using the scripts provided in the model card with the sentence-transformers library and using the hugging face transformers library and I get slightly different embedding in all three cases. Do you know what might be the reason for this? I applied the mean pooling on both torchserve and when using the transformers library. Has anyone noticed this and knows what the reason might be? Perhaps, it is some random initialisation at work here?.

I fixed all possible seeds and get identical results.

Owner

Hi, thanks. No there should be not, but could it be that you do the encoding in batches /list of strings?

Of course it could also be something with libraries and rounding but the most prominent effect could be that you encode them in batches. If so, and if not every input has the same token length, the model / tokenizer will fill each input to be the same length with adding a „pad“ token, therefore slightly distorting the shorter inputs.

My suggestion is to use transformers and then send your inputs in batches, but with the same input token length.

To do so:

  1. define a function that gives token length of each input (from tokenizer)
  2. group by token length and only put same length inputs into the same batch
  3. calculate again. Now transformers is on par with sentence transformers of 1 input. And your embedding is clean.

(Don’t know if sentence transformers does it automatically but this is the proper way)

Sign up or log in to comment