Fix sorting heuristic

We saw issues where models instantiated via `AutoModel` performed poorly on MTEB. During evaluation we saw that most embeddings produced by this model matched those of a working model, with few exceptions in the batches. This appears to be blamed by mixing `sorted` and `np.argsort`, which probably use different methods of taking ties when the input contains duplicate. As a consequence, sentences that have a unique length in their batch are embedded properly, but ones with non-unique length may be swapped. I fixed this issue.

Files changed (1) hide show

modeling_bert.py +3 -2

modeling_bert.py CHANGED Viewed

@@ -1211,8 +1211,9 @@ class JinaBertModel(JinaBertPreTrainedModel):
             self.to(device)
         # TODO: Maybe use better length heuristic?
-        inverse_permutation = np.argsort(np.argsort([-len(i) for i in sentences]))
-        sentences = sorted(sentences, key=len, reverse=True)
         padding = tokenizer_kwargs.pop('padding', True)

             self.to(device)
         # TODO: Maybe use better length heuristic?
+        permutation = np.argsort([-len(i) for i in sentences])
+        inverse_permutation = np.argsort(permutation)
+        sentences = [sentences[idx] for idx in permutation]
         padding = tokenizer_kwargs.pop('padding', True)