Markus28 commited on
Commit
488c818
1 Parent(s): 0f4070e

Fix sorting heuristic

Browse files

We saw issues where models instantiated via `AutoModel` performed poorly on MTEB. During evaluation we saw that most embeddings produced by this model matched those of a working model, with few exceptions in the batches. This appears to be blamed by mixing `sorted` and `np.argsort`, which probably use different methods of taking ties when the input contains duplicate. As a consequence, sentences that have a unique length in their batch are embedded properly, but ones with non-unique length may be swapped. I fixed this issue.

Files changed (1) hide show
  1. modeling_bert.py +3 -2
modeling_bert.py CHANGED
@@ -1211,8 +1211,9 @@ class JinaBertModel(JinaBertPreTrainedModel):
1211
  self.to(device)
1212
 
1213
  # TODO: Maybe use better length heuristic?
1214
- inverse_permutation = np.argsort(np.argsort([-len(i) for i in sentences]))
1215
- sentences = sorted(sentences, key=len, reverse=True)
 
1216
 
1217
  padding = tokenizer_kwargs.pop('padding', True)
1218
 
 
1211
  self.to(device)
1212
 
1213
  # TODO: Maybe use better length heuristic?
1214
+ permutation = np.argsort([-len(i) for i in sentences])
1215
+ inverse_permutation = np.argsort(permutation)
1216
+ sentences = [sentences[idx] for idx in permutation]
1217
 
1218
  padding = tokenizer_kwargs.pop('padding', True)
1219