intfloat
/

e5-mistral-7b-instruct

@@ -5287,7 +5287,7 @@ license: mit
 ## E5-mistral-7b-instruct
-**[TODO] Technical details on the model training and evaluation will be available before 2024-01-01.**
 Some highlights for preview:
 * This model is only fine-tuned for less than 1000 steps, no contrastive pre-training is used.
@@ -5305,19 +5305,17 @@ import torch.nn.functional as F
 from torch import Tensor
 from transformers import AutoTokenizer, AutoModel
-from transformers.file_utils import PaddingStrategy
 def last_token_pool(last_hidden_states: Tensor,
                  attention_mask: Tensor) -> Tensor:
-    last_hidden = last_hidden_states.masked_fill(~attention_mask[..., None].bool(), 0.0)
     left_padding = (attention_mask[:, -1].sum() == attention_mask.shape[0])
     if left_padding:
-        return last_hidden[:, -1]
     else:
         sequence_lengths = attention_mask.sum(dim=1) - 1
-        batch_size = last_hidden.shape[0]
-        return last_hidden[torch.arange(batch_size, device=last_hidden.device), sequence_lengths]
 def get_detailed_instruct(task_description: str, query: str) -> str:
@@ -5336,7 +5334,7 @@ model = AutoModel.from_pretrained('intfloat/e5-mistral-7b-instruct')
 max_length = 4096
 # Tokenize the input texts
-batch_dict = tokenizer(input_texts, max_length=max_length - 1, return_attention_mask=False, padding=PaddingStrategy.DO_NOT_PAD, truncation=True)
 # append eos_token_id to every input_ids
 batch_dict['input_ids'] = [input_ids + [tokenizer.eos_token_id] for input_ids in batch_dict['input_ids']]
 batch_dict = tokenizer.pad(batch_dict, padding=True, return_attention_mask=True, return_tensors='pt')
@@ -5371,6 +5369,8 @@ Yes, this is how the model is trained, otherwise you will see a performance degr
 The task definition should be a one-sentence instruction that describes the task.
 This is a way to customize text embeddings for different scenarios through natural language instructions.
 On the other hand, there is no need to add instructions to the document side.
 **2. Why are my reproduced results slightly different from reported in the model card?**

 ## E5-mistral-7b-instruct
+**[TODO] Technical report on the model training and evaluation will be available before 2024-01-01.**
 Some highlights for preview:
 * This model is only fine-tuned for less than 1000 steps, no contrastive pre-training is used.
 from torch import Tensor
 from transformers import AutoTokenizer, AutoModel
 def last_token_pool(last_hidden_states: Tensor,
                  attention_mask: Tensor) -> Tensor:
     left_padding = (attention_mask[:, -1].sum() == attention_mask.shape[0])
     if left_padding:
+        return last_hidden_states[:, -1]
     else:
         sequence_lengths = attention_mask.sum(dim=1) - 1
+        batch_size = last_hidden_states.shape[0]
+        return last_hidden_states[torch.arange(batch_size, device=last_hidden_states.device), sequence_lengths]
 def get_detailed_instruct(task_description: str, query: str) -> str:
 max_length = 4096
 # Tokenize the input texts
+batch_dict = tokenizer(input_texts, max_length=max_length - 1, return_attention_mask=False, padding=False, truncation=True)
 # append eos_token_id to every input_ids
 batch_dict['input_ids'] = [input_ids + [tokenizer.eos_token_id] for input_ids in batch_dict['input_ids']]
 batch_dict = tokenizer.pad(batch_dict, padding=True, return_attention_mask=True, return_tensors='pt')
 The task definition should be a one-sentence instruction that describes the task.
 This is a way to customize text embeddings for different scenarios through natural language instructions.
+Please check out [unilm/e5/utils.py](https://github.com/microsoft/unilm/blob/16da2f193b9c1dab0a692c6e4380bd43e70a40cd/e5/utils.py#L93) for instructions we used for evaluation.
 On the other hand, there is no need to add instructions to the document side.
 **2. Why are my reproduced results slightly different from reported in the model card?**