intfloat commited on
Commit
109caac
1 Parent(s): c921bc5

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +7 -7
README.md CHANGED
@@ -5287,7 +5287,7 @@ license: mit
5287
 
5288
  ## E5-mistral-7b-instruct
5289
 
5290
- **[TODO] Technical details on the model training and evaluation will be available before 2024-01-01.**
5291
 
5292
  Some highlights for preview:
5293
  * This model is only fine-tuned for less than 1000 steps, no contrastive pre-training is used.
@@ -5305,19 +5305,17 @@ import torch.nn.functional as F
5305
 
5306
  from torch import Tensor
5307
  from transformers import AutoTokenizer, AutoModel
5308
- from transformers.file_utils import PaddingStrategy
5309
 
5310
 
5311
  def last_token_pool(last_hidden_states: Tensor,
5312
  attention_mask: Tensor) -> Tensor:
5313
- last_hidden = last_hidden_states.masked_fill(~attention_mask[..., None].bool(), 0.0)
5314
  left_padding = (attention_mask[:, -1].sum() == attention_mask.shape[0])
5315
  if left_padding:
5316
- return last_hidden[:, -1]
5317
  else:
5318
  sequence_lengths = attention_mask.sum(dim=1) - 1
5319
- batch_size = last_hidden.shape[0]
5320
- return last_hidden[torch.arange(batch_size, device=last_hidden.device), sequence_lengths]
5321
 
5322
 
5323
  def get_detailed_instruct(task_description: str, query: str) -> str:
@@ -5336,7 +5334,7 @@ model = AutoModel.from_pretrained('intfloat/e5-mistral-7b-instruct')
5336
 
5337
  max_length = 4096
5338
  # Tokenize the input texts
5339
- batch_dict = tokenizer(input_texts, max_length=max_length - 1, return_attention_mask=False, padding=PaddingStrategy.DO_NOT_PAD, truncation=True)
5340
  # append eos_token_id to every input_ids
5341
  batch_dict['input_ids'] = [input_ids + [tokenizer.eos_token_id] for input_ids in batch_dict['input_ids']]
5342
  batch_dict = tokenizer.pad(batch_dict, padding=True, return_attention_mask=True, return_tensors='pt')
@@ -5371,6 +5369,8 @@ Yes, this is how the model is trained, otherwise you will see a performance degr
5371
  The task definition should be a one-sentence instruction that describes the task.
5372
  This is a way to customize text embeddings for different scenarios through natural language instructions.
5373
 
 
 
5374
  On the other hand, there is no need to add instructions to the document side.
5375
 
5376
  **2. Why are my reproduced results slightly different from reported in the model card?**
 
5287
 
5288
  ## E5-mistral-7b-instruct
5289
 
5290
+ **[TODO] Technical report on the model training and evaluation will be available before 2024-01-01.**
5291
 
5292
  Some highlights for preview:
5293
  * This model is only fine-tuned for less than 1000 steps, no contrastive pre-training is used.
 
5305
 
5306
  from torch import Tensor
5307
  from transformers import AutoTokenizer, AutoModel
 
5308
 
5309
 
5310
  def last_token_pool(last_hidden_states: Tensor,
5311
  attention_mask: Tensor) -> Tensor:
 
5312
  left_padding = (attention_mask[:, -1].sum() == attention_mask.shape[0])
5313
  if left_padding:
5314
+ return last_hidden_states[:, -1]
5315
  else:
5316
  sequence_lengths = attention_mask.sum(dim=1) - 1
5317
+ batch_size = last_hidden_states.shape[0]
5318
+ return last_hidden_states[torch.arange(batch_size, device=last_hidden_states.device), sequence_lengths]
5319
 
5320
 
5321
  def get_detailed_instruct(task_description: str, query: str) -> str:
 
5334
 
5335
  max_length = 4096
5336
  # Tokenize the input texts
5337
+ batch_dict = tokenizer(input_texts, max_length=max_length - 1, return_attention_mask=False, padding=False, truncation=True)
5338
  # append eos_token_id to every input_ids
5339
  batch_dict['input_ids'] = [input_ids + [tokenizer.eos_token_id] for input_ids in batch_dict['input_ids']]
5340
  batch_dict = tokenizer.pad(batch_dict, padding=True, return_attention_mask=True, return_tensors='pt')
 
5369
  The task definition should be a one-sentence instruction that describes the task.
5370
  This is a way to customize text embeddings for different scenarios through natural language instructions.
5371
 
5372
+ Please check out [unilm/e5/utils.py](https://github.com/microsoft/unilm/blob/16da2f193b9c1dab0a692c6e4380bd43e70a40cd/e5/utils.py#L93) for instructions we used for evaluation.
5373
+
5374
  On the other hand, there is no need to add instructions to the document side.
5375
 
5376
  **2. Why are my reproduced results slightly different from reported in the model card?**