how to extract general text embedding

#17
by daisyyayueyue - opened

is it possible you can share an example of extracting general text embedding ? I would love to test the clustering task based on semantic similarity

Hi @daisyyayueyue
Hope the below code will help you.

def embedding_generation(query):
input_ids = tokenizer(query, max_length=4096, padding=True, truncation=True, return_tensors='pt')
inputs_embeds = model.model.embed_tokens(input_ids.input_ids)
embeddings = average_pool(inputs_embeds, input_ids['attention_mask'])
embeddings = F.normalize(embeddings, p=2, dim=1)
# scores = (embeddings[:2] @ embeddings[2:].T) * 100
# print(scores.tolist())
return embeddings

@mrhimanshu

Where is the average_pool defined, I only see last_token_pool

@pduggi use below function

def average_pool(last_hidden_states: Tensor, attention_mask: Tensor) -> Tensor:
last_hidden = last_hidden_states.masked_fill(~attention_mask[..., None].bool(), 0.0)
return last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]

@pduggi did it help you? :)

@mrhimanshu can this only work with batch size 1? Is there any way to have a bigger batch size?

@mrhimanshu this worked for me:

import torch
import torch.nn.functional as F

from torch import Tensor
from transformers import AutoTokenizer, AutoModel


def last_token_pool(last_hidden_states: Tensor,
                 attention_mask: Tensor) -> Tensor:
    left_padding = (attention_mask[:, -1].sum() == attention_mask.shape[0])
    if left_padding:
        return last_hidden_states[:, -1]
    else:
        sequence_lengths = attention_mask.sum(dim=1) - 1
        batch_size = last_hidden_states.shape[0]
        return last_hidden_states[torch.arange(batch_size, device=last_hidden_states.device), sequence_lengths]


documents = [
    "As a general guideline...",
    "Definition of summit..."
]

tokenizer = AutoTokenizer.from_pretrained('intfloat/e5-mistral-7b-instruct')
model = AutoModel.from_pretrained('intfloat/e5-mistral-7b-instruct')
# model.to('cuda')

max_length = 4096

batch_dict = tokenizer(documents, max_length=max_length - 1, return_attention_mask=False, padding=False, truncation=True)

batch_dict['input_ids'] = [input_ids + [tokenizer.eos_token_id] for input_ids in batch_dict['input_ids']]
batch_dict = tokenizer.pad(batch_dict, padding=True, return_attention_mask=True, return_tensors='pt')
# batch_dict.to('cuda')

outputs = model(**batch_dict)
embeddings = last_token_pool(outputs.last_hidden_state, batch_dict['attention_mask'])

embeddings = F.normalize(embeddings, p=2, dim=1)
embeddings = embeddings.tolist()

Move things to 'cuda' if you need GPU

Does above code work for text embeddings for retrieval task. Data is in chunks of size 3000 characters.

Sign up or log in to comment