Default tokenization differs between sentence_transformers and transformers version

#6
by jokokojote - opened

Using the provided example code for sentence_transformers and transformers library leads to different embeddings for the same sentence. This is due to a different truncation of the inputs. sentence_transformers uses a max. sequence length of 128 tokens while transformers version uses 512.

import torch 
from transformers import AutoTokenizer, AutoModel
from sentence_transformers import SentenceTransformer

st_model_id = 'paraphrase-multilingual-mpnet-base-v2'
hf_model_id = 'sentence-transformers/' + st_model_id

sentences = [
    'S' * 10000 # dummy sentence
    ]

device = 'cuda:0'

st_model = SentenceTransformer(st_model_id, device=device)
st_tokenizer = st_model.tokenizer

print(st_model.get_max_seq_length()) # max seq length 128

st_tokens = st_model.tokenize(sentences)
print(st_tokens['input_ids'].shape) # seq length 128


# Load model from HuggingFace Hub
hf_tokenizer = AutoTokenizer.from_pretrained(hf_model_id)
# Tokenize sentences
hf_tokens = hf_tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

print(hf_tokens['input_ids'].shape) # seq length 512 ?

# get embeddings with transforms and sentence transformers
st_embedding = st_model.encode(sentences)

hf_model = AutoModel.from_pretrained(hf_model_id)
# Compute token embeddings
with torch.no_grad():
    hf_out = hf_model(**hf_tokens)

def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)


# # Perform pooling like written in Readme
hf_embedding = mean_pooling(hf_out, hf_tokens['attention_mask'])

# compare embeddings
embedding_diff = torch.abs(torch.tensor(st_embedding).to(device)- hf_embedding.to(device))
print(embedding_diff)
print(torch.sum(embedding_diff > 1e-6)) # embeddings for same sentence are different!

# use sequence length of 128 explicitly with transformers
hf_tokens = hf_tokenizer(sentences, padding=True, truncation='longest_first', return_tensors="pt", max_length=128) # get tokens with max. 128 sequence length

# Compute token embeddings again with new tokens
with torch.no_grad():
    hf_out = hf_model(**hf_tokens)

hf_embedding = mean_pooling(hf_out, hf_tokens['attention_mask'])

# Compare embeddings again
print(torch.sum(torch.abs(torch.tensor(st_embedding).to(device)- hf_embedding.to(device)) > 1e-6)) # embeddings match!

Is this on purpose or is it an error in the tokenizer configuration for transformers version? I would suggest updating the transformers example code in the Readme at last to give a hint about this. Getting different "default" results for the same model can cause some confusion.

Additional question: Why is the max. input sequence set to 128 tokens at default with sentence_transformers at all when the architecture can support longer sequences?

PS: I think this problem is the same for other models as well like paraphrase-xlm-r-multilingual-v1, etc.

jokokojote changed discussion title from Default tokenization differs between sentence_transformer and transformers version to Default tokenization differs between sentence_transformers and transformers version

Sign up or log in to comment