Why do you add a normalize layer in the end of the model? Does it affect how we fine tune results?

#13

by nps798 - opened Mar 9, 2024

Mar 9, 2024

•

edited Mar 9, 2024

Hi
I create this discussion in order to inquire the normalize() layer in your model?
Doesn't see a normalize layer in other embedding model~~

Additionally, does this affect my fine tuning strategy?
for STS (sentence textual similarity), we use cosine similarity~ That make sense for a model with normalize , but just why previous model does not have normalize layer? any rationale behind this is helpful

THanks

nli-distilroberta-base-v2

SentenceTransformer(
   (0): Transformer({'max_seq_length': 75, 'do_lower_case': False}) with Transformer model: RobertaModel 
   (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
 )

BAAI/bge-large-en

# SentenceTransformer(
   (0): Transformer({'max_seq_length': 512, 'do_lower_case': True}) with Transformer model: BertModel 
   (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
   (2): Normalize()
 )

Shitao

Beijing Academy of Artificial Intelligence org Mar 9, 2024

Hi, the dot product and cosine similarity are commonly used methods for calculating sentence similarity. Many models employ cosine similarity, such as simcse. One advantage of cosine similarity is that it constrains similarity within a range from -1 to 1. Using a normalization layer facilitates the computation of cosine similarity more conveniently.
For fine-tuning, you should set a temperature (0.01-0.05 is good) when using cosine similarity (https://github.com/FlagOpen/FlagEmbedding/blob/master/FlagEmbedding/baai_general_embedding/finetune/modeling.py#L93), and other settings are same as fine-tuning dot product.

nps798

Mar 9, 2024

•

edited Mar 9, 2024

Thanks for your immediate response.

Cosine similarity does not consider the Euclidean distance of different vectors. It just focuses on the angle between vectors. Without normalization, the embedding model would not be optimized well if using cosine similarity as loss function. ---> is this concurring with your points?

Additionaly,
I found out later that sentence-transformers/all-MiniLM-L6-v2 also include a Normalize() in their architecture.

SentenceTransformer(
   (0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: BertModel 
   (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
   (2): Normalize()
)

So, may be it's just nli-distilroberta-base-v2 that it came without Normalize().

Shitao

Beijing Academy of Artificial Intelligence org Mar 9, 2024

Cosine similarity does not consider the Euclidean distance of different vectors. It just focuses on the angle between vectors.

Yes, cosine similarity only focuses on the angle between vectors.

Without normalization, the embedding model would not be optimized well if using cosine similarity as loss function.

The computation of cosine similarity needs to normalize the embedding: cosine(q, p) = (q^Tp)/(||q||*||p||) = (q/||q||) ^T (p/||p||) = norm(q)^T norm(p)

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment