Why do you add a normalize layer in the end of the model? Does it affect how we fine tune results?

#13
by nps798 - opened

Hi
I create this discussion in order to inquire the normalize() layer in your model?
Doesn't see a normalize layer in other embedding model~~

Additionally, does this affect my fine tuning strategy?
for STS (sentence textual similarity), we use cosine similarity~ That make sense for a model with normalize , but just why previous model does not have normalize layer? any rationale behind this is helpful

THanks

nli-distilroberta-base-v2

SentenceTransformer(
   (0): Transformer({'max_seq_length': 75, 'do_lower_case': False}) with Transformer model: RobertaModel 
   (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
 )

BAAI/bge-large-en

# SentenceTransformer(
   (0): Transformer({'max_seq_length': 512, 'do_lower_case': True}) with Transformer model: BertModel 
   (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
   (2): Normalize()
 )
Beijing Academy of Artificial Intelligence org

Hi, the dot product and cosine similarity are commonly used methods for calculating sentence similarity. Many models employ cosine similarity, such as simcse. One advantage of cosine similarity is that it constrains similarity within a range from -1 to 1. Using a normalization layer facilitates the computation of cosine similarity more conveniently.
For fine-tuning, you should set a temperature (0.01-0.05 is good) when using cosine similarity (https://github.com/FlagOpen/FlagEmbedding/blob/master/FlagEmbedding/baai_general_embedding/finetune/modeling.py#L93), and other settings are same as fine-tuning dot product.

Thanks for your immediate response.

Cosine similarity does not consider the Euclidean distance of different vectors. It just focuses on the angle between vectors. Without normalization, the embedding model would not be optimized well if using cosine similarity as loss function. ---> is this concurring with your points?

Additionaly,
I found out later that sentence-transformers/all-MiniLM-L6-v2 also include a Normalize() in their architecture.

SentenceTransformer(
   (0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: BertModel 
   (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
   (2): Normalize()
)

So, may be it's just nli-distilroberta-base-v2 that it came without Normalize().

Beijing Academy of Artificial Intelligence org

Cosine similarity does not consider the Euclidean distance of different vectors. It just focuses on the angle between vectors.

Yes, cosine similarity only focuses on the angle between vectors.

Without normalization, the embedding model would not be optimized well if using cosine similarity as loss function.

The computation of cosine similarity needs to normalize the embedding: cosine(q, p) = (q^Tp)/(||q||*||p||) = (q/||q||) ^T (p/||p||) = norm(q)^T norm(p)

Sign up or log in to comment