could u please introduce the loss function used in 1st pretraining stage? and why this can be used to align image feature and text embeddings?
· Sign up or log in to comment