Finetune BGE models

#8
by Navanit-shorthills - opened

Hi,
I want to know how to finetune bge-small-en. I have my own datasets.
Is there any code or reference for it.

@Shitao Sir kindly guide if you know any reference for it.

Beijing Academy of Artificial Intelligence org

Thanks for your interest in our work!
For the fine-tuning pipeline, you can see our github repo: https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune

@Shitao I have finetuned my model, I have a doubt in finetuned model.
Does the training set data if get measured by the Finetuned model will it give 100% Cosine similarity score or not.
Example
example set = {"query": "Five women walk along a beach wearing flip-flops.", "pos": ["Some women with flip-flops on, are walking along the beach"], "neg": ["The 4 women are sitting on the beach.", "There was a reform in 1996.", "She's not going to court to clear her record.", "The man is talking about hawaii.", "A woman is standing outside.", "The battle was over. ", "A group of people plays volleyball."]}

Checking the cosine similarity between "Five women walk along a beach wearing flip-flops.", "Some women with flip-flops on, are walking along the beach". Should now Give 1 score.

Tell me if there is any misconception.

Beijing Academy of Artificial Intelligence org

The similarity score of text pairs in the training set will not be 1.0. We optimize the model based on constrastive loss (InfoNCE), which aims to increase the gap between positive pairs and negative pairs. When the score of positive pairs is significantly larger than that of negative pairs, the loss will be zero. Therefore, the similarity score of positive pairs has no need to be 100%.

@Shitao thanks a ton. We were able to fine tune the vector model. But now we want to change the hyper paramters to make our accuracy score better.
Below is the hyperparameters
"""
torchrun --nproc_per_node {number of gpus}
-m FlagEmbedding.baai_general_embedding.finetune.run
--output_dir {path to save model}
--model_name_or_path BAAI/bge-large-zh-v1.5
--train_data ./toy_finetune_data.jsonl
--learning_rate 1e-5
--fp16
--num_train_epochs 5
--per_device_train_batch_size {large batch size; set 1 for toy data}
--dataloader_drop_last True
--normlized True
--temperature 0.02
--query_max_len 64
--passage_max_len 256
--train_group_size 2
--negatives_cross_device
--logging_steps 10
--query_instruction_for_retrieval ""
"""

Now in this "train_group_size", the default parameter is set to 2, so if we set it to 5, then the groups i.e, "neg":List[str] and "pos":List[str] will be the same or we have to increase this pos and neg groups too.

Beijing Academy of Artificial Intelligence org

When generating batch data, we always sample only one positive for each query, and a larger train_group_size will sample more negatives from "neg":List[str].
So, if you set it to 5, you should let the "neg" list have about 5 items (if less than 5, the program will perform multiple samplings, which means that some neg items may be repeated multiple times.)

Hey @Shitao , Here is the sample data set :

{"query": "A woman with violet hair rides her bicycle outside.", "pos": ["A woman is riding her bike."], "neg": ["A woman is jogging in the park.", "The street was lined with white-painted houses.", "A group watches a movie inside.", "man at picnics cut steak", "Several chefs are sitting down and talking about food."]} .

I want you to tell me the appropriate value of train_group_size for my given data set where negatives are 5 and 1 positive. Along with this i have one more query that in your given data repo, you gave the example of toy data set in which you gave 7 negative, and in code you gave the value of train_group_size as 2. Can you explain me the same why it is so?

Beijing Academy of Artificial Intelligence org

Hi, @amit44 , a larger batch size will result in higher computational costs (more negatives). In our experiments, setting it between 2 and 8 is sufficient.

Hey @Shitao I've finetuned the 'BAAI/bge-large-en-v1.5' model using your GitHub repository. It's performing well in most of cases but in some cases results are not as expected.

example set = {"query": "Five women walk along a beach wearing flip-flops.", "pos": ["Some women with flip-flops on, are walking along the beach"], "neg": ["The 4 women are sitting on the beach.", "There was a reform in 1996.", "She's not going to court to clear her record.", "The man is talking about hawaii.", "A woman is standing outside."]}

When checking the cosine similarity score between the query and pos "Five women walk along a beach wearing flip-flops." and "Some women with flip-flops on, are walking along the beach", the similarity increases as expected in the fine-tuned model compared to the off-the-shelf model. However, when comparing the cosine similarity score between the pos and neg[0] "Some women with flip-flops on, are walking along the beach" and "The 4 women are sitting on the beach." the similarity score increases again, which is supposed to decrease.

Could you help me understand if there's any misconception or why the similarity score also increases in the cosine similarity of the positive and negative examples?"

Sign up or log in to comment