base model

#6
by ambivalent02 - opened

Thanks team for great work, I wonder that you have pretrained 8k context xlm roberata version with mlm loss right ? Is there any increase/decrease in performance with the previous 512 seq release ? Thanks

Beijing Academy of Artificial Intelligence org

Thanks for your interest in our work!
We extend the position embeddings of original xlm-roberta, and use the retromae method to do further pretrain (see the appendix in our report: https://github.com/FlagOpen/FlagEmbedding/blob/master/FlagEmbedding/BGE_M3/BGE_M3.pdf)

We didn't test the performance of the pre-trained model on other tasks (like GLUE). But for retrieval task, the new pre-trained model has a significant improvement. We will add more results to our report and make the pre-trained model open-sourced.

oh, I miss this point in report, thanks for your response

will xlm 8k be released in near future :D

Beijing Academy of Artificial Intelligence org

Hi, @ambivalent02 , we have released the our xlm-roberta-8k model: https://huggingface.co/BAAI/bge-m3-retromae
Besides, we add the results using different base models in our report (section 4.4).

really appreciate it, hope to see more future advancement, good luck team

@Shitao Another question, I notice that most of the reports for text embedding omitting checkpoint selection, how did you pick the ckp, and can you share the loss curve of unsup and sup stages. Thanks !

Beijing Academy of Artificial Intelligence org

Due to issues with the machine, we had to restart the experiment multiple times, so there wasn't a complete loss curve. Below is the loss from 0 to 9000 in unsup stage:
image.png

The curve fluctuates significantly due to us placing data of the same task and length in each batch, resulting in varying retrieval loss per batch.

For model selection, we experimented with a small portion of fine-tuned data to test the effectiveness of different unsup models and choose a comparatively better one.
For sup stage, we just ran one epoch for both warm-up and unified fine-tuning, without tuning of training step.

yeah, thanks team

@Shitao Any idea on this case, thanks? I fine tune your BGE-m3 unsup model on English data: MSMARCO, AllNLI, Natural Questiton, (3 hard negatives, concat and shuffling)
with your FlagEmbedding codebase
python3 -m torch.distributed.run --nproc_per_node 6 --master_port 1234
-m FlagEmbedding.baai_general_embedding.finetune.run
--deepspeed ds_config.json
--output_dir ./save_checkpoints
--model_name_or_path $1
--tokenizer_name $1
--train_data ./data/english_data/english_hard_mine_shuffle.jsonl
--per_device_train_batch_size 6
--learning_rate 1e-5
--num_train_epochs 1
--save_steps 10000
--logging_steps 1000
--negatives_cross_device
--dataloader_drop_last True
--query_max_len 256
--passage_max_len 512
--train_group_size 4
--log_dir "./eval_log_${job_name}.jsonl"
--fp16
--warmup_ratio 0.05
--report_to "tensorboard" \

and evaluation on a subset of TriviaQA datasets (~3k queries + ~ 3 k context) and observe that model tends to lose original knowledge and relearns,
+BGE-m3 unsup: 'Recall@1': 0.73147, 'Recall@3': 0.86739, 'Recall@5': 0.89619

  • Step 10k: 'Recall@1': 0.19862, 'Recall@3': 0.28833,
  • Step 30k: 'Recall@1': 0.30213, 'Recall@3': 0.44224
    Further, I also witness the similar pattern when I use Vie data and English+Vie data, meanwhile I don't observe it in my past work with 512 context length unsup model.
Beijing Academy of Artificial Intelligence org

@ambivalent02 , the result looks very strange. There may be an issue with the model loading. Could you show the results of the first few steps (e.g., step 100, 1k, ...)?

@Shitao yep, let me rerun first 1k steps, I just download the cache model and load as below, i don't think the problem is model loading:
self.model = AutoModel.from_pretrained(model_name, cache_dir='./save_pretrained/models--BAAI--bge-m3-unsupervised/snapsho/7fc495886877e25ca4d5e6a48cfe5a8219b80b7a')
in modeling.py

Hi,
I am interested to know about what is it exactly "compute_score"?

Beijing Academy of Artificial Intelligence org

Hi,
I am interested to know about what is it exactly "compute_score"?

For bge-m3, compute_score is a function to compute the scores (including dense scores, sparse scores, and colbert scores) of a query-passage pair.

Hi,
I am interested to know about what is it exactly "compute_score"?

For bge-m3, compute_score is a function to compute the scores (including dense scores, sparse scores, and colbert scores) of a query-passage pair.

I am working on a project, it is related to a tourist site. I have experiences of customers with some pictures related to that experience. our goal is finding scoring for finding relevancy between the pictures and experience. I extracted captions from images then I use your model for finding scores of similarity. but I dont have any field for evaluation and finding it is really relevance or not.
i thought may be you can help me for any suggestion for my problem.
thanks

Beijing Academy of Artificial Intelligence org

Hi,
I am interested to know about what is it exactly "compute_score"?

For bge-m3, compute_score is a function to compute the scores (including dense scores, sparse scores, and colbert scores) of a query-passage pair.

I am working on a project, it is related to a tourist site. I have experiences of customers with some pictures related to that experience. our goal is finding scoring for finding relevancy between the pictures and experience. I extracted captions from images then I use your model for finding scores of similarity. but I dont have any field for evaluation and finding it is really relevance or not.
i thought may be you can help me for any suggestion for my problem.
thanks

A possible solution is using a Multi-Modal LLM to evaluate the relevance.

@ambivalent02 : Hi,
Can you tell me about the data format you train for unsup and how you create it??.

Sign up or log in to comment