Sampling Negative Examples for training bge-m3

#32
by mehti - opened

Hello!

First of all, thank you for your incredible and outstanding work!

My question is pretty simple, but might be tricky depending on the dataset: Which approach should I use for sampling negative examples in order to fine-tune bge-m3 model as given in project repo for multilingual semantic search?
Currently I have pairs of (query, document) and need to generate negative example per record.

There are plenty of approaches, such as:

  1. Explicit Negatives - not really applicable because the data is static.
  2. Random Negatives - getting random document from other record.
  3. BM25 Negatives - quite popular, but has some biases when it comes to retrieve documents.
  4. Gold Negatives - similar to random negatives, but getting specific document.
  5. In-batch Negatives - fetching negatives in the same batch
  6. Cross-batch Negatives - quite complicated, but uses GPU for sampling negatives
  7. Approximate Nearest Neighbors - quite costly, not sure if reasonable to use
  8. Hybrid - uses combination of BM25 and other approach

What can you suggest from your model's point of view? I can try out all of them, but I would prefer to save some time and follow previously used approach.

Thank you in advance.

Beijing Academy of Artificial Intelligence org
β€’
edited Mar 26

Thanks for your interest in our work!
We recommend using In-batch Negatives and BM25 Negatives/Approximate Nearest Neighbors.
We provide a script to mine hard negative(Approximate Nearest Neighbors). And for fine-tuning scrip, you can set use_inbatch_neg (default value is True) to use in-batch negatives.

Hello!

Thanks for your quick response!

I am trying to reproduce (on jupyter notebook) script you mentioned in your comment and getting the following error:

File ~/anaconda3/envs/ai/lib/python3.11/site-packages/transformers/trainer.py:349, in Trainer.__init__(self, model, args, data_collator, train_dataset, eval_dataset, tokenizer, model_init, compute_metrics, callbacks, optimizers, preprocess_logits_for_metrics)
    346 self._memory_tracker.start()
    348 # set the correct log level depending on the node
--> 349 log_level = args.get_process_log_level()
    350 logging.set_verbosity(log_level)
    352 # force device and distributed setup init explicitly

TypeError: TrainingArguments.get_process_log_level() missing 1 required positional argument: 'self'

transformers version is 4.33.0 (as mentioned in setup file).

If I upgrade transformers to the last version (4.39.2), the following error arises:

File ~/anaconda3/envs/ai/lib/python3.11/site-packages/transformers/trainer.py:4255, in Trainer.create_accelerator_and_postprocess(self)
   4249 gradient_accumulation_plugin = GradientAccumulationPlugin(**grad_acc_kwargs)
   4251 # create accelerator object
   4252 self.accelerator = Accelerator(
   4253     deepspeed_plugin=self.args.deepspeed_plugin,
   4254     gradient_accumulation_plugin=gradient_accumulation_plugin,
-> 4255     **self.args.accelerator_config.to_dict(),
   4256 )
   4257 # some Trainer classes need to use `gather` instead of `gather_for_metrics`, thus we store a flag
   4258 self.gather_function = self.accelerator.gather_for_metrics

AttributeError: 'NoneType' object has no attribute 'to_dict'

I have tried to set up additional argument accelerator_config in RetrieverTrainingArguments class, but didn't get any success. Even instance class AcceleratorConfig from transformers.trainer_pt_utils doesn't help, despite the fact that it's suggested in transformers' documentation code.

Any ideas or thoughts how I can get rid of any of those errors?

FYI: Opened an issue on transformers GitHub repo page

Sign up or log in to comment