Performance on other languages

#11
by DamianS89 - opened

Hey,
I tested this reranker on a list of questions within my domain. Simple setup - 150 relevant chunks get retrieved and reranked. I noticed that this reranker works really great on german input related to law... Is there an explanation for that?

Its not perfect but better then other rerankers - including your newest multilingual bge-m3...

Do you have an idea if it makes sense to:
a) fine tune this model more to my specific domain
b) if so, how much data is needet to get a significant boost
c) why type of data is needed (what format), for example I do have a list of 2 million questions and perfect answers to those

Best,

Damian

Beijing Academy of Artificial Intelligence org

Hi, thanks for your feedback!
Actually, we didn't use any German data to train the reranker (see https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/reranker#fine-tune for the details of the training data).
Therefore, we highly recommend fine-tuning this reranker with German data.
Approximately a few thousand data points should be sufficient for fine-tuning.
We provide an example to show how to fine-tune the reranker: https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/reranker

Besides, bge-m3 is a retrieval model. We recommend using it for retrieval and then employing a reranker for further filtering.

Thanks for the response! And you are right, idiotic to use m3 as a reranker. Didnt read carefully. And you are totally right, over night I prepared my data using m3 and it even outperformed jina ai new de embedding model! Cudos for that! Do you think there is also room to grow specifically here? Fine tuning M3 on ~2M german data?

Beijing Academy of Artificial Intelligence org

Fine-tuning on high-quality task-relevant data is certain to yield improvements in downstream tasks. You can refer to our examples for fine-tuning: https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune . If the fine-tuned results do not meet expectations, you can try increasing the batch size and mine hard negative.

We also recommend trying hybrid retrieval (dense + sparse).

Hey,
small feedback: Fine tuning bge-reranker-large on a small amount of data (approx 2000 queries + 1 pos + 3 neg) works extremly well - I am surprised how well it works actually. Only downside is the max seq length of 512. Are you familiar with any alternative in the range of 768-1024 max seq length which can be used as a reranker?

Reason is basically that I have my embedding model fine tuned and it works with a chunksize of 768. So the reranker gets chunks of that size and obviously there is a certain gut off of info.

Beijing Academy of Artificial Intelligence org

Hi, there are very few rerankers that can be used with max length of 1024 directly. You can fine-tune a model with long-text capability on your data as a reranker, such as bge-m3, jina-bert. Note to fine-tune the model as reranker using this script instead of the script for embedding model.

I actually tried it with bge-m3, the results were okayish (I tried it raw as well as via llm cocktail and serveral different weight settings). Here I was more or less underwhelmed.
Do you have any recommendation for:
a) How much of data is needed to ft a reranker like bge-m3?
b) how the data should look like? (I dont mean q, pos, neg but more general advice of the contents itself)
c) infos about weight (your default was 0.5 x 0,5)

In general I noticed that I can train with more data but I am hitting hardware limits on my machine (2x 4090, TR + 256 GB RAM), when training with a lot of data (or trying), I am forced to use batchsize=1 etc. Any tips on hyperparameter tuning in this type of scenario?

Thanks

Beijing Academy of Artificial Intelligence org

@DamianS89 , It seems like you fine-tuned BGE-M3 as an embedding model to do re-ranking. We suggest initializing an AutoModelForSequenceClassification model using BGE-M3(or other long-context models), and then fine-tuning it as a cross-encoder model (like https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/reranker), not continue fine-tuning it as an embedding model. Additionally, LM-Cocktail is used for merging two models with the same functionalities. After fine-tuning BGE-M3 into a reranker model, it's not recommended to merge it again with the original embedding model.

A few thousand training samples should be sufficient. The training data should align with the downstream task. For reranker models, you can increase the batch size by adjusting gradient_accumulation_steps. The most crucial parameter is train_group_size; in our experiments, generally, the larger it is, the better the performance.

Thanks.
No I am doing it correctly (after you pointed it out in the beginning).
Embedding mode: bge m3 base
Reranker: bge reranker large finetuned

So nope, I didnt use m3 as reranker and/or merge a reranker and embedding model into one.

DamianS89 changed discussion status to closed

Sign up or log in to comment