BAAI/bge-large-en-v1.5 · Fine tune larger dataset lower eval result.

Hello,

I'm fine tuning the model using proprietary patent dataset with 512 max length and 2 set of data, one 218 rows, one 5800 rows. Each query+instruction has 3 passages to form a positive pair. I didn't include negative pairs in the dataset.
Epoch is 3, and for the larger dataset, use 1e-05 learning rate.
I use Sentence transformer to do fine tuning.
The evaluation is done by calculating correlation between anchor/target cosine similarities and ground truth scores in https://www.kaggle.com/competitions/us-patent-phrase-to-phrase-matching

Surprisingly, the model fine tuned by 5800 rows has a lower score compared to the one with 200 rows.

Any idea what could go wrong?