Random train/test split?

#3
by AmelieSchreiber - opened

It mentions in the paper that you randomly selected sequences for the test dataset. Shouldn't we consider something like sequence similarity and also make sure there are no sequences in the test dataset that are also in the training dataset to prevent data leakage? I was thinking perhaps clustering protein-protein complexes using a protein language model might be helpful in this regard. Also, do you have code released for how you trained the model? I would like to try to replicate what you have done and perhaps also use a different train/test split.

Gleghorn Lab org

Hi! Thanks for your interest.

There is a section in the discussion about how trimming for similarity reduces the dataset size a lot. There have been some recent efforts to accommodate this by making new datasets from scratch with different strategies. https://www.biorxiv.org/content/10.1101/2023.12.12.571298v1

I think your idea is a useful one for sure. We have some larger nonredundant datasets compiled from STRING, Biogrid, Intact, etc. where we did something similar.

We aren't releasing the training details of SYNTERACT until we submit it for peer review: we are waiting on experimental validation with collaborators. SYNTERACT 2.0 preprint will be out within the year, and we plan on operating more open source in that effort.

Awesome. Thanks so much for the response. I'm looking forward to SYNTERACT 2.0!

lhallee changed discussion status to closed

Sign up or log in to comment