How to fine-tune with dataset of multiple .csv.gz files and had some features or labels? Exemple with OAS database (antibody db)

#34
by BioinfoProteo - opened

Hi,
I would like to fine-tune ProtGPT2 on the OAS database.
When you download it you get several .csv.gz files with headers containing information such as specie among others, and each line has several pieces of information, but I am interested in the numbering (a way to part the different regions of an antibody sequence) and the type of chain (heavy/light). How can I train ProtGPT2 with a sectionned sequence (by using "/" for example to delimit each region) and some features or labels?

I did not see any information about supervised fine tuning, so if anyone can guide me I will be thankful

For more details, a line could look like this :

<|endoftext|>
EVQLVESGGGLVKPGGSLRLSCAAS/GFTFSSYT/MNWVRQAPGKGLEWVSS/ISTSSRY
I/YYADSMKGRFTISRDNAKNSLYLQMNSLRAEDTAVYYC/AREFRQWLYGMDV/WGQGT
TVTVSS<|human|><|heavy|>
<|endoftext|>

Thank you

BioinfoProteo changed discussion title from How to fine-tune with dataset of multiple .csv.gz files and had some features? Exemple with OAS database (antibody db) to How to fine-tune with dataset of multiple .csv.gz files and had some features or labels? Exemple with OAS database (antibody db)

Interesting question!
I've seen a few people fine-tuning in a supervised fashion with success so it's possible.
The way that you describe your processed dataset (with the labels human and heavy) sounds completely fine. If the database is large enough, the model will capture the extra tokens and generate sequences with them too. I only worry that the tokenizer might split those labels in a very weird way. How many of such labels do you expect to have?

Good to know that others achieved it !
Oh I haven't thought about the splitting by the tokenizer, I can't think of any other way to put my labels
For the number of them, I will not expect more than 10

For the multiple .csv.gz files, would you suggest to concatenate all the data in a same csv file ?

I am also wondering if you can train the model with two protein sequences that are in complex and ask it to predict a possible sequence that could bind the one in input
In any case, thank you very much for your reply !

Don't worry too much about the labels, they will in any case split in the same way so the model may be able to learn that.
I am not sure what you mean about the csv.gz files, I thought all your training sequences were in the same training.txt file.
The second question is also interesting. Theoretically, yes, you could fine-tune the model by having two sequences in complex. In that case I'd add a special token in between the two sequences so that the model doesn't assume they are the same sequence. Then during inference, you could input the first sequence and the special token, and the model should complete the second one. I assume you would need quite a lot of data for this to work but who knows. I think it's worth trying!

Ok great, thank you for the splitting explanation
I'll have to reprocess them to match the required input format.
Super, than you so much for all your answers, I look forward to see what I can achieve using your model, it is so promising

Sign up or log in to comment