Model for feature generation requires very high memory.

#5
by dipayan26 - opened

Feature generation of protein sequence length of about 1000 takes very high ram usage and google colab's 12GB gpu memory became 'out of memory' error just after using of 6 of those protein sequences.

Rostlab org

Maybe try to cast the model to half-precision before running feature extraction.
Also, I would recommend to use our ProtT5-XL model because it proved to be better in any of our benchmarks:
https://huggingface.co/Rostlab/prot_t5_xl_half_uniref50-enc
Also, when you only hit OOM after embedding 6 sequences of identical length, you have memory leakage somewhere.
Once you managed to embed a single protein of e.g. 1k residues, it should not make any difference whether you repeat the process x-times.

when using this https://huggingface.co/Rostlab/prot_t5_xl_half_uniref50-enc model , the tokenizer shows error "Exception: You're trying to run a Unigram model but you're file was trained with a different algorithm".

Rostlab org

Yeah, I guess you are running into this issue: https://github.com/huggingface/transformers/issues/9871
I think your problem should be solved by loading BertTokenizer or T5Tokenizer instead of AutoTokenizer

mheinz changed discussion status to closed

Sign up or log in to comment