Expanding the maximum input size (2048 tokens) of a pre-trained Geneformer?

#262

by patrick-yu - opened Oct 16, 2023

Oct 16, 2023

Just wondering if there's any way to expand the maximum 2048 token length for Geneformer (e.g. for bigger inputs/datasets)?

Or perhaps is there some easy way to use/pretrain a different (e.g. BERT-like) model that accommodates >2048 tokens in the input but still utilizes some of the same learned weights from the pretrained (6L/12L) Geneformer?

Thanks in advance!

ctheodoris

Owner Oct 25, 2023

Thank you for your interest in Geneformer! Yes, you can use the pretraining code in the example on this repository and increase the maximum input size to pretrain a model with a larger input size with Genecorpus-30M.

ctheodoris changed discussion status to closed Oct 25, 2023

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment