Downstream application of bulk RNA seq and increasing input size

#134
by noamharel - opened

Hello,
First of all thank you for the great work! Very interesting and useful.
I am looking into using the trained model for a downstream task of classifying bulk RNA seq samples.
I see that the tokenizing and the model limits the input size to a vector of 2048 genes, and I wonder whether this could be increased and the embeddings would still be used successfully? I am asking because bulk RNA seq data is much more dense than the single cell RNAseq, and a larger vector size could be more useful here.
Thank you in advance, Noam

Thank you for your question and interest in Geneformer! Yes, Geneformer employs fully dense attention across the input space of 2048. The fully dense attention ensures each gene attends to each other gene in the input space rather than sparse attention approaches that usually combine local attention with sparse global attention, which would be problematic for the application to transcriptomes given genes closer together with a rank value encoding are not necessarily more informative than those more distant (and in fact are likely to be less informative than the top and lowest ranked genes). 2048 is a fairly large input size for fully dense attention, which is quadratic in time dependency, so we selected this size based on the fact that it fully encompassed 93% of the cells in the 30M single cell training corpus Genecorpus-30M so well-balanced the lack of information loss and the required compute. However, as you note, bulk RNAseq has many more genes detected per sample. We have not tried using Geneformer with bulk RNAseq data, and certainly we would recommend fine-tuning for this representation since it is out of distribution from what the model has seen during pretraining. You could consider ways of selecting 2048 genes in an unbiased way from the initial bulk RNAseq to present to the model, for example by intersecting with genes detected in that cell type in usual single cell RNAseq data (depending on whether you have single cell data that is close enough to your cells of interest).

Alternatively, you could consider pretraining a model on a large corpus of bulk RNAseq data (e.g. the Recount2 dataset) with an extended input size. Our modeling approach is integrated with Huggingface so it would be relatively straightforward to do that. For example, you could use the transcriptome tokenizer we provide in this repository but change the "truncate" function to have the upper limit of your input size. You may also want to generate new normalization factors for the bulk data using the code we provided for obtaining nonzero median digests. Then, you could choose any of the transformers models (https://huggingface.co/docs/transformers/index) that allow larger input size and follow our example for pretraining by substituting that model instead of BERT. This review may provide helpful information to choosing the right model for your goal: https://arxiv.org/abs/2009.06732

ctheodoris changed discussion status to closed

Sign up or log in to comment