ctheodoris/Geneformer · Classifying genes

Jul 6, 2023

Hello and thank you for making GeneFormer available. I have few questions that I would like to discuss. First question is regarding the tokenizer - can you give a bit more detailed explanation how it works. My understanding is that it takes cell sequence and assigns ids based on which genes have been expressed in this sequence. Is that correct? Does the score of how much is gene expressed play a role and how?

I have seen examples where you have classified genes within a cell (using token classification), or where you have classified cell whether is with or without disease. However, if I have a set of genes with some feature, and set of genes that do not express same feature, can I use geneformer's method to predict on the bigger set of genes which will have a given feature? Or it is not possible and I can only work within boundaries of gene/token classification within gene or cell classification?

Also, one question I have, since in cell sequences, the ordering of genes that are expressed does not really matter, at least not as in text, what is the intuition based on which token classification of genes would work?

ctheodoris

Owner Jul 7, 2023

Thank you for your interest in Geneformer! Regarding how the transcriptome tokenizer works, please see the section "Rank value encoding of single-cell transcriptomes" in our manuscript Methods. This is also relevant to your question about how genes are not expressed in a positional sequence. As discussed in the manuscript, our method encodes the transcriptome of each single cell as a non-parametric rank value encoding where genes are ranked by their expression in that cell normalized by their expression across the entire Genecorpus-30M, prioritizing genes that distinguish cell state. This rank-based approach may be more robust against technical artifacts that may systematically bias the absolute transcript counts value whereas the overall relative ranking of genes within each cell remains more stable. This also opens the opportunity to use transformer-based architectures for any data type that can be represented as ranks that encode relevant features as opposed to purely sequential data.

Regarding your question about classifying genes that have a particular feature, if I understand your question, you are trying to classify genes that have feature X vs. genes that do not have feature X. This application can be accomplished by fine-tuning the model for gene classification by providing the model with example genes and their accompanying labels of having or lacking feature X. The model could then predict whether unseen genes have feature X or not.

We provide example code here for gene classification, cell classification, and in silico perturbation/treatment analysis. However, Geneformer can be fine-tuned for additional downstream tasks with the relevant fine-tuning task objective and is not at all bounded by these examples. We have integrated with Huggingface to allow users to take full advantage of their extensive and user-friendly tools. For example, Geneformer can be readily used for any of the BERT fine-tuning objectives that Huggingface provides (https://huggingface.co/docs/transformers/model_doc/bert) by swapping out the model type in the example classification notebooks when loading the model. Additional applications beyond those provided by Huggingface can be designed by adding a head layer specific to the task or using gene or cell embeddings outputted by Geneformer as the input to a model on top of Geneformer.

ctheodoris changed discussion status to closed Jul 7, 2023