possible source of inconsistency in gene count normalization
HI,
while working on #102 I noticed that the gene count normalization with loom done here:
https://huggingface.co/ctheodoris/Geneformer/blob/77eb432430e05760c80cc931b0e17844fe18f391/geneformer/tokenizer.py#L173
might result in inconsistent behaviour.
Specifically, the count matrix is filtered by genes that are found in the gene GENE_MEDIAN_FILE
here https://huggingface.co/ctheodoris/Geneformer/blob/77eb432430e05760c80cc931b0e17844fe18f391/geneformer/tokenizer.py#L167 yet the counts are normalized by the total counts computed before filtering. Furthermore, the gene filtering is silent hence the user might now know of this inconsistent behavior.
I think this can be quite problematic in cases where the number of filtered genes is high.
Thank you for your interest in Geneformer and for your pull request! We will review the pull request as soon as possible.
Regarding your question here, it is actually quite important that the counts are normalized by total counts before filtering. The count normalization is done to account for differing sequencing depth that can result in variable scaling of the counts for each cell. It would be very problematic if the normalization by total counts was done without all of the genes being present as this would lead to inconsistency depending on the proportion of genes present in the cell that are protein-coding or miRNA genes.
For your question about the filtering of genes, the transcriptome tokenizer filters for genes that are within the 25,424 protein-coding or miRNA genes that comprise the model vocabulary. These represent all protein-coding and miRNA genes present within the ~30 million cells from a broad range of human tissues in Genecorpus-30M. We focus on protein-coding and miRNA genes as these are likely to have the most relevant effects on the gene networks governing cell state. The 25,424 genes that comprise the model vocabulary are provided in the token dictionary. If users would like to add new genes to the token dictionary, they can certainly do so. There will not be a nonzero median normalization factor for these genes since they were not detected in any of the ~30 million cells in Genecorpus-30M. Therefore, one could choose to add a "neutral" normalization factor based on the provided normalization factors (e.g. could choose the median of these factors to represent the neutral state). However, it should be noted that these genes would not have been pretrained given their absence from the ~30 million cells in Genecorpus-30M. In general, the recommended usage would be to use the transcriptome tokenizer as provided and filter for genes within the model's 25,424 protein-coding or miRNA gene vocabulary that has been used for pretraining.
Thank you again for your pull request and we will review it as soon as possible!
thanks a lot @ctheodoris for the in depth explanation, it greatly helped me to understand the issue. I will have to modify the PR to respect that then.