ctheodoris/Geneformer · ensembl_id and in silico perturbation

Jun 15, 2023

Hello, congrats on the work

I use provided gene_name_id_dict.pkl to map gene names and there are ~10k genes without proper mapping. Most of these are pseudogenes though around 500 are not. Do I need to remove the unmapped genes for fine-tune and in silico perturbation?

For in silico perturbation, I used a subset with ~2k cells. The output from pertrubation_stats (all genes with delete and all cells) only included ~3k genes. Is this expected due to the low cell number or similarity across the start-goal status? What's the threshold applied to export the top perturbed stats?

Thank you~

ctheodoris

Owner Jun 15, 2023

Thank you for your question. The vocabulary for the model is 25,424 protein-coding and miRNA genes (see token_dictionary.pkl). The gene_name_id_dict.pkl may not have the same gene names as your annotation. We suggest converting your genes to Ensembl IDs with Ensembl Biomart so that you can convert as many as possible of your genes to Ensembl IDs and then running the transcriptome tokenizer to convert the data to rank value encodings. If the genes you provide are outside of the 25,424 genes in the model's vocabulary, they will not be tokenized by the transcriptome tokenizer, so you do not need to remove them yourself before tokenization.

Regarding how many genes to expect as an output from the in silico perturbation, this is dependent on the number of genes tested in the perturbation, not on any threshold for reporting top predicted genes. The output should include all genes tested, along with statistics indicating whether or not the gene perturbation is statistically significant. The number of genes tested depends on various factors. As you mentioned, if the ~2K cells you selected are very similar, and there are only ~3K genes detected in total among the ~2K cells, then this would lead to only ~3K genes being tested by in silico perturbation. However, if you are subsetting ~2K cells from a larger dataset, we would expect more diversity in the genes detected and therefore tested by perturbation. The mode you are running for the in silico perturber also can affect the number of genes as an output. Particularly, if you are running a combination in silico deletion with an anchor gene, Gene A, then it will be testing only the cells where Gene A is detected and deleting each other gene in combination with Gene A. If Gene A is only detected in a few cells, then this will limit the other genes that are co-expressed for testing with in silico deletion.

That being said, the number of ~3K genes is much lower than any number we had as output in our analyses. Please see Supplementary Tables 3, 5, and 12 as examples. The number of genes we are perturbing in these analyses in cardiomyocytes is ~17-19K.

ctheodoris changed discussion status to closed Jun 15, 2023