Disagreement between 30M and 95M cell state perturbation results
Hi Christina!
I have done both 30M-finetuned and 95M-pretrained models' cell state perturbation and got results for all the genes. My set is to iterate all genes to see which gene could more shift my cells from initial state to target state.
However, I found diagreement of N_detections between these two results. I use the totally same dataset for perturbation, but in these two results, genes' N_detections are totally different between them.
For example, for gene 'C7', in 30M results, its N is 3628:
But in 95 results, its N is ONLY 142:
And that suit for almost all the genes.
Besides, I also found that 30M results iterate 17802 genes, however, in 95 M results, there are only 9672 rows.
What makes this significant difference?
Thank you for your question.
Generally in our experiments the perturbation results (cosine shifts) are correlated between the two models, though of course the results will not be exactly the same as the models are different sizes/shapes, trained on different data, and have different weights.
With regard to the N detections, we suggest making sure you are using the correct dictionary for each model; otherwise the token will be representing a different gene. The dictionary would need to be provided both for the in silico perturber and the stats module, as well as the tokenizer etc.
Since the models have different genes in the dictionary, there may be some genes detected in one and not the other. Furthermore, the different size inputs and the different gene median values may result in some genes being included in one model and not the other, depending on if they fall outside the given input size for the rank value encoding.
Finally, since one of the models takes longer to run, we recommend confirming you aren’t reaching a time limit for your job that would lead to less cells being analyzed for the larger model.
Thank you for your question.
Generally in our experiments the perturbation results (cosine shifts) are correlated between the two models, though of course the results will not be exactly the same as the models are different sizes/shapes, trained on different data, and have different weights.
With regard to the N detections, we suggest making sure you are using the correct dictionary for each model; otherwise the token will be representing a different gene. The dictionary would need to be provided both for the in silico perturber and the stats module, as well as the tokenizer etc.
Since the models have different genes in the dictionary, there may be some genes detected in one and not the other. Furthermore, the different size inputs and the different gene median values may result in some genes being included in one model and not the other, depending on if they fall outside the given input size for the rank value encoding.
Finally, since one of the models takes longer to run, we recommend confirming you aren’t reaching a time limit for your job that would lead to less cells being analyzed for the larger model.
Thank you so much for your reply.
I'm very sure that correct dict are used for respective models when tokenizing、perturbating and stas. Besides, I'm sure that no genes will fall outside the 4098 size cause all genes are less than 4098 in my test dataset.
However, the 95M models did run a very long time for about 4 weeks. Despite I do not notice any warnings or errors and it seems that my code run through smoothly, I have concern about your saying that time will affect N values.
I will furthermore check some gene like 'C7'… to see their actual number in my tokenized dataset, if there is still disaggrement between perturbation results and tokenized dataset, I thought there may be potential bugs that hinder scientific fingings.
If your code ran through without errors, there shouldn't be an inherent issue with the N values. Sometimes clusters have time limits though that may end jobs early. If you don't think this is an issue, I would simply count up the number of times a particular gene occurs in the tokenized dataset and ensure it matches your in silico perturbation results. If it doesn't, I would start first by extracting the data out of the intermediate in silico perturbation results files and checking the number of occurrences of the genes without using the stats module. That will help narrow down there is a problem at each of those steps.
*Responding here to consolidate the discussion between this one and discussion 452.
You can extract the data in the in silico perturbation results files with this function, read_dictionaries, or writing an analogous one to answer your question. The format of those files, which you can examine by opening one of the pickle files, is a dictionary with keys being the perturbation and values being the cosine shifts for each cell in that batch. So you should search for entries that have your desired gene token number in the key, along with "cell_emb" to indicate the perturbation measured is the effect on the cell's embedding. The number of cosine shift numbers in the values (all combined for each intermediate file) would be the number of times this gene was perturbed / the result of the perturbation measured.