ctheodoris/Geneformer · in-silico_perturbation/IndexError: Dimension out of range (expected to be in range of [-1, 0], but got 1)

Jan 18, 2024

Hello, while I was trying to run the in-silico_perturbation script (the exact file provided in the examples) using the provided cardiomyopathy data, I encountered the following error: "IndexError: Dimension out of range (expected to be in range of [-1, 0], but got 1)".

For your information, I am using the most updated version in the huggingface running on an Amazon EC2 G5.x8large instance with GPU support. I would really appreciate it if you can help me to resolve this problem. I'm attaching the code and error message below:

Thanks,
Milad

ctheodoris

Owner Jan 19, 2024

Thank you for your interest in Geneformer! Unfortunately, we are unable to reproduce this error. When running with the provided example code as is and also when running the code with your modifications (in max_ncells and forward_batch_size), the code runs without error (by the way, in the future it would be great to paste in the text of your code rather than an image so we can directly copy it and ensure we caught all the changes). It looks like in your case the code is not able to finish the first batch of 50 and encounters this error - is that correct? Could you try changing the batch size to check if that's related?

Hiroshi-Kobayashi

Jan 26, 2024

•

edited Jan 26, 2024

Recently, I ran into the same issue with an in-silico perturbation script that previously worked without errors. The problem arose during the execution of the cos_sim_shift function within the perturber_utils.py file, specifically when handling the end_emb variable. This variable, originating from state_embs_dict[state] in the quant_cos_sims function and passed into the cos_sim_shift function, unexpectedly had the shape torch.Size([256]), despite the anticipated shape being torch.Size([1, 256]).

To address this, I modified the cos_sim_shift function by inserting end_emb = end_emb.unsqueeze(0) right before the cosine similarity computation, which successfully resolved the error. This adjustment ensures end_emb matches the expected dimensionality, facilitating the correct computation of cosine similarity.

However, it remains unclear why the dimensionality of state_embs_dict[state] differs across various environments, leading to this discrepancy. I'm sharing this solution in hopes it might assist others facing similar issues and to potentially spark a discussion on the underlying cause of such dimensionality variances.

Hiroshi

miladmrv

Jan 29, 2024

Hello Hiroshi,

Thank you very much for your response! I was also able to resolve this solution in a similar way by adding "unsqueeze(0)".

miladmrv

Jan 29, 2024

•

edited Jan 29, 2024

Hello Christina,

Thanks for your comment. The specific problem that I was facing got resolved. So I can run the "in_silico_perturbation" code without an error, however, I am still struggling to choose the values for "max_ncells" and "forward_batch_size", when "genes_to_perturb" is set to "all". Basically, I was not able to run the code with more that 200 cells, i.e., max_ncells = 200, without an memory error, although, I am using pretty strong GPU instance "g5.16xlarge (64 cores - 256 GB)". In your example script, you have set max_ncells = 2000 and I am wondering how these parameters would affect the accuracy?

Thanks,
Milad

ctheodoris

Owner Jan 31, 2024

Thank you for following up.

@Happy-Thomas Thank you for the information so we can look into this because we were not able to reproduce the error before to know where it was coming from. To clarify, for the state_embs_dict that you are passing to the model now that it is being generated separately, could you let us know the dimensions of the embeddings in the start and end states in your dictionary before they are passed to the model for in silico perturbation?

@miladmrv For the memory issues, have you tried reducing the forward_batch_size? If so, and it's not the individual batch that is causing the memory issue but rather the accumulation of data across cells, we currently have it set to clear the memory every 1000 cells. If you are not able to get past 200 cells, then you could try setting this to a lower amount. The output is saved every 100 cells, so you could set this to 200, for example, in line 854 in in_silico_perturber.py (commit 316d817). The model takes up some fixed space so as you expand the size of your GPU, the space available for data batches can become quite a bit larger. We have performed these in silico perturbations on 40G and 80G GPUs, so if your GPU memory is smaller than that, you may need to adjust from the default batch size etc.

Hiroshi-Kobayashi

Jan 31, 2024

@miladmrv
Glad to hear that it worked out well!

@ctheodoris
Thank you for your comment. It's indeed an exciting and interesting model, and we are currently exploring its applicability to our data. Regarding the dimensions of the embeddings at the stage before in silico perturbation, our state_embs_dict contains data with torch.Size[256] for each of its three keys. Furthermore, regarding the shapes during subsequent calculations, we have:

full_original_emb: torch.Size([1, 2048, 256])
original_cell_emb: torch.Size([1, 256])
full_perturbation_emb: torch.Size([2048, 2047, 256])
perturbation_cell_emb: torch.Size([2048, 256])
state_embs_dict[state]: torch.Size([256])
These calculations are performed on Google Colab with an A100 40GB. It seems that the error might be arising because the size [1, 256] of original_cell_emb passed to torch.nn.CosineSimilarity does not match state_embs_dict[state]. However, when I calculated the cosine similarity between data of torch.Size[1, 256] and torch.Size[256] on my local machine (not google colab), no errors occurred, so it's unclear why the error arises during the cosine similarity calculation.

miladmrv

Apr 8, 2024

@ctheodoris
Thank you for all your assistance with addressing the memory issue in the in-silico perturbation task!

Another observation I've made relates to the efficacy of fine-tuning in in-silico perturbation. Upon testing across various oncology datasets, I've found that fine-tuning DOES NOT impact the prediction of perturbation shift_to_goal_state values. In fact, whether utilizing the 3-state hcm/dcm model provided in the paper or fine-tuning on CRC disease-specific data, the resulting significant genes remain very similar when applied to the same CRC data (instead, I've noticed a pivotal role played by "max_ncells" in the "InSilicoPerturber" function in determining the output genes).

I'm curious if this observation has been documented elsewhere regarding in-silico perturbation fine-tuning? (It's worth noting that this problem doesn't seem to extend to other tasks like cell classification, reinforcing the importance of fine-tuning specifically for this task and its impact on accuracy).

ctheodoris

Owner Apr 8, 2024

Thank you for your question! The purpose of fine-tuning is to better separate the classes within the embedding space so that the model can better distinguish which in silico perturbations shift between the now better-separated states. If the pretrained model has already well-separated the states, fine-tuning is likely not necessary and will likely not impact the results. Additionally, fine-tuning the model to separate given classes that are irrelevant to your classes is not necessarily expected to worsen separation of your classes of interest. Finally, in the case that the pretrained model does not well-separate the classes of interest, then the fine-tuning should be confirmed to appreciably improve the separation; otherwise the lack of change in performance is likely because the fine-tuning was not successful.

Regarding the max_ncells, this is expected to cause a large change in the output genes, as discussed elsewhere in this repository's discussions. Like any statistical test, the number of observations affects the power of the comparison. With more observations, more subtle differences are more likely to be detected as significant. Additionally, for certain relatively rarely expressed genes, the larger number of cells will mean these genes are detected at all to be included in the in silico perturbation test.

ctheodoris changed discussion status to closed Apr 8, 2024

miladmrv

Apr 11, 2024

Thank you so much for your response!