Error adding deletions through inSilicoPerturber

#260
by AMCalejandro - opened

Hi,

As always, thank you so much for this amazing work.

I have found two errors when I was trying to add deletions to inSilicoPerturber

1. Error 1 related to the case when I only pass one gene to perturb ( I am using latest version of Geneformer after last pushes working on single gene issues)

Data and model

# Data and models
model_cellclassifier = '/home/jupyter/AMC_WD/HACKATHON_DATA_DIR/Geneformer/fine_tuned_models/geneformer-6L-30M_CellClassifier_cardiomyopathies_220224/'
#model_cellclassifier = '/home/jupyter/AMC_WD/HACKATHON_DATA_DIR/Geneformer/'

#cardio_data = '/home/jupyter/AMC_WD/HACKATHON_DATA_DIR/data/datasets_examples_finetuning/CELL_CLAS_DATA.dataset/'
cardio_data = '/home/jupyter/AMC_WD/HACKATHON_DATA_DIR/data/datasets_examples_finetuning/DISEASE_CLASSIFIER.dataset'

output_dir = '/home/jupyter/AMC_WD/INS_PERTURB/OUTPUT/'

output_prefix = 'out'

Params

# Model params
#ensembleList = ["ENSG00000145335"]
ensembleList = ["ENSG00000173175"]

#maxCell = 2000
#maxCell = None
maxCell = 100
bSize = 40
nproc = 8

#toFilter = None
#toFilter['cell_type'] = [cellType]
toFilter={"cell_type":["Cardiomyocyte1","Cardiomyocyte2","Cardiomyocyte3"]}

# States to model
states_toModel = {'state_key': 'disease', 
              'start_state': 'dcm', 
              'goal_state': 'nf', 
              'alt_states': ['hcm']}


perturbType = "delete" # Deletion does not work
#perturbType = "overexpress" # It works, we will need to interpret results

Run in silico perturber

isp = InSilicoPerturber(perturb_type=perturbType,
                        genes_to_perturb=ensembleList,
                        combos=0,
                        anchor_gene=None,
                        model_type="CellClassifier", 
                        emb_mode="cell",
                        num_classes=3,
                        cell_emb_style="mean_pool",
                        filter_data=toFilter,
                        cell_states_to_model=states_toModel,
                        max_ncells= maxCell,
                        emb_layer=0,
                        forward_batch_size=bSize,
                        nproc=nproc)

# outputs intermediate files from in silico perturbation
isp.perturb_data(model_cellclassifier,
                 cardio_data,
                 output_dir,
                 output_prefix)

ERROR
```
Embeddings are not the same dimensions. original_emb is torch.Size([40, 2046, 256]). minibatch_emb is torch.Size([40, 2047, 256]).

RuntimeError Traceback (most recent call last)
Cell In[15], line 2
1 # outputs intermediate files from in silico perturbation
----> 2 isp.perturb_data(model_cellclassifier,
3 cardio_data,
4 output_dir,
5 output_prefix)

File ~/.local/lib/python3.10/site-packages/geneformer/in_silico_perturber.py:981, in InSilicoPerturber.perturb_data(self, model_directory, input_data_file, output_directory, output_prefix)
977 return example[state_name] in [start_state]
979 filtered_input_data = filtered_input_data.filter(filter_for_origin, num_proc=self.nproc)
--> 981 self.in_silico_perturb(model,
982 filtered_input_data,
983 layer_to_quant,
984 state_embs_dict,
985 output_directory,
986 output_prefix)

File ~/.local/lib/python3.10/site-packages/geneformer/in_silico_perturber.py:1059, in InSilicoPerturber.in_silico_perturb(self, model, filtered_input_data, layer_to_quant, state_embs_dict, output_directory, output_prefix)
1056 perturbation_batch = filtered_input_data.map(make_group_perturbation_batch, num_proc=self.nproc)
1057 indices_to_perturb = perturbation_batch["perturb_index"]
-> 1059 cos_sims_data = quant_cos_sims(model,
1060 self.perturb_type,
1061 perturbation_batch,
1062 self.forward_batch_size,
1063 layer_to_quant,
1064 filtered_input_data,
1065 self.tokens_to_perturb,
1066 indices_to_perturb,
1067 self.perturb_group,
1068 self.cell_states_to_model,
1069 state_embs_dict,
1070 self.pad_token_id,
1071 model_input_size,
1072 self.nproc)
1074 perturbed_genes = tuple(self.tokens_to_perturb)
1075 original_lengths = filtered_input_data["length"]

File ~/.local/lib/python3.10/site-packages/geneformer/in_silico_perturber.py:459, in quant_cos_sims(model, perturb_type, perturbation_batch, forward_batch_size, layer_to_quant, original_emb, tokens_to_perturb, indices_to_perturb, perturb_group, cell_states_to_model, state_embs_dict, pad_token_id, model_input_size, nproc)
454 cos_sims_vs_alt_dict[state] += cos_sim_shift(original_emb,
455 minibatch_emb,
456 state_embs_dict[state],
457 perturb_group)
458 elif perturb_group == True:
--> 459 cos_sims_vs_alt_dict[state] += cos_sim_shift(original_minibatch_emb,
460 minibatch_emb,
461 state_embs_dict[state],
462 perturb_group,
463 original_minibatch_lengths,
464 minibatch_lengths)
465 del outputs
466 del minibatch_emb

File ~/.local/lib/python3.10/site-packages/geneformer/in_silico_perturber.py:495, in cos_sim_shift(original_emb, minibatch_emb, end_emb, perturb_group, original_minibatch_lengths, minibatch_lengths)
489 if original_emb.size() != minibatch_emb.size():
490 logger.error(
491 f"Embeddings are not the same dimensions. "
492 f"original_emb is {original_emb.size()}. "
493 f"minibatch_emb is {minibatch_emb.size()}. "
494 )
--> 495 raise
496 if not perturb_group:
497 original_emb = torch.mean(original_emb,dim=1,keepdim=True)

RuntimeError: No active exception to reraise



**2. Also, if I pass two genes to the models, I am falling somehwere else**

**PARAMS**

ensembleList = ["ENSG00000173175","ENSG00000145335"]

#maxCell = 2000
#maxCell = None
maxCell = 100
bSize = 40
nproc = 8

#toFilter = None
#toFilter['cell_type'] = [cellType]
toFilter={"cell_type":["Cardiomyocyte1","Cardiomyocyte2","Cardiomyocyte3"]}

States to model

states_toModel = {'state_key': 'disease',
'start_state': 'dcm',
'goal_state': 'nf',
'alt_states': ['hcm']}

perturbType = "delete" # Deletion does not work
#perturbType = "overexpress" # It works, we will need to interpret results


**ERROR**

TypeError Traceback (most recent call last)
Cell In[18], line 2
1 # outputs intermediate files from in silico perturbation
----> 2 isp.perturb_data(model_cellclassifier,
3 cardio_data,
4 output_dir,
5 output_prefix)

File ~/.local/lib/python3.10/site-packages/geneformer/in_silico_perturber.py:981, in InSilicoPerturber.perturb_data(self, model_directory, input_data_file, output_directory, output_prefix)
977 return example[state_name] in [start_state]
979 filtered_input_data = filtered_input_data.filter(filter_for_origin, num_proc=self.nproc)
--> 981 self.in_silico_perturb(model,
982 filtered_input_data,
983 layer_to_quant,
984 state_embs_dict,
985 output_directory,
986 output_prefix)

File ~/.local/lib/python3.10/site-packages/geneformer/in_silico_perturber.py:1059, in InSilicoPerturber.in_silico_perturb(self, model, filtered_input_data, layer_to_quant, state_embs_dict, output_directory, output_prefix)
1056 perturbation_batch = filtered_input_data.map(make_group_perturbation_batch, num_proc=self.nproc)
1057 indices_to_perturb = perturbation_batch["perturb_index"]
-> 1059 cos_sims_data = quant_cos_sims(model,
1060 self.perturb_type,
1061 perturbation_batch,
1062 self.forward_batch_size,
1063 layer_to_quant,
1064 filtered_input_data,
1065 self.tokens_to_perturb,
1066 indices_to_perturb,
1067 self.perturb_group,
1068 self.cell_states_to_model,
1069 state_embs_dict,
1070 self.pad_token_id,
1071 model_input_size,
1072 self.nproc)
1074 perturbed_genes = tuple(self.tokens_to_perturb)
1075 original_lengths = filtered_input_data["length"]

File ~/.local/lib/python3.10/site-packages/geneformer/in_silico_perturber.py:426, in quant_cos_sims(model, perturb_type, perturbation_batch, forward_batch_size, layer_to_quant, original_emb, tokens_to_perturb, indices_to_perturb, perturb_group, cell_states_to_model, state_embs_dict, pad_token_id, model_input_size, nproc)
424 num_perturbed = len(tokens_to_perturb)
425 indices_to_perturb_minibatch = []
--> 426 end_range = [i for i in range(orig_max_len - tokens_to_perturb, orig_max_len)]
427 for idx in indices_to_perturb[i:i+max_range]:
428 if idx == [-100]:

TypeError: unsupported operand type(s) for -: 'int' and 'list'


When I used the latest version of the code to perturb a single gene with the fine-tuned model and dataset of the example, I also encountered the same error as as point 1 mentioned by @AMCalejandro . Can anyone help solve it?
The code is as below:

ensems = ['ENSG00000183878']
isp = InSilicoPerturber(perturb_type="delete",
perturb_rank_shift=None,
genes_to_perturb=ensems,
combos=0,
anchor_gene=None,
model_type="CellClassifier",
num_classes=3,
emb_mode="cell",
cell_emb_style="mean_pool",
filter_data={"cell_type":["Cardiomyocyte1","Cardiomyocyte2","Cardiomyocyte3"]},
cell_states_to_model={'state_key': 'disease',
'start_state': 'dcm',
'goal_state': 'nf',
'alt_states': ['hcm']},
max_ncells=2000,
emb_layer=0,
forward_batch_size=100,
nproc=16)

isp.perturb_data("./fine_tuned_models/geneformer-6L-30M_CellClassifier_cardiomyopathies_220224",
"./Genecorpus-30M/example_input_files/cell_classification/disease_classification/human_dcm_hcm_nf.dataset",
"./output/",
"output_prefix")

Embeddings are not the same dimensions. original_emb is torch.Size([100, 2046, 256]). minibatch_emb is torch.Size([100, 2047, 256]).
RuntimeError Traceback (most recent call last)
Cell In[8], line 2
1 # outputs intermediate files from in silico perturbation
----> 2 isp.perturb_data("./fine_tuned_models/geneformer-6L-30M_CellClassifier_cardiomyopathies_220224",
3 "./Genecorpus-30M/example_input_files/cell_classification/disease_classification/human_dcm_hcm_nf.dataset",
4 "./output/in_silico_perturbation/human_dcm_hcm_nf/",
5 "cardiomyopathies_220224")

File ~/miniconda3/envs/ai/lib/python3.11/site-packages/geneformer/in_silico_perturber.py:983, in InSilicoPerturber.perturb_data(self, model_directory, input_data_file, output_directory, output_prefix)
979 return example[state_name] in [start_state]
981 filtered_input_data = filtered_input_data.filter(filter_for_origin, num_proc=self.nproc)
--> 983 self.in_silico_perturb(model,
984 filtered_input_data,
985 layer_to_quant,
986 state_embs_dict,
987 output_directory,
988 output_prefix)

File ~/miniconda3/envs/ai/lib/python3.11/site-packages/geneformer/in_silico_perturber.py:1061, in InSilicoPerturber.in_silico_perturb(self, model, filtered_input_data, layer_to_quant, state_embs_dict, output_directory, output_prefix)
1058 perturbation_batch = filtered_input_data.map(make_group_perturbation_batch, num_proc=self.nproc)
1059 indices_to_perturb = perturbation_batch["perturb_index"]
-> 1061 cos_sims_data = quant_cos_sims(model,
1062 self.perturb_type,
1063 perturbation_batch,
1064 self.forward_batch_size,
1065 layer_to_quant,
1066 filtered_input_data,
1067 self.tokens_to_perturb,
1068 indices_to_perturb,
1069 self.perturb_group,
1070 self.cell_states_to_model,
1071 state_embs_dict,
1072 self.pad_token_id,
1073 model_input_size,
1074 self.nproc)
1076 perturbed_genes = tuple(self.tokens_to_perturb)
1077 original_lengths = filtered_input_data["length"]

File ~/miniconda3/envs/ai/lib/python3.11/site-packages/geneformer/in_silico_perturber.py:461, in quant_cos_sims(model, perturb_type, perturbation_batch, forward_batch_size, layer_to_quant, original_emb, tokens_to_perturb, indices_to_perturb, perturb_group, cell_states_to_model, state_embs_dict, pad_token_id, model_input_size, nproc)
456 cos_sims_vs_alt_dict[state] += cos_sim_shift(original_emb,
457 minibatch_emb,
458 state_embs_dict[state],
459 perturb_group)
460 elif perturb_group == True:
--> 461 cos_sims_vs_alt_dict[state] += cos_sim_shift(original_minibatch_emb,
462 minibatch_emb,
463 state_embs_dict[state],
464 perturb_group,
465 original_minibatch_lengths,
466 minibatch_lengths)
467 del outputs
468 del minibatch_emb

File ~/miniconda3/envs/ai/lib/python3.11/site-packages/geneformer/in_silico_perturber.py:497, in cos_sim_shift(original_emb, minibatch_emb, end_emb, perturb_group, original_minibatch_lengths, minibatch_lengths)
491 if original_emb.size() != minibatch_emb.size():
492 logger.error(
493 f"Embeddings are not the same dimensions. "
494 f"original_emb is {original_emb.size()}. "
495 f"minibatch_emb is {minibatch_emb.size()}. "
496 )
--> 497 raise
498 if not perturb_group:
499 original_emb = torch.mean(original_emb,dim=1,keepdim=True)

RuntimeError: No active exception to reraise

When I check the sizes of original_emb and minibatch_emb and print them out, the results are as follows (first output the size of original_emb, and then output the size of minibatch_emb.):
torch.Size([100, 2047, 256])
torch.Size([100, 2047, 256])
torch.Size([100, 2047, 256])
torch.Size([100, 2047, 256])
torch.Size([100, 2047, 256])
torch.Size([100, 2047, 256])
torch.Size([100, 2047, 256])
torch.Size([100, 2047, 256])
torch.Size([100, 2047, 256])
torch.Size([100, 2047, 256])
torch.Size([100, 2047, 256])
torch.Size([100, 2047, 256])
torch.Size([100, 2047, 256])
torch.Size([100, 2047, 256])
torch.Size([100, 2047, 256])
torch.Size([100, 2047, 256])
torch.Size([100, 2047, 256])
torch.Size([100, 2047, 256])
torch.Size([100, 2047, 256])
torch.Size([100, 2047, 256])
torch.Size([100, 2047, 256])
torch.Size([100, 2047, 256])
torch.Size([100, 2047, 256])
torch.Size([100, 2047, 256])
torch.Size([100, 2047, 256])
torch.Size([100, 2047, 256])
torch.Size([100, 2046, 256])
torch.Size([100, 2047, 256])

When I used batch size of 400, it will show the same error.

I'm not sure if you both are still having these issues but I've recently started working with the in silico perturber and I think I've solved both of the issues @AMCalejandro was having.

1: I don't think removal of indices from the embedding batch should be occurring when perturbing in delete mode, only in overexpress mode - removing this line of code allowed running of delete but not overexpress, so I added a condition that it should only be run when overexpressing - code is below.

Line 435 in in_silico_perturber.py:
if perturb_type == 'overexpress': #added this line of code as the indices should only be removed in overexpression
original_minibatch_emb = remove_indices_from_emb_batch(original_minibatch_emb,
indices_to_perturb_minibatch,
gene_dim=1)

  1. I think this was just an error in the code? As mentioned in the error message tokens_to_perturb is a list, and the length of this list has already been assigned to num_perturbed, so subtracting this from orig_max_len allows the code to run:

Line 427 in in_silico_perturber.py:
end_range = [i for i in range(orig_max_len - num_perturbed, orig_max_len)]##Made a change which fixed an error - tokens_to_perturb replaced with num_perturbed.

Let me know if these help - I think these enable the intended function of the code but please let me know if there are any unexpected consequences for you.

I'm not sure if you both are still having these issues but I've recently started working with the in silico perturber and I think I've solved both of the issues @AMCalejandro was having.

1: I don't think removal of indices from the embedding batch should be occurring when perturbing in delete mode, only in overexpress mode - removing this line of code allowed running of delete but not overexpress, so I added a condition that it should only be run when overexpressing - code is below.

Line 435 in in_silico_perturber.py:
if perturb_type == 'overexpress': #added this line of code as the indices should only be removed in overexpression
original_minibatch_emb = remove_indices_from_emb_batch(original_minibatch_emb,
indices_to_perturb_minibatch,
gene_dim=1)

  1. I think this was just an error in the code? As mentioned in the error message tokens_to_perturb is a list, and the length of this list has already been assigned to num_perturbed, so subtracting this from orig_max_len allows the code to run:

Line 427 in in_silico_perturber.py:
end_range = [i for i in range(orig_max_len - num_perturbed, orig_max_len)]##Made a change which fixed an error - tokens_to_perturb replaced with num_perturbed.

Let me know if these help - I think these enable the intended function of the code but please let me know if there are any unexpected consequences for you.

@tobyclark Thank you very much for your code. But after I tried this modification, my error message became:

"RuntimeError: The size of tensor a (2048) must match the size of tensor b (2047) at non-singleton dimension 1".

(Looks like not fix my issue.)

But many thanks for your advice.

Thank you for your interest in Geneformer and for your patience! We pushed an update that should resolve this issue. If you continue to face errors after pulling the updated code, please let us know by either reopening this discussion if it's the same error or opening a new discussion if it's a new error. Thank you!

ctheodoris changed discussion status to closed

Sign up or log in to comment