Problems of get_state_embed

#434
by ZYSK-huggingface - opened

Hi !
I encountered an error when I tried to :
from geneformer import EmbExtractor
emb = EmbExtractor(
model_type="Pretrained",
num_classes=0,
emb_mode="cls",
filter_data={"cell_type":["Endo","Neuron"]},
max_ncells=None,
emb_layer=-1,
emb_label=["cell_type"],
#labels_to_plot=["disease", "cell_type"],
forward_batch_size=64,
nproc=24,
summary_stat="exact_mean",#"exact_median"
)
state_embs_dict=emb.get_state_embs(
cell_states_to_model={"state_key": "cell_type", "start_state": "Endo", "goal_state": "Neuron", "alt_states": []},
model_directory="/home/Geneformer-2/gf-20L-95M-i4096/",
input_data_file="/data/02_Datasets/cell_state_test.dataset",
output_directory="/data/03_Results/export_embedding/endo+neuron/",
output_prefix="get_embed_dict_endo+neuron",
output_torch_embs=True
)

ERROR:

屏幕截图 2024-10-17 232330.png

Thank you so much !

Thank you for your question!

Could you please add the full error trace so we have the context of where the error occurs?

Could you please also print the set of cell_type labels that exist in your dataset?
set(dataset[“cell_type”])

Also, does this occur with every dataset you attempt or is it specific to this one?

Thank you for your question!

Could you please add the full error trace so we have the context of where the error occurs?

Could you please also print the set of cell_type labels that exist in your dataset?
set(dataset[“cell_type”])

Also, does this occur with every dataset you attempt or is it specific to this one?

Full error trace as follows:

image.png

And dataset structure:

image.png

Thank you so much for your patience !

Hi!
I think I have solved problems above and run smoothly, by removing parameter 'emb_label', so what is the correct usage of this parameter?

Besides, I have another question to consult. I began to run in silico perturbution by adding cell_state and genes_to_perturb is all, and in this process, I obtained a series of results of different batch, like:

image.png

And the file numbers are so much that I am not sure whether I should output them together with perturber_stats or one by one. Also I wander what does the file number depend on, the gene number? or the cell number ? Is it do in sillico by one gene to next gene, or by one cell to next cell? How could I select interested gene range?(gene list says it will perturbate combination, not alone)

Thank you for your question! For emb_label, this is for labeling output embeddings. Since you were trying to get the state_embs_dict that just has 1 emb per condition, it isn't really labeled in that way - it seems that is what was causing the error. Thank you for bringing this up so we are aware of this potential scenario.

For the in silico perturbation output files, these are just broken up based on the batch size to avoid memory constraints. When you use the perturber stats module, you can provide the parent directory so that it processes all the files together.

The batches are either oriented as genes for each cell (for the "all" mode) or cells (for perturbing a specific gene or list of genes), depending on the most efficient method to batch the operation.

For the "interested gene range", I'm not completely sure what you mean by this, but the InSilicoPerturberStats will output a different line for each gene for the relevant modes to what I believe you are trying to do, so you can just run it on all the files and then pull out the genes of interest from the output csv.

ctheodoris changed discussion status to closed

Sign up or log in to comment