Problems of get_state_embed

#434

by ZYSK-huggingface - opened Oct 17, 2024

Oct 17, 2024

Hi !
I encountered an error when I tried to :
from geneformer import EmbExtractor
emb = EmbExtractor(
model_type="Pretrained",
num_classes=0,
emb_mode="cls",
filter_data={"cell_type":["Endo","Neuron"]},
max_ncells=None,
emb_layer=-1,
emb_label=["cell_type"],
#labels_to_plot=["disease", "cell_type"],
forward_batch_size=64,
nproc=24,
summary_stat="exact_mean",#"exact_median"
)
state_embs_dict=emb.get_state_embs(
cell_states_to_model={"state_key": "cell_type", "start_state": "Endo", "goal_state": "Neuron", "alt_states": []},
model_directory="/home/Geneformer-2/gf-20L-95M-i4096/",
input_data_file="/data/02_Datasets/cell_state_test.dataset",
output_directory="/data/03_Results/export_embedding/endo+neuron/",
output_prefix="get_embed_dict_endo+neuron",
output_torch_embs=True
)

ERROR:

Thank you so much !

ctheodoris

Owner Oct 17, 2024

Thank you for your question!

Could you please add the full error trace so we have the context of where the error occurs?

Could you please also print the set of cell_type labels that exist in your dataset?
set(dataset[“cell_type”])

Also, does this occur with every dataset you attempt or is it specific to this one?

ZYSK-huggingface

Oct 18, 2024

Thank you for your question!

Could you please add the full error trace so we have the context of where the error occurs?

Could you please also print the set of cell_type labels that exist in your dataset?
set(dataset[“cell_type”])

Also, does this occur with every dataset you attempt or is it specific to this one?

Full error trace as follows:

And dataset structure:

Thank you so much for your patience !

ZYSK-huggingface

Oct 19, 2024

•

edited Oct 19, 2024

Hi！
I think I have solved problems above and run smoothly, by removing parameter 'emb_label', so what is the correct usage of this parameter?

Besides, I have another question to consult. I began to run in silico perturbution by adding cell_state and genes_to_perturb is all, and in this process, I obtained a series of results of different batch, like:

And the file numbers are so much that I am not sure whether I should output them together with perturber_stats or one by one. Also I wander what does the file number depend on, the gene number? or the cell number ? Is it do in sillico by one gene to next gene, or by one cell to next cell? How could I select interested gene range？（gene list says it will perturbate combination, not alone）

ctheodoris

Owner Oct 23, 2024

Thank you for your question! For emb_label, this is for labeling output embeddings. Since you were trying to get the state_embs_dict that just has 1 emb per condition, it isn't really labeled in that way - it seems that is what was causing the error. Thank you for bringing this up so we are aware of this potential scenario.

For the in silico perturbation output files, these are just broken up based on the batch size to avoid memory constraints. When you use the perturber stats module, you can provide the parent directory so that it processes all the files together.

The batches are either oriented as genes for each cell (for the "all" mode) or cells (for perturbing a specific gene or list of genes), depending on the most efficient method to batch the operation.

For the "interested gene range", I'm not completely sure what you mean by this, but the InSilicoPerturberStats will output a different line for each gene for the relevant modes to what I believe you are trying to do, so you can just run it on all the files and then pull out the genes of interest from the output csv.

ctheodoris changed discussion status to closed Oct 23, 2024

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment