Issue in Embedding Extraction
@ctheodoris
I found some minor issue in the 'emb_extractor.py' file
embs_df[0 : emb_dims - 1].mean(axis="rows")
Here the code snippet takes the mean of only 512 cells rather than maximum number of cells passed via argument "max_ncells" = 700 (mentioned below).
Since "embs_df" is a dataframe and to my understanding, I believe you are selecting columns upto [0 : emb_dims - 1] but it always takes mean of only 511 (emb_dims -1) cells no matter how many you pass via max_ncells.
I have a suggestion for this line of code below
It should be
embs_df = embs_df.loc[:, embs_df.columns[0:emb_dims]]
than just embs_df[0 : emb_dims - 1].mean(axis="rows") in the link mentioned. In addition it throws error as well "raise TypeError(f"Could not convert {x} to numeric")". which I fixed with above and it also fixes the above problem
which takes the mean of all the 0,1,...511 cols across max_ncells creating a 512-dimensional vector of all cells passed.
get_embeds = EmbExtractor(
model_type="CellClassifier",
num_classes=47,
max_ncells=700,
emb_layer=-1,
emb_mode = "cls", #"cell" or "gene"
summary_stat="exact_mean", #returns mean of all cell embeddings
emb_label=["cl46sub_malig (47)"],
labels_to_plot=["cl46sub_malig (47)"],
forward_batch_size=4,
nproc=4
)
and similarly for "exact_median"
https://huggingface.co/ctheodoris/Geneformer/blob/69e6887bd55003007f3294e2141e648dc9fd286d/geneformer/emb_extractor.py#L651
Could you please clarify this or am I missing anything @ctheodoris
Thank you for noticing this and for bringing it to our attention! A fix has been pushed.
@ctheodoris
Thank you for your quick response,
But after subtracting 1 in
embs_df.iloc[:, 0 : emb_dims - 1],
it will not consider the last dimension in the mean/median computation. It only takes first 511 dimensions but the model size is 512
The 1 shouldn't not be here
I cross verified it in the output.
It should be like this without "-1"
embs_df.iloc[:, 0 : emb_dims]
Thank you! This has been updated.
Thank you so much