Gene embedding

#27
by Leonskra - opened

Hello, thanks for your fantastic work, now I am trying to find how to quantify the impact of common batch-dependent technical artifacts on Geneformer gene
embeddings, but I could not find where I can get the results of gene embedding, could you please give some advice.

Many Thanks

Thank you for your interest in Geneformer. Gene embeddings can be extracted from the model similarly to any transformers model, by setting the model to evaluation mode and doing a forward pass through the model to obtain outputs with setting output_hidden_states=True.

Please see this relevant page from Huggingface transformers documentation:
https://huggingface.co/docs/transformers/main_classes/output

In the methods of the manuscript, we state which layer embeddings we used for each analysis. There are many ways to consider which embeddings to extract for each question at hand. However, generally speaking, the last layer embeddings are most specific to the predictions optimized by the learning objective, so if your question is closely aligned, then you could use the last layer embeddings, but if not, the second to last layer may be a better choice. There are also methods that consider concatenation or pooling of multiple layer embeddings. You can explore what is the best option for your specific application and question at hand.

Update:
We have now added a function to extract and plot cell embeddings (extracting gene embeddings will also be added in the future). Please see example here:
https://huggingface.co/ctheodoris/Geneformer/blob/main/examples/extract_and_plot_cell_embeddings.ipynb

ctheodoris changed discussion status to closed

Sign up or log in to comment