Seeking Geneformer related help

#289
by rongguangda - opened

Dear Christina V. Theodoris:

My name is Guangda Rong. I am a Ph.D. student from Nanfang Hospital, currently working on the deep learning models of oncology.

I recently had the opportunity to read your paper "Transfer learning enables predictions in network biology" published in Nature, and I was impressed by the depth of your research and the insights presented. Your work was particularly relevant to my current research endeavours.

As I delved into the details of your paper, I came across a few points where I had some questions and was hoping to seek clarification. I believe that your expertise would be of great help to me in gaining a better understanding. Here are the specific questions I have:

With transcriptomic data (bulk RNAseq) and clinical data, I would like to implement the function of gene dosage sensitivity prediction, but I don't know how to convert the transcriptomic data and clinical data into the input file of the code named "gene_classification.ipynb" which was shared in the Huggingface(https://huggingface.co/ctheodoris/Geneformer/tree/main/examples). Could you please tell me how to solve this problem? The file format of the transcriptomic and clinical data was shown in the attachment.

I understand that you have a busy schedule, but any insight or guidance you could provide would be invaluable to me. I appreciate your time and consideration.

Thank you for your contributions to the field and I look forward to hearing from you.

Yours sincerely,

Guangda Rong
Nanfang Hospital, Southern Medical University, Guangzhou, China.

clinical_data.png

transcriptomic_data.png

Thank you for your interest in Geneformer!

The input of Geneformer is single-cell RNAseq data. Please see here for more information on the input format and how to convert it to a tokenized dataset for use with the model:
https://huggingface.co/ctheodoris/Geneformer/blob/main/examples/tokenizing_scRNAseq_data.ipynb

We have not tested using bulk RNAseq data but this question has arisen previously - please check the closed discussions to read about some potential approaches to using bulk RNAseq as an input format.

The gene_classification notebook is an example for fine-tuning the model towards gene classification. You can proceed with the notebook as is to obtain a fine-tuned model for this purpose, and then you can apply that fine-tuned model to make predictions about gene dosage sensitivity of particular genes in a given single cell presented to the model.

Regarding incorporating clinical data, you may consider incorporating clinical data as a cell state by fine-tuning a relevant cell classifier and then using in silico perturbation to identify genes that are important for that cell state (e.g. particular disease). Please see the example for hyperparameter tuning if you are interested in this type of downstream task - this would be analogous to our approach to identifying candidate therapeutic targets for cardiomyopathy as presented in the manuscript.

ctheodoris changed discussion status to closed

Dear Christina V. Theodoris,

I hope this email finds you well. I am writing to express my sincere gratitude for your previous detailed responses to my enquiries regarding Geneformer. Your insights have been immensely valuable and have contributed greatly to my understanding of the subject.

Since then, my research has deepened and new questions have arisen. Using the datasets from Genecorpus-30M (https://huggingface.co/datasets/ctheodoris/Genecorpus-30M/tree/main/example_input_files/gene_classification/bivalent_promoters/panglao_SRA553822-SRS2119548.dataset) as the input file for the code named "gene_classification.ipynb", I encountered the problem that the results contained a NaN value when the y_score was calculated in the function named "classifier_predict". I ran the code many times, but only once the ROC curve and the confusion matrix were displayed in the console. Could you please tell me how to solve this problem?

Your expertise has been instrumental in shaping the direction of my research and I really appreciate your willingness to share your knowledge.

Thank you again for your time and help. I look forward to the opportunity to continue our fruitful collaboration.

Yours sincerely,
Guangda Rong
Nanfang Hospital, Southern Medical University, Guangzhou, China.

Thank you for following up! If you'd like to test fine-tuning Geneformer to distinguish dosage sensitive transcription factors as in the example notebook, please use the data here: https://huggingface.co/datasets/ctheodoris/Genecorpus-30M/tree/main/example_input_files/gene_classification/dosage_sensitive_tfs

The gene set labels are in that directory. The input data, as described in the README.md, are 10k random cells from Genecorpus-30M, which is here: https://huggingface.co/datasets/ctheodoris/Genecorpus-30M/tree/main/genecorpus_30M_2048.dataset

You can select 10k random cells by first shuffling the dataset and then using the .select method to select 10k cells. See the Hugging Face Datasets API documentation here: https://huggingface.co/docs/datasets/process

Sign up or log in to comment