Example code for other fine-tuning applications (e.g. gene-level experiments, gene/drug perturbation)?
Thanks again for the great work - I was wondering if you may have plans to release code for other downstream tasks for Geneformer?
For example, I am greatly interested in gene-level tasks such as your gene network centrality tasks/chromatin dynamic/TF target identification, or dosage sensitivity.
I'm also thinking of applying your model on drug and gene perturbation experiments - I was wondering if those would be feasible (perhaps by training your model by 1) appending the perturbed genes or drug tokens at the beginning of the input, before the gene tokens that represent the single cell that is being perturbed, and 2) applying some type of "reconstruction error" between the predicted gene tokens at the output and the actual perturbed single-cell expression rank value encodings).
Thanks again for your advice!
-Patrick
Thank you for your interest in Geneformer!
Example code for the gene classification tasks is here: https://huggingface.co/ctheodoris/Geneformer/blob/main/examples/gene_classification.ipynb
There is also an open PR that we are currently testing that provides a modular approach to this. Please note that although the hyperparameter optimization example is written as a cell classification application, we also recommend hyperparameter optimization for gene classification applications (as with all deep learning models, hyperparameter optimization can have a large impact on how well the model is trained).
Please also see the dataset repository for example gene classification labels/datasets:
https://huggingface.co/datasets/ctheodoris/Genecorpus-30M/tree/main/example_input_files/gene_classification
For your question about drugs/gene perturbations:
For gene perturbations, you could directly encode the perturbation by moving up or down the given gene within the rank value encoding (as we do with the in silico perturbation code). However, you could potentially take a prompt based approach instead, as long as you made clear with a special token the separation between the prompt and then encoded transcriptome.
For the drug perturbations, the approach you outlined would be great to try since the drugs would be a new vocabulary that is not already represented in the rank value encoding of the transcriptome.