Some questions about gene type prediction task
Hi, congs for your great work! I take a look at the supp table for gene type prediction task, and I found that the second dataset is a little ambiguous. I cannot find that dataset(15K embryonic stem cells (ESCs)29) in PanglaoDB. Could you please offer more information? Thanks a lot.
Thank you for your interest in Geneformer. The dataset used for fine-tuning the model to distinguish bivalent promoters was from PanglaoDB, SRA553822-SRS2119548. In the example_input_files directory, we added the labels for the genes in the 56 highly conserved regions reported in Bernstein et al. 2006.
15K refers to the number of cells. [Update: we have stored the embryonic stem cell .dataset in the dataset repository: https://huggingface.co/datasets/ctheodoris/Genecorpus-30M/tree/main/example_input_files]