Need a Gene Name Dictionary for Mouse Gene Names

#366
by bbutaney - opened

I am trying to run a pretrained version of Geneformer to evaluate on mouse scRNA data in a zero shot setting. The only gene name dictionary I can find on the repo is for human genes, which have a different naming convention than mouse genes. As a result, I get the ValueError: Only 0.00% genes in the dataset are in the vocabulary! when running data.InputData(adata_dataset_path = in_dataset_path).preprocess_data(gene_col = gene_col, model_type = "geneformer", save_ext = "loom", gene_name_id_dict = geneform.gene_name_id, preprocessed_path = preprocessed_path). Is there another gene name dictionary I can use for Geneformer with mouse gene names instead? Thank you!

bbutaney changed discussion title from Pretrained model on mouse genes to Need a Gene Name Dictionary for Mouse Gene Names

Thank you for your question! If you'd like to map the genes to orthologs and take advantage of the pretraining on the human genes, you can label your loom or h5ad file with the orthologous human Ensembl IDs. If you'd like to train the model further on the mouse genes, you can add new tokens to the dictionary that correspond to the mouse genes (either all mouse genes or only the ones without orthologs).

ctheodoris changed discussion status to closed

Sign up or log in to comment