license: cc-by-nc-nd-4.0
language:
- en
base_model: EleutherAI/pythia-410m
library_name: transformers
tags:
- biology
- scRNAseq
Overview
This is the C2S-Pythia-410m-diverse-single-and-multi-cell-tasks model, based on the Pythia-410m architecture developed by EleutherAI, fine-tuned using Cell2Sentence (C2S) on a wide array of single-cell RNA sequencing (scRNA-seq) datasets from CellxGene and the Human Cell Atlas. Cell2Sentence is a cutting-edge method that adapts large language models (LLMs) to single-cell biology by converting scRNA-seq data into "cell sentences" — ordered sequences of gene names based on expression levels. This model has been trained to perform a broad range of single- and multi-cell tasks, making it a versatile tool for various single-cell and multi-cell analyses.
Training Data
This model was trained on over 57 million human and mouse cells gathered from over 800 single-cell RNA sequencing datasets from CellxGene and the Human Cell Atlas. This dataset covers a broad range of cell types and conditions from multiple tissues in both human and mouse.
This model was trained with a variable number of genes per cell sentence, with a maximum context length of 8192 tokens. The context length of the default Pythia model was extended using rotary positional embeddings prior to C2S training.
- Cells: For multi cell samples, each training sample contained between 5 and 20 cells, with the same number of genes for each of the cells in the same sample.
- Genes: For single cell samples, each cell sentence contained between 100 and 2048 genes. For multi cell samples, each cell sentence per cell contained between 100 and 400 genes.
Tasks
This model is designed for the following tasks:
Single-Cell Tasks
- Unconditional single-cell generation: Generate single cell sentences unconditionally.
- Cell type prediction: Predict the cell type of a given single cell.
- Cell type-conditioned generation: Generate a single cell sentence conditioned on a specific cell type.
Multi-Cell Tasks
- Unconditional multi-cell generation: Generate multiple cell sentences unconditionally.
- Tissue prediction: Predict the tissue of origin for a group of cells.
- Cell type prediction: Predict the cell type for each cell in a group of multiple cells.
- Tissue-conditioned multi-cell generation: Generate multiple cell sentences conditioned on a specific tissue.
- Cell type-conditioned multi-cell generation: Generate multiple cell sentences conditioned on the cell type of each individual cell.
- Multi-cells to abstract: Generate a research paper abstract based on the provided multi-cell sentences.
- Abstract to multi-cells: Generate multiple cell sentences based on a given research paper abstract.
Gene Set Tasks
- Gene set name to genes: Generate an alphabetical list of genes given a gene set name.
- Genes to gene set name: Generate the name of a gene set given an alphabetical list of genes.
Cell2Sentence Links
- GitHub: https://github.com/vandijklab/cell2sentence
- Paper: https://www.biorxiv.org/content/10.1101/2023.09.11.557287v3
Pythia Links
- Paper: https://arxiv.org/pdf/2304.01373
- Hugging Face: https://huggingface.co/EleutherAI/pythia-410m