Model Card for databio/r2v-luecken2021-hg38-v2
Model Details
This is a single-cell Region2Vec (r2v) model designed to be used with with scEmbed. It was trained on the Luecken2021 dataset. This model should be used to generate embeddings of single cells from scATAC-seq experiments. It produces 100 dimensional embeddings for each single-cell.
Model Sources [optional]
- Repository: https://github.com/databio/geniml
- Paper: https://www.biorxiv.org/content/10.1101/2023.08.01.551452v1
Uses
This model should be used for producing low dimensional embeddings of single-cells. These embeddings can be used for downstream clustering or classification tasks.
Bias, Risks, and Limitations
The Luecken2021 dataset is trained on the accessibility profile of bone-marrow cells. Bone marrow is the site of several stages of erythrocyte differentiation and B cell maturation. Reads from the these experiments were aligned to hg38, as such, one should only use this model with other data aligned to hg38.
Recommendations
If finetuning on your own data, we recommend 100 epochs. You might be able to get away with less, however.
How to Get Started with the Model
You can use the geniml
python library to download this model and start encoding your single-cell data:
import scanpy as sc
from geniml.scembed import ScEmbed
adata = sc.read_h5ad("path/to/adata.h5ad")
model = ScEmbed("databio/r2v-luecken2021-hg38-v2")
embeddings = model.encode(adata)
Training Details
Training Data
The data for this model comes from Luecken2021. It is a first-of-its-kind multimodal benchmark dataset of 120,000 single cells from the human bone marrow of 10 diverse donors measured with two commercially-available multi-modal technologies: nuclear GEX with joint ATAC, and cellular GEX with joint ADT profiles.
- Downloads last month
- 4