Model Card for databio/r2v-luecken2021-hg38
Model Details
This is a Region2Vec model trained for cross-modal retrieval of of genomic region set
Model Sources [optional]
- Repository: https://github.com/databio/gitk
- Paper: (Coming soon...)
Uses
This model should be to embed region sets in the search data base
Bias, Risks, and Limitations
The Luecken2021 dataset is trained on the accessibility profile of bone-marrow cells. Bone marrow is the site of several stages of erythrocyte differentiation and B cell maturation. Reads from the these experiments were aligned to hg38, as such, one should only use this model with other data aligned to hg38.
Recommendations
If finetuning on your own data, we recommend 100 epochs. You might be able to get away with less, however.
How to Get Started with the Model
You can use the gitk python library to download this model and start encoding your single-cell data:
import scanpy as sc
from gitk.scembed import ScEmbed
adata = sc.read_h5af("path/to/adata.h5ad")
model = ScEmbed("databio/r2v-pbmc-hg38-small")
embeddings = model.encode(adata)
Training Details
Universe: all human cCRE (hg38)
Context window size: 50
Training Data
3,647 hg38 BED files from ENCODE
- Downloads last month
- 53