Model Card for databio/r2v-luecken2021-hg38

Model Details

This is a Region2Vec model trained for cross-modal retrieval of of genomic region set

Model Sources [optional]

Repository: https://github.com/databio/gitk
Paper: (Coming soon...)

Uses

This model should be to embed region sets in the search data base

Bias, Risks, and Limitations

The Luecken2021 dataset is trained on the accessibility profile of bone-marrow cells. Bone marrow is the site of several stages of erythrocyte differentiation and B cell maturation. Reads from the these experiments were aligned to hg38, as such, one should only use this model with other data aligned to hg38.

Recommendations

If finetuning on your own data, we recommend 100 epochs. You might be able to get away with less, however.

How to Get Started with the Model

You can use the gitk python library to download this model and start encoding your single-cell data:

import scanpy as sc
from gitk.scembed import ScEmbed

adata = sc.read_h5af("path/to/adata.h5ad")

model = ScEmbed("databio/r2v-pbmc-hg38-small")
embeddings = model.encode(adata)

Training Details

Universe: all human cCRE (hg38)

Context window size: 50

Training Data

3,647 hg38 BED files from ENCODE