You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

Converge-SC for Embeddings: How to use?

Task Description

Single-cell embeddings are vector representations of cells that capture their biological characteristics in a high-dimensional space. These embeddings encapsulate gene expression patterns, allowing for efficient computational analysis, visualization, and comparison of cells. The task is to generate embeddings for single-cell RNA-seq data using the pre-trained Converge-SC model. These embeddings can be used for downstream analysis tasks such as clustering, visualization, integration, and more.

Basic Usage

The examples folder under the tab files and versions contains both the notebook and the gene mapping json file.

Go to the examples/get_embeddings.ipynb notebook to see how to generate embeddings for your single-cell data.

Pipeline Description

The pipeline uses the pre-trained Converge-SC model to generate embeddings for each cell in your dataset. The workflow involves:

  1. Loading your single-cell data (as an AnnData object)
  2. Preprocessing and normalizing the data
  3. Loading the pre-trained Converge-SC model and tokenizer
  4. Generating embeddings for each cell
  5. Storing the embeddings for downstream tasks

Input Data Requirements

Your data should be in the form of an AnnData object (.h5ad file) with:

  1. Expression Data: Gene expression measurements in adata.X
  2. Gene Information: Gene identifiers in adata.var_names

Preprocessing Steps

Before generating embeddings, you should preprocess your data:

  1. Normalization: Normalize your data to a common scale
import scanpy as sc
   
# Normalize to 10,000 counts per cell
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)  # Log-transform the data
  1. Gene Name Mapping: Converge-SC's vocabulary is in gene symbols, not ENSEMBL IDs, so you'll need to map ENSEMBL IDs to gene symbols if applicable
import json
   
# Load the mapping file
with open('examples/ensembl_to_gene_symbol.json', 'r') as file:
    ensg_to_symbol = json.load(file)
    
# Map gene names
adata.var_names = adata.var_names.map(lambda col: ensg_to_symbol.get(col, col))

Generating Embeddings

Load model and tokenizer

model = AutoModel.from_pretrained('ConvergeBio/ConvergeSC-embeddings', token='your_token_here', trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained('ConvergeBio/ConvergeSC-embeddings', token='your_token_here', trust_remote_code=True)

Compute Embeddings

tokenized_cell = tokenizer(gene_names, expression_values=gene_values)
embedding = model(**tokenized_cell)
Downloads last month
341
Safetensors
Model size
69.3M params
Tensor type
F32
·
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.