You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

BrainScope scGPT Pipeline

This repository contains the reusable pipeline code for running embedding, reference mapping, and cell-type annotation with scGPT on brain single-cell / single-nucleus RNA-seq datasets.

It is designed to support two execution modes:

  • small mode: one-pass workflows for smaller datasets such as LIBD
  • large mode: chunked / sharded workflows for large datasets such as full BrainScope

This repo is the main user-facing entry point. The model weights live in separate Hugging Face repos:

  • Original model repo: YOUR_USERNAME/brainscope-scgpt-og
  • Disease fine-tuned model repo: YOUR_USERNAME/brainscope-scgpt-disease

What this pipeline does

The pipeline currently supports:

  • Embed
    • generate scGPT cell embeddings from .h5ad
  • Reference map
    • embed query cells
    • search a FAISS reference index
    • assign labels by nearest-neighbor voting
  • Annotate
    • run the classification / CLS-head path for cell-type prediction
  • Error analysis
    • summarize per-class performance and confusion patterns

The codebase also supports both:

  • small datasets that fit in memory
  • large datasets that require chunking / sharding

Intended use

This pipeline is intended for:

  • disease-aware cell-type annotation
  • reference mapping against a healthy or external atlas
  • reproducible inference on brain sc/snRNA-seq data
  • research workflows on BrainScope, LIBD, and related datasets

This is a research pipeline, not a clinical product.


Repository structure

A typical structure looks like this:

src/brainscope_scgpt/
  cli.py
  preprocess.py
  reference_map.py
  tokenize.py
  model_factory.py
  freeze.py
  infer.py
  evaluate.py
  error_analysis.py

configs/
  og_scgpt.yaml
  disease_scgpt.yaml
  moe_scgpt.yaml

Installation

This project uses a local scGPT fork plus a Python environment built around Python 3.9 and PyTorch 1.13.

Option A: clean environment

Use the provided clean environment file if available:

conda env create -f environment.clean.yml
conda activate brainscope_scgpt_clean

Option B: existing scGPT environment

If you already have your scGPT environment working locally or on BlueHive, activate that environment.

On BlueHive, load modules before activating the environment:

module load cuda/11.8 git gcc/11.2.0/b1 anaconda3/2023.07-2
conda activate scgpt_finetune

Then install the local code:

pip install -e ../scGPT
pip install -e .

Input requirements

The pipeline expects an AnnData .h5ad input with:

  • genes aligned to the expected gene vocabulary
  • required metadata columns such as:
    • cell type column
    • batch column
    • gene column
  • preprocessing consistent with the selected config

Exact required field names depend on the config you use.


Example usage

1. Annotation

python -m brainscope_scgpt annotate   --input data/query.h5ad   --model-repo YOUR_USERNAME/brainscope-scgpt-disease   --output results/query_annotated.h5ad   --mode small

2. Reference mapping

python -m brainscope_scgpt reference-map   --input data/query.h5ad   --model-dir ../save/scGPT_human   --faiss-index-dir ../save/CellXGene_faiss_index   --output results/query_rm.h5ad   --mode small

3. Large dataset mode

python -m brainscope_scgpt annotate   --input data/brainscope_full.h5ad   --model-repo YOUR_USERNAME/brainscope-scgpt-disease   --output results/brainscope_full_annotated.h5ad   --mode large

Related model repositories

1. Original scGPT baseline

YOUR_USERNAME/brainscope-scgpt-og

Contains the packaged original scGPT baseline artifacts for reuse.

2. Disease fine-tuned scGPT

YOUR_USERNAME/brainscope-scgpt-disease

Contains the disease-adapted fine-tuned model and metadata.


Limitations

  • This pipeline depends on a modified scGPT fork and environment assumptions specific to this project.
  • Large-scale reference mapping may require a FAISS index that is not stored in this repository.
  • Performance depends strongly on preprocessing consistency, vocabulary matching, and label definitions.
  • This release is intended for research reproducibility, not for clinical decision-making.

Citation

If you use this pipeline, please cite the original scGPT paper and your project manuscript when available.

scGPT

@article{cui2024scgpt,
  title={scGPT: toward building a foundation model for single-cell multi-omics using generative AI},
  author={Cui, Haotian and Wang, Chloe and Maan, Hassaan and others},
  journal={Nature Methods},
  year={2024}
}

Contact

Yuesong Huang
University of Rochester
Email: yhu116@ur.rochester.edu

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support