BrainScope scGPT Pipeline
This repository contains the reusable pipeline code for running embedding, reference mapping, and cell-type annotation with scGPT on brain single-cell / single-nucleus RNA-seq datasets.
It is designed to support two execution modes:
- small mode: one-pass workflows for smaller datasets such as LIBD
- large mode: chunked / sharded workflows for large datasets such as full BrainScope
This repo is the main user-facing entry point. The model weights live in separate Hugging Face repos:
- Original model repo:
YOUR_USERNAME/brainscope-scgpt-og - Disease fine-tuned model repo:
YOUR_USERNAME/brainscope-scgpt-disease
What this pipeline does
The pipeline currently supports:
- Embed
- generate scGPT cell embeddings from
.h5ad
- generate scGPT cell embeddings from
- Reference map
- embed query cells
- search a FAISS reference index
- assign labels by nearest-neighbor voting
- Annotate
- run the classification / CLS-head path for cell-type prediction
- Error analysis
- summarize per-class performance and confusion patterns
The codebase also supports both:
- small datasets that fit in memory
- large datasets that require chunking / sharding
Intended use
This pipeline is intended for:
- disease-aware cell-type annotation
- reference mapping against a healthy or external atlas
- reproducible inference on brain sc/snRNA-seq data
- research workflows on BrainScope, LIBD, and related datasets
This is a research pipeline, not a clinical product.
Repository structure
A typical structure looks like this:
src/brainscope_scgpt/
cli.py
preprocess.py
reference_map.py
tokenize.py
model_factory.py
freeze.py
infer.py
evaluate.py
error_analysis.py
configs/
og_scgpt.yaml
disease_scgpt.yaml
moe_scgpt.yaml
Installation
This project uses a local scGPT fork plus a Python environment built around Python 3.9 and PyTorch 1.13.
Option A: clean environment
Use the provided clean environment file if available:
conda env create -f environment.clean.yml
conda activate brainscope_scgpt_clean
Option B: existing scGPT environment
If you already have your scGPT environment working locally or on BlueHive, activate that environment.
On BlueHive, load modules before activating the environment:
module load cuda/11.8 git gcc/11.2.0/b1 anaconda3/2023.07-2
conda activate scgpt_finetune
Then install the local code:
pip install -e ../scGPT
pip install -e .
Input requirements
The pipeline expects an AnnData .h5ad input with:
- genes aligned to the expected gene vocabulary
- required metadata columns such as:
- cell type column
- batch column
- gene column
- preprocessing consistent with the selected config
Exact required field names depend on the config you use.
Example usage
1. Annotation
python -m brainscope_scgpt annotate --input data/query.h5ad --model-repo YOUR_USERNAME/brainscope-scgpt-disease --output results/query_annotated.h5ad --mode small
2. Reference mapping
python -m brainscope_scgpt reference-map --input data/query.h5ad --model-dir ../save/scGPT_human --faiss-index-dir ../save/CellXGene_faiss_index --output results/query_rm.h5ad --mode small
3. Large dataset mode
python -m brainscope_scgpt annotate --input data/brainscope_full.h5ad --model-repo YOUR_USERNAME/brainscope-scgpt-disease --output results/brainscope_full_annotated.h5ad --mode large
Related model repositories
1. Original scGPT baseline
YOUR_USERNAME/brainscope-scgpt-og
Contains the packaged original scGPT baseline artifacts for reuse.
2. Disease fine-tuned scGPT
YOUR_USERNAME/brainscope-scgpt-disease
Contains the disease-adapted fine-tuned model and metadata.
Limitations
- This pipeline depends on a modified scGPT fork and environment assumptions specific to this project.
- Large-scale reference mapping may require a FAISS index that is not stored in this repository.
- Performance depends strongly on preprocessing consistency, vocabulary matching, and label definitions.
- This release is intended for research reproducibility, not for clinical decision-making.
Citation
If you use this pipeline, please cite the original scGPT paper and your project manuscript when available.
scGPT
@article{cui2024scgpt,
title={scGPT: toward building a foundation model for single-cell multi-omics using generative AI},
author={Cui, Haotian and Wang, Chloe and Maan, Hassaan and others},
journal={Nature Methods},
year={2024}
}
Contact
Yuesong Huang
University of Rochester
Email: yhu116@ur.rochester.edu