PreprocessReadMe.md

The preprocess.py script is designed to preprocess gene expression data for use in our models. It takes in data.raw.X or data.X data, applies various preprocessing techniques, and prepares it for training or inference.

General Workflow

The script follows these main steps: 0. Load Data and Metadata: The script starts by loading the gene expression data from an AnnData file and metadata from a JSON file.

Set Raw Layer: It checks if the data.raw.X layer is set in the AnnData object. If not, it sets it based on the integer counts in the data.X.
Initialize Processed Layer: It initializes the data.layer['processed'] in the AnnData object, which is the layer that will be affected by preprocessing.
Filter Genes by Reference ID: It filters genes based on reference IDs if specified in the hyperparameters.
Remove Assays: It removes specified assays from the data.
Filter Cells by Gene Counts: It filters out cells with gene counts below a specified threshold.
Filter Cells by Mitochondrial Fraction: It removes cells with a high mitochondrial gene fraction.
Filter Highly Variable Genes: It filters genes to retain only highly variable ones using specified methods.
Normalize Data: It normalizes the data by applying row (gene level) normalization and scaling.
Scale Columns by Median: It scales columns based on median values from a specified dictionary.
Log Transform: It applies a log+1 transformation to the data.
Compute Medians: It computes and saves medians of the processed data if specified.
Update Metadata: It updates the metadata with cell counts and processing arguments.
Save and Cleanup: It saves the processed data and metadata to disk and performs garbage collection.

Preprocessing Arguments

The script uses several preprocessing arguments to control its behavior. Here is an explanation of each argument and the steps they influence:

reference_id_only
- Description: Specifies whether to filter genes by reference ID.
- Impact: If enabled, the script filters genes based on reference IDs.
remove_assays
- Description: List of assays to remove from the data.
- Impact: The script removes specified assays from the data.
min_gene_counts
- Description: Minimum gene counts required for cells to be retained.
- Impact: The script filters out cells with gene counts below this threshold.
max_mitochondrial_prop
- Description: Maximum mitochondrial gene fraction allowed for cells.
- Impact: The script removes cells with a mitochondrial gene fraction above this threshold.
hvg_method
- Description: Method to use for filtering highly variable genes.
- Impact: The script filters genes to retain only highly variable ones using the specified method.
normalized_total
- Description: Value to normalize the total gene expression to.
- Impact: The script normalizes the data by applying row (gene level) normalization and scaling.
median_dict
- Description: Path to a JSON file containing median values for scaling columns.
- Impact: The script scales columns based on median values from the specified dictionary.
median_column
- Description: Column name to use for looking up median values.
- Impact: The script uses this column to look up median values for scaling.
log1p
- Description: Indicates whether to apply a log transformation to the data.
- Impact: If enabled, the script applies a log transformation to the data.
compute_medians
- Description: Indicates whether to compute and save medians of the processed data.
- Impact: If enabled, the script computes and saves medians of the processed data.