File size: 3,780 Bytes
4527b5f |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 |
# PreprocessReadMe.md
The `preprocess.py` script is designed to preprocess gene expression data for use in our models. It takes in `data.raw.X` or `data.X` data, applies various preprocessing techniques, and prepares it for training or inference.
# General Workflow
The script follows these main steps:
0. **Load Data and Metadata**: The script starts by loading the gene expression data from an AnnData file and metadata from a JSON file.
1. **Set Raw Layer**: It checks if the `data.raw.X` layer is set in the AnnData object. If not, it sets it based on the integer counts in the `data.X`.
2. **Initialize Processed Layer**: It initializes the `data.layer['processed']` in the AnnData object, which is the layer that will be affected by preprocessing.
3. **Filter Genes by Reference ID**: It filters genes based on reference IDs if specified in the hyperparameters.
4. **Remove Assays**: It removes specified assays from the data.
5. **Filter Cells by Gene Counts**: It filters out cells with gene counts below a specified threshold.
6. **Filter Cells by Mitochondrial Fraction**: It removes cells with a high mitochondrial gene fraction.
7. **Filter Highly Variable Genes**: It filters genes to retain only highly variable ones using specified methods.
8. **Normalize Data**: It normalizes the data by applying row (gene level) normalization and scaling.
9. **Scale Columns by Median**: It scales columns based on median values from a specified dictionary.
10. **Log Transform**: It applies a log+1 transformation to the data.
11. **Compute Medians**: It computes and saves medians of the processed data if specified.
12. **Update Metadata**: It updates the metadata with cell counts and processing arguments.
13. **Save and Cleanup**: It saves the processed data and metadata to disk and performs garbage collection.
# Preprocessing Arguments
The script uses several preprocessing arguments to control its behavior. Here is an explanation of each argument and the steps they influence:
- `reference_id_only`
- Description: Specifies whether to filter genes by reference ID.
- Impact: If enabled, the script filters genes based on reference IDs.
- `remove_assays`
- Description: List of assays to remove from the data.
- Impact: The script removes specified assays from the data.
- `min_gene_counts`
- Description: Minimum gene counts required for cells to be retained.
- Impact: The script filters out cells with gene counts below this threshold.
- `max_mitochondrial_prop`
- Description: Maximum mitochondrial gene fraction allowed for cells.
- Impact: The script removes cells with a mitochondrial gene fraction above this threshold.
- `hvg_method`
- Description: Method to use for filtering highly variable genes.
- Impact: The script filters genes to retain only highly variable ones using the specified method.
- `normalized_total`
- Description: Value to normalize the total gene expression to.
- Impact: The script normalizes the data by applying row (gene level) normalization and scaling.
- `median_dict`
- Description: Path to a JSON file containing median values for scaling columns.
- Impact: The script scales columns based on median values from the specified dictionary.
- `median_column`
- Description: Column name to use for looking up median values.
- Impact: The script uses this column to look up median values for scaling.
- `log1p`
- Description: Indicates whether to apply a log transformation to the data.
- Impact: If enabled, the script applies a log transformation to the data.
- `compute_medians`
- Description: Indicates whether to compute and save medians of the processed data.
- Impact: If enabled, the script computes and saves medians of the processed data.
|