Fill-Mask
Transformers
PyTorch
esm
Inference Endpoints
svincoff's picture
uploading data folder
1e6a1f0

Training Data Curation and Processing

The data folder and its subfolders hold all raw data and processed data used to assemble FusOn-DB, as well as all processing scripts. Additional benchmarking datasets can be found in the benchmarking folder.

From raw data to train/val/test splits and head/tail data

This section will outline the pipeline for converting the raw FusionPDB and FOdb datasets into the train/val/test splits used in FusOn-pLM. This process included data cleaning, clustering, and splitting. During the cleaning process, we also extracted data about the heads and tails of each fusion oncoprpotein.

data/                                 
└── clustering/ 
    β”œβ”€β”€ input.fasta
    β”œβ”€β”€ mmseqs_full_results.csv
└── head_tail_data/  
    └── uniprot_idmap_inputs/  
└── raw_data/ 
    β”œβ”€β”€ FOdb_all.csv
    β”œβ”€β”€ FOdb_puncta.csv
    β”œβ”€β”€ FOdb_SD5.csv
    β”œβ”€β”€ FusionPDB_cleaned.csv
    β”œβ”€β”€ FusionPDB.txt
    β”œβ”€β”€ gene_to_ensembl_dict.pkl
└── splits/ 
    β”œβ”€β”€ combined_plot.png
    β”œβ”€β”€ train_df.csv
    β”œβ”€β”€ train_cluster_split.csv
    β”œβ”€β”€ val_df.csv
    β”œβ”€β”€ val_cluster_split.csv
    β”œβ”€β”€ test_df.csv
    β”œβ”€β”€ test_cluster_split.csv
β”œβ”€β”€ clean.py
β”œβ”€β”€ cluster.py
β”œβ”€β”€ config.py
β”œβ”€β”€ split.py
β”œβ”€β”€ split_vis.py
β”œβ”€β”€ data_cleaning_log.txt
β”œβ”€β”€ clustering_log.txt
β”œβ”€β”€ splitting_log.txt
β”œβ”€β”€ fuson_db.csv
  • clean.py: script for cleaning the datasets in raw_data. Print statements in this code produce data_cleaning_log.txt.
  • cluster.py: script for clustering the processed data in fuson_db.csv. Print statements in this code produce clustering_log.txt.
  • config.py: configs for the cleaning, clustering, and splitting scripts.
  • split.py: script for splitting the data, post-clusteirng. Print statements in this code produce splitting_log.txt.
  • split_vis.py script with code for the plots in splits/combined_plot.png, which describe the content of the train, validation, and test splits (length distribution, Shannon Entropy, amino acid frequencies, and cluster sizes)

Usage

To repeat our cleaning, clustering, and splitting process, proceed as follows.

  1. Install MMSeqs2 at /*/FusOn-pLM/fuson_plm/mmseqs2 according to these instructions: https://github.com/soedinglab/MMseqs2. Make sure that in config.py, CLUSTER.PATH_TO_MMSEQS points to your mmseqs installation.
  2. Run the cleaning script:
python clean.py

This script will create the following files:

  • fuson_db.csv: FusOn-DB. Our full database of 44,414 fusion oncoproteins.
  • raw_data/FusionPDB_cleaned.csv: a processed version of the FusionPDB database with the following columns: aa_seq,n_fusiongenes,fusiongenes,cancers,primary_sources,secondary_source.
  • head_tail_data/uniprot_idmap_inputs/head_tail_ens.txt and head_tail_data/uniprot_idmap_inputs/head_tail_genes.txt: all unique Ensembl IDs and gene symbols for all unique head/tail proteins corresponding to any fusion oncoproteins in FusOn-DB. These were submitted to the UniProt ID-mapping tool to create head_tail_data/ensembl_ht_idmap.txt and **head_tail_data/genename_ht_idmap.txt, respectively.
  • head_tail_data/uniprot_idmap_inputs/gene_to_ensembl_dict.pkl: a dictionary mapping each unique gene symbol to a comma-separated list of its associated Ensembl IDs, according to FusionPDB.
  • head_tail_data/uniprot_idmap_inputs/htgenes_uniprotids.csv a file with each unique gene symbol (Gene), a comma-separated list of all associated UniProt IDs (UniProtID), and a concatenated list of 1s and 0s representing whether each ID in the UniProtID column is reviewed or not (Reviewed).
    • For example, a Reviewed value of "100" means the first ID in the UniProtID column of the same row is reviewed (1) and the second and third are not (0)
  1. Run the clustering script:
python cluster.py

The command entered by this script to the clustering software is:

mmseqs easy-cluster clustering/input.fasta clustering/raw_output/mmseqs clustering/raw_output --min-seq-id 0.3 -c 0.8 --cov-mode 0

This script will cluster all sequences length 2000 or shorter (see config.py) and create the following files:

  • clustering/input.fasta: the input file used by MMSeqs2 to cluster the fusion oncoprotein sequences. Headers are our assigned sequence IDs (can be found in the seq_id column of fuson_db.csv.)
  • clustering/mmseqs_full_results.csv: clustering results. Columns:
    • representative seq_id: the seq_id of the sequence representing this cluster
    • member seq_id: the seq_id of a member of the cluster
    • representative seq: the amino acid sequence of the cluster representative (representative seq_id)
    • member seq: the amino acid sequence of the cluster member
  1. Run the splitting script:
python split.py

This script will create the following files:

  • splits/train_cluster_split.csv, splits/val_cluster_split.csv, splits/test_cluster_split.csv: The subsets of clustering/mmseqs_full_results.csv that have been partitioned into the train, validation, and test sets respectively.
  • splits/train_df.csv, splits/val_df.csv, splits/test_df.csv: The train, validation, and testing splits used to train FusOn-pLM. Columns: sequence,member length
  • splits/combined_plot.png: plot displaying the composition of the train, validation, and test splits.

BLAST

We ran BLAST to get the best alignment of each sequence in FusOn-DB to a protein in SwissProt. See the README in the blast folder for more details.