Training Data Curation and Processing
The data
folder and its subfolders hold all raw data and processed data used to assemble FusOn-DB, as well as all processing scripts. Additional benchmarking datasets can be found in the benchmarking
folder.
From raw data to train/val/test splits and head/tail data
This section will outline the pipeline for converting the raw FusionPDB and FOdb datasets into the train/val/test splits used in FusOn-pLM. This process included data cleaning, clustering, and splitting. During the cleaning process, we also extracted data about the heads and tails of each fusion oncoprpotein.
data/
βββ clustering/
βββ input.fasta
βββ mmseqs_full_results.csv
βββ head_tail_data/
βββ uniprot_idmap_inputs/
βββ raw_data/
βββ FOdb_all.csv
βββ FOdb_puncta.csv
βββ FOdb_SD5.csv
βββ FusionPDB_cleaned.csv
βββ FusionPDB.txt
βββ gene_to_ensembl_dict.pkl
βββ splits/
βββ combined_plot.png
βββ train_df.csv
βββ train_cluster_split.csv
βββ val_df.csv
βββ val_cluster_split.csv
βββ test_df.csv
βββ test_cluster_split.csv
βββ clean.py
βββ cluster.py
βββ config.py
βββ split.py
βββ split_vis.py
βββ data_cleaning_log.txt
βββ clustering_log.txt
βββ splitting_log.txt
βββ fuson_db.csv
clean.py
: script for cleaning the datasets inraw_data
. Print statements in this code producedata_cleaning_log.txt
.cluster.py
: script for clustering the processed data in fuson_db.csv. Print statements in this code produceclustering_log.txt
.config.py
: configs for the cleaning, clustering, and splitting scripts.split.py
: script for splitting the data, post-clusteirng. Print statements in this code producesplitting_log.txt
.split_vis.py
script with code for the plots insplits/combined_plot.png
, which describe the content of the train, validation, and test splits (length distribution, Shannon Entropy, amino acid frequencies, and cluster sizes)
Usage
To repeat our cleaning, clustering, and splitting process, proceed as follows.
- Install MMSeqs2 at
/*/FusOn-pLM/fuson_plm/mmseqs2
according to these instructions: https://github.com/soedinglab/MMseqs2. Make sure that inconfig.py
, CLUSTER.PATH_TO_MMSEQS points to your mmseqs installation. - Run the cleaning script:
python clean.py
This script will create the following files:
fuson_db.csv
: FusOn-DB. Our full database of 44,414 fusion oncoproteins.raw_data/FusionPDB_cleaned.csv
: a processed version of the FusionPDB database with the following columns:aa_seq
,n_fusiongenes
,fusiongenes
,cancers
,primary_sources
,secondary_source
.head_tail_data/uniprot_idmap_inputs/head_tail_ens.txt
andhead_tail_data/uniprot_idmap_inputs/head_tail_genes.txt
: all unique Ensembl IDs and gene symbols for all unique head/tail proteins corresponding to any fusion oncoproteins in FusOn-DB. These were submitted to the UniProt ID-mapping tool to createhead_tail_data/ensembl_ht_idmap.txt
and **head_tail_data/genename_ht_idmap.txt
, respectively.head_tail_data/uniprot_idmap_inputs/gene_to_ensembl_dict.pkl
: a dictionary mapping each unique gene symbol to a comma-separated list of its associated Ensembl IDs, according to FusionPDB.head_tail_data/uniprot_idmap_inputs/htgenes_uniprotids.csv
a file with each unique gene symbol (Gene
), a comma-separated list of all associated UniProt IDs (UniProtID
), and a concatenated list of 1s and 0s representing whether each ID in theUniProtID
column is reviewed or not (Reviewed
).- For example, a
Reviewed
value of "100" means the first ID in theUniProtID
column of the same row is reviewed (1) and the second and third are not (0)
- For example, a
- Run the clustering script:
python cluster.py
The command entered by this script to the clustering software is:
mmseqs easy-cluster clustering/input.fasta clustering/raw_output/mmseqs clustering/raw_output --min-seq-id 0.3 -c 0.8 --cov-mode 0
This script will cluster all sequences length 2000 or shorter (see config.py
) and create the following files:
clustering/input.fasta
: the input file used by MMSeqs2 to cluster the fusion oncoprotein sequences. Headers are our assigned sequence IDs (can be found in theseq_id
column offuson_db.csv
.)clustering/mmseqs_full_results.csv
: clustering results. Columns:representative seq_id
: the seq_id of the sequence representing this clustermember seq_id
: the seq_id of a member of the clusterrepresentative seq
: the amino acid sequence of the cluster representative (representative seq_id)member seq
: the amino acid sequence of the cluster member
- Run the splitting script:
python split.py
This script will create the following files:
splits/train_cluster_split.csv
,splits/val_cluster_split.csv
,splits/test_cluster_split.csv
: The subsets ofclustering/mmseqs_full_results.csv
that have been partitioned into the train, validation, and test sets respectively.splits/train_df.csv
,splits/val_df.csv
,splits/test_df.csv
: The train, validation, and testing splits used to train FusOn-pLM. Columns:sequence
,member length
splits/combined_plot.png
: plot displaying the composition of the train, validation, and test splits.
BLAST
We ran BLAST to get the best alignment of each sequence in FusOn-DB to a protein in SwissProt. See the README in the blast
folder for more details.