ChatterjeeLab
/

FusOn-pLM

Inference Endpoints

Model card Files Files and versions Community

FusOn-pLM / fuson_plm /data /config.py

svincoff's picture

uploading data folder

1e6a1f0 about 1 month ago

history blame contribute delete

2.21 kB

	# config.py
	from fuson_plm.utils.logging import CustomParams

	CLEAN = CustomParams(
	### Changing these parameters is not recommended
	FODB_PATH = '../data/raw_data/FOdb_all.csv', # path to raw FOdb database
	FODB_PUNCTA_PATH = '../data/raw_data/FOdb_puncta.csv', # path to raw FOdb puncta experimental data
	FUSIONPDB_PATH = '../data/raw_data/FusionPDB.txt', # path to raw FusionPDB Level 1 .txt download
	)

	# Clustering Parameters
	CLUSTER = CustomParams(
	MAX_SEQ_LENGTH = 2000, # INCLUSIVE max length (amino acids) of a sequence for training, validation, or testing

	# MMSeqs2 parameters: see GitHub or MMSeqs2 Wiki for guidance
	MIN_SEQ_ID = 0.3, # % identity
	C = 0.8, # % sequence length overlap
	COV_MODE = 0, # cov-mode: 0 = bidirectional, 1 = target coverage, 2 = query coverage, 3 = target-in-query length coverage.
	# File paths
	INPUT_PATH = '../data/fuson_db.csv',
	PATH_TO_MMSEQS = '../mmseqs' # path to where you installed MMSeqs2
	)

	# Splitting Parameters
	# We randomly split clusters in two rounds to arrive at a Train, Validation, and Test set.
	# Round 1) All clusters -> Train (final) and Other (temp). Round 2) Other (temp) clusters -> Val (final) and Test (final)
	SPLIT = CustomParams(
	FUSON_DB_PATH = '../data/fuson_db.csv',
	CLUSTER_OUTPUT_PATH = '../data/clustering/mmseqs_full_results.csv',
	RANDOM_STATE_1 = 2, # random_state_1 = state for splitting all data into train & other
	TEST_SIZE_1 = 0.18, # test size for data -> train/test split. e.g. 20 means 80% clusters in train, 20% clusters in other
	RANDOM_STATE_2 = 6, # random_state_2 = state for splitting other from ^ into val and test
	TEST_SIZE_2 = 0.44 # test size for train -> train/val split. e.g. 0.50 means 50% clusters in train, 50% clusters in test
	)