Predictor of Francisella tularensis MICs
Updated: Tue 1 Apr 08:02:46 BST 2025
Trained on the Francisella tularensis, WT accumulator phenotype subset of the human-curated SPARK dataset (9671 rows in total for Francisella tularensis).
Model details
This model was trained using our Duvida framework, as a result of hyperparameter searches and selecting the model that performs best on unseen test data (from a scaffold split).
Duvida also saves the training data in this checkpoint to allows the calculation of uncertainty metrics based on that training data.
This model is the best regression model from a hyperparameter search, determined by Pearson's $$r$$ on a held-out test set not used in training or early stopping.
Model architecture
- Regression
{
"dropout": 0.0,
"ensemble_size": 3,
"extra_featurizers": null,
"learning_rate": 1e-05,
"model_class": "ChempropModelBox",
"n_hidden": 3,
"n_units": 8,
"use_2d": false,
"use_fp": true
}
Model usage
You can use this model with:
from duvida.autoclasses import AutoModelBox
modelbox = AutoModelBox.from_pretrained("hf://scbirlab/spark-dv-2503-ftul")
modelbox.predict(filename=..., inputs=[...], columns=[...]) # make predictions on your own data
Training details
- Dataset: SPARK, WT accumulator, Francisella tularensis subset (9671 rows in total for Francisella tularensis)
- Input column: smiles
- Output column: pmic
- Split type: Murcko scaffold
- Split proportions:
- 70% training (6770 rows)
- 15% validation (for early stopping) (1450 rows)
- 15% test (for selecting hyperparameters) (1451 rows)
Here is the training log:

And these are the evaluation scores.
Train (6770 rows):
{
"Pearson r": 0.516938771017399,
"RMSE": 0.12588317692279816,
"Spearman rho": 0.6954019742202038
}

Validation (1450 rows):
{
"Pearson r": 0.2963203237123644,
"RMSE": 0.13911816477775574,
"Spearman rho": 0.5852028507710829
}

Test (1451 rows):
{
"Pearson r": 0.3693269511911911,
"RMSE": 0.12272950261831284,
"Spearman rho": 0.6347207507195295
}

Training data details
The training data were collated by the authors of:
Joe Thomas, Marc Navre, Aileen Rubio, and Allan Coukell Shared Platform for Antibiotic Research and Knowledge: A Collaborative Tool to SPARK Antibiotic Discovery ACS Infectious Diseases 2018 4 (11), 1536-1539 DOI: 10.1021/acsinfecdis.8b00193
We cleaned the original SPARK dataset to subset the most relevant columns, remove empty values, give succint column titles, and split by species.
This particular dataset retains only measurements on bacteria with wild-type accumulation phenotypes.
Dataset Sources
- Repository: https://www.collaborativedrug.com/spark-data-downloads
- Paper: https://doi.org/10.1021/acsinfecdis.8b00193
Data Collection and Processing
Data were processed using schemist, a tool for processing chemical datasets.
The SMILES strings have been canonicalized, and split into training (70%), validation (15%), and test (15%) sets by Murcko scaffold for each species with more than 1000 entries. Additional features like molecular weight and topological polar surface area have also been calculated.
Who are the source data producers?
Joe Thomas, Marc Navre, Aileen Rubio, and Allan Coukell