Exoplanet Classification using NASA Datasets (Kepler, K2, TESS) — README
This README explains how we trained a mission-agnostic classifier to predict whether a detected object is CONFIRMED, CANDIDATE, or FALSE POSITIVE using public catalogs from Kepler, K2, and TESS. It documents the data ingestion, harmonization, feature engineering (including SNR logic), preprocessing, model selection, evaluation, class-imbalance handling, explainability, and inference artifacts.
Project Structure (key files)
.
├─ Exoplanet_Classification_NASA_Kepler_K2_TESS_withSNR_ROBUST.ipynb # main notebook
└─ artifacts/
├─ exoplanet_best_model.joblib # full sklearn Pipeline (preprocessing + estimator)
├─ exoplanet_feature_columns.json # exact feature order used during training
├─ exoplanet_class_labels.json # label names in prediction index order
└─ exoplanet_metadata.json # summary (best model name, n_features, timestamp)
Datasets
- Kepler Cumulative:
cumulative_YYYY.MM.DD_HH.MM.SS.csv - K2 Planets & Candidates:
k2pandc_YYYY.MM.DD_HH.MM.SS.csv - TESS Objects of Interest (TOI):
TOI_YYYY.MM.DD_HH.MM.SS.csv
Each table contains per-candidate features and a mission-specific disposition field. These missions are complementary (different cadences/systems), so merging increases coverage and diversity.
Environment
- Python ≥ 3.9
- Libraries:
pandas,numpy,matplotlib,scikit-learn,imblearn, optionalxgboost, optionalshap,joblib.
If
xgboostorshaparen’t installed, the notebook automatically skips those steps.
Reproducibility
- Random seeds set to 42 in model splits and estimators.
- Stratified splits used for consistent class composition.
- The exact feature order and class label order are saved to JSON and re-used at inference.
Data Loading (robust, no low_memory)
CSV parsing is robust:
- Try multiple separators and engines (prefers Python engine with
on_bad_lines="skip"). - Normalize column names to
snake_case. - Re-detect header if the first read looks suspicious (e.g., numeric column names).
- Drop columns that are entirely empty.
We deliberately do not pass
low_memory(it’s unsupported by the Python engine and can degrade type inference).
Label Harmonization
Different catalogs use different labels. We map all to a unified set:
- CONFIRMED:
CONFIRMED,CP,KP - CANDIDATE:
CANDIDATE,PC,CAND,APC - FALSE POSITIVE:
FALSE POSITIVE,FP,FA,REFUTED
Rows without a recognized disposition are dropped before modeling.
Feature Harmonization (mission-agnostic)
We build a canonical numeric feature set by scanning a list of alias groups and picking the first numeric-coercible column present per mission, e.g.:
- Period:
['koi_period','orbital_period','period','pl_orbper','per'] - Duration:
['koi_duration','duration','tran_dur'] - Depth:
['koi_depth','depth','tran_depth'] - Stellar context:
['koi_steff','st_teff','teff'],['koi_slogg','st_logg','logg'],['koi_smet','st_metfe','feh'],['koi_srad','st_rad','star_radius'] - Transit geometry:
['koi_impact','impact','b'],['koi_sma','sma','a_rs','semi_major_axis'] - SNR/MES (broad):
['koi_snr','koi_model_snr','koi_mes','mes','max_mes','snr','detection_snr','transit_snr','signal_to_noise','koi_max_snr']
We coerce to numeric (errors='coerce') and accept if any non-null values remain.
Feature Engineering
Physics-motivated features improve separability:
- Duty cycle:
duty_cycle = koi_duration / (koi_period * 24.0)
(transit duration relative to orbital period) - Log transforms:
log_koi_period = log10(koi_period)log_koi_depth = log10(koi_depth) - Equilibrium-temperature proxy:
- Simple:
teq_proxy = koi_steff - Refined (if
a/R*available or derivable):teq_proxy = koi_steff / sqrt(2 * a_rs)
- Simple:
- SNR logic (guaranteed SNR-like feature):
- If any mission exposes a usable SNR/MES, we use
koi_snrand also computelog_koi_snr. - Otherwise we compute a mission-agnostic proxy:
snr_proxy = koi_depth * sqrt( koi_duration / (koi_period * 24.0) ) log_snr_proxy = log10(snr_proxy)
- If any mission exposes a usable SNR/MES, we use
The exact features used in training are whatever appear in
artifacts/exoplanet_feature_columns.json(this list is created dynamically from the actual files you load).
Mission-Agnostic Policy
- We keep a temporary
missioncolumn only for auditing, then drop it before training. - Features are derived from physical quantities, not the mission identity.
Preprocessing & Split
- Keep numeric columns only.
- Imputation: median (
SimpleImputer(strategy="median")). - Scaling:
RobustScaler()(robust to outliers). - Label encoding:
LabelEncoder→ integer target (needed by XGBoost). - Split:
train_test_split(test_size=0.2, stratify=y, random_state=42).
Model Selection & Training
We wrap preprocessing and the classifier in a single Pipeline, and optimize macro-F1 with RandomizedSearchCV using StratifiedKFold:
- Random Forest (class_weight="balanced")
n_estimators: 150–600 (fast mode: 150/300)max_depth: None or small integersmin_samples_split,min_samples_leaf
- SVM (RBF) (class_weight="balanced")
C: logspace gridgamma:["scale","auto"]
- XGBoost (optional; only if installed)
objective="multi:softprob",eval_metric="mlogloss",tree_method="hist"- Grid over
learning_rate,max_depth,subsample,colsample_bytree,n_estimators
We use a
FASTflag (defaults to True) to keep grids small for iteration speed. Disabling FAST expands the search.
Evaluation
- Primary metric: Macro-F1 (treats classes evenly).
- Also report accuracy and per-class precision/recall/F1.
- Confusion matrices (shared color scale across models).
- ROC AUC (one-vs-rest) when probability estimates are available.
Handling Class Imbalance
We provide an optional SMOTE section using imblearn.Pipeline:
ImbPipeline([
("impute", SimpleImputer(median)),
("scale", RobustScaler()),
("smote", SMOTE()),
("clf", RandomForestClassifier or SVC(probability=True))
])
- We avoid nested sklearn Pipelines inside the imblearn pipeline to prevent:
TypeError: All intermediate steps of the chain should not be Pipelines.
The SMOTE models are tuned on macro-F1 and compared against the non-SMOTE runs.
Explainability
- Permutation importance computed on the full pipeline (so it reflects preprocessing).
- SHAP (if installed) for tree-based models:
- Global bar chart of mean |SHAP| across classes
- Per-class bar chart (e.g., CONFIRMED)
- SHAP summary bar (optional)
If the best model isn’t tree-based, SHAP is skipped automatically.
Saved Artifacts
After selecting the best model (highest macro-F1 on the held-out test set), we serialize:
artifacts/exoplanet_best_model.joblib– the full sklearn Pipelineartifacts/exoplanet_feature_columns.json– exact feature order expected at inferenceartifacts/exoplanet_class_labels.json– class names in output index orderartifacts/exoplanet_metadata.json– name, n_features, labels, timestamp
These provide stable inference even if the notebook or environment changes later.
Inference Usage
We include a helper that builds an input row with the exact training feature order, warns about unknown/missing keys, and prints predictions and class probabilities:
from pathlib import Path
import json, joblib, numpy as np, pandas as pd
ARTIFACTS_DIR = Path("artifacts")
model = joblib.load(ARTIFACTS_DIR / "exoplanet_best_model.joblib")
feature_columns = json.loads((ARTIFACTS_DIR / "exoplanet_feature_columns.json").read_text())
class_labels = json.loads((ARTIFACTS_DIR / "exoplanet_class_labels.json").read_text())
def predict_with_debug(model, feature_columns, class_labels, params: dict):
X = pd.DataFrame([params], dtype=float).reindex(columns=feature_columns)
y_idx = int(model.predict(X)[0])
label = class_labels[y_idx]
print("Prediction:", label)
try:
proba = model.predict_proba(X)[0]
for lbl, p in sorted(zip(class_labels, proba), key=lambda t: t[1], reverse=True):
print(f" {lbl:>15s}: {p:.3f}")
except Exception:
pass
return label
Engineered features to compute in your inputs
duty_cycle = koi_duration / (koi_period * 24.0)log_koi_period = log10(koi_period)log_koi_depth = log10(koi_depth)teq_proxy = koi_steff(orkoi_steff / sqrt(2 * a_rs)if using refined variant)- If the trained feature list includes
koi_snr: also sendlog_koi_snr = log10(koi_snr) - Else, if it includes
snr_proxy: sendsnr_proxy = koi_depth * sqrt(koi_duration / (koi_period*24.0))andlog_snr_proxy
You can check which SNR flavor was used by inspecting
feature_columns.json.
Design Choices & Rationale
- Mission-agnostic: We drop mission identity; rely on physical features to generalize.
- Macro-F1: More balanced assessment when class counts differ (especially for FALSE POSITIVE).
- RobustScaler: Less sensitive to outliers than StandardScaler.
- LabelEncoder: Ensures XGBoost gets integers (fixes “Invalid classes inferred” errors).
- SNR strategy: If SNR/MES exists we use it; otherwise we inject a physics-motivated proxy to preserve detection strength information across missions.
Troubleshooting
CSV ParserError / “python engine” warnings
We avoidlow_memoryand use Python engine withon_bad_lines="skip". The loader will try alternate separators and re-try with a detected header line.SMOTE “intermediate steps should not be Pipelines”
We use a singleimblearn.Pipelinewith inline steps (impute → scale → SMOTE → clf).PicklingError: QuantileCapper
We removed custom transformers and rely on standard sklearn components to ensure picklability.AttributeError: 'numpy.ndarray' has no attribute 'columns'
We compute permutation importance on the pipeline (not the bare estimator), and we take feature names fromX_train.columns.XGBoost “Invalid classes inferred”
We label-encodeyto integers (LabelEncoder) before fitting.Different inputs return the “same” class
Often caused by: (a) too few features provided (most imputed), (b) features outside training ranges (scaled to similar z-scores), or (c) providing keys not used by the model. Usepredict_with_debugto see recognized/unknown/missing features and adjust your dict.
How to Re-train
- Open the notebook
Exoplanet_Classification_NASA_Kepler_K2_TESS_withSNR_ROBUST.ipynb. - Ensure the three CSV files are available at the paths set in the Data Loading cell.
- (Optional) Set
FAST = Falsein the training cell to run a broader hyper-parameter search. - Run all cells. Artifacts are written to
./artifacts/.
Extending the Work
- Add more SNR/MES name variants if future catalogs introduce new headers.
- Experiment with calibrated probabilities (e.g.,
CalibratedClassifierCV) for better ROC behavior. - Try additional models (LightGBM/CatBoost) if available.
- Incorporate vetting-pipeline shape features (e.g., V-shape metrics) when harmonizable across missions.
- Add temporal cross-validation per mission or per release to stress test generalization.