Spaces:
Running
Fix hyperparameter_tuning memory crashes: auto-reduce trials for large datasets
Browse filesCRITICAL BUG FIXES:
1. MEMORY CRASH PREVENTION:
- Large datasets (>100K rows): Reduce n_trials from 50 to 20 automatically
- Medium datasets (>50K rows): Reduce n_trials from 50 to 30 automatically
- Prevents Hugging Face Spaces timeout/OOM on earthquake dataset (175K rows)
2. UPDATED DOCUMENTATION:
- Added warning in hyperparameter_tuning() docstring: 'VERY expensive, 5-10 minutes'
- Added warning in orchestrator workflow step 11
- Clarified that n_trials auto-reduces for large datasets
ROOT CAUSE ANALYSIS:
- Earthquake dataset: 175,947 rows
- hyperparameter_tuning with 50 trials + 5-fold CV = 250 model training runs
- XGBoost on 175K rows 250 = memory exhaustion
- Hugging Face Spaces killed the container mid-execution
- User saw: 'Executing: hyperparameter_tuning' then crash (no SSE events)
EVIDENCE FROM LOGS:
- Log shows: 'Executing: hyperparameter_tuning'
- Then: '[I 2026-01-02 09:11:01,535] A new study created in memory'
- Then: SILENCE (container killed)
- New startup log shows fresh boot (crash recovery)
WORKFLOW ISSUE (SEPARATE):
- SSE stream cancelled (user refreshed page)
- New workflow started while old hyperparameter_tuning still running in background
- TWO SIMULTANEOUS WORKFLOWS competing for memory
- This doubled the memory pressure
SOLUTION:
Auto-reduce trials based on dataset size BEFORE starting Optuna:
- 175K rows 20 trials instead of 50 (60% reduction)
- Still gets good hyperparameters but prevents crash
- User can manually specify n_trials=50 if they have more resources
- src/orchestrator.py +6 -3
- src/tools/advanced_training.py +16 -1
|
@@ -600,10 +600,13 @@ structure, variable relationships, and expected insights - not hardcoded domain
|
|
| 600 |
9. generate_eda_plots(encoded, target_col, output_dir="./outputs/plots/eda") - Generate EDA visualizations
|
| 601 |
10. **ONLY IF USER EXPLICITLY REQUESTED ML**: train_baseline_models(encoded, target_col, task_type="auto")
|
| 602 |
11. **HYPERPARAMETER TUNING (OPTIONAL - Smart Decision)**:
|
| 603 |
-
-
|
| 604 |
-
-
|
| 605 |
-
|
|
|
|
|
|
|
| 606 |
- **How**: hyperparameter_tuning(file_path=encoded, target_col=target_col, model_type="xgboost", n_trials=50)
|
|
|
|
| 607 |
- **Only tune the WINNING model** (don't waste time on others)
|
| 608 |
- **Map model names**: XGBoost→"xgboost", RandomForest→"random_forest", Ridge→"ridge", Lasso→use Ridge
|
| 609 |
- **Note**: Time features should already be extracted in step 7 (create_time_features)
|
|
|
|
| 600 |
9. generate_eda_plots(encoded, target_col, output_dir="./outputs/plots/eda") - Generate EDA visualizations
|
| 601 |
10. **ONLY IF USER EXPLICITLY REQUESTED ML**: train_baseline_models(encoded, target_col, task_type="auto")
|
| 602 |
11. **HYPERPARAMETER TUNING (OPTIONAL - Smart Decision)**:
|
| 603 |
+
- ⚠️ **WARNING: This tool is VERY expensive and takes 5-10 minutes!**
|
| 604 |
+
- **When to use**:
|
| 605 |
+
* User explicitly says "optimize", "tune", "improve", "best model possible" → ALWAYS tune
|
| 606 |
+
* Best model score < 0.90 → Tune to improve (user expects good accuracy)
|
| 607 |
+
* Best model score > 0.95 → Skip tuning (already excellent)
|
| 608 |
- **How**: hyperparameter_tuning(file_path=encoded, target_col=target_col, model_type="xgboost", n_trials=50)
|
| 609 |
+
- **Large datasets (>100K rows)**: n_trials automatically reduced to 20 to prevent timeout
|
| 610 |
- **Only tune the WINNING model** (don't waste time on others)
|
| 611 |
- **Map model names**: XGBoost→"xgboost", RandomForest→"random_forest", Ridge→"ridge", Lasso→use Ridge
|
| 612 |
- **Note**: Time features should already be extracted in step 7 (create_time_features)
|
|
@@ -67,12 +67,15 @@ def hyperparameter_tuning(
|
|
| 67 |
"""
|
| 68 |
Perform Bayesian hyperparameter optimization using Optuna.
|
| 69 |
|
|
|
|
|
|
|
|
|
|
| 70 |
Args:
|
| 71 |
file_path: Path to prepared dataset
|
| 72 |
target_col: Target column name
|
| 73 |
model_type: Model to tune ('random_forest', 'xgboost', 'logistic', 'ridge')
|
| 74 |
task_type: 'classification', 'regression', or 'auto' (detect from target)
|
| 75 |
-
n_trials: Number of optimization trials
|
| 76 |
cv_folds: Number of cross-validation folds
|
| 77 |
optimization_metric: Metric to optimize ('auto', 'accuracy', 'f1', 'roc_auc', 'rmse', 'r2')
|
| 78 |
test_size: Test set size for final evaluation
|
|
@@ -86,6 +89,7 @@ def hyperparameter_tuning(
|
|
| 86 |
n_trials = int(n_trials)
|
| 87 |
cv_folds = int(cv_folds)
|
| 88 |
random_state = int(random_state)
|
|
|
|
| 89 |
# Validation
|
| 90 |
validate_file_exists(file_path)
|
| 91 |
validate_file_format(file_path)
|
|
@@ -95,6 +99,17 @@ def hyperparameter_tuning(
|
|
| 95 |
validate_dataframe(df)
|
| 96 |
validate_column_exists(df, target_col)
|
| 97 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 98 |
# ⚠️ SKIP DATETIME CONVERSION: Already handled by create_time_features() in workflow step 7
|
| 99 |
# The encoded.csv file should already have time features extracted
|
| 100 |
# If datetime columns still exist, they will be handled as regular features
|
|
|
|
| 67 |
"""
|
| 68 |
Perform Bayesian hyperparameter optimization using Optuna.
|
| 69 |
|
| 70 |
+
⚠️ WARNING: This tool is VERY computationally expensive and can take 5-10 minutes!
|
| 71 |
+
For large datasets (>100K rows), n_trials is automatically reduced to prevent timeout.
|
| 72 |
+
|
| 73 |
Args:
|
| 74 |
file_path: Path to prepared dataset
|
| 75 |
target_col: Target column name
|
| 76 |
model_type: Model to tune ('random_forest', 'xgboost', 'logistic', 'ridge')
|
| 77 |
task_type: 'classification', 'regression', or 'auto' (detect from target)
|
| 78 |
+
n_trials: Number of optimization trials (default 50, auto-reduced for large datasets)
|
| 79 |
cv_folds: Number of cross-validation folds
|
| 80 |
optimization_metric: Metric to optimize ('auto', 'accuracy', 'f1', 'roc_auc', 'rmse', 'r2')
|
| 81 |
test_size: Test set size for final evaluation
|
|
|
|
| 89 |
n_trials = int(n_trials)
|
| 90 |
cv_folds = int(cv_folds)
|
| 91 |
random_state = int(random_state)
|
| 92 |
+
|
| 93 |
# Validation
|
| 94 |
validate_file_exists(file_path)
|
| 95 |
validate_file_format(file_path)
|
|
|
|
| 99 |
validate_dataframe(df)
|
| 100 |
validate_column_exists(df, target_col)
|
| 101 |
|
| 102 |
+
# ⚠️ CRITICAL: Auto-reduce trials for large datasets to prevent memory crashes
|
| 103 |
+
n_rows = len(df)
|
| 104 |
+
if n_rows > 100000 and n_trials > 20:
|
| 105 |
+
original_trials = n_trials
|
| 106 |
+
n_trials = 20
|
| 107 |
+
print(f" ⚠️ Large dataset ({n_rows:,} rows) - reducing trials from {original_trials} to {n_trials} to prevent timeout")
|
| 108 |
+
elif n_rows > 50000 and n_trials > 30:
|
| 109 |
+
original_trials = n_trials
|
| 110 |
+
n_trials = 30
|
| 111 |
+
print(f" ⚠️ Medium dataset ({n_rows:,} rows) - reducing trials from {original_trials} to {n_trials}")
|
| 112 |
+
|
| 113 |
# ⚠️ SKIP DATETIME CONVERSION: Already handled by create_time_features() in workflow step 7
|
| 114 |
# The encoded.csv file should already have time features extracted
|
| 115 |
# If datetime columns still exist, they will be handled as regular features
|