new README.md and Trainer design

Browse files

Files changed (6) hide show

README.md +432 -2
requirements.txt +5 -2
src/main/parser.py +145 -36
src/main/train_helpers.py +0 -265
src/main/trainer.py +472 -110
src/utils.py +0 -17

README.md CHANGED Viewed

@@ -1,7 +1,437 @@
-# Official implementation of [AdaFortiTran: An Adaptive Transformer Model for Robust OFDM Channel Estimation](https://arxiv.org/abs/2505.09076) accepted at ICC 2025, Montreal, Canada.
-## License
 This project is licensed under the Apache License 2.0 - see the [LICENSE](LICENSE) file for details.

+# AdaFortiTran: Adaptive Transformer Model for Robust OFDM Channel Estimation
+[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](LICENSE)
+[![Python](https://img.shields.io/badge/Python-3.8+-blue.svg)](https://www.python.org/)
+[![PyTorch](https://img.shields.io/badge/PyTorch-1.8+-red.svg)](https://pytorch.org/)
+Official implementation of [AdaFortiTran: An Adaptive Transformer Model for Robust OFDM Channel Estimation](https://arxiv.org/abs/2505.09076) accepted at ICC 2025, Montreal, Canada.
+## 📖 Overview
+AdaFortiTran is a novel adaptive transformer-based model for OFDM channel estimation that dynamically adapts to varying channel conditions (SNR, delay spread, Doppler shift). The model combines the power of transformer architectures with channel-aware adaptation mechanisms to achieve robust performance across diverse wireless environments.
+### Key Features
+- **🔄 Adaptive Architecture**: Dynamically adapts to channel conditions using meta-information
+- **⚡ High Performance**: State-of-the-art results on OFDM channel estimation tasks
+- **🧠 Transformer-Based**: Leverages attention mechanisms for long-range dependencies
+- **🎯 Robust**: Maintains performance across varying SNR, delay spread, and Doppler conditions
+- **🚀 Production Ready**: Comprehensive training pipeline with advanced features
+## 🏗️ Architecture
+The project implements three model variants:
+1. **Linear Estimator**: Simple learned linear transformation baseline
+2. **FortiTran**: Fixed transformer-based channel estimator
+3. **AdaFortiTran**: Adaptive transformer with channel condition awareness
+### Model Comparison
+| Model | Channel Adaptation | Complexity | Performance |
+|-------|-------------------|------------|-------------|
+| Linear | ❌ | Low | Baseline |
+| FortiTran | ❌ | Medium | Good |
+| AdaFortiTran | ✅ | High | **Best** |
+## 🚀 Quick Start
+### Installation
+1. **Clone the repository**:
+   ```bash
+   git clone https://github.com/your-username/AdaFortiTran.git
+   cd AdaFortiTran
+   ```
+2. **Install dependencies**:
+   ```bash
+   pip install -r requirements.txt
+   ```
+3. **Verify installation**:
+   ```bash
+   python -c "import torch; print(f'PyTorch {torch.__version__}')"
+   ```
+### Basic Training
+Train an AdaFortiTran model with default settings:
+```bash
+python src/main.py \
+    --model_name adafortitran \
+    --system_config_path config/system_config.yaml \
+    --model_config_path config/adafortitran.yaml \
+    --train_set data/train \
+    --val_set data/val \
+    --test_set data/test \
+    --exp_id my_experiment
+```
+### Advanced Training
+Use all available features for optimal performance:
+```bash
+python src/main.py \
+    --model_name adafortitran \
+    --system_config_path config/system_config.yaml \
+    --model_config_path config/adafortitran.yaml \
+    --train_set data/train \
+    --val_set data/val \
+    --test_set data/test \
+    --exp_id advanced_experiment \
+    --batch_size 128 \
+    --lr 5e-4 \
+    --max_epoch 100 \
+    --patience 10 \
+    --weight_decay 1e-4 \
+    --gradient_clip_val 1.0 \
+    --use_mixed_precision \
+    --save_every_n_epochs 5 \
+    --num_workers 8 \
+    --test_every_n 5
+```
+## 📁 Project Structure
+```
+AdaFortiTran/
+├── config/                     # Configuration files
+│   ├── system_config.yaml     # OFDM system parameters
+│   ├── adafortitran.yaml      # AdaFortiTran model config
+│   ├── fortitran.yaml         # FortiTran model config
+│   └── linear.yaml            # Linear model config
+├── data/                      # Dataset directory
+│   ├── train/                 # Training data
+│   ├── val/                   # Validation data
+│   └── test/                  # Test data (DS, MDS, SNR sets)
+├── src/                       # Source code
+│   ├── main/                  # Training pipeline
+│   │   ├── trainer.py         # Enhanced ModelTrainer
+│   │   └── parser.py          # Command-line argument parser
+│   ├── models/                # Model implementations
+│   │   ├── adafortitran.py    # AdaFortiTran model
+│   │   ├── fortitran.py       # FortiTran model
+│   │   ├── linear.py          # Linear model
+│   │   └── blocks/            # Model building blocks
+│   ├── data/                  # Data loading
+│   │   └── dataset.py         # Dataset and DataLoader classes
+│   ├── config/                # Configuration management
+│   │   ├── config_loader.py   # YAML configuration loader
+│   │   └── schemas.py         # Pydantic validation schemas
+│   └── utils.py               # Utility functions
+├── requirements.txt           # Python dependencies
+├── README.md                  # This file
+```
+## ⚙️ Configuration
+### System Configuration (`config/system_config.yaml`)
+Defines OFDM system parameters:
+```yaml
+ofdm:
+  num_scs: 120      # Number of subcarriers
+  num_symbols: 14   # Number of OFDM symbols
+pilot:
+  num_scs: 12       # Number of pilot subcarriers
+  num_symbols: 2    # Number of pilot symbols
+```
+### Model Configuration (`config/adafortitran.yaml`)
+Defines model architecture parameters:
+```yaml
+model_type: 'adafortitran'
+patch_size: [3, 2]                    # Patch dimensions
+num_layers: 6                         # Transformer layers
+model_dim: 128                        # Model dimension
+num_head: 4                           # Attention heads
+activation: 'gelu'                    # Activation function
+dropout: 0.1                          # Dropout rate
+max_seq_len: 512                      # Maximum sequence length
+pos_encoding_type: 'learnable'        # Positional encoding
+channel_adaptivity_hidden_sizes: [7, 42, 560]  # Adaptation layers
+adaptive_token_length: 6              # Adaptive token length
+```
+## 🎯 Training Features
+### Advanced Training Options
+| Feature | Description | Default |
+|---------|-------------|---------|
+| `--use_mixed_precision` | Enable mixed precision training | False |
+| `--gradient_clip_val` | Gradient clipping value | None |
+| `--weight_decay` | Weight decay for optimizer | 0.0 |
+| `--save_checkpoints` | Enable model checkpointing | True |
+| `--save_best_only` | Save only best model | True |
+| `--resume_from_checkpoint` | Resume from checkpoint | None |
+| `--num_workers` | Data loading workers | 4 |
+| `--pin_memory` | Pin memory for GPU | True |
+### Callback System
+The training pipeline includes an extensible callback system:
+- **TensorBoard Logging**: Automatic metric tracking and visualization
+- **Checkpoint Management**: Flexible checkpoint saving strategies
+- **Custom Callbacks**: Easy to add new logging or monitoring systems
+### Performance Optimizations
+- **Mixed Precision Training**: Faster training on modern GPUs
+- **Optimized Data Loading**: Configurable workers and memory pinning
+- **Gradient Clipping**: Stable training with configurable clipping
+- **Early Stopping**: Automatic training termination on plateau
+## 📊 Dataset Format
+### Expected File Structure
+```
+data/
+├── train/
+│   ├── 1_SNR-20_DS-50_DOP-500_N-3_TDL-A.mat
+│   ├── 2_SNR-20_DS-50_DOP-500_N-3_TDL-A.mat
+│   └── ...
+├── val/
+│   └── ...
+└── test/
+    ├── DS_test_set/          # Delay Spread tests
+    │   ├── DS_50/
+    │   ├── DS_100/
+    │   └── ...
+    ├── SNR_test_set/         # SNR tests
+    │   ├── SNR_10/
+    │   ├── SNR_20/
+    │   └── ...
+    └── MDS_test_set/         # Multi-Doppler tests
+        ├── DOP_200/
+        ├── DOP_400/
+        └── ...
+```
+### File Naming Convention
+Files must follow the pattern:
+```
+{file_number}_SNR-{snr}_DS-{delay_spread}_DOP-{doppler}_N-{pilot_freq}_{channel_type}.mat
+```
+Example: `1_SNR-20_DS-50_DOP-500_N-3_TDL-A.mat`
+### Data Format
+Each `.mat` file must contain variable `H` with shape `[subcarriers, symbols, 3]`:
+- `H[:, :, 0]`: Ground truth channel (complex values)
+- `H[:, :, 1]`: LS channel estimate with zeros for non-pilot positions
+- `H[:, :, 2]`: Reserved for future use
+## 🔧 Usage Examples
+### Training Different Models
+**Linear Estimator**:
+```bash
+python src/main.py \
+    --model_name linear \
+    --system_config_path config/system_config.yaml \
+    --model_config_path config/linear.yaml \
+    --train_set data/train \
+    --val_set data/val \
+    --test_set data/test \
+    --exp_id linear_baseline
+```
+**FortiTran**:
+```bash
+python src/main.py \
+    --model_name fortitran \
+    --system_config_path config/system_config.yaml \
+    --model_config_path config/fortitran.yaml \
+    --train_set data/train \
+    --val_set data/val \
+    --test_set data/test \
+    --exp_id fortitran_experiment
+```
+**AdaFortiTran**:
+```bash
+python src/main.py \
+    --model_name adafortitran \
+    --system_config_path config/system_config.yaml \
+    --model_config_path config/adafortitran.yaml \
+    --train_set data/train \
+    --val_set data/val \
+    --test_set data/test \
+    --exp_id adafortitran_experiment
+```
+### Resume Training
+```bash
+python src/main.py \
+    --model_name adafortitran \
+    --system_config_path config/system_config.yaml \
+    --model_config_path config/adafortitran.yaml \
+    --train_set data/train \
+    --val_set data/val \
+    --test_set data/test \
+    --exp_id resumed_experiment \
+    --resume_from_checkpoint runs/adafortitran_experiment/best/checkpoint_epoch_50.pt
+```
+### Hyperparameter Tuning
+```bash
+python src/main.py \
+    --model_name adafortitran \
+    --system_config_path config/system_config.yaml \
+    --model_config_path config/adafortitran.yaml \
+    --train_set data/train \
+    --val_set data/val \
+    --test_set data/test \
+    --exp_id hyperparameter_tuning \
+    --batch_size 64 \
+    --lr 1e-3 \
+    --max_epoch 50 \
+    --patience 5 \
+    --weight_decay 1e-5 \
+    --gradient_clip_val 0.5 \
+    --use_mixed_precision \
+    --test_every_n 5
+```
+## 📈 Monitoring and Logging
+### TensorBoard Integration
+Training automatically logs metrics to TensorBoard:
+```bash
+tensorboard --logdir runs/
+```
+Available metrics:
+- Training/validation loss
+- Learning rate
+- Test performance across conditions
+- Error visualizations
+- Model hyperparameters
+### Log Files
+Training logs are saved to:
+- `logs/training_{exp_id}.log`: Python logging output
+- `runs/{model_name}_{exp_id}/`: TensorBoard logs and checkpoints
+## 🧪 Testing and Evaluation
+### Automatic Testing
+The training pipeline automatically evaluates models on:
+- **DS (Delay Spread)**: Varying delay spread conditions
+- **SNR**: Different signal-to-noise ratios
+- **MDS (Multi-Doppler)**: Various Doppler shift scenarios
+### Manual Evaluation
+```python
+from src.models import AdaFortiTranEstimator
+from src.config import load_config
+# Load configurations
+system_config, model_config = load_config(
+    'config/system_config.yaml',
+    'config/adafortitran.yaml'
+)
+# Initialize model
+model = AdaFortiTranEstimator(system_config, model_config)
+# Load checkpoint
+checkpoint = torch.load('checkpoint.pt')
+model.load_state_dict(checkpoint['model_state_dict'])
+# Evaluate
+model.eval()
+# ... evaluation code
+```
+## 🔬 Research and Development
+### Adding Custom Callbacks
+```python
+from src.main.trainer import Callback, TrainingMetrics
+class CustomCallback(Callback):
+    def on_epoch_end(self, epoch: int, metrics: TrainingMetrics) -> None:
+        # Custom logic here
+        print(f"Epoch {epoch}: Train Loss = {metrics.train_loss:.4f}")
+```
+### Extending Models
+The modular architecture makes it easy to add new model variants:
+```python
+from src.models.fortitran import BaseFortiTranEstimator
+class CustomEstimator(BaseFortiTranEstimator):
+    def __init__(self, system_config, model_config):
+        super().__init__(system_config, model_config, use_channel_adaptation=True)
+        # Add custom components
+```
+## 🐛 Troubleshooting
+### Common Issues
+**CUDA Out of Memory**:
+- Reduce batch size: `--batch_size 32`
+- Enable mixed precision: `--use_mixed_precision`
+- Reduce number of workers: `--num_workers 2`
+**Slow Training**:
+- Increase number of workers: `--num_workers 8`
+- Enable pin memory: `--pin_memory`
+- Use mixed precision: `--use_mixed_precision`
+**Poor Convergence**:
+- Adjust learning rate: `--lr 1e-4`
+- Add gradient clipping: `--gradient_clip_val 1.0`
+- Increase patience: `--patience 10`
+### Getting Help
+1. Check the logs in `logs/training_{exp_id}.log`
+2. Verify dataset format matches requirements
+3. Ensure all dependencies are installed correctly
+4. Check TensorBoard for training curves
+## 📚 Citation
+If you use this code in your research, please cite:
+```bibtex
+@misc{guler2025adafortitranadaptivetransformermodel,
+      title={AdaFortiTran: An Adaptive Transformer Model for Robust OFDM Channel Estimation},
+      author={Berkay Guler and Hamid Jafarkhani},
+      year={2025},
+      eprint={2505.09076},
+      archivePrefix={arXiv},
+      primaryClass={cs.LG},
+      url={https://arxiv.org/abs/2505.09076},
+}
+```
+## 📄 License
 This project is licensed under the Apache License 2.0 - see the [LICENSE](LICENSE) file for details.

requirements.txt CHANGED Viewed

@@ -1,5 +1,8 @@
 torch
 pydantic
-yaml
 scipy
-tqdm

 torch
 pydantic
+pyyaml
 scipy
+tqdm
+matplotlib
+prettytable
+tensorboard

src/main/parser.py CHANGED Viewed

@@ -10,7 +10,7 @@ of training runs.
 from pathlib import Path
 import argparse
 from pydantic import BaseModel, Field, model_validator
-from typing import Self
 class TrainingArguments(BaseModel):
@@ -41,9 +41,22 @@ class TrainingArguments(BaseModel):
         lr: Learning rate for optimizer
         max_epoch: Maximum number of training epochs
         patience: Early stopping patience in epochs
         # Evaluation
         test_every_n: Number of epochs between test evaluations
     """
     # Model Configuration
@@ -67,10 +80,23 @@ class TrainingArguments(BaseModel):
     lr: float = Field(default=1e-3, gt=0, description="Initial learning rate")
     max_epoch: int = Field(default=10, gt=0, description="Maximum number of training epochs")
     patience: int = Field(default=3, gt=0, description="Early stopping patience (epochs)")
     # Evaluation
     test_every_n: int = Field(default=10, gt=0, description="Test model every N epochs")
     @model_validator(mode='after')
     def validate_paths(self) -> Self:
         """Validate path-related arguments.
@@ -92,6 +118,13 @@ class TrainingArguments(BaseModel):
         if not self.model_config_path.suffix == '.yaml':
             raise ValueError(f"Model config file must be a .yaml file: {self.model_config_path}")
         return self
@@ -161,58 +194,134 @@ def parse_arguments() -> TrainingArguments:
         help='Experiment identifier for log folder naming'
     )
-    # Optional arguments
-    optional = parser.add_argument_group('optional arguments')
-    optional.add_argument(
-        '--python_log_level',
-        type=str,
-        default="INFO",
-        help='Logger level for python logging module'
-    )
-    optional.add_argument(
-        '--tensorboard_log_dir',
-        type=Path,
-        default="runs",
-        help='Directory for tensorboard logs'
-    )
-    optional.add_argument(
-        '--python_log_dir',
-        type=Path,
-        default="logs",
-        help='Directory for python logging files'
-    )
-    optional.add_argument(
-        '--test_every_n',
         type=int,
-        default=10,
-        help='Test model every N epochs'
     )
-    optional.add_argument(
         '--max_epoch',
         type=int,
         default=10,
         help='Maximum number of training epochs'
     )
-    optional.add_argument(
         '--patience',
         type=int,
         default=3,
         help='Early stopping patience (epochs)'
     )
-    optional.add_argument(
-        '--batch_size',
         type=int,
-        default=64,
-        help='Training batch size'
     )
-    optional.add_argument(
-        '--lr',
-        type=float,
-        default=1e-3,
-        help='Initial learning rate'
     )
     args = parser.parse_args()

 from pathlib import Path
 import argparse
 from pydantic import BaseModel, Field, model_validator
+from typing import Self, Optional
 class TrainingArguments(BaseModel):
         lr: Learning rate for optimizer
         max_epoch: Maximum number of training epochs
         patience: Early stopping patience in epochs
+        weight_decay: Weight decay for optimizer
+        gradient_clip_val: Gradient clipping value
+        use_mixed_precision: Whether to use mixed precision training
         # Evaluation
         test_every_n: Number of epochs between test evaluations
+        # Checkpointing
+        save_checkpoints: Whether to save model checkpoints
+        save_best_only: Whether to save only the best model
+        save_every_n_epochs: Save checkpoint every N epochs
+        resume_from_checkpoint: Path to checkpoint to resume from
+        # Data Loading
+        num_workers: Number of data loading workers
+        pin_memory: Whether to pin memory for faster GPU transfer
     """
     # Model Configuration
     lr: float = Field(default=1e-3, gt=0, description="Initial learning rate")
     max_epoch: int = Field(default=10, gt=0, description="Maximum number of training epochs")
     patience: int = Field(default=3, gt=0, description="Early stopping patience (epochs)")
+    weight_decay: float = Field(default=0.0, ge=0.0, description="Weight decay for optimizer")
+    gradient_clip_val: Optional[float] = Field(default=None, gt=0, description="Gradient clipping value")
+    use_mixed_precision: bool = Field(default=False, description="Whether to use mixed precision training")
     # Evaluation
     test_every_n: int = Field(default=10, gt=0, description="Test model every N epochs")
+    # Checkpointing
+    save_checkpoints: bool = Field(default=True, description="Whether to save model checkpoints")
+    save_best_only: bool = Field(default=True, description="Whether to save only the best model")
+    save_every_n_epochs: Optional[int] = Field(default=None, gt=0, description="Save checkpoint every N epochs")
+    resume_from_checkpoint: Optional[Path] = Field(default=None, description="Path to checkpoint to resume from")
+    # Data Loading
+    num_workers: int = Field(default=4, ge=0, description="Number of data loading workers")
+    pin_memory: bool = Field(default=True, description="Whether to pin memory for faster GPU transfer")
     @model_validator(mode='after')
     def validate_paths(self) -> Self:
         """Validate path-related arguments.
         if not self.model_config_path.suffix == '.yaml':
             raise ValueError(f"Model config file must be a .yaml file: {self.model_config_path}")
+        # Validate checkpoint path if provided
+        if self.resume_from_checkpoint is not None:
+            if not self.resume_from_checkpoint.exists():
+                raise ValueError(f"Checkpoint file not found: {self.resume_from_checkpoint}")
+            if not self.resume_from_checkpoint.suffix == '.pt':
+                raise ValueError(f"Checkpoint file must be a .pt file: {self.resume_from_checkpoint}")
         return self
         help='Experiment identifier for log folder naming'
     )
+    # Training hyperparameters
+    training = parser.add_argument_group('training hyperparameters')
+    training.add_argument(
+        '--batch_size',
         type=int,
+        default=64,
+        help='Training batch size'
+    )
+    training.add_argument(
+        '--lr',
+        type=float,
+        default=1e-3,
+        help='Initial learning rate'
     )
+    training.add_argument(
         '--max_epoch',
         type=int,
         default=10,
         help='Maximum number of training epochs'
     )
+    training.add_argument(
         '--patience',
         type=int,
         default=3,
         help='Early stopping patience (epochs)'
     )
+    training.add_argument(
+        '--weight_decay',
+        type=float,
+        default=0.0,
+        help='Weight decay for optimizer'
+    )
+    training.add_argument(
+        '--gradient_clip_val',
+        type=float,
+        default=None,
+        help='Gradient clipping value (disabled if not specified)'
+    )
+    training.add_argument(
+        '--use_mixed_precision',
+        action='store_true',
+        help='Use mixed precision training (requires PyTorch >= 1.6)'
+    )
+    # Evaluation settings
+    evaluation = parser.add_argument_group('evaluation settings')
+    evaluation.add_argument(
+        '--test_every_n',
         type=int,
+        default=10,
+        help='Test model every N epochs'
     )
+    # Checkpointing settings
+    checkpointing = parser.add_argument_group('checkpointing settings')
+    checkpointing.add_argument(
+        '--save_checkpoints',
+        action='store_true',
+        default=True,
+        help='Save model checkpoints'
+    )
+    checkpointing.add_argument(
+        '--no_save_checkpoints',
+        action='store_false',
+        dest='save_checkpoints',
+        help='Disable saving model checkpoints'
+    )
+    checkpointing.add_argument(
+        '--save_best_only',
+        action='store_true',
+        default=True,
+        help='Save only the best model based on validation loss'
+    )
+    checkpointing.add_argument(
+        '--save_every_n_epochs',
+        type=int,
+        default=None,
+        help='Save checkpoint every N epochs (in addition to best model)'
+    )
+    checkpointing.add_argument(
+        '--resume_from_checkpoint',
+        type=Path,
+        default=None,
+        help='Path to checkpoint file to resume training from'
     )
+    # Data loading settings
+    data_loading = parser.add_argument_group('data loading settings')
+    data_loading.add_argument(
+        '--num_workers',
+        type=int,
+        default=4,
+        help='Number of data loading workers'
+    )
+    data_loading.add_argument(
+        '--pin_memory',
+        action='store_true',
+        default=True,
+        help='Pin memory for faster GPU transfer'
+    )
+    data_loading.add_argument(
+        '--no_pin_memory',
+        action='store_false',
+        dest='pin_memory',
+        help='Disable pin memory'
+    )
+    # Logging settings
+    logging_group = parser.add_argument_group('logging settings')
+    logging_group.add_argument(
+        '--python_log_level',
+        type=str,
+        default="INFO",
+        choices=['DEBUG', 'INFO', 'WARNING', 'ERROR', 'CRITICAL'],
+        help='Logger level for python logging module'
+    )
+    logging_group.add_argument(
+        '--tensorboard_log_dir',
+        type=Path,
+        default="runs",
+        help='Directory for tensorboard logs'
+    )
+    logging_group.add_argument(
+        '--python_log_dir',
+        type=Path,
+        default="logs",
+        help='Directory for python logging files'
+    )
     args = parser.parse_args()

src/main/train_helpers.py DELETED Viewed

@@ -1,265 +0,0 @@
-"""
-Training helper functions for OFDM channel estimation models.
-This module provides utility functions for training, evaluating, and testing
-deep learning models for OFDM channel estimation tasks. It includes functions
-for performing training epochs, model evaluation, prediction generation,
-and performance statistics calculation across different test conditions.
-"""
-from typing import Dict, List, Tuple, Union, Callable
-import torch
-from torch import nn
-from torch.utils.data import DataLoader
-from torch.optim import Optimizer
-from torch.optim.lr_scheduler import ExponentialLR
-from src.utils import to_db, concat_complex_channel
-# Type aliases
-ComplexTensor = torch.Tensor  # Complex tensor
-BatchType = Tuple[ComplexTensor, ComplexTensor, Union[Dict, None]]
-TestDataLoadersType = List[Tuple[str, DataLoader]]
-StatsType = Dict[int, float]
-def get_all_test_stats(
-        model: nn.Module,
-        test_dataloaders: Dict[str, TestDataLoadersType],
-        loss_fn: Callable
-) -> Tuple[StatsType, StatsType, StatsType]:
-    """
-    Evaluate model on all test datasets.
-    Calculates performance statistics (MSE in dB) for a model across different
-    test conditions: Delay Spread (DS), Max Doppler Shift (MDS), and
-    Signal-to-Noise Ratio (SNR).
-    Args:
-        model: Model to evaluate
-        test_dataloaders: Dictionary containing DataLoader objects for test sets:
-            - "DS": Delay Spread test set
-            - "MDS": Max Doppler Shift test set
-            - "SNR": Signal-to-Noise Ratio test set
-        loss_fn: Loss function for evaluation
-    Returns:
-        Tuple containing statistics (MSE in dB) for DS, MDS, and SNR test sets,
-        where each set of statistics is a dictionary mapping parameter values to MSE
-    """
-    ds_stats = get_test_stats(model, test_dataloaders["DS"], loss_fn)
-    mds_stats = get_test_stats(model, test_dataloaders["MDS"], loss_fn)
-    snr_stats = get_test_stats(model, test_dataloaders["SNR"], loss_fn)
-    return ds_stats, mds_stats, snr_stats
-def get_test_stats(
-        model: nn.Module,
-        test_dataloaders: TestDataLoadersType,
-        loss_fn: Callable
-) -> StatsType:
-    """
-    Evaluate model on provided test dataloaders.
-    Calculates performance statistics (MSE in dB) for a model on a
-    specific set of test conditions.
-    Args:
-        model: Model to evaluate
-        test_dataloaders: List of (name, DataLoader) tuples for test sets,
-                         where names are in format "parameter_value"
-        loss_fn: Loss function for evaluation
-    Returns:
-        Dictionary mapping test parameter values (as integers) to MSE values in dB
-    """
-    stats: StatsType = {}
-    sorted_loaders = sorted(
-        test_dataloaders,
-        key=lambda x: int(x[0].split("_")[1])
-    )
-    for name, test_dataloader in sorted_loaders:
-        var, val = name.split("_")
-        test_loss = eval_model(model, test_dataloader, loss_fn)
-        db_error = to_db(test_loss)
-        print(f"{var}:{val} Test MSE: {db_error:.4f} dB")
-        stats[int(val)] = db_error
-    return stats
-def eval_model(
-        model: nn.Module,
-        eval_dataloader: DataLoader,
-        loss_fn: Callable
-) -> float:
-    """
-    Evaluate model on given dataloader.
-    Calculates the average loss for a model on a dataset without
-    performing parameter updates.
-    Args:
-        model: Model to evaluate
-        eval_dataloader: DataLoader containing evaluation data
-        loss_fn: Loss function for computing error
-    Returns:
-        Average validation loss (adjusted for complex values)
-    Notes:
-        Loss is multiplied by 2 to account for complex-valued matrices being
-        represented as real-valued matrices of double size.
-    """
-    val_loss = 0.0
-    model.eval()
-    with torch.no_grad():
-        for batch in eval_dataloader:
-            estimated_channel, ideal_channel = _forward_pass(batch, model)
-            output = _compute_loss(estimated_channel, ideal_channel, loss_fn)
-            val_loss += (2 * output.item() * batch[0].size(0))
-    val_loss /= sum(len(batch[0]) for batch in eval_dataloader)
-    return val_loss
-def predict_channels(
-        model: nn.Module,
-        test_dataloaders: TestDataLoadersType
-) -> Dict[int, Dict[str, ComplexTensor]]:
-    """
-    Generate channel predictions for test datasets.
-    Creates predictions for a sample from each test dataset to enable
-    visualization and error analysis.
-    Args:
-        model: Model to use for predictions
-        test_dataloaders: List of (name, DataLoader) tuples for test sets,
-                         where names are in format "parameter_value"
-    Returns:
-        Dictionary mapping test parameter values (as integers) to dictionaries containing
-        estimated and ideal channels for a single sample
-    """
-    channels: Dict[int, Dict[str, ComplexTensor]] = {}
-    sorted_loaders = sorted(
-        test_dataloaders,
-        key=lambda x: int(x[0].split("_")[1])
-    )
-    for name, test_dataloader in sorted_loaders:
-        with torch.no_grad():
-            batch = next(iter(test_dataloader))
-            estimated_channels, ideal_channels = _forward_pass(batch, model)
-        var, val = name.split("_")
-        channels[int(val)] = {
-            "estimated_channel": estimated_channels[0],
-            "ideal_channel": ideal_channels[0]
-        }
-    return channels
-def train_epoch(
-        model: nn.Module,
-        optimizer: Optimizer,
-        loss_fn: Callable,
-        scheduler: ExponentialLR,
-        train_dataloader: DataLoader
-) -> float:
-    """
-    Train model for one epoch.
-    Performs a complete training iteration over the dataset, including:
-    - Forward pass through the model
-    - Loss calculation
-    - Backpropagation
-    - Parameter updates
-    - Learning rate scheduling
-    Args:
-        model: Model to train
-        optimizer: Optimizer for updating model parameters
-        loss_fn: Loss function for computing error
-        scheduler: Learning rate scheduler
-        train_dataloader: DataLoader containing training data
-    Returns:
-        Average training loss for the epoch (adjusted for complex values)
-    Notes:
-        Loss is multiplied by 2 to account for complex-valued matrices being
-        represented as real-valued matrices of double size.
-    """
-    train_loss = 0.0
-    model.train()
-    for batch in train_dataloader:
-        optimizer.zero_grad()
-        estimated_channel, ideal_channel = _forward_pass(batch, model)
-        output = _compute_loss(estimated_channel, ideal_channel, loss_fn)
-        output.backward()
-        optimizer.step()
-        train_loss += (2 * output.item() * batch[0].size(0))
-    scheduler.step()
-    train_loss /= sum(len(batch[0]) for batch in train_dataloader)
-    return train_loss
-def _forward_pass(batch: BatchType, model: nn.Module) -> Tuple[ComplexTensor, ComplexTensor]:
-    """
-    Perform forward pass through model.
-    Processes input data through the appropriate model based on its type,
-    handling different input requirements for different model architectures.
-    Args:
-        batch: Tuple containing (estimated_channel, ideal_channel, metadata)
-        model: Model to use for processing
-    Returns:
-        Tuple of (processed_estimated_channel, ideal_channel)
-    Raises:
-        ValueError: If model type is not recognized
-    """
-    estimated_channel, ideal_channel, meta_data = batch
-    # All models now handle complex input directly
-    if hasattr(model, 'use_channel_adaptation') and model.use_channel_adaptation:
-        # AdaFortiTran uses meta_data for channel adaptation
-        estimated_channel = model(estimated_channel, meta_data)
-    else:
-        # Linear and FortiTran models don't use meta_data
-        estimated_channel = model(estimated_channel)
-    return estimated_channel, ideal_channel.to(model.device)
-def _compute_loss(
-        estimated_channel: ComplexTensor,
-        ideal_channel: ComplexTensor,
-        loss_fn: Callable
-) -> torch.Tensor:
-    """
-    Calculate loss between estimated and ideal channels.
-    Computes the loss between model output and ground truth using the specified
-    loss function, with appropriate handling of complex values.
-    Args:
-        estimated_channel: Estimated channel from model
-        ideal_channel: Ground truth ideal channel
-        loss_fn: Loss function to compute error
-    Returns:
-        Computed loss value as a scalar tensor
-    """
-    return loss_fn(
-        concat_complex_channel(estimated_channel),
-        concat_complex_channel(ideal_channel)
-    )

src/main/trainer.py CHANGED Viewed

@@ -11,9 +11,12 @@ import torch
 from torch import nn, optim
 from torch.utils.data import DataLoader
 from torch.utils.tensorboard.writer import SummaryWriter
-from typing import Dict, Tuple, Type, Union
 import logging
 from tqdm import tqdm
 from .parser import TrainingArguments
 from src.data.dataset import MatDataset, get_test_dataloaders
@@ -33,6 +36,291 @@ from src.config.schemas import SystemConfig, ModelConfig
 ModelType = Union[LinearEstimator, AdaFortiTranEstimator, FortiTranEstimator]
 class ModelTrainer:
     """Handles the training and evaluation of deep learning models.
@@ -59,6 +347,9 @@ class ModelTrainer:
         val_loader: DataLoader for validation set (used for validation)
         test_loaders: Dictionary of test set DataLoaders (used for testing)
         logger: Logger instance for logging messages
     """
     MODEL_REGISTRY: Dict[str, Type[ModelType]] = {
@@ -86,13 +377,59 @@ class ModelTrainer:
         self.logger = logging.getLogger(__name__)
         self.model = self._initialize_model()
-        self.optimizer = optim.Adam(self.model.parameters(), lr=args.lr)
         self.scheduler = optim.lr_scheduler.ExponentialLR(self.optimizer, gamma=self.EXP_LR_GAMMA)
         self.early_stopper = EarlyStopping(patience=args.patience)
         self.training_loss = nn.MSELoss()
         self.train_loader, self.val_loader, self.test_loaders = self._get_dataloaders()
     def _setup_tensorboard(self) -> SummaryWriter:
         """Set up TensorBoard logging.
@@ -134,26 +471,30 @@ class ModelTrainer:
         return model
     def _get_dataloaders(self) -> Tuple[DataLoader, DataLoader, dict[str, list[tuple[str, DataLoader]]]]:
         pilot_dims = [self.system_config.pilot.num_scs, self.system_config.pilot.num_symbols]
         # Training and validation dataloaders
-        train_dataset = MatDataset(
-            self.args.train_set,
-            pilot_dims
-        )
-        val_dataset = MatDataset(
-            self.args.val_set,
-            pilot_dims
-        )
         train_loader = DataLoader(
             train_dataset,
             batch_size=self.args.batch_size,
-            shuffle=True
         )
         val_loader = DataLoader(
             val_dataset,
             batch_size=self.args.batch_size,
-            shuffle=True
         )
         test_loaders = {
             "DS": get_test_dataloaders(
                 self.args.test_set / "DS_test_set",
@@ -173,11 +514,7 @@ class ModelTrainer:
         }
         return train_loader, val_loader, test_loaders
-    def _log_test_results(
-            self,
-            epoch: int,
-            test_stats: Dict[str, Dict]
-    ) -> None:
         """Log test results to TensorBoard.
         Creates and logs visualizations for model performance across different test conditions.
@@ -198,7 +535,7 @@ class ModelTrainer:
             )
             # Plot error images
-            predicted_channels = self._predict_channels(self.test_loaders[key])
             self.writer.add_figure(
                 tag=f"{key} Error Images (Epoch:{epoch + 1})",
                 figure=get_error_images(
@@ -208,15 +545,20 @@ class ModelTrainer:
                 )
             )
-    def _run_tests(self, epoch: int) -> None:
         """Run tests and log results.
         Evaluates the model on all test datasets and logs performance metrics and visualizations.
         Args:
             epoch: Current training epoch
         """
-        ds_stats, mds_stats, snr_stats = self._get_all_test_stats()
         test_stats = {
             "DS": ds_stats,
@@ -225,6 +567,8 @@ class ModelTrainer:
         }
         self._log_test_results(epoch, test_stats)
     def _log_final_metrics(self, final_epoch: int) -> None:
         """Log final training metrics and hyperparameters.
@@ -270,92 +614,84 @@ class ModelTrainer:
         except Exception as e:
             self.writer.add_text("Error", f"Failed to log final test results: {str(e)}")
-    def _compute_loss(self, estimated_channel, ideal_channel, loss_fn):
-        return loss_fn(
-            concat_complex_channel(estimated_channel),
-            concat_complex_channel(ideal_channel)
-        )
-    def _forward_pass(self, batch, model):
-        estimated_channel, ideal_channel, meta_data = batch
-        # All models now handle complex input directly
-        if isinstance(model, AdaFortiTranEstimator):
-            # AdaFortiTran uses meta_data for channel adaptation
-            estimated_channel = model(estimated_channel, meta_data)
-        else:
-            # Linear and FortiTran models don't use meta_data
-            estimated_channel = model(estimated_channel)
-        return estimated_channel, ideal_channel.to(model.device)
-    def _train_epoch(self):
-        train_loss = 0.0
-        self.model.train()
-        num_samples = 0
-        for batch in self.train_loader:
-            self.optimizer.zero_grad()
-            estimated_channel, ideal_channel = self._forward_pass(batch, self.model)
-            output = self._compute_loss(estimated_channel, ideal_channel, self.training_loss)
-            output.backward()
-            self.optimizer.step()
-            batch_size = batch[0].size(0)
-            train_loss += (2 * output.item() * batch_size)
-            num_samples += batch_size
-        self.scheduler.step()
-        train_loss /= num_samples
-        return train_loss
-    def _eval_model(self, eval_dataloader):
-        val_loss = 0.0
-        self.model.eval()
-        num_samples = 0
-        with torch.no_grad():
-            for batch in eval_dataloader:
-                estimated_channel, ideal_channel = self._forward_pass(batch, self.model)
-                output = self._compute_loss(estimated_channel, ideal_channel, self.training_loss)
-                batch_size = batch[0].size(0)
-                val_loss += (2 * output.item() * batch_size)
-                num_samples += batch_size
-        val_loss /= num_samples
-        return val_loss
-    def _predict_channels(self, test_dataloaders):
-        channels = {}
-        sorted_loaders = sorted(
-            test_dataloaders,
-            key=lambda x: int(x[0].split("_")[1])
-        )
-        for name, test_dataloader in sorted_loaders:
-            with torch.no_grad():
-                batch = next(iter(test_dataloader))
-                estimated_channels, ideal_channels = self._forward_pass(batch, self.model)
-            var, val = name.split("_")
-            channels[int(val)] = {
-                "estimated_channel": estimated_channels[0],
-                "ideal_channel": ideal_channels[0]
-            }
-        return channels
-    def _get_test_stats(self, test_dataloaders):
-        stats = {}
-        sorted_loaders = sorted(
-            test_dataloaders,
-            key=lambda x: int(x[0].split("_")[1])
-        )
-        for name, test_dataloader in sorted_loaders:
-            var, val = name.split("_")
-            test_loss = self._eval_model(test_dataloader)
-            db_error = to_db(test_loss)
-            self.logger.info(f"{var}:{val} Test MSE: {db_error:.4f} dB")
-            stats[int(val)] = db_error
-        return stats
-    def _get_all_test_stats(self):
-        ds_stats = self._get_test_stats(self.test_loaders["DS"])
-        mds_stats = self._get_test_stats(self.test_loaders["MDS"])
-        snr_stats = self._get_test_stats(self.test_loaders["SNR"])
-        return ds_stats, mds_stats, snr_stats
     def train(self) -> None:
         """Execute the training loop.
@@ -366,21 +702,43 @@ class ModelTrainer:
         - Early stopping when validation loss plateaus
         - Logging final metrics and results
         """
         last_epoch = 0
         pbar = tqdm(range(self.args.max_epoch), desc="Training")
         for epoch in pbar:
             last_epoch = epoch
             # Training step
-            train_loss = self._train_epoch()
-            self.writer.add_scalar('Loss/Train', train_loss, epoch + 1)
             # Validation step
-            val_loss = self._eval_model(self.val_loader)
-            self.writer.add_scalar('Loss/Val', val_loss, epoch + 1)
             # Update progress bar with loss info
             pbar.set_description(
-                f"Epoch {epoch + 1}/{self.args.max_epoch} - Train Loss: {train_loss:.4f}, Val Loss: {val_loss:.4f}")
             if self.early_stopper.early_stop(val_loss):
                 pbar.write(f"Early stopping triggered at epoch {epoch + 1}")
@@ -391,8 +749,12 @@ class ModelTrainer:
                 message = f"Test results after epoch {epoch + 1}:\n" + 50 * "-"
                 pbar.write(message)
                 self._run_tests(epoch)
         self._log_final_metrics(last_epoch)
-        self.writer.close()
 def train(system_config: SystemConfig, model_config: ModelConfig, args: TrainingArguments) -> None:

 from torch import nn, optim
 from torch.utils.data import DataLoader
 from torch.utils.tensorboard.writer import SummaryWriter
+from typing import Dict, Tuple, Type, Union, Optional, List, Protocol
 import logging
 from tqdm import tqdm
+from dataclasses import dataclass
+from pathlib import Path
+from abc import ABC, abstractmethod
 from .parser import TrainingArguments
 from src.data.dataset import MatDataset, get_test_dataloaders
 ModelType = Union[LinearEstimator, AdaFortiTranEstimator, FortiTranEstimator]
+@dataclass
+class TrainingMetrics:
+    """Container for training metrics."""
+    train_loss: float
+    val_loss: float
+    epoch: int
+    learning_rate: float
+@dataclass
+class TestResults:
+    """Container for test results."""
+    ds_stats: Dict[int, float]
+    mds_stats: Dict[int, float]
+    snr_stats: Dict[int, float]
+class Callback(ABC):
+    """Base class for training callbacks."""
+    @abstractmethod
+    def on_epoch_begin(self, epoch: int) -> None:
+        """Called at the beginning of each epoch."""
+        pass
+    @abstractmethod
+    def on_epoch_end(self, epoch: int, metrics: TrainingMetrics) -> None:
+        """Called at the end of each epoch."""
+        pass
+    @abstractmethod
+    def on_training_begin(self) -> None:
+        """Called at the beginning of training."""
+        pass
+    @abstractmethod
+    def on_training_end(self) -> None:
+        """Called at the end of training."""
+        pass
+class CheckpointCallback(Callback):
+    """Callback for saving model checkpoints."""
+    def __init__(self, save_dir: Path, save_best_only: bool = True,
+                 save_every_n_epochs: Optional[int] = None):
+        self.save_dir = save_dir
+        self.save_best_only = save_best_only
+        self.save_every_n_epochs = save_every_n_epochs
+        self.best_val_loss = float('inf')
+        self.trainer = None
+    def set_trainer(self, trainer: 'ModelTrainer') -> None:
+        """Set the trainer reference."""
+        self.trainer = trainer
+    def on_epoch_begin(self, epoch: int) -> None:
+        pass
+    def on_epoch_end(self, epoch: int, metrics: TrainingMetrics) -> None:
+        if self.trainer is None:
+            return
+        # Save best model
+        if self.save_best_only and metrics.val_loss < self.best_val_loss:
+            self.best_val_loss = metrics.val_loss
+            self.trainer.save_checkpoint(
+                epoch, metrics,
+                checkpoint_dir=self.save_dir / "best"
+            )
+        # Save every N epochs
+        if (self.save_every_n_epochs is not None and
+            (epoch + 1) % self.save_every_n_epochs == 0):
+            self.trainer.save_checkpoint(
+                epoch, metrics,
+                checkpoint_dir=self.save_dir / "periodic"
+            )
+    def on_training_begin(self) -> None:
+        self.save_dir.mkdir(parents=True, exist_ok=True)
+    def on_training_end(self) -> None:
+        pass
+class TensorBoardCallback(Callback):
+    """Callback for TensorBoard logging."""
+    def __init__(self, writer: SummaryWriter):
+        self.writer = writer
+    def on_epoch_begin(self, epoch: int) -> None:
+        pass
+    def on_epoch_end(self, epoch: int, metrics: TrainingMetrics) -> None:
+        self.writer.add_scalar('Loss/Train', metrics.train_loss, metrics.epoch + 1)
+        self.writer.add_scalar('Loss/Val', metrics.val_loss, metrics.epoch + 1)
+        self.writer.add_scalar('Learning_Rate', metrics.learning_rate, metrics.epoch + 1)
+    def on_training_begin(self) -> None:
+        pass
+    def on_training_end(self) -> None:
+        self.writer.close()
+class TrainingLoop:
+    """Handles the core training loop logic."""
+    def __init__(self, model: ModelType, optimizer: optim.Optimizer,
+                 scheduler: optim.lr_scheduler.LRScheduler,
+                 loss_fn: nn.Module, device: torch.device, scaler: Optional[torch.cuda.amp.GradScaler] = None,
+                 gradient_clip_val: Optional[float] = None):
+        self.model = model
+        self.optimizer = optimizer
+        self.scheduler = scheduler
+        self.loss_fn = loss_fn
+        self.device = device
+        self.scaler = scaler
+        self.gradient_clip_val = gradient_clip_val
+    def _compute_loss(self, estimated_channel: torch.Tensor,
+                     ideal_channel: torch.Tensor) -> torch.Tensor:
+        """Compute loss between estimated and ideal channels."""
+        return self.loss_fn(
+            concat_complex_channel(estimated_channel),
+            concat_complex_channel(ideal_channel)
+        )
+    def _forward_pass(self, batch: Tuple[torch.Tensor, torch.Tensor, Tuple],
+                     model: ModelType) -> Tuple[torch.Tensor, torch.Tensor]:
+        """Perform forward pass through the model."""
+        estimated_channel, ideal_channel, meta_data = batch
+        # All models now handle complex input directly
+        if isinstance(model, AdaFortiTranEstimator):
+            # AdaFortiTran uses meta_data for channel adaptation
+            estimated_channel = model(estimated_channel, meta_data)
+        else:
+            # Linear and FortiTran models don't use meta_data
+            estimated_channel = model(estimated_channel)
+        return estimated_channel, ideal_channel.to(model.device)
+    def train_epoch(self, train_loader: DataLoader) -> float:
+        """Train for one epoch."""
+        train_loss = 0.0
+        self.model.train()
+        num_samples = 0
+        for batch in train_loader:
+            self.optimizer.zero_grad()
+            estimated_channel, ideal_channel = self._forward_pass(batch, self.model)
+            if self.scaler:
+                with torch.cuda.amp.autocast():
+                    loss = self._compute_loss(estimated_channel, ideal_channel)
+                self.scaler.scale(loss).backward()
+                # Gradient clipping
+                if self.gradient_clip_val:
+                    self.scaler.unscale_(self.optimizer)
+                    torch.nn.utils.clip_grad_norm_(self.model.parameters(), self.gradient_clip_val)
+                self.scaler.step(self.optimizer)
+                self.scaler.update()
+            else:
+                loss = self._compute_loss(estimated_channel, ideal_channel)
+                loss.backward()
+                # Gradient clipping
+                if self.gradient_clip_val:
+                    torch.nn.utils.clip_grad_norm_(self.model.parameters(), self.gradient_clip_val)
+                self.optimizer.step()
+            batch_size = batch[0].size(0)
+            train_loss += (2 * loss.item() * batch_size)
+            num_samples += batch_size
+        self.scheduler.step()
+        return train_loss / num_samples
+    def evaluate(self, eval_loader: DataLoader) -> float:
+        """Evaluate the model."""
+        val_loss = 0.0
+        self.model.eval()
+        num_samples = 0
+        with torch.no_grad():
+            for batch in eval_loader:
+                estimated_channel, ideal_channel = self._forward_pass(batch, self.model)
+                if self.scaler:
+                    with torch.cuda.amp.autocast():
+                        loss = self._compute_loss(estimated_channel, ideal_channel)
+                else:
+                    loss = self._compute_loss(estimated_channel, ideal_channel)
+                batch_size = batch[0].size(0)
+                val_loss += (2 * loss.item() * batch_size)
+                num_samples += batch_size
+        return val_loss / num_samples
+class ModelEvaluator:
+    """Handles model evaluation and testing."""
+    def __init__(self, model: ModelType, device: torch.device, logger: logging.Logger):
+        self.model = model
+        self.device = device
+        self.logger = logger
+    def _forward_pass(self, batch: Tuple[torch.Tensor, torch.Tensor, Tuple],
+                     model: ModelType) -> Tuple[torch.Tensor, torch.Tensor]:
+        """Perform forward pass through the model."""
+        estimated_channel, ideal_channel, meta_data = batch
+        if isinstance(model, AdaFortiTranEstimator):
+            estimated_channel = model(estimated_channel, meta_data)
+        else:
+            estimated_channel = model(estimated_channel)
+        return estimated_channel, ideal_channel.to(model.device)
+    def predict_channels(self, test_dataloaders: List[Tuple[str, DataLoader]]) -> Dict[int, Dict]:
+        """Predict channels for visualization."""
+        channels = {}
+        sorted_loaders = sorted(
+            test_dataloaders,
+            key=lambda x: int(x[0].split("_")[1])
+        )
+        for name, test_dataloader in sorted_loaders:
+            with torch.no_grad():
+                batch = next(iter(test_dataloader))
+                estimated_channels, ideal_channels = self._forward_pass(batch, self.model)
+            var, val = name.split("_")
+            channels[int(val)] = {
+                "estimated_channel": estimated_channels[0],
+                "ideal_channel": ideal_channels[0]
+            }
+        return channels
+    def get_test_stats(self, test_dataloaders: List[Tuple[str, DataLoader]],
+                      loss_fn: nn.Module) -> Dict[int, float]:
+        """Get test statistics for a set of dataloaders."""
+        stats = {}
+        sorted_loaders = sorted(
+            test_dataloaders,
+            key=lambda x: int(x[0].split("_")[1])
+        )
+        for name, test_dataloader in sorted_loaders:
+            var, val = name.split("_")
+            test_loss = self._evaluate_dataloader(test_dataloader, loss_fn)
+            db_error = to_db(test_loss)
+            self.logger.info(f"{var}:{val} Test MSE: {db_error:.4f} dB")
+            stats[int(val)] = db_error
+        return stats
+    def _evaluate_dataloader(self, dataloader: DataLoader, loss_fn: nn.Module) -> float:
+        """Evaluate a single dataloader."""
+        total_loss = 0.0
+        num_samples = 0
+        self.model.eval()
+        with torch.no_grad():
+            for batch in dataloader:
+                estimated_channel, ideal_channel = self._forward_pass(batch, self.model)
+                loss = loss_fn(
+                    concat_complex_channel(estimated_channel),
+                    concat_complex_channel(ideal_channel)
+                )
+                batch_size = batch[0].size(0)
+                total_loss += (2 * loss.item() * batch_size)
+                num_samples += batch_size
+        return total_loss / num_samples
 class ModelTrainer:
     """Handles the training and evaluation of deep learning models.
         val_loader: DataLoader for validation set (used for validation)
         test_loaders: Dictionary of test set DataLoaders (used for testing)
         logger: Logger instance for logging messages
+        training_loop: TrainingLoop instance for core training logic
+        evaluator: ModelEvaluator instance for evaluation logic
+        callbacks: List of training callbacks
     """
     MODEL_REGISTRY: Dict[str, Type[ModelType]] = {
         self.logger = logging.getLogger(__name__)
         self.model = self._initialize_model()
+        # Initialize optimizer with weight decay
+        self.optimizer = optim.Adam(
+            self.model.parameters(),
+            lr=args.lr,
+            weight_decay=args.weight_decay
+        )
         self.scheduler = optim.lr_scheduler.ExponentialLR(self.optimizer, gamma=self.EXP_LR_GAMMA)
         self.early_stopper = EarlyStopping(patience=args.patience)
         self.training_loss = nn.MSELoss()
+        # Initialize mixed precision training if requested
+        self.scaler = None
+        if args.use_mixed_precision and self.device.type == 'cuda':
+            self.scaler = torch.cuda.amp.GradScaler()
+            self.logger.info("Mixed precision training enabled")
         self.train_loader, self.val_loader, self.test_loaders = self._get_dataloaders()
+        # Initialize components
+        self.training_loop = TrainingLoop(
+            self.model, self.optimizer, self.scheduler, self.training_loss,
+            self.device, self.scaler, self.args.gradient_clip_val
+        )
+        self.evaluator = ModelEvaluator(self.model, self.device, self.logger)
+        # Initialize callbacks
+        self.callbacks = self._setup_callbacks()
+        # Resume from checkpoint if specified
+        if args.resume_from_checkpoint is not None:
+            self._resume_from_checkpoint(args.resume_from_checkpoint)
+    def _setup_callbacks(self) -> List[Callback]:
+        """Set up training callbacks."""
+        callbacks = []
+        # TensorBoard callback
+        callbacks.append(TensorBoardCallback(self.writer))
+        # Checkpoint callback (only if checkpointing is enabled)
+        if self.args.save_checkpoints:
+            checkpoint_dir = self.args.tensorboard_log_dir / f"{self.args.model_name}_{self.args.exp_id}"
+            checkpoint_callback = CheckpointCallback(
+                save_dir=checkpoint_dir,
+                save_best_only=self.args.save_best_only,
+                save_every_n_epochs=self.args.save_every_n_epochs
+            )
+            checkpoint_callback.set_trainer(self)
+            callbacks.append(checkpoint_callback)
+        return callbacks
     def _setup_tensorboard(self) -> SummaryWriter:
         """Set up TensorBoard logging.
         return model
     def _get_dataloaders(self) -> Tuple[DataLoader, DataLoader, dict[str, list[tuple[str, DataLoader]]]]:
+        """Get training, validation, and test dataloaders."""
         pilot_dims = [self.system_config.pilot.num_scs, self.system_config.pilot.num_symbols]
         # Training and validation dataloaders
+        train_dataset = MatDataset(self.args.train_set, pilot_dims)
+        val_dataset = MatDataset(self.args.val_set, pilot_dims)
         train_loader = DataLoader(
             train_dataset,
             batch_size=self.args.batch_size,
+            shuffle=True,
+            num_workers=self.args.num_workers,
+            pin_memory=self.args.pin_memory and self.device.type == 'cuda'
         )
         val_loader = DataLoader(
             val_dataset,
             batch_size=self.args.batch_size,
+            shuffle=False,  # No need to shuffle validation data
+            num_workers=self.args.num_workers,
+            pin_memory=self.args.pin_memory and self.device.type == 'cuda'
         )
+        # Test dataloaders
         test_loaders = {
             "DS": get_test_dataloaders(
                 self.args.test_set / "DS_test_set",
         }
         return train_loader, val_loader, test_loaders
+    def _log_test_results(self, epoch: int, test_stats: Dict[str, Dict]) -> None:
         """Log test results to TensorBoard.
         Creates and logs visualizations for model performance across different test conditions.
             )
             # Plot error images
+            predicted_channels = self.evaluator.predict_channels(self.test_loaders[key])
             self.writer.add_figure(
                 tag=f"{key} Error Images (Epoch:{epoch + 1})",
                 figure=get_error_images(
                 )
             )
+    def _run_tests(self, epoch: int) -> TestResults:
         """Run tests and log results.
         Evaluates the model on all test datasets and logs performance metrics and visualizations.
         Args:
             epoch: Current training epoch
+        Returns:
+            TestResults containing all test statistics
         """
+        ds_stats = self.evaluator.get_test_stats(self.test_loaders["DS"], self.training_loss)
+        mds_stats = self.evaluator.get_test_stats(self.test_loaders["MDS"], self.training_loss)
+        snr_stats = self.evaluator.get_test_stats(self.test_loaders["SNR"], self.training_loss)
         test_stats = {
             "DS": ds_stats,
         }
         self._log_test_results(epoch, test_stats)
+        return TestResults(ds_stats, mds_stats, snr_stats)
     def _log_final_metrics(self, final_epoch: int) -> None:
         """Log final training metrics and hyperparameters.
         except Exception as e:
             self.writer.add_text("Error", f"Failed to log final test results: {str(e)}")
+    def _get_all_test_stats(self) -> Tuple[Dict[int, float], Dict[int, float], Dict[int, float]]:
+        """Get all test statistics."""
+        ds_stats = self.evaluator.get_test_stats(self.test_loaders["DS"], self.training_loss)
+        mds_stats = self.evaluator.get_test_stats(self.test_loaders["MDS"], self.training_loss)
+        snr_stats = self.evaluator.get_test_stats(self.test_loaders["SNR"], self.training_loss)
+        return ds_stats, mds_stats, snr_stats
+    def save_checkpoint(self, epoch: int, metrics: TrainingMetrics,
+                       checkpoint_dir: Optional[Path] = None) -> None:
+        """Save model checkpoint.
+        Args:
+            epoch: Current epoch number
+            metrics: Current training metrics
+            checkpoint_dir: Directory to save checkpoint (defaults to tensorboard log dir)
+        """
+        if checkpoint_dir is None:
+            checkpoint_dir = self.args.tensorboard_log_dir / f"{self.args.model_name}_{self.args.exp_id}"
+        checkpoint_dir.mkdir(parents=True, exist_ok=True)
+        checkpoint_path = checkpoint_dir / f"checkpoint_epoch_{epoch}.pt"
+        checkpoint = {
+            'epoch': epoch,
+            'model_state_dict': self.model.state_dict(),
+            'optimizer_state_dict': self.optimizer.state_dict(),
+            'scheduler_state_dict': self.scheduler.state_dict(),
+            'train_loss': metrics.train_loss,
+            'val_loss': metrics.val_loss,
+            'learning_rate': metrics.learning_rate,
+            'system_config': self.system_config,
+            'model_config': self.model_config,
+            'args': self.args
+        }
+        # Save scaler state if using mixed precision
+        if self.scaler:
+            checkpoint['scaler_state_dict'] = self.scaler.state_dict()
+        torch.save(checkpoint, checkpoint_path)
+        self.logger.info(f"Checkpoint saved to {checkpoint_path}")
+    def load_checkpoint(self, checkpoint_path: Path) -> int:
+        """Load model checkpoint.
+        Args:
+            checkpoint_path: Path to checkpoint file
+        Returns:
+            Epoch number of loaded checkpoint
+        """
+        checkpoint = torch.load(checkpoint_path, map_location=self.device)
+        self.model.load_state_dict(checkpoint['model_state_dict'])
+        self.optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
+        self.scheduler.load_state_dict(checkpoint['scheduler_state_dict'])
+        # Load scaler state if it exists
+        if self.scaler and 'scaler_state_dict' in checkpoint:
+            self.scaler.load_state_dict(checkpoint['scaler_state_dict'])
+        self.logger.info(f"Loaded checkpoint from epoch {checkpoint['epoch']}")
+        return checkpoint['epoch']
+    def _resume_from_checkpoint(self, checkpoint_path: Path) -> None:
+        """Resume training from a checkpoint.
+        Args:
+            checkpoint_path: Path to checkpoint file
+        """
+        start_epoch = self.load_checkpoint(checkpoint_path)
+        self.logger.info(f"Resuming training from epoch {start_epoch}")
+        # Update the early stopper with the best loss from checkpoint
+        checkpoint = torch.load(checkpoint_path, map_location=self.device)
+        if 'val_loss' in checkpoint:
+            self.early_stopper.min_loss = checkpoint['val_loss']
+            self.logger.info(f"Early stopper initialized with validation loss: {checkpoint['val_loss']:.4f}")
     def train(self) -> None:
         """Execute the training loop.
         - Early stopping when validation loss plateaus
         - Logging final metrics and results
         """
+        # Notify callbacks that training is beginning
+        for callback in self.callbacks:
+            callback.on_training_begin()
         last_epoch = 0
         pbar = tqdm(range(self.args.max_epoch), desc="Training")
         for epoch in pbar:
             last_epoch = epoch
+            # Notify callbacks that epoch is beginning
+            for callback in self.callbacks:
+                callback.on_epoch_begin(epoch)
             # Training step
+            train_loss = self.training_loop.train_epoch(self.train_loader)
             # Validation step
+            val_loss = self.training_loop.evaluate(self.val_loader)
+            # Create metrics object
+            metrics = TrainingMetrics(
+                train_loss=train_loss,
+                val_loss=val_loss,
+                epoch=epoch,
+                learning_rate=self.optimizer.param_groups[0]['lr']
+            )
+            # Notify callbacks that epoch has ended
+            for callback in self.callbacks:
+                callback.on_epoch_end(epoch, metrics)
             # Update progress bar with loss info
             pbar.set_description(
+                f"Epoch {epoch + 1}/{self.args.max_epoch} - "
+                f"Train Loss: {train_loss:.4f}, Val Loss: {val_loss:.4f}"
+            )
             if self.early_stopper.early_stop(val_loss):
                 pbar.write(f"Early stopping triggered at epoch {epoch + 1}")
                 message = f"Test results after epoch {epoch + 1}:\n" + 50 * "-"
                 pbar.write(message)
                 self._run_tests(epoch)
         self._log_final_metrics(last_epoch)
+        # Notify callbacks that training has ended
+        for callback in self.callbacks:
+            callback.on_training_end()
 def train(system_config: SystemConfig, model_config: ModelConfig, args: TrainingArguments) -> None:

src/utils.py CHANGED Viewed

@@ -180,24 +180,7 @@ def concat_complex_channel(channel_matrix):
     return cat_channel_m
-def inverse_concat_complex_channel(channel_matrix: torch.Tensor) -> torch.Tensor:
-    """
-    Reconstruct complex channel matrix from concatenated real matrix.
-    Reverses the operation performed by concat_complex_channel by
-    splitting the tensor and combining the parts into a complex tensor.
-    Args:
-        channel_matrix: Real-valued matrix of shape (B, F, 2*T)
-    Returns:
-        Complex matrix of shape (B, F, T)
-    """
-    split_idx = channel_matrix.shape[-1] // 2
-    return torch.complex(
-        channel_matrix[:, :split_idx],
-        channel_matrix[:, split_idx:]
-    )
 def get_test_stats_plot(x_name, stats, methods, show=False):


180	return cat_channel_m
181
182









183








184
185
186	def get_test_stats_plot(x_name, stats, methods, show=False):