Model Card for MatroidNN
Model Details
Model Description
Model type: Neural Network with Matroid-based Feature Selection (MatroidNN)
Version: 1.0
Framework: PyTorch
Last updated: February 27, 2025
Overview
MatroidNN is a neural network architecture that incorporates matroid theory for feature selection. It addresses the challenge of feature redundancy by selecting a maximally independent set of features based on matroid theory principles before training the neural network.
Model Architecture
- Feature Selection Component: MatroidFeatureSelector using correlation-based dependency analysis
- Neural Network: 3-layer feedforward network with batch normalization and dropout
- Input: Varies based on the number of features selected by the matroid selector
- Hidden Layers: Configurable hidden layer sizes (default 64 โ 32)
- Output: Multi-class classification (configurable number of classes)
- Parameters: ~5K-10K parameters (varies based on input/output dimensions)
Uses
Direct Use
MatroidNN is designed for classification tasks where feature redundancy is a potential issue. It's particularly useful for:
- High-dimensional datasets with correlated features
- Feature selection in biological/medical data
- Financial prediction with multicollinear variables
- Any classification task where feature independence is desired
Out-of-Scope Use
This model is not intended for:
- Regression tasks (without modification)
- Time series prediction (without temporal adaptations)
- Raw image or text classification (without appropriate feature extraction)
Training Data
The model was developed and tested using synthetic data with deliberate feature dependencies. For real-world applications, the model should be retrained on domain-specific data.
Training Dataset
- Type: Synthetic data with controlled dependencies
- Size: 1000 samples (default), configurable
- Features: 20 initial features (default), configurable
- Classes: 3 classes (default), configurable
- Distribution: Equal class distribution in the synthetic data
Performance
Metrics
On synthetic test data with 3 classes:
- Accuracy: 94.0%
- Macro-average F1-score: 0.93
- Per-class metrics:
- Class 0: Precision 0.96, Recall 1.00, F1 0.98
- Class 1: Precision 0.86, Recall 0.86, F1 0.86
- Class 2: Precision 0.97, Recall 0.93, F1 0.95
Factors
Performance may vary based on:
- Feature correlation structure in the dataset
- Number of initial features and their information content
- Class distribution balance
- Rank threshold parameter in the MatroidFeatureSelector
Limitations
- The matroid-based feature selection uses correlation as a proxy for independence, which may not capture all forms of dependency
- The current implementation assumes numerical features and may require adaptation for categorical features
- Feature selection is performed once before training and does not adapt during training
- The rank threshold parameter requires careful tuning based on the dataset
Ethical Considerations
- Feature selection might unintentionally exclude features that are important for fairness considerations
- The model inherits any biases present in the training data
- Results should be interpreted with caution in high-stakes applications, with human oversight
Technical Specifications
Hardware Requirements
- Training: CUDA-capable GPU recommended for larger datasets
- Inference: CPU sufficient for most applications
Software Requirements
- Python 3.8+
- PyTorch 1.8+
- NumPy 1.20+
- scikit-learn 0.24+
Training Hyperparameters
- Batch size: 32 (default)
- Learning rate: 0.001 (default)
- Optimizer: Adam
- Loss function: Cross-Entropy Loss
- Epochs: Early stopping based on validation loss (patience=10)
- Feature selection rank threshold: 0.7 (default, configurable)
How to Use
from matroid_nn import MatroidFeatureSelector, MatroidNN
# Initialize feature selector
selector = MatroidFeatureSelector(rank_threshold=0.7)
# Apply feature selection
X_train_selected = selector.fit_transform(X_train)
X_test_selected = selector.transform(X_test)
# Create and train model
model = MatroidNN(
input_size=X_train_selected.shape[1],
hidden_size=64,
output_size=num_classes
)