Upload README.md with huggingface_hub
Browse files
README.md
ADDED
|
@@ -0,0 +1,70 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
tags:
|
| 3 |
+
- gpu-runtime-prediction
|
| 4 |
+
- code-understanding
|
| 5 |
+
- regression
|
| 6 |
+
- performance-modeling
|
| 7 |
+
datasets:
|
| 8 |
+
- RajBhope/gpu-runtime-prediction-dataset
|
| 9 |
+
language:
|
| 10 |
+
- code
|
| 11 |
+
library_name: scikit-learn
|
| 12 |
+
pipeline_tag: tabular-regression
|
| 13 |
+
---
|
| 14 |
+
|
| 15 |
+
# GPU Runtime Predictor 🚀⚡
|
| 16 |
+
|
| 17 |
+
Predicts GPU kernel/operation **runtime in milliseconds** given **source code** + **GPU hardware specifications**.
|
| 18 |
+
|
| 19 |
+
## How It Works
|
| 20 |
+
|
| 21 |
+
1. **Code Feature Extraction**: Analyzes source code to extract 48 features (tensor dimensions, operation types, complexity indicators)
|
| 22 |
+
2. **GPU Feature Encoding**: Uses 12 hardware specs (CUDA cores, memory bandwidth, compute capability, etc.)
|
| 23 |
+
3. **ML Prediction**: Ensemble of Gradient Boosted Trees + Random Forest + Neural Network
|
| 24 |
+
|
| 25 |
+
### Model Comparison
|
| 26 |
+
|
| 27 |
+
| Model | R² | RMSE | Spearman ρ | MAPE % |
|
| 28 |
+
|-------|-----|------|------------|--------|
|
| 29 |
+
| **GBR** | 0.9923 | 0.0728 | 0.9264 | 16.5% |
|
| 30 |
+
| **RF** | 0.9924 | 0.0724 | 0.9277 | 16.3% |
|
| 31 |
+
| **NN** | 0.9932 | 0.0687 | 0.9187 | 17.0% |
|
| 32 |
+
| **Ensemble** | 0.9930 | 0.0693 | 0.9272 | 16.3% |
|
| 33 |
+
|
| 34 |
+
### GPU Catalog (12 GPUs)
|
| 35 |
+
|
| 36 |
+
| GPU | FP32 TFLOPS | Memory BW | VRAM |
|
| 37 |
+
|-----|------------|-----------|------|
|
| 38 |
+
| NVIDIA T4 | 8.1 | 320 GB/s | 16 GB |
|
| 39 |
+
| NVIDIA V100 | 15.7 | 900 GB/s | 32 GB |
|
| 40 |
+
| NVIDIA A10G | 31.2 | 600 GB/s | 24 GB |
|
| 41 |
+
| NVIDIA A100 40GB | 19.5 | 1555 GB/s | 40 GB |
|
| 42 |
+
| NVIDIA A100 80GB | 19.5 | 2039 GB/s | 80 GB |
|
| 43 |
+
| NVIDIA L4 | 30.3 | 300 GB/s | 24 GB |
|
| 44 |
+
| NVIDIA L40S | 91.6 | 864 GB/s | 48 GB |
|
| 45 |
+
| NVIDIA RTX 3090 | 35.6 | 936 GB/s | 24 GB |
|
| 46 |
+
| NVIDIA RTX 4090 | 82.6 | 1008 GB/s | 24 GB |
|
| 47 |
+
| NVIDIA H100 SXM | 67.0 | 3350 GB/s | 80 GB |
|
| 48 |
+
| NVIDIA H100 PCIe | 48.0 | 2039 GB/s | 80 GB |
|
| 49 |
+
| NVIDIA RTX A6000 | 38.7 | 768 GB/s | 48 GB |
|
| 50 |
+
|
| 51 |
+
### 15 Supported Workload Types
|
| 52 |
+
matmul, conv2d, attention, transformer_block, linear, layernorm, batchnorm,
|
| 53 |
+
softmax, embedding, elementwise, reduction, pooling, FFT, sort, loss+backward
|
| 54 |
+
|
| 55 |
+
## Usage
|
| 56 |
+
|
| 57 |
+
```python
|
| 58 |
+
# See the Gradio demo for interactive use
|
| 59 |
+
# Or load models directly:
|
| 60 |
+
import pickle
|
| 61 |
+
with open('model_gbr.pkl', 'rb') as f:
|
| 62 |
+
model = pickle.load(f)
|
| 63 |
+
```
|
| 64 |
+
|
| 65 |
+
## Training
|
| 66 |
+
|
| 67 |
+
- **Dataset**: [RajBhope/gpu-runtime-prediction-dataset](https://hf.co/datasets/RajBhope/gpu-runtime-prediction-dataset)
|
| 68 |
+
- **51,900 samples** = 4,325 workloads × 12 GPUs
|
| 69 |
+
- Runtime generated via physics-based roofline performance model
|
| 70 |
+
- Based on research from [Regression Language Models](https://arxiv.org/abs/2509.26476) and [HELP](https://arxiv.org/abs/2106.08630)
|