Spaces:
Paused
Paused
aeb56
commited on
Commit
Β·
3fb1215
1
Parent(s):
74f609c
Aggressive memory cleanup: 5s wait, env vars, optional model loading
Browse files
README.md
CHANGED
|
@@ -46,25 +46,26 @@ Model evaluation Space for the fine-tuned Kimi-Linear-48B-A3B-Instruct model. **
|
|
| 46 |
|
| 47 |
### Quick Start
|
| 48 |
|
| 49 |
-
1
|
| 50 |
-
|
| 51 |
-
|
| 52 |
-
|
| 53 |
-
|
| 54 |
-
|
| 55 |
-
|
| 56 |
-
|
| 57 |
-
|
| 58 |
-
|
| 59 |
-
|
| 60 |
-
|
| 61 |
-
|
| 62 |
-
|
| 63 |
-
|
| 64 |
-
|
| 65 |
-
|
| 66 |
-
|
| 67 |
-
|
|
|
|
| 68 |
|
| 69 |
## Why LM Evaluation Harness?
|
| 70 |
|
|
@@ -82,12 +83,14 @@ The LM Evaluation Harness is a standard framework for evaluating language models
|
|
| 82 |
|
| 83 |
### Memory Management
|
| 84 |
|
| 85 |
-
This Space is optimized for limited VRAM:
|
| 86 |
-
- **
|
| 87 |
-
- **Automatic Cleanup:**
|
|
|
|
| 88 |
- **Single Instance:** Only lm_eval's model instance runs during evaluation
|
| 89 |
-
- **Batch Size:** Set to 1 to minimize memory usage
|
| 90 |
- **Device Mapping:** Automatic distribution across available GPUs
|
|
|
|
| 91 |
|
| 92 |
## Technical Details
|
| 93 |
|
|
|
|
| 46 |
|
| 47 |
### Quick Start
|
| 48 |
|
| 49 |
+
**Option 1: Direct Evaluation (Recommended)**
|
| 50 |
+
1. Go directly to the "π Evaluation" tab
|
| 51 |
+
2. Select benchmarks to run (ARC-Challenge, TruthfulQA, Winogrande)
|
| 52 |
+
3. Click "π Start Evaluation"
|
| 53 |
+
4. lm_eval will automatically load and evaluate the model
|
| 54 |
+
5. Wait 30-60 minutes for results
|
| 55 |
+
6. Results will be displayed and saved to `/tmp/eval_results_[timestamp]/`
|
| 56 |
+
|
| 57 |
+
**Option 2: With Model Verification**
|
| 58 |
+
1. **(Optional)** Click "π Load Model" in Controls tab to verify setup (5-10 min)
|
| 59 |
+
2. Go to the "π Evaluation" tab
|
| 60 |
+
3. Select benchmarks and click "π Start Evaluation"
|
| 61 |
+
4. The pre-loaded model will be automatically unloaded to free VRAM
|
| 62 |
+
5. lm_eval will load its own fresh instance for evaluation
|
| 63 |
+
6. Wait 30-60 minutes for results
|
| 64 |
+
|
| 65 |
+
**View Results**
|
| 66 |
+
- Evaluation results include metrics for each benchmark
|
| 67 |
+
- Results are automatically formatted and displayed
|
| 68 |
+
- Full results JSON files are saved for detailed analysis
|
| 69 |
|
| 70 |
## Why LM Evaluation Harness?
|
| 71 |
|
|
|
|
| 83 |
|
| 84 |
### Memory Management
|
| 85 |
|
| 86 |
+
This Space is optimized for limited VRAM (92GB across 4x L4):
|
| 87 |
+
- **Direct Evaluation:** Skip model pre-loading and go straight to evaluation (recommended)
|
| 88 |
+
- **Automatic Cleanup:** Any pre-loaded model is unloaded before evaluation starts
|
| 89 |
+
- **Aggressive Memory Clearing:** Multiple garbage collection passes + 5s wait time
|
| 90 |
- **Single Instance:** Only lm_eval's model instance runs during evaluation
|
| 91 |
+
- **Batch Size:** Set to 1 to minimize memory usage during evaluation
|
| 92 |
- **Device Mapping:** Automatic distribution across available GPUs
|
| 93 |
+
- **Memory Fragmentation:** PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True set by default
|
| 94 |
|
| 95 |
## Technical Details
|
| 96 |
|
app.py
CHANGED
|
@@ -5,9 +5,11 @@ import os
|
|
| 5 |
import subprocess
|
| 6 |
import json
|
| 7 |
from datetime import datetime
|
|
|
|
| 8 |
|
| 9 |
-
# Set environment
|
| 10 |
os.environ["FLA_USE_TRITON"] = "1"
|
|
|
|
| 11 |
|
| 12 |
# Model configuration
|
| 13 |
MODEL_NAME = "optiviseapp/kimi-linear-48b-a3b-instruct-fine-tune"
|
|
@@ -121,9 +123,8 @@ class ChatBot:
|
|
| 121 |
|
| 122 |
def run_evaluation(self, tasks_to_run):
|
| 123 |
"""Run lm_eval on selected tasks"""
|
| 124 |
-
|
| 125 |
-
|
| 126 |
-
return
|
| 127 |
|
| 128 |
try:
|
| 129 |
# Map friendly names to lm_eval task names
|
|
@@ -141,25 +142,39 @@ class ChatBot:
|
|
| 141 |
|
| 142 |
yield f"π **Preparing for evaluation...**\n\nTasks: {', '.join(tasks_to_run)}\n\n"
|
| 143 |
|
| 144 |
-
# IMPORTANT:
|
| 145 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 146 |
|
| 147 |
-
|
| 148 |
-
|
| 149 |
-
|
| 150 |
-
|
| 151 |
-
del self.tokenizer
|
| 152 |
-
self.tokenizer = None
|
| 153 |
|
| 154 |
-
# Clear CUDA cache
|
| 155 |
if torch.cuda.is_available():
|
| 156 |
-
torch.cuda.
|
| 157 |
-
|
|
|
|
|
|
|
|
|
|
| 158 |
|
| 159 |
-
|
| 160 |
-
|
|
|
|
| 161 |
|
| 162 |
-
|
|
|
|
| 163 |
|
| 164 |
yield f"β
**Memory cleared! Starting evaluation...**\n\nThis will take 30-60 minutes total.\n\n"
|
| 165 |
|
|
@@ -261,18 +276,19 @@ with gr.Blocks(theme=gr.themes.Soft(), title="Kimi 48B Fine-tuned - Evaluation")
|
|
| 261 |
with gr.Tabs():
|
| 262 |
# Tab 1: Controls (always visible)
|
| 263 |
with gr.Tab("ποΈ Controls"):
|
| 264 |
-
gr.Markdown("### Load Model
|
| 265 |
load_btn = gr.Button("π Load Model", variant="primary", size="lg")
|
| 266 |
status = gr.Markdown("**Status:** Model not loaded")
|
| 267 |
|
| 268 |
gr.Markdown("""
|
| 269 |
### βΉοΈ Instructions
|
| 270 |
-
1. **Click "Load Model"
|
| 271 |
-
2. **
|
| 272 |
|
| 273 |
**Note:**
|
| 274 |
- Chat/inference functionality is currently disabled. This Space focuses on model evaluation only.
|
| 275 |
-
-
|
|
|
|
| 276 |
""")
|
| 277 |
|
| 278 |
# Tab 2: Chat - DISABLED
|
|
@@ -339,9 +355,9 @@ with gr.Blocks(theme=gr.themes.Soft(), title="Kimi 48B Fine-tuned - Evaluation")
|
|
| 339 |
gr.Markdown("""
|
| 340 |
---
|
| 341 |
**Note:**
|
| 342 |
-
-
|
| 343 |
-
-
|
| 344 |
-
- lm_eval will load its own instance of the model for evaluation
|
| 345 |
- Results will be saved to `/tmp/eval_results_[timestamp]/`
|
| 346 |
""")
|
| 347 |
|
|
|
|
| 5 |
import subprocess
|
| 6 |
import json
|
| 7 |
from datetime import datetime
|
| 8 |
+
import time
|
| 9 |
|
| 10 |
+
# Set environment variables for flash-linear-attention and memory management
|
| 11 |
os.environ["FLA_USE_TRITON"] = "1"
|
| 12 |
+
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"
|
| 13 |
|
| 14 |
# Model configuration
|
| 15 |
MODEL_NAME = "optiviseapp/kimi-linear-48b-a3b-instruct-fine-tune"
|
|
|
|
| 123 |
|
| 124 |
def run_evaluation(self, tasks_to_run):
|
| 125 |
"""Run lm_eval on selected tasks"""
|
| 126 |
+
# Note: We don't strictly require the model to be loaded first
|
| 127 |
+
# since we'll be unloading it anyway. The load step is just for verification.
|
|
|
|
| 128 |
|
| 129 |
try:
|
| 130 |
# Map friendly names to lm_eval task names
|
|
|
|
| 142 |
|
| 143 |
yield f"π **Preparing for evaluation...**\n\nTasks: {', '.join(tasks_to_run)}\n\n"
|
| 144 |
|
| 145 |
+
# IMPORTANT: Clean up any loaded model to free VRAM for lm_eval
|
| 146 |
+
if self.loaded and self.model is not None:
|
| 147 |
+
yield f"π **Unloading model to free VRAM...**\n\nThis is necessary because lm_eval will load its own instance.\n\n"
|
| 148 |
+
|
| 149 |
+
if self.model is not None:
|
| 150 |
+
del self.model
|
| 151 |
+
self.model = None
|
| 152 |
+
if self.tokenizer is not None:
|
| 153 |
+
del self.tokenizer
|
| 154 |
+
self.tokenizer = None
|
| 155 |
+
|
| 156 |
+
self.loaded = False
|
| 157 |
+
else:
|
| 158 |
+
yield f"π **Cleaning up memory...**\n\nPreparing environment for evaluation.\n\n"
|
| 159 |
|
| 160 |
+
# Aggressive memory cleanup
|
| 161 |
+
import gc
|
| 162 |
+
for _ in range(3):
|
| 163 |
+
gc.collect()
|
|
|
|
|
|
|
| 164 |
|
|
|
|
| 165 |
if torch.cuda.is_available():
|
| 166 |
+
for i in range(torch.cuda.device_count()):
|
| 167 |
+
torch.cuda.empty_cache()
|
| 168 |
+
torch.cuda.synchronize(device=i)
|
| 169 |
+
torch.cuda.reset_peak_memory_stats(device=i)
|
| 170 |
+
torch.cuda.reset_accumulated_memory_stats(device=i)
|
| 171 |
|
| 172 |
+
# Wait for memory to be fully released
|
| 173 |
+
yield f"π **Waiting for memory cleanup...**\n\nGiving the system time to fully release VRAM.\n\n"
|
| 174 |
+
time.sleep(5)
|
| 175 |
|
| 176 |
+
# Final garbage collection
|
| 177 |
+
gc.collect()
|
| 178 |
|
| 179 |
yield f"β
**Memory cleared! Starting evaluation...**\n\nThis will take 30-60 minutes total.\n\n"
|
| 180 |
|
|
|
|
| 276 |
with gr.Tabs():
|
| 277 |
# Tab 1: Controls (always visible)
|
| 278 |
with gr.Tab("ποΈ Controls"):
|
| 279 |
+
gr.Markdown("### Load Model (Optional)")
|
| 280 |
load_btn = gr.Button("π Load Model", variant="primary", size="lg")
|
| 281 |
status = gr.Markdown("**Status:** Model not loaded")
|
| 282 |
|
| 283 |
gr.Markdown("""
|
| 284 |
### βΉοΈ Instructions
|
| 285 |
+
1. **(Optional)** Click "Load Model" to verify setup (takes 5-10 minutes)
|
| 286 |
+
2. **Go directly to Evaluation tab** to run benchmarks
|
| 287 |
|
| 288 |
**Note:**
|
| 289 |
- Chat/inference functionality is currently disabled. This Space focuses on model evaluation only.
|
| 290 |
+
- Loading the model first is optional - you can go straight to the Evaluation tab
|
| 291 |
+
- Any loaded model will be automatically unloaded before evaluation starts to free VRAM for lm_eval.
|
| 292 |
""")
|
| 293 |
|
| 294 |
# Tab 2: Chat - DISABLED
|
|
|
|
| 355 |
gr.Markdown("""
|
| 356 |
---
|
| 357 |
**Note:**
|
| 358 |
+
- You can start evaluation immediately - no need to load the model first
|
| 359 |
+
- If you did load the model, it will be automatically unloaded before evaluation to free VRAM
|
| 360 |
+
- lm_eval will load its own fresh instance of the model for evaluation
|
| 361 |
- Results will be saved to `/tmp/eval_results_[timestamp]/`
|
| 362 |
""")
|
| 363 |
|