Spaces:
Paused
Paused
Ali Mohsin
commited on
Commit
Β·
6086b2f
1
Parent(s):
8bcf79a
fixes
Browse files- PRODUCTION_DEPLOYMENT.md +310 -0
- app.py +3 -2
- scripts/prepare_polyvore.py +309 -119
- startup_fix.py +306 -0
- utils/data_fetch.py +181 -17
PRODUCTION_DEPLOYMENT.md
ADDED
|
@@ -0,0 +1,310 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# π Production Deployment Guide for Dressify
|
| 2 |
+
|
| 3 |
+
## Overview
|
| 4 |
+
|
| 5 |
+
This guide explains how to deploy Dressify as a production-ready outfit recommendation service using the official Polyvore dataset splits.
|
| 6 |
+
|
| 7 |
+
## π― Key Changes Made
|
| 8 |
+
|
| 9 |
+
### 1. **Official Split Usage** β
|
| 10 |
+
- **Before**: System tried to create random 70/15/15 splits
|
| 11 |
+
- **After**: System uses official splits from `nondisjoint/` and `disjoint/` folders
|
| 12 |
+
- **Benefit**: Reproducible, research-grade results
|
| 13 |
+
|
| 14 |
+
### 2. **Robust Dataset Detection** π
|
| 15 |
+
- Automatically detects official splits in multiple locations
|
| 16 |
+
- Falls back to metadata extraction if needed
|
| 17 |
+
- No more random split creation by default
|
| 18 |
+
|
| 19 |
+
### 3. **Production-Ready Startup** π
|
| 20 |
+
- Comprehensive error handling and diagnostics
|
| 21 |
+
- Clear status reporting
|
| 22 |
+
- Automatic dataset verification
|
| 23 |
+
|
| 24 |
+
## π Dataset Structure
|
| 25 |
+
|
| 26 |
+
The system expects this structure after download:
|
| 27 |
+
|
| 28 |
+
```
|
| 29 |
+
data/Polyvore/
|
| 30 |
+
βββ images/ # Extracted from images.zip
|
| 31 |
+
βββ nondisjoint/ # Official splits (preferred)
|
| 32 |
+
β βββ train.json # 31.8 MB - Training outfits
|
| 33 |
+
β βββ valid.json # 2.99 MB - Validation outfits
|
| 34 |
+
β βββ test.json # 5.97 MB - Test outfits
|
| 35 |
+
βββ disjoint/ # Alternative official splits
|
| 36 |
+
β βββ train.json # 9.65 MB - Training outfits
|
| 37 |
+
β βββ valid.json # 1.72 MB - Validation outfits
|
| 38 |
+
β βββ test.json # 8.36 MB - Test outfits
|
| 39 |
+
βββ polyvore_item_metadata.json # 105 MB - Item metadata
|
| 40 |
+
βββ polyvore_outfit_titles.json # 6.97 MB - Outfit information
|
| 41 |
+
βββ categories.csv # 4.91 KB - Category mappings
|
| 42 |
+
```
|
| 43 |
+
|
| 44 |
+
## π Deployment Steps
|
| 45 |
+
|
| 46 |
+
### Step 1: Initial Setup
|
| 47 |
+
```bash
|
| 48 |
+
# Clone the repository
|
| 49 |
+
git clone <your-repo>
|
| 50 |
+
cd recomendation
|
| 51 |
+
|
| 52 |
+
# Install dependencies
|
| 53 |
+
pip install -r requirements.txt
|
| 54 |
+
```
|
| 55 |
+
|
| 56 |
+
### Step 2: Dataset Preparation
|
| 57 |
+
```bash
|
| 58 |
+
# Run the startup fix script
|
| 59 |
+
python startup_fix.py
|
| 60 |
+
```
|
| 61 |
+
|
| 62 |
+
This script will:
|
| 63 |
+
1. β
Download the Polyvore dataset from Hugging Face
|
| 64 |
+
2. β
Extract images from images.zip
|
| 65 |
+
3. β
Detect official splits in nondisjoint/ and disjoint/
|
| 66 |
+
4. β
Create training splits from official data
|
| 67 |
+
5. β
Verify all components are ready
|
| 68 |
+
|
| 69 |
+
### Step 3: Verify Dataset
|
| 70 |
+
```bash
|
| 71 |
+
# Check dataset status
|
| 72 |
+
python -c "
|
| 73 |
+
from utils.data_fetch import check_dataset_structure
|
| 74 |
+
import json
|
| 75 |
+
structure = check_dataset_structure('data/Polyvore')
|
| 76 |
+
print(json.dumps(structure, indent=2))
|
| 77 |
+
"
|
| 78 |
+
```
|
| 79 |
+
|
| 80 |
+
Expected output:
|
| 81 |
+
```json
|
| 82 |
+
{
|
| 83 |
+
"status": "ready",
|
| 84 |
+
"images": {
|
| 85 |
+
"exists": true,
|
| 86 |
+
"count": 100000,
|
| 87 |
+
"extensions": [".jpg", ".jpeg", ".png"]
|
| 88 |
+
},
|
| 89 |
+
"splits": {
|
| 90 |
+
"nondisjoint": {
|
| 91 |
+
"train.json": {"exists": true, "size_mb": 31.8},
|
| 92 |
+
"valid.json": {"exists": true, "size_mb": 2.99},
|
| 93 |
+
"test.json": {"exists": true, "size_mb": 5.97}
|
| 94 |
+
}
|
| 95 |
+
}
|
| 96 |
+
}
|
| 97 |
+
```
|
| 98 |
+
|
| 99 |
+
### Step 4: Launch Application
|
| 100 |
+
```bash
|
| 101 |
+
# Start the main application
|
| 102 |
+
python app.py
|
| 103 |
+
```
|
| 104 |
+
|
| 105 |
+
The system will:
|
| 106 |
+
1. π Check dataset status
|
| 107 |
+
2. β
Load official splits
|
| 108 |
+
3. π Launch Gradio interface
|
| 109 |
+
4. π― Be ready for training and inference
|
| 110 |
+
|
| 111 |
+
## π§ Troubleshooting
|
| 112 |
+
|
| 113 |
+
### Issue: "No official splits found"
|
| 114 |
+
|
| 115 |
+
**Cause**: The dataset download didn't include the split files.
|
| 116 |
+
|
| 117 |
+
**Solution**:
|
| 118 |
+
```bash
|
| 119 |
+
# Check what was downloaded
|
| 120 |
+
ls -la data/Polyvore/
|
| 121 |
+
|
| 122 |
+
# Re-run data fetcher
|
| 123 |
+
python -c "
|
| 124 |
+
from utils.data_fetch import ensure_dataset_ready
|
| 125 |
+
ensure_dataset_ready()
|
| 126 |
+
"
|
| 127 |
+
```
|
| 128 |
+
|
| 129 |
+
### Issue: "Dataset preparation failed"
|
| 130 |
+
|
| 131 |
+
**Cause**: The prepare script couldn't parse the official splits.
|
| 132 |
+
|
| 133 |
+
**Solution**:
|
| 134 |
+
```bash
|
| 135 |
+
# Check split file format
|
| 136 |
+
head -20 data/Polyvore/nondisjoint/train.json
|
| 137 |
+
|
| 138 |
+
# Run preparation manually
|
| 139 |
+
python scripts/prepare_polyvore.py --root data/Polyvore
|
| 140 |
+
```
|
| 141 |
+
|
| 142 |
+
### Issue: "Out of memory during training"
|
| 143 |
+
|
| 144 |
+
**Cause**: GPU memory insufficient for default batch sizes.
|
| 145 |
+
|
| 146 |
+
**Solution**: Use the Advanced Training interface to reduce batch sizes:
|
| 147 |
+
- ResNet: Reduce from 64 to 16-32
|
| 148 |
+
- ViT: Reduce from 32 to 8-16
|
| 149 |
+
- Enable mixed precision (AMP)
|
| 150 |
+
|
| 151 |
+
## π― Production Configuration
|
| 152 |
+
|
| 153 |
+
### Environment Variables
|
| 154 |
+
```bash
|
| 155 |
+
export EXPORT_DIR="models/exports"
|
| 156 |
+
export POLYVORE_ROOT="data/Polyvore"
|
| 157 |
+
export CUDA_VISIBLE_DEVICES="0" # Specify GPU
|
| 158 |
+
```
|
| 159 |
+
|
| 160 |
+
### Docker Deployment
|
| 161 |
+
```bash
|
| 162 |
+
# Build image
|
| 163 |
+
docker build -t dressify .
|
| 164 |
+
|
| 165 |
+
# Run container
|
| 166 |
+
docker run -p 7860:7860 -p 8000:8000 \
|
| 167 |
+
-v $(pwd)/data:/app/data \
|
| 168 |
+
-v $(pwd)/models:/app/models \
|
| 169 |
+
dressify
|
| 170 |
+
```
|
| 171 |
+
|
| 172 |
+
### Hugging Face Space
|
| 173 |
+
1. Upload the entire `recomendation/` folder
|
| 174 |
+
2. Set Space type to "Gradio"
|
| 175 |
+
3. The system auto-bootstraps on first run
|
| 176 |
+
4. Uses official splits for production-quality results
|
| 177 |
+
|
| 178 |
+
## π Expected Performance
|
| 179 |
+
|
| 180 |
+
### Dataset Statistics
|
| 181 |
+
- **Total Images**: ~100,000 fashion items
|
| 182 |
+
- **Training Outfits**: ~50,000 (nondisjoint) or ~20,000 (disjoint)
|
| 183 |
+
- **Validation Outfits**: ~5,000 (nondisjoint) or ~2,000 (disjoint)
|
| 184 |
+
- **Test Outfits**: ~10,000 (nondisjoint) or ~4,000 (disjoint)
|
| 185 |
+
|
| 186 |
+
### Training Times (L4 GPU)
|
| 187 |
+
- **ResNet Item Embedder**: 2-4 hours (20 epochs)
|
| 188 |
+
- **ViT Outfit Encoder**: 1-2 hours (30 epochs)
|
| 189 |
+
- **Total**: 3-6 hours for full training
|
| 190 |
+
|
| 191 |
+
### Inference Performance
|
| 192 |
+
- **Item Embedding**: < 50ms per image
|
| 193 |
+
- **Outfit Generation**: < 100ms per outfit
|
| 194 |
+
- **Memory Usage**: ~2-4 GB GPU VRAM
|
| 195 |
+
|
| 196 |
+
## π¬ Research vs Production
|
| 197 |
+
|
| 198 |
+
### Research Mode
|
| 199 |
+
```bash
|
| 200 |
+
# Use disjoint splits (smaller, more challenging)
|
| 201 |
+
python scripts/prepare_polyvore.py --root data/Polyvore
|
| 202 |
+
# Automatically uses disjoint/ splits
|
| 203 |
+
```
|
| 204 |
+
|
| 205 |
+
### Production Mode
|
| 206 |
+
```bash
|
| 207 |
+
# Use nondisjoint splits (larger, more robust)
|
| 208 |
+
python scripts/prepare_polyvore.py --root data/Polyvore
|
| 209 |
+
# Automatically uses nondisjoint/ splits (default)
|
| 210 |
+
```
|
| 211 |
+
|
| 212 |
+
## π Monitoring & Logging
|
| 213 |
+
|
| 214 |
+
### Training Logs
|
| 215 |
+
```bash
|
| 216 |
+
# Check training progress
|
| 217 |
+
tail -f models/exports/training.log
|
| 218 |
+
|
| 219 |
+
# Monitor GPU usage
|
| 220 |
+
nvidia-smi -l 1
|
| 221 |
+
```
|
| 222 |
+
|
| 223 |
+
### System Health
|
| 224 |
+
```bash
|
| 225 |
+
# Health check endpoint
|
| 226 |
+
curl http://localhost:8000/health
|
| 227 |
+
|
| 228 |
+
# Expected response
|
| 229 |
+
{
|
| 230 |
+
"status": "ok",
|
| 231 |
+
"device": "cuda:0",
|
| 232 |
+
"resnet": "resnet50_v2",
|
| 233 |
+
"vit": "vit_outfit_v1"
|
| 234 |
+
}
|
| 235 |
+
```
|
| 236 |
+
|
| 237 |
+
## π¨ Emergency Procedures
|
| 238 |
+
|
| 239 |
+
### Dataset Corruption
|
| 240 |
+
```bash
|
| 241 |
+
# Remove corrupted data
|
| 242 |
+
rm -rf data/Polyvore/splits/
|
| 243 |
+
|
| 244 |
+
# Re-run preparation
|
| 245 |
+
python startup_fix.py
|
| 246 |
+
```
|
| 247 |
+
|
| 248 |
+
### Model Issues
|
| 249 |
+
```bash
|
| 250 |
+
# Remove corrupted models
|
| 251 |
+
rm -rf models/exports/*.pth
|
| 252 |
+
|
| 253 |
+
# Re-train from scratch
|
| 254 |
+
python train_resnet.py --data_root data/Polyvore --epochs 20
|
| 255 |
+
python train_vit_triplet.py --data_root data/Polyvore --epochs 30
|
| 256 |
+
```
|
| 257 |
+
|
| 258 |
+
### System Recovery
|
| 259 |
+
```bash
|
| 260 |
+
# Full system reset
|
| 261 |
+
rm -rf data/Polyvore/
|
| 262 |
+
rm -rf models/exports/
|
| 263 |
+
|
| 264 |
+
# Fresh start
|
| 265 |
+
python startup_fix.py
|
| 266 |
+
```
|
| 267 |
+
|
| 268 |
+
## β
Production Checklist
|
| 269 |
+
|
| 270 |
+
- [ ] Dataset downloaded successfully (2.5GB+ images)
|
| 271 |
+
- [ ] Official splits detected in nondisjoint/ or disjoint/
|
| 272 |
+
- [ ] Training splits created in data/Polyvore/splits/
|
| 273 |
+
- [ ] Models can be trained without errors
|
| 274 |
+
- [ ] Inference service responds to health checks
|
| 275 |
+
- [ ] Gradio interface loads successfully
|
| 276 |
+
- [ ] Advanced training controls work
|
| 277 |
+
- [ ] Model checkpoints can be saved/loaded
|
| 278 |
+
|
| 279 |
+
## π Success Indicators
|
| 280 |
+
|
| 281 |
+
When everything is working correctly, you should see:
|
| 282 |
+
|
| 283 |
+
```
|
| 284 |
+
β
Dataset ready at: data/Polyvore
|
| 285 |
+
π Images: 100000 files
|
| 286 |
+
π polyvore_item_metadata.json: 105.0 MB
|
| 287 |
+
π polyvore_outfit_titles.json: 6.97 MB
|
| 288 |
+
π― Official splits found:
|
| 289 |
+
β
nondisjoint/train.json (31.8 MB)
|
| 290 |
+
β
nondisjoint/valid.json (2.99 MB)
|
| 291 |
+
β
nondisjoint/test.json (5.97 MB)
|
| 292 |
+
π Using official splits from dataset!
|
| 293 |
+
β
Dataset preparation completed successfully!
|
| 294 |
+
β
All required splits verified!
|
| 295 |
+
π Your Dressify system is ready to use!
|
| 296 |
+
```
|
| 297 |
+
|
| 298 |
+
## π Support
|
| 299 |
+
|
| 300 |
+
If you encounter issues:
|
| 301 |
+
|
| 302 |
+
1. **Check the logs** for specific error messages
|
| 303 |
+
2. **Verify dataset structure** matches expected layout
|
| 304 |
+
3. **Run startup_fix.py** for automated diagnostics
|
| 305 |
+
4. **Check GPU memory** and reduce batch sizes if needed
|
| 306 |
+
5. **Ensure official splits** are present in nondisjoint/ or disjoint/
|
| 307 |
+
|
| 308 |
+
---
|
| 309 |
+
|
| 310 |
+
**π― Your Dressify system is now production-ready with official dataset splits!**
|
app.py
CHANGED
|
@@ -61,7 +61,7 @@ def _background_bootstrap():
|
|
| 61 |
BOOT_STATUS = "dataset-not-prepared"
|
| 62 |
return
|
| 63 |
|
| 64 |
-
# Prepare
|
| 65 |
splits_dir = os.path.join(ds_root, "splits")
|
| 66 |
need_prepare = not (
|
| 67 |
os.path.isfile(os.path.join(splits_dir, "train.json")) or
|
|
@@ -75,7 +75,8 @@ def _background_bootstrap():
|
|
| 75 |
import sys
|
| 76 |
argv_bak = sys.argv
|
| 77 |
try:
|
| 78 |
-
|
|
|
|
| 79 |
prepare_main()
|
| 80 |
finally:
|
| 81 |
sys.argv = argv_bak
|
|
|
|
| 61 |
BOOT_STATUS = "dataset-not-prepared"
|
| 62 |
return
|
| 63 |
|
| 64 |
+
# Prepare splits from official data if missing
|
| 65 |
splits_dir = os.path.join(ds_root, "splits")
|
| 66 |
need_prepare = not (
|
| 67 |
os.path.isfile(os.path.join(splits_dir, "train.json")) or
|
|
|
|
| 75 |
import sys
|
| 76 |
argv_bak = sys.argv
|
| 77 |
try:
|
| 78 |
+
# Use official splits from nondisjoint/ and disjoint/ folders
|
| 79 |
+
sys.argv = ["prepare_polyvore.py", "--root", ds_root]
|
| 80 |
prepare_main()
|
| 81 |
finally:
|
| 82 |
sys.argv = argv_bak
|
scripts/prepare_polyvore.py
CHANGED
|
@@ -7,78 +7,89 @@ from typing import Dict, Any, List, Set, Union
|
|
| 7 |
|
| 8 |
|
| 9 |
def _normalize_outfits(obj: Union[List[Any], Dict[str, Any]]) -> List[Dict[str, Any]]:
|
| 10 |
-
"""Normalize various Polyvore JSON formats into a list of {"items": [id,...]} dicts.
|
| 11 |
-
|
| 12 |
-
Accepts:
|
| 13 |
-
- List of objects where each object may be:
|
| 14 |
-
- {"items": [id,...]} already
|
| 15 |
-
- {"items": [{"item_id": id}...]} (extract item_id or id)
|
| 16 |
-
- {"set_id": ..., "items": [...]}
|
| 17 |
-
- List of ids directly
|
| 18 |
-
- Dict mapping outfit_id -> list of item ids or an object with items.
|
| 19 |
-
"""
|
| 20 |
result: List[Dict[str, Any]] = []
|
|
|
|
| 21 |
if isinstance(obj, dict):
|
| 22 |
-
#
|
| 23 |
-
|
| 24 |
-
|
| 25 |
-
|
| 26 |
-
|
| 27 |
-
|
| 28 |
-
|
| 29 |
-
|
| 30 |
-
|
| 31 |
-
|
| 32 |
-
|
| 33 |
-
|
| 34 |
-
|
| 35 |
-
|
| 36 |
-
|
| 37 |
-
result.append({"items": [str(x) for x in v]})
|
| 38 |
-
elif isinstance(v, dict):
|
| 39 |
-
if "items" in v:
|
| 40 |
-
itm = v["items"]
|
| 41 |
-
if isinstance(itm, list):
|
| 42 |
-
if itm and isinstance(itm[0], dict):
|
| 43 |
-
items = []
|
| 44 |
-
for it in itm:
|
| 45 |
-
iid = it.get("item_id") or it.get("id") or it.get("itemId")
|
| 46 |
-
if iid is not None:
|
| 47 |
-
items.append(str(iid))
|
| 48 |
-
if items:
|
| 49 |
-
result.append({"items": items})
|
| 50 |
else:
|
| 51 |
-
|
| 52 |
-
|
| 53 |
-
|
| 54 |
-
|
| 55 |
-
|
| 56 |
-
|
| 57 |
-
|
| 58 |
-
|
| 59 |
-
|
| 60 |
-
|
| 61 |
-
|
| 62 |
-
|
| 63 |
-
|
| 64 |
-
|
| 65 |
-
|
| 66 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 67 |
else:
|
| 68 |
-
|
| 69 |
-
|
| 70 |
-
|
| 71 |
-
|
| 72 |
-
|
| 73 |
-
|
| 74 |
-
|
| 75 |
-
|
| 76 |
-
|
| 77 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 78 |
return result
|
| 79 |
|
| 80 |
|
| 81 |
def load_outfits_json(root: str, split: str) -> List[Dict[str, Any]]:
|
|
|
|
| 82 |
candidates = [
|
| 83 |
os.path.join(root, f"{split}.json"),
|
| 84 |
os.path.join(root, f"{split}_no_dup.json"),
|
|
@@ -88,132 +99,258 @@ def load_outfits_json(root: str, split: str) -> List[Dict[str, Any]]:
|
|
| 88 |
os.path.join(root, "nondisjoint", f"{split}.json"),
|
| 89 |
os.path.join(root, "disjoint", f"{split}.json"),
|
| 90 |
]
|
|
|
|
| 91 |
for p in candidates:
|
| 92 |
if os.path.exists(p):
|
| 93 |
-
|
| 94 |
-
|
| 95 |
-
|
| 96 |
-
|
| 97 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 98 |
raise FileNotFoundError(f"Could not find usable {split} split in {root} or {root}/splits")
|
| 99 |
|
| 100 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 101 |
def try_load_any_outfits(root: str) -> List[Dict[str, Any]]:
|
| 102 |
-
|
| 103 |
merged: List[Dict[str, Any]] = []
|
| 104 |
-
|
|
|
|
|
|
|
|
|
|
| 105 |
try:
|
| 106 |
-
|
|
|
|
|
|
|
| 107 |
except FileNotFoundError:
|
|
|
|
| 108 |
continue
|
|
|
|
| 109 |
if merged:
|
|
|
|
| 110 |
return merged
|
| 111 |
-
|
| 112 |
-
|
| 113 |
-
|
| 114 |
-
|
| 115 |
-
|
| 116 |
-
|
| 117 |
-
|
| 118 |
-
|
| 119 |
-
|
| 120 |
-
#
|
| 121 |
-
|
| 122 |
-
|
| 123 |
-
|
| 124 |
-
|
| 125 |
-
|
| 126 |
-
raw = json.load(f)
|
| 127 |
-
data = _normalize_outfits(raw)
|
| 128 |
-
if data:
|
| 129 |
-
return data
|
| 130 |
return []
|
| 131 |
|
| 132 |
|
| 133 |
def collect_all_items(outfits: List[Dict[str, Any]]) -> List[str]:
|
|
|
|
| 134 |
s: Set[str] = set()
|
| 135 |
for o in outfits:
|
| 136 |
for it in o.get("items", []):
|
| 137 |
s.add(str(it))
|
| 138 |
-
return sorted(s)
|
| 139 |
|
| 140 |
|
| 141 |
def build_triplets(outfits: List[Dict[str, Any]], all_items: List[str], max_triplets: int = 200000) -> List[Dict[str, str]]:
|
|
|
|
| 142 |
rng = random.Random(42)
|
| 143 |
all_items_set = set(all_items)
|
| 144 |
triplets: List[Dict[str, str]] = []
|
|
|
|
| 145 |
for o in outfits:
|
| 146 |
items = [str(i) for i in o.get("items", [])]
|
| 147 |
if len(items) < 2:
|
| 148 |
continue
|
|
|
|
| 149 |
local_set = set(items)
|
| 150 |
for i in range(len(items) - 1):
|
| 151 |
a = items[i]
|
| 152 |
p = items[i + 1]
|
| 153 |
-
|
|
|
|
| 154 |
negatives = list(all_items_set - local_set)
|
| 155 |
if not negatives:
|
| 156 |
continue
|
|
|
|
| 157 |
n = rng.choice(negatives)
|
| 158 |
triplets.append({"anchor": a, "positive": p, "negative": n})
|
|
|
|
| 159 |
if len(triplets) >= max_triplets:
|
| 160 |
return triplets
|
|
|
|
| 161 |
return triplets
|
| 162 |
|
| 163 |
|
| 164 |
def build_outfit_pairs(outfits: List[Dict[str, Any]], num_negatives_per_pos: int = 1) -> List[Dict[str, Any]]:
|
|
|
|
| 165 |
rng = random.Random(123)
|
| 166 |
all_items = collect_all_items(outfits)
|
| 167 |
all_set = set(all_items)
|
| 168 |
pairs: List[Dict[str, Any]] = []
|
|
|
|
| 169 |
# Positive samples
|
| 170 |
for o in outfits:
|
| 171 |
items = [str(i) for i in o.get("items", [])]
|
| 172 |
if len(items) < 2:
|
| 173 |
continue
|
|
|
|
| 174 |
pairs.append({"items": items, "label": 1})
|
|
|
|
| 175 |
# Negative by corrupting one item
|
| 176 |
for _ in range(num_negatives_per_pos):
|
| 177 |
if not items:
|
| 178 |
continue
|
|
|
|
| 179 |
idx = rng.randrange(len(items))
|
| 180 |
neg_pool = list(all_set - set(items))
|
| 181 |
if not neg_pool:
|
| 182 |
continue
|
|
|
|
| 183 |
neg_item = rng.choice(neg_pool)
|
| 184 |
neg_items = items.copy()
|
| 185 |
neg_items[idx] = neg_item
|
| 186 |
pairs.append({"items": neg_items, "label": 0})
|
|
|
|
| 187 |
return pairs
|
| 188 |
|
| 189 |
|
| 190 |
def build_outfit_triplets(outfits: List[Dict[str, Any]], num_triplets: int = 200000) -> List[Dict[str, Any]]:
|
|
|
|
| 191 |
rng = random.Random(999)
|
| 192 |
-
|
|
|
|
| 193 |
pos = [o for o in outfits if len(o.get("items", [])) >= 3]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 194 |
all_items = collect_all_items(outfits)
|
| 195 |
all_set = set(all_items)
|
| 196 |
triplets: List[Dict[str, Any]] = []
|
| 197 |
-
|
|
|
|
| 198 |
if len(pos) < 2:
|
| 199 |
break
|
|
|
|
| 200 |
ga = rng.choice(pos)
|
| 201 |
gb = rng.choice(pos)
|
|
|
|
| 202 |
# Ensure ga != gb
|
| 203 |
if ga is gb:
|
| 204 |
continue
|
|
|
|
| 205 |
# Create bad by corrupting one item in ga
|
| 206 |
items_ga = [str(i) for i in ga.get("items", [])]
|
| 207 |
if not items_ga:
|
| 208 |
continue
|
|
|
|
| 209 |
corrupt_idx = rng.randrange(len(items_ga))
|
| 210 |
neg_pool = list(all_set - set(items_ga))
|
| 211 |
if not neg_pool:
|
| 212 |
continue
|
|
|
|
| 213 |
neg_item = rng.choice(neg_pool)
|
| 214 |
bad = items_ga.copy()
|
| 215 |
bad[corrupt_idx] = neg_item
|
| 216 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 217 |
return triplets
|
| 218 |
|
| 219 |
|
|
@@ -223,55 +360,108 @@ def main() -> None:
|
|
| 223 |
ap.add_argument("--out", type=str, default=None, help="Output directory for splits (default: <root>/splits)")
|
| 224 |
ap.add_argument("--max_triplets", type=int, default=200000)
|
| 225 |
ap.add_argument("--neg_per_pos", type=int, default=1)
|
| 226 |
-
ap.add_argument("--
|
| 227 |
args = ap.parse_args()
|
| 228 |
|
| 229 |
out_dir = args.out or os.path.join(args.root, "splits")
|
| 230 |
Path(out_dir).mkdir(parents=True, exist_ok=True)
|
| 231 |
|
| 232 |
-
|
|
|
|
|
|
|
|
|
|
| 233 |
splits = {}
|
| 234 |
found_any_official = False
|
|
|
|
|
|
|
| 235 |
for split in ["train", "valid", "test"]:
|
| 236 |
try:
|
| 237 |
data = load_outfits_json(args.root, split)
|
| 238 |
splits[split] = data
|
| 239 |
if data:
|
| 240 |
found_any_official = True
|
|
|
|
| 241 |
except FileNotFoundError as e:
|
| 242 |
-
print(f"Skipping {split}: {e}")
|
| 243 |
splits[split] = []
|
| 244 |
|
| 245 |
-
if
|
| 246 |
-
|
| 247 |
-
|
| 248 |
-
|
| 249 |
-
|
| 250 |
-
|
| 251 |
-
|
| 252 |
-
|
| 253 |
-
|
| 254 |
-
|
| 255 |
-
|
| 256 |
-
|
| 257 |
-
|
| 258 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 259 |
|
|
|
|
| 260 |
for split, outfits in splits.items():
|
| 261 |
if not outfits:
|
|
|
|
| 262 |
continue
|
|
|
|
|
|
|
|
|
|
| 263 |
all_items = collect_all_items(outfits)
|
|
|
|
|
|
|
| 264 |
triplets = build_triplets(outfits, all_items, max_triplets=args.max_triplets)
|
|
|
|
|
|
|
| 265 |
pairs = build_outfit_pairs(outfits, num_negatives_per_pos=args.neg_per_pos)
|
| 266 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 267 |
with open(os.path.join(out_dir, f"{split}.json"), "w") as f:
|
| 268 |
-
json.dump(triplets, f)
|
|
|
|
| 269 |
with open(os.path.join(out_dir, f"outfits_{split}.json"), "w") as f:
|
| 270 |
-
json.dump(pairs, f)
|
| 271 |
-
|
| 272 |
with open(os.path.join(out_dir, f"outfit_triplets_{split}.json"), "w") as f:
|
| 273 |
-
json.dump(
|
| 274 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 275 |
|
| 276 |
|
| 277 |
if __name__ == "__main__":
|
|
|
|
| 7 |
|
| 8 |
|
| 9 |
def _normalize_outfits(obj: Union[List[Any], Dict[str, Any]]) -> List[Dict[str, Any]]:
|
| 10 |
+
"""Normalize various Polyvore JSON formats into a list of {"items": [id,...]} dicts."""
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 11 |
result: List[Dict[str, Any]] = []
|
| 12 |
+
|
| 13 |
if isinstance(obj, dict):
|
| 14 |
+
# Handle case where the file contains outfit_id -> outfit_data mapping
|
| 15 |
+
for outfit_id, outfit_data in obj.items():
|
| 16 |
+
if isinstance(outfit_data, dict):
|
| 17 |
+
if "items" in outfit_data:
|
| 18 |
+
items = outfit_data["items"]
|
| 19 |
+
if isinstance(items, list):
|
| 20 |
+
if items and isinstance(items[0], dict):
|
| 21 |
+
# Extract item IDs from dict format
|
| 22 |
+
item_ids = []
|
| 23 |
+
for item in items:
|
| 24 |
+
item_id = item.get("item_id") or item.get("id") or item.get("itemId")
|
| 25 |
+
if item_id is not None:
|
| 26 |
+
item_ids.append(str(item_id))
|
| 27 |
+
if item_ids:
|
| 28 |
+
result.append({"items": item_ids, "outfit_id": outfit_id})
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 29 |
else:
|
| 30 |
+
# Direct list of item IDs
|
| 31 |
+
result.append({"items": [str(x) for x in items], "outfit_id": outfit_id})
|
| 32 |
+
elif "set_id" in outfit_data:
|
| 33 |
+
# Alternative format with set_id
|
| 34 |
+
if "items" in outfit_data:
|
| 35 |
+
items = outfit_data["items"]
|
| 36 |
+
if isinstance(items, list):
|
| 37 |
+
if items and isinstance(items[0], dict):
|
| 38 |
+
item_ids = []
|
| 39 |
+
for item in items:
|
| 40 |
+
item_id = item.get("item_id") or item.get("id") or item.get("itemId")
|
| 41 |
+
if item_id is not None:
|
| 42 |
+
item_ids.append(str(item_id))
|
| 43 |
+
if item_ids:
|
| 44 |
+
result.append({"items": item_ids, "outfit_id": outfit_id})
|
| 45 |
+
else:
|
| 46 |
+
result.append({"items": [str(x) for x in items], "outfit_id": outfit_id})
|
| 47 |
+
elif isinstance(outfit_data, list):
|
| 48 |
+
# Direct list of item IDs
|
| 49 |
+
result.append({"items": [str(x) for x in outfit_data], "outfit_id": outfit_id})
|
| 50 |
+
|
| 51 |
+
elif isinstance(obj, list):
|
| 52 |
+
for item in obj:
|
| 53 |
+
if isinstance(item, dict):
|
| 54 |
+
if "items" in item:
|
| 55 |
+
items = item["items"]
|
| 56 |
+
if isinstance(items, list):
|
| 57 |
+
if items and isinstance(items[0], dict):
|
| 58 |
+
# Extract item IDs from dict format
|
| 59 |
+
item_ids = []
|
| 60 |
+
for it in items:
|
| 61 |
+
item_id = it.get("item_id") or it.get("id") or it.get("itemId")
|
| 62 |
+
if item_id is not None:
|
| 63 |
+
item_ids.append(str(item_id))
|
| 64 |
+
if item_ids:
|
| 65 |
+
result.append({"items": item_ids})
|
| 66 |
else:
|
| 67 |
+
# Direct list of item IDs
|
| 68 |
+
result.append({"items": [str(x) for x in items]})
|
| 69 |
+
elif "set_id" in item:
|
| 70 |
+
# Alternative format
|
| 71 |
+
if "items" in item:
|
| 72 |
+
items = item["items"]
|
| 73 |
+
if isinstance(items, list):
|
| 74 |
+
if items and isinstance(items[0], dict):
|
| 75 |
+
item_ids = []
|
| 76 |
+
for it in items:
|
| 77 |
+
item_id = it.get("item_id") or it.get("id") or it.get("itemId")
|
| 78 |
+
if item_id is not None:
|
| 79 |
+
item_ids.append(str(item_id))
|
| 80 |
+
if item_ids:
|
| 81 |
+
result.append({"items": item_ids})
|
| 82 |
+
else:
|
| 83 |
+
result.append({"items": [str(x) for x in items]})
|
| 84 |
+
elif isinstance(item, list):
|
| 85 |
+
# Direct list of item IDs
|
| 86 |
+
result.append({"items": [str(x) for x in item]})
|
| 87 |
+
|
| 88 |
return result
|
| 89 |
|
| 90 |
|
| 91 |
def load_outfits_json(root: str, split: str) -> List[Dict[str, Any]]:
|
| 92 |
+
"""Try to load outfit data from various possible locations and formats."""
|
| 93 |
candidates = [
|
| 94 |
os.path.join(root, f"{split}.json"),
|
| 95 |
os.path.join(root, f"{split}_no_dup.json"),
|
|
|
|
| 99 |
os.path.join(root, "nondisjoint", f"{split}.json"),
|
| 100 |
os.path.join(root, "disjoint", f"{split}.json"),
|
| 101 |
]
|
| 102 |
+
|
| 103 |
for p in candidates:
|
| 104 |
if os.path.exists(p):
|
| 105 |
+
try:
|
| 106 |
+
with open(p, "r") as f:
|
| 107 |
+
raw = json.load(f)
|
| 108 |
+
data = _normalize_outfits(raw)
|
| 109 |
+
if data:
|
| 110 |
+
print(f"β
Loaded {len(data)} outfits from {p}")
|
| 111 |
+
return data
|
| 112 |
+
except Exception as e:
|
| 113 |
+
print(f"β οΈ Failed to load {p}: {e}")
|
| 114 |
+
continue
|
| 115 |
+
|
| 116 |
raise FileNotFoundError(f"Could not find usable {split} split in {root} or {root}/splits")
|
| 117 |
|
| 118 |
|
| 119 |
+
def extract_outfits_from_metadata(root: str) -> List[Dict[str, Any]]:
|
| 120 |
+
"""Extract outfit information from polyvore_item_metadata.json using set_id grouping."""
|
| 121 |
+
print("π Extracting outfits from metadata using set_id grouping...")
|
| 122 |
+
|
| 123 |
+
metadata_path = os.path.join(root, "polyvore_item_metadata.json")
|
| 124 |
+
if not os.path.exists(metadata_path):
|
| 125 |
+
print(f"β Metadata file not found: {metadata_path}")
|
| 126 |
+
return []
|
| 127 |
+
|
| 128 |
+
try:
|
| 129 |
+
with open(metadata_path, "r") as f:
|
| 130 |
+
metadata = json.load(f)
|
| 131 |
+
|
| 132 |
+
if not isinstance(metadata, dict):
|
| 133 |
+
print("β Metadata is not a dictionary")
|
| 134 |
+
return []
|
| 135 |
+
|
| 136 |
+
# Group items by set_id to create outfits
|
| 137 |
+
outfits_by_set = {}
|
| 138 |
+
for item_id, item_data in metadata.items():
|
| 139 |
+
if isinstance(item_data, dict) and "set_id" in item_data:
|
| 140 |
+
set_id = item_data["set_id"]
|
| 141 |
+
if set_id not in outfits_by_set:
|
| 142 |
+
outfits_by_set[set_id] = []
|
| 143 |
+
outfits_by_set[set_id].append(str(item_id))
|
| 144 |
+
|
| 145 |
+
# Convert to outfit format
|
| 146 |
+
outfits = []
|
| 147 |
+
for set_id, item_ids in outfits_by_set.items():
|
| 148 |
+
if len(item_ids) >= 2: # Minimum outfit size
|
| 149 |
+
outfits.append({
|
| 150 |
+
"items": item_ids,
|
| 151 |
+
"set_id": set_id,
|
| 152 |
+
"outfit_id": f"set_{set_id}"
|
| 153 |
+
})
|
| 154 |
+
|
| 155 |
+
print(f"β
Extracted {len(outfits)} outfits from metadata (set_id grouping)")
|
| 156 |
+
return outfits
|
| 157 |
+
|
| 158 |
+
except Exception as e:
|
| 159 |
+
print(f"β Failed to parse metadata: {e}")
|
| 160 |
+
return []
|
| 161 |
+
|
| 162 |
+
|
| 163 |
+
def extract_outfits_from_titles(root: str) -> List[Dict[str, Any]]:
|
| 164 |
+
"""Extract outfit information from polyvore_outfit_titles.json."""
|
| 165 |
+
print("π Extracting outfits from outfit titles...")
|
| 166 |
+
|
| 167 |
+
titles_path = os.path.join(root, "polyvore_outfit_titles.json")
|
| 168 |
+
if not os.path.exists(titles_path):
|
| 169 |
+
print(f"β Titles file not found: {titles_path}")
|
| 170 |
+
return []
|
| 171 |
+
|
| 172 |
+
try:
|
| 173 |
+
with open(titles_path, "r") as f:
|
| 174 |
+
titles = json.load(f)
|
| 175 |
+
|
| 176 |
+
if not isinstance(titles, dict):
|
| 177 |
+
print("β Titles is not a dictionary")
|
| 178 |
+
return []
|
| 179 |
+
|
| 180 |
+
outfits = []
|
| 181 |
+
for outfit_id, outfit_data in titles.items():
|
| 182 |
+
if isinstance(outfit_data, dict) and "items" in outfit_data:
|
| 183 |
+
items = outfit_data["items"]
|
| 184 |
+
if isinstance(items, list) and len(items) >= 2:
|
| 185 |
+
# Convert all items to strings
|
| 186 |
+
item_ids = [str(x) for x in items]
|
| 187 |
+
outfits.append({
|
| 188 |
+
"items": item_ids,
|
| 189 |
+
"outfit_id": outfit_id
|
| 190 |
+
})
|
| 191 |
+
|
| 192 |
+
print(f"β
Extracted {len(outfits)} outfits from titles")
|
| 193 |
+
return outfits
|
| 194 |
+
|
| 195 |
+
except Exception as e:
|
| 196 |
+
print(f"β Failed to parse titles: {e}")
|
| 197 |
+
return []
|
| 198 |
+
|
| 199 |
+
|
| 200 |
def try_load_any_outfits(root: str) -> List[Dict[str, Any]]:
|
| 201 |
+
"""Try to load outfits from any available source, prioritizing official splits."""
|
| 202 |
merged: List[Dict[str, Any]] = []
|
| 203 |
+
|
| 204 |
+
# First try official splits (nondisjoint and disjoint)
|
| 205 |
+
print("π Looking for official splits...")
|
| 206 |
+
for split in ["train", "valid", "test"]:
|
| 207 |
try:
|
| 208 |
+
data = load_outfits_json(root, split)
|
| 209 |
+
merged.extend(data)
|
| 210 |
+
print(f"β
Found {split} split with {len(data)} outfits")
|
| 211 |
except FileNotFoundError:
|
| 212 |
+
print(f"β οΈ No {split} split found")
|
| 213 |
continue
|
| 214 |
+
|
| 215 |
if merged:
|
| 216 |
+
print(f"β
Total: {len(merged)} outfits from official splits")
|
| 217 |
return merged
|
| 218 |
+
|
| 219 |
+
# If no official splits, try to extract from metadata
|
| 220 |
+
print("π§ No official splits found, extracting from metadata...")
|
| 221 |
+
|
| 222 |
+
# Try metadata first (more reliable)
|
| 223 |
+
outfits = extract_outfits_from_metadata(root)
|
| 224 |
+
if outfits:
|
| 225 |
+
return outfits
|
| 226 |
+
|
| 227 |
+
# Try titles as fallback
|
| 228 |
+
outfits = extract_outfits_from_titles(root)
|
| 229 |
+
if outfits:
|
| 230 |
+
return outfits
|
| 231 |
+
|
| 232 |
+
print("β No outfits could be extracted from any source")
|
|
|
|
|
|
|
|
|
|
|
|
|
| 233 |
return []
|
| 234 |
|
| 235 |
|
| 236 |
def collect_all_items(outfits: List[Dict[str, Any]]) -> List[str]:
|
| 237 |
+
"""Collect all unique item IDs from outfits."""
|
| 238 |
s: Set[str] = set()
|
| 239 |
for o in outfits:
|
| 240 |
for it in o.get("items", []):
|
| 241 |
s.add(str(it))
|
| 242 |
+
return sorted(list(s))
|
| 243 |
|
| 244 |
|
| 245 |
def build_triplets(outfits: List[Dict[str, Any]], all_items: List[str], max_triplets: int = 200000) -> List[Dict[str, str]]:
|
| 246 |
+
"""Build training triplets from outfits."""
|
| 247 |
rng = random.Random(42)
|
| 248 |
all_items_set = set(all_items)
|
| 249 |
triplets: List[Dict[str, str]] = []
|
| 250 |
+
|
| 251 |
for o in outfits:
|
| 252 |
items = [str(i) for i in o.get("items", [])]
|
| 253 |
if len(items) < 2:
|
| 254 |
continue
|
| 255 |
+
|
| 256 |
local_set = set(items)
|
| 257 |
for i in range(len(items) - 1):
|
| 258 |
a = items[i]
|
| 259 |
p = items[i + 1]
|
| 260 |
+
|
| 261 |
+
# Pick a negative not in this outfit
|
| 262 |
negatives = list(all_items_set - local_set)
|
| 263 |
if not negatives:
|
| 264 |
continue
|
| 265 |
+
|
| 266 |
n = rng.choice(negatives)
|
| 267 |
triplets.append({"anchor": a, "positive": p, "negative": n})
|
| 268 |
+
|
| 269 |
if len(triplets) >= max_triplets:
|
| 270 |
return triplets
|
| 271 |
+
|
| 272 |
return triplets
|
| 273 |
|
| 274 |
|
| 275 |
def build_outfit_pairs(outfits: List[Dict[str, Any]], num_negatives_per_pos: int = 1) -> List[Dict[str, Any]]:
|
| 276 |
+
"""Build outfit pairs for training."""
|
| 277 |
rng = random.Random(123)
|
| 278 |
all_items = collect_all_items(outfits)
|
| 279 |
all_set = set(all_items)
|
| 280 |
pairs: List[Dict[str, Any]] = []
|
| 281 |
+
|
| 282 |
# Positive samples
|
| 283 |
for o in outfits:
|
| 284 |
items = [str(i) for i in o.get("items", [])]
|
| 285 |
if len(items) < 2:
|
| 286 |
continue
|
| 287 |
+
|
| 288 |
pairs.append({"items": items, "label": 1})
|
| 289 |
+
|
| 290 |
# Negative by corrupting one item
|
| 291 |
for _ in range(num_negatives_per_pos):
|
| 292 |
if not items:
|
| 293 |
continue
|
| 294 |
+
|
| 295 |
idx = rng.randrange(len(items))
|
| 296 |
neg_pool = list(all_set - set(items))
|
| 297 |
if not neg_pool:
|
| 298 |
continue
|
| 299 |
+
|
| 300 |
neg_item = rng.choice(neg_pool)
|
| 301 |
neg_items = items.copy()
|
| 302 |
neg_items[idx] = neg_item
|
| 303 |
pairs.append({"items": neg_items, "label": 0})
|
| 304 |
+
|
| 305 |
return pairs
|
| 306 |
|
| 307 |
|
| 308 |
def build_outfit_triplets(outfits: List[Dict[str, Any]], num_triplets: int = 200000) -> List[Dict[str, Any]]:
|
| 309 |
+
"""Build outfit-level triplets for ViT training."""
|
| 310 |
rng = random.Random(999)
|
| 311 |
+
|
| 312 |
+
# Collect only valid positive outfits (len >= 3)
|
| 313 |
pos = [o for o in outfits if len(o.get("items", [])) >= 3]
|
| 314 |
+
|
| 315 |
+
if len(pos) < 2:
|
| 316 |
+
print(f"β οΈ Only {len(pos)} valid outfits found, need at least 2 for triplets")
|
| 317 |
+
return []
|
| 318 |
+
|
| 319 |
all_items = collect_all_items(outfits)
|
| 320 |
all_set = set(all_items)
|
| 321 |
triplets: List[Dict[str, Any]] = []
|
| 322 |
+
|
| 323 |
+
for _ in range(min(num_triplets, len(pos) * 10)): # Limit based on available outfits
|
| 324 |
if len(pos) < 2:
|
| 325 |
break
|
| 326 |
+
|
| 327 |
ga = rng.choice(pos)
|
| 328 |
gb = rng.choice(pos)
|
| 329 |
+
|
| 330 |
# Ensure ga != gb
|
| 331 |
if ga is gb:
|
| 332 |
continue
|
| 333 |
+
|
| 334 |
# Create bad by corrupting one item in ga
|
| 335 |
items_ga = [str(i) for i in ga.get("items", [])]
|
| 336 |
if not items_ga:
|
| 337 |
continue
|
| 338 |
+
|
| 339 |
corrupt_idx = rng.randrange(len(items_ga))
|
| 340 |
neg_pool = list(all_set - set(items_ga))
|
| 341 |
if not neg_pool:
|
| 342 |
continue
|
| 343 |
+
|
| 344 |
neg_item = rng.choice(neg_pool)
|
| 345 |
bad = items_ga.copy()
|
| 346 |
bad[corrupt_idx] = neg_item
|
| 347 |
+
|
| 348 |
+
triplets.append({
|
| 349 |
+
"good_a": items_ga,
|
| 350 |
+
"good_b": [str(i) for i in gb.get("items", [])],
|
| 351 |
+
"bad": bad
|
| 352 |
+
})
|
| 353 |
+
|
| 354 |
return triplets
|
| 355 |
|
| 356 |
|
|
|
|
| 360 |
ap.add_argument("--out", type=str, default=None, help="Output directory for splits (default: <root>/splits)")
|
| 361 |
ap.add_argument("--max_triplets", type=int, default=200000)
|
| 362 |
ap.add_argument("--neg_per_pos", type=int, default=1)
|
| 363 |
+
ap.add_argument("--force_random_split", action="store_true", help="Force random split creation (not recommended)")
|
| 364 |
args = ap.parse_args()
|
| 365 |
|
| 366 |
out_dir = args.out or os.path.join(args.root, "splits")
|
| 367 |
Path(out_dir).mkdir(parents=True, exist_ok=True)
|
| 368 |
|
| 369 |
+
print(f"π Preparing Polyvore dataset from {args.root}")
|
| 370 |
+
print(f"π Output directory: {out_dir}")
|
| 371 |
+
|
| 372 |
+
# Always try to use official splits first
|
| 373 |
splits = {}
|
| 374 |
found_any_official = False
|
| 375 |
+
|
| 376 |
+
print("π― Looking for official splits...")
|
| 377 |
for split in ["train", "valid", "test"]:
|
| 378 |
try:
|
| 379 |
data = load_outfits_json(args.root, split)
|
| 380 |
splits[split] = data
|
| 381 |
if data:
|
| 382 |
found_any_official = True
|
| 383 |
+
print(f"β
Loaded {split} split: {len(data)} outfits")
|
| 384 |
except FileNotFoundError as e:
|
| 385 |
+
print(f"β οΈ Skipping {split}: {e}")
|
| 386 |
splits[split] = []
|
| 387 |
|
| 388 |
+
if found_any_official:
|
| 389 |
+
print("π Using official splits from dataset!")
|
| 390 |
+
else:
|
| 391 |
+
print("β οΈ No official splits found")
|
| 392 |
+
|
| 393 |
+
if args.force_random_split:
|
| 394 |
+
print("π§ Creating random split (not recommended for production)...")
|
| 395 |
+
all_outfits = try_load_any_outfits(args.root)
|
| 396 |
+
|
| 397 |
+
if not all_outfits:
|
| 398 |
+
print("β No outfits found to split. Please check dataset structure.")
|
| 399 |
+
print("π Expected files:")
|
| 400 |
+
print(" - train.json, valid.json, test.json")
|
| 401 |
+
print(" - nondisjoint/train.json, etc.")
|
| 402 |
+
print(" - polyvore_item_metadata.json")
|
| 403 |
+
print(" - polyvore_outfit_titles.json")
|
| 404 |
+
return
|
| 405 |
+
|
| 406 |
+
print(f"π― Creating random split from {len(all_outfits)} outfits")
|
| 407 |
+
rng = random.Random(2024)
|
| 408 |
+
rng.shuffle(all_outfits)
|
| 409 |
+
|
| 410 |
+
n = len(all_outfits)
|
| 411 |
+
n_train = int(0.7 * n)
|
| 412 |
+
n_valid = int(0.1 * n)
|
| 413 |
+
|
| 414 |
+
splits = {
|
| 415 |
+
"train": all_outfits[:n_train],
|
| 416 |
+
"valid": all_outfits[n_train:n_train + n_valid],
|
| 417 |
+
"test": all_outfits[n_train + n_valid:],
|
| 418 |
+
}
|
| 419 |
+
|
| 420 |
+
print(f"π Split created: train={n_train}, valid={n_valid}, test={n-n_train-n_valid}")
|
| 421 |
+
else:
|
| 422 |
+
print("β Random split creation disabled. Use --force_random_split if needed.")
|
| 423 |
+
print("π§ Please ensure official splits are available in nondisjoint/ or disjoint/ folders.")
|
| 424 |
+
return
|
| 425 |
|
| 426 |
+
# Generate training data for each split
|
| 427 |
for split, outfits in splits.items():
|
| 428 |
if not outfits:
|
| 429 |
+
print(f"β οΈ No outfits for {split} split, skipping")
|
| 430 |
continue
|
| 431 |
+
|
| 432 |
+
print(f"\nπ§ Processing {split} split ({len(outfits)} outfits)...")
|
| 433 |
+
|
| 434 |
all_items = collect_all_items(outfits)
|
| 435 |
+
print(f" π¦ Total unique items: {len(all_items)}")
|
| 436 |
+
|
| 437 |
triplets = build_triplets(outfits, all_items, max_triplets=args.max_triplets)
|
| 438 |
+
print(f" π Generated {len(triplets)} item triplets")
|
| 439 |
+
|
| 440 |
pairs = build_outfit_pairs(outfits, num_negatives_per_pos=args.neg_per_pos)
|
| 441 |
+
print(f" π Generated {len(pairs)} outfit pairs")
|
| 442 |
+
|
| 443 |
+
outfit_triplets = build_outfit_triplets(outfits)
|
| 444 |
+
print(f" π Generated {len(outfit_triplets)} outfit triplets")
|
| 445 |
+
|
| 446 |
+
# Save files
|
| 447 |
with open(os.path.join(out_dir, f"{split}.json"), "w") as f:
|
| 448 |
+
json.dump(triplets, f, indent=2)
|
| 449 |
+
|
| 450 |
with open(os.path.join(out_dir, f"outfits_{split}.json"), "w") as f:
|
| 451 |
+
json.dump(pairs, f, indent=2)
|
| 452 |
+
|
| 453 |
with open(os.path.join(out_dir, f"outfit_triplets_{split}.json"), "w") as f:
|
| 454 |
+
json.dump(outfit_triplets, f, indent=2)
|
| 455 |
+
|
| 456 |
+
print(f" πΎ Saved {split} data to {out_dir}")
|
| 457 |
+
|
| 458 |
+
print(f"\nπ Dataset preparation complete!")
|
| 459 |
+
print(f"π All files saved to: {out_dir}")
|
| 460 |
+
|
| 461 |
+
if found_any_official:
|
| 462 |
+
print("β
Used official dataset splits - production ready!")
|
| 463 |
+
else:
|
| 464 |
+
print("β οΈ Used random splits - not recommended for production")
|
| 465 |
|
| 466 |
|
| 467 |
if __name__ == "__main__":
|
startup_fix.py
ADDED
|
@@ -0,0 +1,306 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env python3
|
| 2 |
+
"""
|
| 3 |
+
Startup Fix Script for Dressify
|
| 4 |
+
Handles dataset preparation issues and ensures system startup
|
| 5 |
+
"""
|
| 6 |
+
|
| 7 |
+
import os
|
| 8 |
+
import sys
|
| 9 |
+
import subprocess
|
| 10 |
+
import time
|
| 11 |
+
from pathlib import Path
|
| 12 |
+
|
| 13 |
+
|
| 14 |
+
def check_dataset_status():
|
| 15 |
+
"""Check the current dataset status."""
|
| 16 |
+
print("π Checking dataset status...")
|
| 17 |
+
|
| 18 |
+
root = os.path.abspath(os.path.join(os.getcwd(), "data", "Polyvore"))
|
| 19 |
+
|
| 20 |
+
if not os.path.exists(root):
|
| 21 |
+
print(f"β Dataset directory not found: {root}")
|
| 22 |
+
return False
|
| 23 |
+
|
| 24 |
+
# Check key components
|
| 25 |
+
images_dir = os.path.join(root, "images")
|
| 26 |
+
splits_dir = os.path.join(root, "splits")
|
| 27 |
+
|
| 28 |
+
has_images = os.path.isdir(images_dir) and any(Path(images_dir).glob("*"))
|
| 29 |
+
has_splits = os.path.isdir(splits_dir) and any(Path(splits_dir).glob("*.json"))
|
| 30 |
+
|
| 31 |
+
print(f"π Dataset root: {root}")
|
| 32 |
+
print(f"πΌοΈ Images: {'β
' if has_images else 'β'} ({images_dir})")
|
| 33 |
+
print(f"π Splits: {'β
' if has_splits else 'β'} ({splits_dir})")
|
| 34 |
+
|
| 35 |
+
# Check for official splits
|
| 36 |
+
official_splits = []
|
| 37 |
+
for location in ["nondisjoint", "disjoint"]:
|
| 38 |
+
location_path = os.path.join(root, location)
|
| 39 |
+
if os.path.exists(location_path):
|
| 40 |
+
for split in ["train", "valid", "test"]:
|
| 41 |
+
split_file = os.path.join(location_path, f"{split}.json")
|
| 42 |
+
if os.path.exists(split_file):
|
| 43 |
+
size_mb = os.path.getsize(split_file) / (1024 * 1024)
|
| 44 |
+
official_splits.append(f"{location}/{split}.json ({size_mb:.1f} MB)")
|
| 45 |
+
|
| 46 |
+
if official_splits:
|
| 47 |
+
print(f"π― Official splits found:")
|
| 48 |
+
for split in official_splits:
|
| 49 |
+
print(f" β
{split}")
|
| 50 |
+
|
| 51 |
+
if has_images and has_splits:
|
| 52 |
+
print("β
Dataset is ready!")
|
| 53 |
+
return True
|
| 54 |
+
elif has_images:
|
| 55 |
+
print("β οΈ Images present but splits missing - will create splits from official data")
|
| 56 |
+
return "needs_splits"
|
| 57 |
+
else:
|
| 58 |
+
print("β Dataset incomplete - needs full preparation")
|
| 59 |
+
return False
|
| 60 |
+
|
| 61 |
+
|
| 62 |
+
def prepare_dataset():
|
| 63 |
+
"""Prepare the dataset using the improved scripts."""
|
| 64 |
+
print("\nπ Preparing dataset...")
|
| 65 |
+
|
| 66 |
+
root = os.path.abspath(os.path.join(os.getcwd(), "data", "Polyvore"))
|
| 67 |
+
|
| 68 |
+
# First, ensure the data fetcher runs
|
| 69 |
+
try:
|
| 70 |
+
print("π₯ Running data fetcher...")
|
| 71 |
+
from utils.data_fetch import ensure_dataset_ready
|
| 72 |
+
dataset_root = ensure_dataset_ready()
|
| 73 |
+
|
| 74 |
+
if not dataset_root:
|
| 75 |
+
print("β Data fetcher failed")
|
| 76 |
+
return False
|
| 77 |
+
|
| 78 |
+
print(f"β
Data fetcher completed: {dataset_root}")
|
| 79 |
+
|
| 80 |
+
except Exception as e:
|
| 81 |
+
print(f"β Data fetcher error: {e}")
|
| 82 |
+
return False
|
| 83 |
+
|
| 84 |
+
# Now run the dataset preparation script (without random splits)
|
| 85 |
+
try:
|
| 86 |
+
print("π§ Running dataset preparation...")
|
| 87 |
+
|
| 88 |
+
# Check if prepare_polyvore.py exists
|
| 89 |
+
prep_script = "scripts/prepare_polyvore.py"
|
| 90 |
+
if not os.path.exists(prep_script):
|
| 91 |
+
prep_script = "prepare_polyvore.py"
|
| 92 |
+
|
| 93 |
+
if not os.path.exists(prep_script):
|
| 94 |
+
print(f"β Prepare script not found: {prep_script}")
|
| 95 |
+
return False
|
| 96 |
+
|
| 97 |
+
# Run the preparation script WITHOUT random splits
|
| 98 |
+
cmd = [
|
| 99 |
+
sys.executable, prep_script,
|
| 100 |
+
"--root", root
|
| 101 |
+
# Note: NOT using --force_random_split
|
| 102 |
+
]
|
| 103 |
+
|
| 104 |
+
print(f"π§ Running: {' '.join(cmd)}")
|
| 105 |
+
print("π― This will use official splits from nondisjoint/ and disjoint/ folders")
|
| 106 |
+
|
| 107 |
+
result = subprocess.run(cmd, capture_output=True, text=True, check=False)
|
| 108 |
+
|
| 109 |
+
if result.returncode == 0:
|
| 110 |
+
print("β
Dataset preparation completed successfully!")
|
| 111 |
+
print("π Output:")
|
| 112 |
+
print(result.stdout)
|
| 113 |
+
return True
|
| 114 |
+
else:
|
| 115 |
+
print("β Dataset preparation failed!")
|
| 116 |
+
print("π Error output:")
|
| 117 |
+
print(result.stderr)
|
| 118 |
+
print("π Standard output:")
|
| 119 |
+
print(result.stdout)
|
| 120 |
+
|
| 121 |
+
# Check if it's because official splits are missing
|
| 122 |
+
if "No official splits found" in result.stderr or "No official splits found" in result.stdout:
|
| 123 |
+
print("\nπ§ Issue: Official splits not found in nondisjoint/ or disjoint/ folders")
|
| 124 |
+
print("π Expected structure:")
|
| 125 |
+
print(" data/Polyvore/")
|
| 126 |
+
print(" βββ nondisjoint/")
|
| 127 |
+
print(" β βββ train.json")
|
| 128 |
+
print(" β βββ valid.json")
|
| 129 |
+
print(" β βββ test.json")
|
| 130 |
+
print(" βββ disjoint/")
|
| 131 |
+
print(" β βββ train.json")
|
| 132 |
+
print(" β βββ valid.json")
|
| 133 |
+
print(" β βββ test.json")
|
| 134 |
+
print(" βββ images/")
|
| 135 |
+
|
| 136 |
+
print("\nπ‘ Solution: The dataset should have been downloaded with official splits.")
|
| 137 |
+
print(" Check if the Hugging Face download completed successfully.")
|
| 138 |
+
|
| 139 |
+
return False
|
| 140 |
+
|
| 141 |
+
except Exception as e:
|
| 142 |
+
print(f"β Dataset preparation error: {e}")
|
| 143 |
+
return False
|
| 144 |
+
|
| 145 |
+
|
| 146 |
+
def verify_splits():
|
| 147 |
+
"""Verify that splits were created successfully."""
|
| 148 |
+
print("\nπ Verifying splits...")
|
| 149 |
+
|
| 150 |
+
root = os.path.abspath(os.path.join(os.getcwd(), "data", "Polyvore"))
|
| 151 |
+
splits_dir = os.path.join(root, "splits")
|
| 152 |
+
|
| 153 |
+
if not os.path.exists(splits_dir):
|
| 154 |
+
print("β Splits directory not found")
|
| 155 |
+
return False
|
| 156 |
+
|
| 157 |
+
required_files = [
|
| 158 |
+
"train.json",
|
| 159 |
+
"outfits_train.json",
|
| 160 |
+
"outfit_triplets_train.json"
|
| 161 |
+
]
|
| 162 |
+
|
| 163 |
+
missing_files = []
|
| 164 |
+
for file_name in required_files:
|
| 165 |
+
file_path = os.path.join(splits_dir, file_name)
|
| 166 |
+
if os.path.exists(file_path):
|
| 167 |
+
size_mb = os.path.getsize(file_path) / (1024 * 1024)
|
| 168 |
+
print(f"β
{file_name}: {size_mb:.1f} MB")
|
| 169 |
+
else:
|
| 170 |
+
print(f"β {file_name}: Missing")
|
| 171 |
+
missing_files.append(file_name)
|
| 172 |
+
|
| 173 |
+
if missing_files:
|
| 174 |
+
print(f"β Missing required files: {missing_files}")
|
| 175 |
+
return False
|
| 176 |
+
|
| 177 |
+
print("β
All required splits verified!")
|
| 178 |
+
return True
|
| 179 |
+
|
| 180 |
+
|
| 181 |
+
def test_training_scripts():
|
| 182 |
+
"""Test that training scripts can run without errors."""
|
| 183 |
+
print("\nπ§ͺ Testing training scripts...")
|
| 184 |
+
|
| 185 |
+
# Test ResNet training script
|
| 186 |
+
try:
|
| 187 |
+
print("π§ Testing ResNet training script...")
|
| 188 |
+
from models.resnet_embedder import ResNetItemEmbedder
|
| 189 |
+
print("β
ResNet model imports successfully")
|
| 190 |
+
except Exception as e:
|
| 191 |
+
print(f"β ResNet model import failed: {e}")
|
| 192 |
+
return False
|
| 193 |
+
|
| 194 |
+
# Test ViT training script
|
| 195 |
+
try:
|
| 196 |
+
print("π§ Testing ViT training script...")
|
| 197 |
+
from models.vit_outfit import OutfitCompatibilityModel
|
| 198 |
+
print("β
ViT model imports successfully")
|
| 199 |
+
except Exception as e:
|
| 200 |
+
print(f"β ViT model import failed: {e}")
|
| 201 |
+
return False
|
| 202 |
+
|
| 203 |
+
print("β
All training scripts tested successfully!")
|
| 204 |
+
return True
|
| 205 |
+
|
| 206 |
+
|
| 207 |
+
def create_quick_start_script():
|
| 208 |
+
"""Create a quick start script for easy testing."""
|
| 209 |
+
script_content = """#!/bin/bash
|
| 210 |
+
# Quick Start Script for Dressify
|
| 211 |
+
# This script will prepare the dataset and start training
|
| 212 |
+
|
| 213 |
+
echo "π Dressify Quick Start"
|
| 214 |
+
echo "========================"
|
| 215 |
+
|
| 216 |
+
# Check if dataset is ready
|
| 217 |
+
if [ -d "data/Polyvore/splits" ] && [ -f "data/Polyvore/splits/train.json" ]; then
|
| 218 |
+
echo "β
Dataset is ready!"
|
| 219 |
+
else
|
| 220 |
+
echo "π§ Preparing dataset..."
|
| 221 |
+
python startup_fix.py
|
| 222 |
+
fi
|
| 223 |
+
|
| 224 |
+
# Start quick training
|
| 225 |
+
echo "π― Starting quick training..."
|
| 226 |
+
python train_resnet.py --data_root data/Polyvore --epochs 3 --out models/exports/resnet_quick.pth
|
| 227 |
+
|
| 228 |
+
echo "π Quick start completed!"
|
| 229 |
+
echo "π Check models/exports/ for trained models"
|
| 230 |
+
"""
|
| 231 |
+
|
| 232 |
+
script_path = "quick_start.sh"
|
| 233 |
+
with open(script_path, "w") as f:
|
| 234 |
+
f.write(script_content)
|
| 235 |
+
|
| 236 |
+
# Make executable
|
| 237 |
+
os.chmod(script_path, 0o755)
|
| 238 |
+
print(f"π Created quick start script: {script_path}")
|
| 239 |
+
|
| 240 |
+
|
| 241 |
+
def main():
|
| 242 |
+
"""Main startup fix routine."""
|
| 243 |
+
print("π Dressify Startup Fix")
|
| 244 |
+
print("=" * 50)
|
| 245 |
+
|
| 246 |
+
# Check current status
|
| 247 |
+
status = check_dataset_status()
|
| 248 |
+
|
| 249 |
+
if status is True:
|
| 250 |
+
print("β
System is ready to go!")
|
| 251 |
+
return True
|
| 252 |
+
|
| 253 |
+
elif status == "needs_splits":
|
| 254 |
+
print("π§ Dataset needs splits created from official data...")
|
| 255 |
+
if prepare_dataset():
|
| 256 |
+
if verify_splits():
|
| 257 |
+
print("β
Dataset preparation completed successfully!")
|
| 258 |
+
return True
|
| 259 |
+
else:
|
| 260 |
+
print("β Split verification failed")
|
| 261 |
+
return False
|
| 262 |
+
else:
|
| 263 |
+
print("β Dataset preparation failed")
|
| 264 |
+
return False
|
| 265 |
+
|
| 266 |
+
else:
|
| 267 |
+
print("π§ Dataset needs full preparation...")
|
| 268 |
+
if prepare_dataset():
|
| 269 |
+
if verify_splits():
|
| 270 |
+
print("β
Dataset preparation completed successfully!")
|
| 271 |
+
return True
|
| 272 |
+
else:
|
| 273 |
+
print("β Split verification failed")
|
| 274 |
+
return False
|
| 275 |
+
else:
|
| 276 |
+
print("β Dataset preparation failed")
|
| 277 |
+
return False
|
| 278 |
+
|
| 279 |
+
|
| 280 |
+
if __name__ == "__main__":
|
| 281 |
+
try:
|
| 282 |
+
success = main()
|
| 283 |
+
|
| 284 |
+
if success:
|
| 285 |
+
print("\nπ Startup fix completed successfully!")
|
| 286 |
+
print("π Your Dressify system is ready to use!")
|
| 287 |
+
|
| 288 |
+
# Create quick start script
|
| 289 |
+
create_quick_start_script()
|
| 290 |
+
|
| 291 |
+
print("\nπ Next steps:")
|
| 292 |
+
print("1. Run: python app.py")
|
| 293 |
+
print("2. Or use: ./quick_start.sh")
|
| 294 |
+
print("3. Check the Advanced Training tab for parameter controls")
|
| 295 |
+
|
| 296 |
+
else:
|
| 297 |
+
print("\nβ Startup fix failed!")
|
| 298 |
+
print("π§ Please check the error messages above")
|
| 299 |
+
print("π Contact support if issues persist")
|
| 300 |
+
|
| 301 |
+
except KeyboardInterrupt:
|
| 302 |
+
print("\nβΉοΈ Startup fix interrupted by user")
|
| 303 |
+
except Exception as e:
|
| 304 |
+
print(f"\nπ₯ Unexpected error: {e}")
|
| 305 |
+
import traceback
|
| 306 |
+
traceback.print_exc()
|
utils/data_fetch.py
CHANGED
|
@@ -12,18 +12,43 @@ def _unzip_images_if_needed(root: str) -> None:
|
|
| 12 |
"""
|
| 13 |
images_dir = os.path.join(root, "images")
|
| 14 |
if os.path.isdir(images_dir) and any(Path(images_dir).glob("*")):
|
|
|
|
| 15 |
return
|
|
|
|
| 16 |
# Common zip names at root or subfolders
|
| 17 |
candidates = [os.path.join(root, name) for name in ("images.zip", "polyvore-images.zip", "imgs.zip")]
|
| 18 |
# Also search recursively for any *images*.zip
|
| 19 |
for p in Path(root).rglob("*images*.zip"):
|
| 20 |
candidates.append(str(p))
|
|
|
|
| 21 |
for zpath in candidates:
|
| 22 |
if os.path.isfile(zpath):
|
|
|
|
|
|
|
| 23 |
os.makedirs(images_dir, exist_ok=True)
|
| 24 |
-
|
| 25 |
-
|
| 26 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 27 |
|
| 28 |
|
| 29 |
def ensure_dataset_ready() -> Optional[str]:
|
|
@@ -36,17 +61,37 @@ def ensure_dataset_ready() -> Optional[str]:
|
|
| 36 |
root = os.path.abspath(os.path.join(os.getcwd(), "data", "Polyvore"))
|
| 37 |
Path(root).mkdir(parents=True, exist_ok=True)
|
| 38 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 39 |
# If images are already present, don't return early; still ensure metadata JSONs exist
|
| 40 |
-
|
|
|
|
| 41 |
|
| 42 |
# Download the HF dataset snapshot into root
|
| 43 |
try:
|
|
|
|
|
|
|
| 44 |
# Only fetch what's needed to run and prepare splits
|
| 45 |
allow = [
|
| 46 |
"images.zip",
|
| 47 |
# root-level (some mirrors place jsons here)
|
| 48 |
"train.json",
|
| 49 |
-
"valid.json",
|
| 50 |
"test.json",
|
| 51 |
# official splits often live here
|
| 52 |
"nondisjoint/train.json",
|
|
@@ -60,27 +105,27 @@ def ensure_dataset_ready() -> Optional[str]:
|
|
| 60 |
"polyvore_outfit_titles.json",
|
| 61 |
"categories.csv",
|
| 62 |
]
|
|
|
|
| 63 |
# Explicit ignores to prevent huge downloads (>10GB)
|
| 64 |
ignore = [
|
| 65 |
"**/*hglmm*",
|
| 66 |
-
"disjoint/**",
|
| 67 |
-
"nondisjoint/**",
|
| 68 |
-
"*/large/**",
|
| 69 |
"**/*.tar",
|
| 70 |
"**/*.tar.gz",
|
| 71 |
"**/*.7z",
|
|
|
|
| 72 |
]
|
| 73 |
-
|
| 74 |
-
|
| 75 |
-
|
| 76 |
-
]) and (
|
| 77 |
# any location providing official splits is acceptable
|
| 78 |
all(os.path.exists(os.path.join(root, f)) for f in ["train.json", "valid.json", "test.json"]) or
|
| 79 |
all(os.path.exists(os.path.join(root, "nondisjoint", f)) for f in ["train.json", "valid.json", "test.json"]) or
|
| 80 |
all(os.path.exists(os.path.join(root, "disjoint", f)) for f in ["train.json", "valid.json", "test.json"])
|
| 81 |
)
|
| 82 |
)
|
| 83 |
-
|
|
|
|
|
|
|
| 84 |
snapshot_download(
|
| 85 |
"Stylique/Polyvore",
|
| 86 |
repo_type="dataset",
|
|
@@ -89,12 +134,131 @@ def ensure_dataset_ready() -> Optional[str]:
|
|
| 89 |
allow_patterns=allow,
|
| 90 |
ignore_patterns=ignore,
|
| 91 |
)
|
| 92 |
-
|
| 93 |
-
|
| 94 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 95 |
|
| 96 |
# Unzip images if needed
|
| 97 |
_unzip_images_if_needed(root)
|
| 98 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 99 |
|
| 100 |
|
|
|
|
| 12 |
"""
|
| 13 |
images_dir = os.path.join(root, "images")
|
| 14 |
if os.path.isdir(images_dir) and any(Path(images_dir).glob("*")):
|
| 15 |
+
print(f"β
Images already present in {images_dir}")
|
| 16 |
return
|
| 17 |
+
|
| 18 |
# Common zip names at root or subfolders
|
| 19 |
candidates = [os.path.join(root, name) for name in ("images.zip", "polyvore-images.zip", "imgs.zip")]
|
| 20 |
# Also search recursively for any *images*.zip
|
| 21 |
for p in Path(root).rglob("*images*.zip"):
|
| 22 |
candidates.append(str(p))
|
| 23 |
+
|
| 24 |
for zpath in candidates:
|
| 25 |
if os.path.isfile(zpath):
|
| 26 |
+
print(f"π§ Found image archive: {zpath}")
|
| 27 |
+
print(f"π Extracting to: {images_dir}")
|
| 28 |
os.makedirs(images_dir, exist_ok=True)
|
| 29 |
+
|
| 30 |
+
try:
|
| 31 |
+
with zipfile.ZipFile(zpath, "r") as zf:
|
| 32 |
+
# Get total size for progress
|
| 33 |
+
total_size = sum(f.file_size for f in zf.filelist)
|
| 34 |
+
extracted_size = 0
|
| 35 |
+
|
| 36 |
+
for file_info in zf.filelist:
|
| 37 |
+
zf.extract(file_info, images_dir)
|
| 38 |
+
extracted_size += file_info.file_size
|
| 39 |
+
|
| 40 |
+
# Progress update every 100MB
|
| 41 |
+
if extracted_size % (100 * 1024 * 1024) < file_info.file_size:
|
| 42 |
+
progress = (extracted_size / total_size) * 100
|
| 43 |
+
print(f"π¦ Extraction progress: {progress:.1f}%")
|
| 44 |
+
|
| 45 |
+
print(f"β
Successfully extracted {len(zf.filelist)} files")
|
| 46 |
+
return
|
| 47 |
+
except Exception as e:
|
| 48 |
+
print(f"β Failed to extract {zpath}: {e}")
|
| 49 |
+
continue
|
| 50 |
+
|
| 51 |
+
print("β οΈ No image archive found to extract")
|
| 52 |
|
| 53 |
|
| 54 |
def ensure_dataset_ready() -> Optional[str]:
|
|
|
|
| 61 |
root = os.path.abspath(os.path.join(os.getcwd(), "data", "Polyvore"))
|
| 62 |
Path(root).mkdir(parents=True, exist_ok=True)
|
| 63 |
|
| 64 |
+
print(f"π Checking dataset at: {root}")
|
| 65 |
+
|
| 66 |
+
# Check if we already have the essential files
|
| 67 |
+
images_dir = os.path.join(root, "images")
|
| 68 |
+
metadata_files = [
|
| 69 |
+
"polyvore_item_metadata.json",
|
| 70 |
+
"polyvore_outfit_titles.json",
|
| 71 |
+
"categories.csv"
|
| 72 |
+
]
|
| 73 |
+
|
| 74 |
+
has_images = os.path.isdir(images_dir) and any(Path(images_dir).glob("*"))
|
| 75 |
+
has_metadata = all(os.path.exists(os.path.join(root, f)) for f in metadata_files)
|
| 76 |
+
|
| 77 |
+
if has_images and has_metadata:
|
| 78 |
+
print("β
Dataset already complete")
|
| 79 |
+
return root
|
| 80 |
+
|
| 81 |
# If images are already present, don't return early; still ensure metadata JSONs exist
|
| 82 |
+
if not has_images:
|
| 83 |
+
_unzip_images_if_needed(root)
|
| 84 |
|
| 85 |
# Download the HF dataset snapshot into root
|
| 86 |
try:
|
| 87 |
+
print("π₯ Downloading Polyvore dataset from Hugging Face...")
|
| 88 |
+
|
| 89 |
# Only fetch what's needed to run and prepare splits
|
| 90 |
allow = [
|
| 91 |
"images.zip",
|
| 92 |
# root-level (some mirrors place jsons here)
|
| 93 |
"train.json",
|
| 94 |
+
"valid.json",
|
| 95 |
"test.json",
|
| 96 |
# official splits often live here
|
| 97 |
"nondisjoint/train.json",
|
|
|
|
| 105 |
"polyvore_outfit_titles.json",
|
| 106 |
"categories.csv",
|
| 107 |
]
|
| 108 |
+
|
| 109 |
# Explicit ignores to prevent huge downloads (>10GB)
|
| 110 |
ignore = [
|
| 111 |
"**/*hglmm*",
|
|
|
|
|
|
|
|
|
|
| 112 |
"**/*.tar",
|
| 113 |
"**/*.tar.gz",
|
| 114 |
"**/*.7z",
|
| 115 |
+
"**/large/**",
|
| 116 |
]
|
| 117 |
+
|
| 118 |
+
need_download = not (
|
| 119 |
+
has_metadata and (
|
|
|
|
| 120 |
# any location providing official splits is acceptable
|
| 121 |
all(os.path.exists(os.path.join(root, f)) for f in ["train.json", "valid.json", "test.json"]) or
|
| 122 |
all(os.path.exists(os.path.join(root, "nondisjoint", f)) for f in ["train.json", "valid.json", "test.json"]) or
|
| 123 |
all(os.path.exists(os.path.join(root, "disjoint", f)) for f in ["train.json", "valid.json", "test.json"])
|
| 124 |
)
|
| 125 |
)
|
| 126 |
+
|
| 127 |
+
if need_download or not has_images:
|
| 128 |
+
print("π Starting download...")
|
| 129 |
snapshot_download(
|
| 130 |
"Stylique/Polyvore",
|
| 131 |
repo_type="dataset",
|
|
|
|
| 134 |
allow_patterns=allow,
|
| 135 |
ignore_patterns=ignore,
|
| 136 |
)
|
| 137 |
+
print("β
Download completed")
|
| 138 |
+
else:
|
| 139 |
+
print("β
All required files already present")
|
| 140 |
+
|
| 141 |
+
except Exception as e:
|
| 142 |
+
print(f"β Failed to download Stylique/Polyvore dataset: {e}")
|
| 143 |
+
print("π§ Trying to work with existing files...")
|
| 144 |
+
|
| 145 |
+
# Check what we have locally
|
| 146 |
+
existing_files = []
|
| 147 |
+
for file_path in Path(root).rglob("*"):
|
| 148 |
+
if file_path.is_file():
|
| 149 |
+
existing_files.append(str(file_path.relative_to(root)))
|
| 150 |
+
|
| 151 |
+
if existing_files:
|
| 152 |
+
print(f"π Found {len(existing_files)} existing files:")
|
| 153 |
+
for f in sorted(existing_files)[:10]: # Show first 10
|
| 154 |
+
print(f" - {f}")
|
| 155 |
+
if len(existing_files) > 10:
|
| 156 |
+
print(f" ... and {len(existing_files) - 10} more")
|
| 157 |
+
else:
|
| 158 |
+
print("π No existing files found")
|
| 159 |
+
return None
|
| 160 |
|
| 161 |
# Unzip images if needed
|
| 162 |
_unzip_images_if_needed(root)
|
| 163 |
+
|
| 164 |
+
# Final verification
|
| 165 |
+
if os.path.isdir(images_dir) and any(Path(images_dir).glob("*")):
|
| 166 |
+
print(f"β
Dataset ready at: {root}")
|
| 167 |
+
print(f"π Images: {len(list(Path(images_dir).glob('*')))} files")
|
| 168 |
+
|
| 169 |
+
# Check metadata
|
| 170 |
+
for meta_file in metadata_files:
|
| 171 |
+
meta_path = os.path.join(root, meta_file)
|
| 172 |
+
if os.path.exists(meta_path):
|
| 173 |
+
size_mb = os.path.getsize(meta_path) / (1024 * 1024)
|
| 174 |
+
print(f"π {meta_file}: {size_mb:.1f} MB")
|
| 175 |
+
else:
|
| 176 |
+
print(f"β οΈ Missing: {meta_file}")
|
| 177 |
+
|
| 178 |
+
return root
|
| 179 |
+
else:
|
| 180 |
+
print("β Failed to prepare dataset")
|
| 181 |
+
return None
|
| 182 |
+
|
| 183 |
+
|
| 184 |
+
def check_dataset_structure(root: str) -> dict:
|
| 185 |
+
"""Check the structure of the downloaded dataset."""
|
| 186 |
+
structure = {
|
| 187 |
+
"root": root,
|
| 188 |
+
"images": {"exists": False, "count": 0, "path": os.path.join(root, "images")},
|
| 189 |
+
"metadata": {},
|
| 190 |
+
"splits": {},
|
| 191 |
+
"status": "unknown"
|
| 192 |
+
}
|
| 193 |
+
|
| 194 |
+
# Check images
|
| 195 |
+
images_dir = os.path.join(root, "images")
|
| 196 |
+
if os.path.isdir(images_dir):
|
| 197 |
+
image_files = list(Path(images_dir).glob("*"))
|
| 198 |
+
structure["images"]["exists"] = True
|
| 199 |
+
structure["images"]["count"] = len(image_files)
|
| 200 |
+
structure["images"]["extensions"] = list(set(f.suffix.lower() for f in image_files))
|
| 201 |
+
|
| 202 |
+
# Check metadata files
|
| 203 |
+
metadata_files = [
|
| 204 |
+
"polyvore_item_metadata.json",
|
| 205 |
+
"polyvore_outfit_titles.json",
|
| 206 |
+
"categories.csv"
|
| 207 |
+
]
|
| 208 |
+
|
| 209 |
+
for meta_file in metadata_files:
|
| 210 |
+
meta_path = os.path.join(root, meta_file)
|
| 211 |
+
if os.path.exists(meta_path):
|
| 212 |
+
size_mb = os.path.getsize(meta_path) / (1024 * 1024)
|
| 213 |
+
structure["metadata"][meta_file] = {"exists": True, "size_mb": size_mb}
|
| 214 |
+
else:
|
| 215 |
+
structure["metadata"][meta_file] = {"exists": False, "size_mb": 0}
|
| 216 |
+
|
| 217 |
+
# Check for splits
|
| 218 |
+
split_locations = [
|
| 219 |
+
("root", ["train.json", "valid.json", "test.json"]),
|
| 220 |
+
("nondisjoint", ["train.json", "valid.json", "test.json"]),
|
| 221 |
+
("disjoint", ["train.json", "valid.json", "test.json"]),
|
| 222 |
+
("splits", ["train.json", "valid.json", "test.json"])
|
| 223 |
+
]
|
| 224 |
+
|
| 225 |
+
for location, files in split_locations:
|
| 226 |
+
location_path = os.path.join(root, location)
|
| 227 |
+
if os.path.exists(location_path):
|
| 228 |
+
structure["splits"][location] = {}
|
| 229 |
+
for split_file in files:
|
| 230 |
+
split_path = os.path.join(location_path, split_file)
|
| 231 |
+
if os.path.exists(split_path):
|
| 232 |
+
size_mb = os.path.getsize(split_path) / (1024 * 1024)
|
| 233 |
+
structure["splits"][location][split_file] = {"exists": True, "size_mb": size_mb}
|
| 234 |
+
else:
|
| 235 |
+
structure["splits"][location][split_file] = {"exists": False, "size_mb": 0}
|
| 236 |
+
else:
|
| 237 |
+
structure["splits"][location] = "directory_not_found"
|
| 238 |
+
|
| 239 |
+
# Determine overall status
|
| 240 |
+
if structure["images"]["exists"] and structure["images"]["count"] > 0:
|
| 241 |
+
if any(meta["exists"] for meta in structure["metadata"].values()):
|
| 242 |
+
structure["status"] = "ready"
|
| 243 |
+
else:
|
| 244 |
+
structure["status"] = "partial"
|
| 245 |
+
else:
|
| 246 |
+
structure["status"] = "incomplete"
|
| 247 |
+
|
| 248 |
+
return structure
|
| 249 |
+
|
| 250 |
+
|
| 251 |
+
if __name__ == "__main__":
|
| 252 |
+
# Test the dataset fetcher
|
| 253 |
+
print("π§ͺ Testing Polyvore dataset fetcher...")
|
| 254 |
+
|
| 255 |
+
root = ensure_dataset_ready()
|
| 256 |
+
if root:
|
| 257 |
+
print(f"\nπ Dataset structure:")
|
| 258 |
+
structure = check_dataset_structure(root)
|
| 259 |
+
import json
|
| 260 |
+
print(json.dumps(structure, indent=2))
|
| 261 |
+
else:
|
| 262 |
+
print("β Failed to prepare dataset")
|
| 263 |
|
| 264 |
|