Camais03 commited on
Commit
9e360cf
verified
1 Parent(s): 64981cc

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +185 -16
README.md CHANGED
@@ -1,9 +1,34 @@
1
  ---
2
  license: gpl-3.0
 
 
 
 
 
 
 
3
  ---
4
- # Anime Image Tagger
 
 
 
 
 
 
 
 
 
 
5
 
6
- An advanced deep learning model for automatically tagging anime/manga illustrations with relevant tags across multiple categories, achieving 61% F1 score over 70,000+ possible tags.
 
 
 
 
 
 
 
 
7
 
8
  ## Features
9
 
@@ -15,6 +40,72 @@ An advanced deep learning model for automatically tagging anime/manga illustrati
15
  - **Adjustable threshold profiles**: Overall, Weighted, Category-specific, High Precision, and High Recall profiles
16
  - **Fine-grained control**: Per-category threshold adjustments for precision-recall tradeoffs
17
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
18
  ## Dataset
19
 
20
  The model was trained on a carefully filtered subset of the [Danbooru 2024 dataset](https://huggingface.co/datasets/p1atdev/danbooru-2024), which contains a vast collection of anime/manga illustrations with comprehensive tagging.
@@ -104,22 +195,23 @@ This will automatically set up all necessary packages for the application.
104
 
105
  ### Requirements
106
 
107
- - Python 3.8+
108
  - PyTorch 1.10+
109
  - Streamlit
110
  - PIL/Pillow
111
  - NumPy
112
- - Flash Attention (automatically installed but may have issues on Windows)
113
 
114
- ## Usage
115
 
116
- After installation, run the application by executing `setup.bat`. This launches a web interface where you can:
117
 
118
- - Upload your own images or select from example images
119
- - Choose different threshold profiles
120
- - Adjust category-specific thresholds
121
- - View predictions organized by category
122
- - Filter and sort tags based on confidence
 
123
 
124
  ## Model Details
125
 
@@ -132,6 +224,7 @@ The model recognizes tags across these categories:
132
  - **Artist**: Creator of the artwork
133
  - **Meta**: Meta information about the image
134
  - **Rating**: Content rating
 
135
 
136
  ### Performance Notes
137
 
@@ -151,16 +244,14 @@ In benchmarks, the model achieved a 61% F1 score across all categories, which is
151
 
152
  ## Windows Compatibility
153
 
154
- The full model uses Flash Attention, which has installation challenges on Windows. For Windows users:
155
 
156
  - The application automatically defaults to the Initial-only model
157
- - Performance difference is minimal for most use cases (usually less than 3-5% F1 score reduction)
158
  - The Initial-only model still uses the same powerful EfficientNet backbone and initial classifier
159
 
160
  ## Web Interface Guide
161
 
162
- ![Application Interface](app_screenshot.png)
163
-
164
  The interface is divided into three main sections:
165
 
166
  1. **Model Selection** (Sidebar)
@@ -184,6 +275,12 @@ The interface is divided into three main sections:
184
  - **Minimum confidence**: Filter out low-confidence predictions
185
  - **Category selection**: Choose which categories to include in the summary
186
 
 
 
 
 
 
 
187
  ## Training Environment
188
 
189
  The model was trained using surprisingly modest hardware:
@@ -196,6 +293,73 @@ The model was trained using surprisingly modest hardware:
196
  - PyTorch with CUDA acceleration
197
  - Flash Attention for optimized attention computation
198
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
199
  ### Training Notes
200
 
201
  - Training notebooks require WSL and likely 32GB+ of RAM to handle the dataset
@@ -203,7 +367,12 @@ The model was trained using surprisingly modest hardware:
203
  - Despite hardware limitations, the model achieves impressive results
204
  - With more computational resources, the model could be trained longer on the full dataset
205
 
206
- If you'd like to support further training on the complete dataset, consider [buying me a coffee].
 
 
 
 
 
207
 
208
 
209
  ## Acknowledgments
 
1
  ---
2
  license: gpl-3.0
3
+ datasets:
4
+ - p1atdev/danbooru-2024
5
+ metrics:
6
+ - f1
7
+ tags:
8
+ - art
9
+ - code
10
  ---
11
+ ## Usage
12
+
13
+ After installation, run the application by executing `setup.bat`. This launches a web interface where you can:
14
+
15
+ - Upload your own images or select from example images
16
+ - Choose different threshold profiles
17
+ - Adjust category-specific thresholds
18
+ - View predictions organized by category
19
+ - Filter and sort tags based on confidence# Anime Image Tagger
20
+
21
+ An advanced deep learning model for automatically tagging anime/manga illustrations with relevant tags across multiple categories, achieving **61% F1 score** across 70,000+ possible tags on a test set of 20,116 samples.
22
 
23
+ ## Key Highlights
24
+
25
+ - **Efficient Training**: Completed on just a single RTX 3060 GPU (12GB VRAM)
26
+ - **Fast Convergence**: Trained on 7,024,392 samples (3.52 epochs) in 1,756,098 batches
27
+ - **Comprehensive Coverage**: 70,000+ tags across 7 categories (general, character, copyright, artist, meta, rating, year)
28
+ - **Innovative Architecture**: Two-stage prediction model with cross-attention for tag context
29
+ - **User-Friendly Interface**: Easy-to-use application with customizable thresholds
30
+
31
+ *This project demonstrates that high-quality anime image tagging models can be trained on consumer hardware with the right optimization techniques.*
32
 
33
  ## Features
34
 
 
40
  - **Adjustable threshold profiles**: Overall, Weighted, Category-specific, High Precision, and High Recall profiles
41
  - **Fine-grained control**: Per-category threshold adjustments for precision-recall tradeoffs
42
 
43
+ ## Loss Function
44
+
45
+ The model employs a specialized `UnifiedFocalLoss` to address the extreme class imbalance inherent in multi-label tag prediction:
46
+
47
+ ```python
48
+ class UnifiedFocalLoss(nn.Module):
49
+ def __init__(self, device=None, gamma=2.0, alpha=0.25, lambda_initial=0.4):
50
+ # Implementation details...
51
+ ```
52
+
53
+ ### Key Components
54
+
55
+ 1. **Focal Loss Mechanism**:
56
+ - Down-weights well-classified examples (纬=2.0) to focus training on difficult tags
57
+ - Addresses the extreme imbalance between positive and negative examples (often 100:1 or worse)
58
+ - Uses 伪=0.25 to balance positive/negative examples across 70,000+ possible tags
59
+
60
+ 2. **Two-stage Weighting**:
61
+ - Combines losses from both prediction stages (`initial_predictions` and `refined_predictions`)
62
+ - Uses 位=0.4 to weight the initial prediction loss, giving more importance (0.6) to refined predictions
63
+ - This encourages the model to improve predictions in the refinement stage while still maintaining strong initial predictions
64
+
65
+ 3. **Per-sample Statistics**:
66
+ - Tracks separate metrics for positive and negative samples
67
+ - Provides detailed debugging information about prediction distributions
68
+ - Enables analysis of which tag categories are performing well/poorly
69
+
70
+ This loss function was essential for achieving high F1 scores across diverse tag categories despite the extreme classification challenge of 70,000+ possible tags.
71
+
72
+ ## DeepSpeed Configuration
73
+
74
+ Microsoft DeepSpeed was crucial for training this model on consumer hardware. The project uses a carefully tuned configuration to maximize efficiency:
75
+
76
+ ```python
77
+ def create_deepspeed_config(
78
+ config_path,
79
+ learning_rate=3e-4,
80
+ weight_decay=0.01,
81
+ num_train_samples=None,
82
+ micro_batch_size=4,
83
+ grad_accum_steps=8
84
+ ):
85
+ # Implementation details...
86
+ ```
87
+
88
+ ### Key Optimizations
89
+
90
+ 1. **Memory Efficiency**:
91
+ - **ZeRO Stage 2**: Partitions optimizer states and gradients, dramatically reducing memory requirements
92
+ - **Activation Checkpointing**: Trades computation for memory by recomputing activations during backpropagation
93
+ - **Contiguous Memory Optimization**: Reduces memory fragmentation
94
+
95
+ 2. **Mixed Precision Training**:
96
+ - **FP16 Mode**: Uses half-precision (16-bit) for most calculations, with automatic loss scaling
97
+ - **Initial Scale Power**: Set to 16 for stable convergence with large batch sizes
98
+
99
+ 3. **Gradient Accumulation**:
100
+ - Micro-batch size of 4 with 8 gradient accumulation steps
101
+ - Effective batch size of 32 while only requiring memory for 4 samples at once
102
+
103
+ 4. **Learning Rate Schedule**:
104
+ - WarmupLR scheduler with gradual increase from 3e-6 to 3e-4
105
+ - Warmup over 1/4 of an epoch to stabilize early training
106
+
107
+ This configuration allowed the model to train efficiently with only 12GB of VRAM while maintaining numerical stability across millions of training examples with 70,000+ output dimensions.
108
+
109
  ## Dataset
110
 
111
  The model was trained on a carefully filtered subset of the [Danbooru 2024 dataset](https://huggingface.co/datasets/p1atdev/danbooru-2024), which contains a vast collection of anime/manga illustrations with comprehensive tagging.
 
195
 
196
  ### Requirements
197
 
198
+ - **Python 3.11.9 specifically** (newer versions are incompatible)
199
  - PyTorch 1.10+
200
  - Streamlit
201
  - PIL/Pillow
202
  - NumPy
203
+ - Flash Attention (note: doesn't work properly on Windows)
204
 
205
+ ### Running the Application
206
 
207
+ The application is located in the `app` folder and can be launched via the setup script:
208
 
209
+ 1. Run `setup.bat` to install dependencies
210
+ 2. The Streamlit interface will automatically open in your browser
211
+ 3. If the browser doesn't open automatically, navigate to http://localhost:8501
212
+
213
+ ![Application Interface](app_screenshot.png)
214
+ ![Tag Results Example](tag_results_example.png)
215
 
216
  ## Model Details
217
 
 
224
  - **Artist**: Creator of the artwork
225
  - **Meta**: Meta information about the image
226
  - **Rating**: Content rating
227
+ - **Year**: Year of upload
228
 
229
  ### Performance Notes
230
 
 
244
 
245
  ## Windows Compatibility
246
 
247
+ The full model uses Flash Attention, which does not work properly on Windows. For Windows users:
248
 
249
  - The application automatically defaults to the Initial-only model
250
+ - Performance difference is minimal (0.2% absolute F1 score reduction, from 61.6% to 61.4%)
251
  - The Initial-only model still uses the same powerful EfficientNet backbone and initial classifier
252
 
253
  ## Web Interface Guide
254
 
 
 
255
  The interface is divided into three main sections:
256
 
257
  1. **Model Selection** (Sidebar)
 
275
  - **Minimum confidence**: Filter out low-confidence predictions
276
  - **Category selection**: Choose which categories to include in the summary
277
 
278
+ ### Interface Screenshots
279
+
280
+ ![Application Interface](app_screenshot.png)
281
+
282
+ ![Tag Results Example](tag_results_example.png)
283
+
284
  ## Training Environment
285
 
286
  The model was trained using surprisingly modest hardware:
 
293
  - PyTorch with CUDA acceleration
294
  - Flash Attention for optimized attention computation
295
 
296
+ ### Training Notebooks
297
+
298
+ The repository includes two main training notebooks:
299
+
300
+ 1. **CAMIE Tagger.ipynb**
301
+ - Main training notebook
302
+ - Dataset loading and preprocessing
303
+ - Model initialization
304
+ - Initial training loop with DeepSpeed integration
305
+ - Tag selection optimization
306
+ - Metric tracking and visualization
307
+
308
+ 2. **Camie Tagger Cont and Evals.ipynb**
309
+ - Continuation of training from checkpoints
310
+ - Comprehensive model evaluation
311
+ - Per-category performance metrics
312
+ - Threshold optimization
313
+ - Model conversion for deployment in the app
314
+ - Export functionality for the standalone application
315
+
316
+ ### Training Monitor
317
+
318
+ The project includes a real-time training monitor accessible via browser at `localhost:5000` during training:
319
+
320
+ ![Training Monitor Overview](training_monitor_overview.png)
321
+
322
+ #### Performance Tips
323
+
324
+ 鈿狅笍 **Important**: For optimal training speed, keep VSCode minimized and the training monitor open in your browser. This can improve iteration speed by **3-5x** due to how the Windows/WSL graphics stack handles window focus and CUDA kernel execution.
325
+
326
+ #### Monitor Features
327
+
328
+ The training monitor provides three main views:
329
+
330
+ ##### 1. Overview Tab
331
+
332
+ ![Overview Tab](training_monitor_charts.png)
333
+
334
+ - **Training Progress**: Real-time metrics including epoch, batch, speed, and time estimates
335
+ - **Loss Chart**: Training and validation loss visualization
336
+ - **F1 Scores**: Initial and refined F1 metrics for both training and validation
337
+
338
+ ##### 2. Predictions Tab
339
+
340
+ ![Predictions Tab](training_monitor_predictions.png)
341
+
342
+ - **Image Preview**: Shows the current sample being analyzed
343
+ - **Prediction Controls**: Toggle between initial and refined predictions
344
+ - **Tag Analysis**:
345
+ - Color-coded tag results (correct, incorrect, missing)
346
+ - Confidence visualization with probability bars
347
+ - Category-based organization
348
+ - Filtering options for error analysis
349
+
350
+ ##### 3. Selection Analysis Tab
351
+
352
+ ![Selection Analysis Tab](training_monitor_selection.png)
353
+
354
+ - **Selection Metrics**: Statistics on tag selection quality
355
+ - Ground truth recall
356
+ - Average probability for ground truth vs. non-ground truth tags
357
+ - Unique tags selected
358
+ - **Selection Graph**: Trends in selection quality over time
359
+ - **Selected Tags Details**: Detailed view of model-selected tags with confidence scores
360
+
361
+ The monitor provides invaluable insights into how the two-stage prediction model is performing, particularly how the tag selection process is working between the initial and refined prediction stages.
362
+
363
  ### Training Notes
364
 
365
  - Training notebooks require WSL and likely 32GB+ of RAM to handle the dataset
 
367
  - Despite hardware limitations, the model achieves impressive results
368
  - With more computational resources, the model could be trained longer on the full dataset
369
 
370
+
371
+ ## Support:
372
+
373
+ I plan to move onto LLMs after this project as I have lots of ideas on how to improve upon them. I will update this model based on community attention.
374
+
375
+ If you'd like to support further training on the complete dataset or my future projects, consider [buying me a coffee](https://www.buymeacoffee.com/yourusername).
376
 
377
 
378
  ## Acknowledgments