Camais03
/

camie-tagger

Image Classification

ONNX

PyTorch

English

Model card Files Files and versions Community

Camais03 commited on 12 days ago

Commit

9e360cf

verified ·

1 Parent(s): 64981cc

Update README.md

Browse files

Files changed (1) hide show

README.md +185 -16

README.md CHANGED Viewed

@@ -1,9 +1,34 @@
 ---
 license: gpl-3.0
 ---
-# Anime Image Tagger
-An advanced deep learning model for automatically tagging anime/manga illustrations with relevant tags across multiple categories, achieving 61% F1 score over 70,000+ possible tags.
 ## Features
@@ -15,6 +40,72 @@ An advanced deep learning model for automatically tagging anime/manga illustrati
 - **Adjustable threshold profiles**: Overall, Weighted, Category-specific, High Precision, and High Recall profiles
 - **Fine-grained control**: Per-category threshold adjustments for precision-recall tradeoffs
 ## Dataset
 The model was trained on a carefully filtered subset of the [Danbooru 2024 dataset](https://huggingface.co/datasets/p1atdev/danbooru-2024), which contains a vast collection of anime/manga illustrations with comprehensive tagging.
@@ -104,22 +195,23 @@ This will automatically set up all necessary packages for the application.
 ### Requirements
-- Python 3.8+
 - PyTorch 1.10+
 - Streamlit
 - PIL/Pillow
 - NumPy
-- Flash Attention (automatically installed but may have issues on Windows)
-## Usage
-After installation, run the application by executing `setup.bat`. This launches a web interface where you can:
-- Upload your own images or select from example images
-- Choose different threshold profiles
-- Adjust category-specific thresholds
-- View predictions organized by category
-- Filter and sort tags based on confidence
 ## Model Details
@@ -132,6 +224,7 @@ The model recognizes tags across these categories:
 - **Artist**: Creator of the artwork
 - **Meta**: Meta information about the image
 - **Rating**: Content rating
 ### Performance Notes
@@ -151,16 +244,14 @@ In benchmarks, the model achieved a 61% F1 score across all categories, which is
 ## Windows Compatibility
-The full model uses Flash Attention, which has installation challenges on Windows. For Windows users:
 - The application automatically defaults to the Initial-only model
-- Performance difference is minimal for most use cases (usually less than 3-5% F1 score reduction)
 - The Initial-only model still uses the same powerful EfficientNet backbone and initial classifier
 ## Web Interface Guide
-![Application Interface](app_screenshot.png)
 The interface is divided into three main sections:
 1. **Model Selection** (Sidebar)
@@ -184,6 +275,12 @@ The interface is divided into three main sections:
 - **Minimum confidence**: Filter out low-confidence predictions
 - **Category selection**: Choose which categories to include in the summary
 ## Training Environment
 The model was trained using surprisingly modest hardware:
@@ -196,6 +293,73 @@ The model was trained using surprisingly modest hardware:
   - PyTorch with CUDA acceleration
   - Flash Attention for optimized attention computation
 ### Training Notes
 - Training notebooks require WSL and likely 32GB+ of RAM to handle the dataset
@@ -203,7 +367,12 @@ The model was trained using surprisingly modest hardware:
 - Despite hardware limitations, the model achieves impressive results
 - With more computational resources, the model could be trained longer on the full dataset
-If you'd like to support further training on the complete dataset, consider [buying me a coffee].
 ## Acknowledgments

 ---
 license: gpl-3.0
+datasets:
+- p1atdev/danbooru-2024
+metrics:
+- f1
+tags:
+- art
+- code
 ---
+## Usage
+After installation, run the application by executing `setup.bat`. This launches a web interface where you can:
+- Upload your own images or select from example images
+- Choose different threshold profiles
+- Adjust category-specific thresholds
+- View predictions organized by category
+- Filter and sort tags based on confidence# Anime Image Tagger
+An advanced deep learning model for automatically tagging anime/manga illustrations with relevant tags across multiple categories, achieving **61% F1 score** across 70,000+ possible tags on a test set of 20,116 samples.
+## Key Highlights
+- **Efficient Training**: Completed on just a single RTX 3060 GPU (12GB VRAM)
+- **Fast Convergence**: Trained on 7,024,392 samples (3.52 epochs) in 1,756,098 batches
+- **Comprehensive Coverage**: 70,000+ tags across 7 categories (general, character, copyright, artist, meta, rating, year)
+- **Innovative Architecture**: Two-stage prediction model with cross-attention for tag context
+- **User-Friendly Interface**: Easy-to-use application with customizable thresholds
+*This project demonstrates that high-quality anime image tagging models can be trained on consumer hardware with the right optimization techniques.*
 ## Features
 - **Adjustable threshold profiles**: Overall, Weighted, Category-specific, High Precision, and High Recall profiles
 - **Fine-grained control**: Per-category threshold adjustments for precision-recall tradeoffs
+## Loss Function
+The model employs a specialized `UnifiedFocalLoss` to address the extreme class imbalance inherent in multi-label tag prediction:
+```python
+class UnifiedFocalLoss(nn.Module):
+    def __init__(self, device=None, gamma=2.0, alpha=0.25, lambda_initial=0.4):
+        # Implementation details...
+```
+### Key Components
+1. **Focal Loss Mechanism**:
+   - Down-weights well-classified examples (γ=2.0) to focus training on difficult tags
+   - Addresses the extreme imbalance between positive and negative examples (often 100:1 or worse)
+   - Uses α=0.25 to balance positive/negative examples across 70,000+ possible tags
+2. **Two-stage Weighting**:
+   - Combines losses from both prediction stages (`initial_predictions` and `refined_predictions`)
+   - Uses λ=0.4 to weight the initial prediction loss, giving more importance (0.6) to refined predictions
+   - This encourages the model to improve predictions in the refinement stage while still maintaining strong initial predictions
+3. **Per-sample Statistics**:
+   - Tracks separate metrics for positive and negative samples
+   - Provides detailed debugging information about prediction distributions
+   - Enables analysis of which tag categories are performing well/poorly
+This loss function was essential for achieving high F1 scores across diverse tag categories despite the extreme classification challenge of 70,000+ possible tags.
+## DeepSpeed Configuration
+Microsoft DeepSpeed was crucial for training this model on consumer hardware. The project uses a carefully tuned configuration to maximize efficiency:
+```python
+def create_deepspeed_config(
+    config_path,
+    learning_rate=3e-4,
+    weight_decay=0.01,
+    num_train_samples=None,
+    micro_batch_size=4,
+    grad_accum_steps=8
+):
+    # Implementation details...
+```
+### Key Optimizations
+1. **Memory Efficiency**:
+   - **ZeRO Stage 2**: Partitions optimizer states and gradients, dramatically reducing memory requirements
+   - **Activation Checkpointing**: Trades computation for memory by recomputing activations during backpropagation
+   - **Contiguous Memory Optimization**: Reduces memory fragmentation
+2. **Mixed Precision Training**:
+   - **FP16 Mode**: Uses half-precision (16-bit) for most calculations, with automatic loss scaling
+   - **Initial Scale Power**: Set to 16 for stable convergence with large batch sizes
+3. **Gradient Accumulation**:
+   - Micro-batch size of 4 with 8 gradient accumulation steps
+   - Effective batch size of 32 while only requiring memory for 4 samples at once
+4. **Learning Rate Schedule**:
+   - WarmupLR scheduler with gradual increase from 3e-6 to 3e-4
+   - Warmup over 1/4 of an epoch to stabilize early training
+This configuration allowed the model to train efficiently with only 12GB of VRAM while maintaining numerical stability across millions of training examples with 70,000+ output dimensions.
 ## Dataset
 The model was trained on a carefully filtered subset of the [Danbooru 2024 dataset](https://huggingface.co/datasets/p1atdev/danbooru-2024), which contains a vast collection of anime/manga illustrations with comprehensive tagging.
 ### Requirements
+- **Python 3.11.9 specifically** (newer versions are incompatible)
 - PyTorch 1.10+
 - Streamlit
 - PIL/Pillow
 - NumPy
+- Flash Attention (note: doesn't work properly on Windows)
+### Running the Application
+The application is located in the `app` folder and can be launched via the setup script:
+1. Run `setup.bat` to install dependencies
+2. The Streamlit interface will automatically open in your browser
+3. If the browser doesn't open automatically, navigate to http://localhost:8501
+![Application Interface](app_screenshot.png)
+![Tag Results Example](tag_results_example.png)
 ## Model Details
 - **Artist**: Creator of the artwork
 - **Meta**: Meta information about the image
 - **Rating**: Content rating
+- **Year**: Year of upload
 ### Performance Notes
 ## Windows Compatibility
+The full model uses Flash Attention, which does not work properly on Windows. For Windows users:
 - The application automatically defaults to the Initial-only model
+- Performance difference is minimal (0.2% absolute F1 score reduction, from 61.6% to 61.4%)
 - The Initial-only model still uses the same powerful EfficientNet backbone and initial classifier
 ## Web Interface Guide
 The interface is divided into three main sections:
 1. **Model Selection** (Sidebar)
 - **Minimum confidence**: Filter out low-confidence predictions
 - **Category selection**: Choose which categories to include in the summary
+### Interface Screenshots
+![Application Interface](app_screenshot.png)
+![Tag Results Example](tag_results_example.png)
 ## Training Environment
 The model was trained using surprisingly modest hardware:
   - PyTorch with CUDA acceleration
   - Flash Attention for optimized attention computation
+### Training Notebooks
+The repository includes two main training notebooks:
+1. **CAMIE Tagger.ipynb**
+   - Main training notebook
+   - Dataset loading and preprocessing
+   - Model initialization
+   - Initial training loop with DeepSpeed integration
+   - Tag selection optimization
+   - Metric tracking and visualization
+2. **Camie Tagger Cont and Evals.ipynb**
+   - Continuation of training from checkpoints
+   - Comprehensive model evaluation
+   - Per-category performance metrics
+   - Threshold optimization
+   - Model conversion for deployment in the app
+   - Export functionality for the standalone application
+### Training Monitor
+The project includes a real-time training monitor accessible via browser at `localhost:5000` during training:
+![Training Monitor Overview](training_monitor_overview.png)
+#### Performance Tips
+⚠️ **Important**: For optimal training speed, keep VSCode minimized and the training monitor open in your browser. This can improve iteration speed by **3-5x** due to how the Windows/WSL graphics stack handles window focus and CUDA kernel execution.
+#### Monitor Features
+The training monitor provides three main views:
+##### 1. Overview Tab
+![Overview Tab](training_monitor_charts.png)
+- **Training Progress**: Real-time metrics including epoch, batch, speed, and time estimates
+- **Loss Chart**: Training and validation loss visualization
+- **F1 Scores**: Initial and refined F1 metrics for both training and validation
+##### 2. Predictions Tab
+![Predictions Tab](training_monitor_predictions.png)
+- **Image Preview**: Shows the current sample being analyzed
+- **Prediction Controls**: Toggle between initial and refined predictions
+- **Tag Analysis**:
+  - Color-coded tag results (correct, incorrect, missing)
+  - Confidence visualization with probability bars
+  - Category-based organization
+  - Filtering options for error analysis
+##### 3. Selection Analysis Tab
+![Selection Analysis Tab](training_monitor_selection.png)
+- **Selection Metrics**: Statistics on tag selection quality
+  - Ground truth recall
+  - Average probability for ground truth vs. non-ground truth tags
+  - Unique tags selected
+- **Selection Graph**: Trends in selection quality over time
+- **Selected Tags Details**: Detailed view of model-selected tags with confidence scores
+The monitor provides invaluable insights into how the two-stage prediction model is performing, particularly how the tag selection process is working between the initial and refined prediction stages.
 ### Training Notes
 - Training notebooks require WSL and likely 32GB+ of RAM to handle the dataset
 - Despite hardware limitations, the model achieves impressive results
 - With more computational resources, the model could be trained longer on the full dataset
+## Support:
+I plan to move onto LLMs after this project as I have lots of ideas on how to improve upon them. I will update this model based on community attention.
+If you'd like to support further training on the complete dataset or my future projects, consider [buying me a coffee](https://www.buymeacoffee.com/yourusername).
 ## Acknowledgments