File size: 7,631 Bytes
c2ddc7e |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 |
<div align="center">
# Soldier-Offier Window self-Attention (SOWA)
<a href="https://pytorch.org/get-started/locally/"><img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-ee4c2c?logo=pytorch&logoColor=white"></a>
<a href="https://pytorchlightning.ai/"><img alt="Lightning" src="https://img.shields.io/badge/-Lightning-792ee5?logo=pytorchlightning&logoColor=white"></a>
<a href="https://hydra.cc/"><img alt="Config: Hydra" src="https://img.shields.io/badge/Config-Hydra-89b8cd"></a>
<a href="https://github.com/ashleve/lightning-hydra-template"><img alt="Template" src="https://img.shields.io/badge/-Lightning--Hydra--Template-017F2F?style=flat&logo=github&labelColor=gray"></a><br>
[![Paper](http://img.shields.io/badge/paper-arxiv.2407.03634-B31B1B.svg)](https://arxiv.org/abs/2407.03634)
[![Conference](http://img.shields.io/badge/AnyConference-year-4b44ce.svg)](https://papers.nips.cc/paper/2020)
</div>
## Description
<div align="center">
<img src="https://github.com/huzongxiang/sowa/blob/resources/fig1.png" alt="concept" style="width: 50%;">
</div>
Visual anomaly detection is critical in industrial manufacturing, but traditional methods often rely on extensive
normal datasets and custom models, limiting scalability.
Recent advancements in large-scale visual-language models have significantly improved zero/few-shot anomaly detection. However, these approaches may not fully utilize hierarchical features, potentially missing nuanced details. We
introduce a window self-attention mechanism based on the
CLIP model, combined with learnable prompts to process
multi-level features within a Soldier-Offier Window selfAttention (SOWA) framework. Our method has been tested
on five benchmark datasets, demonstrating superior performance by leading in 18 out of 20 metrics compared to existing state-of-the-art techniques.
![architecture](https://github.com/huzongxiang/sowa/blob/resources/fig2.png)
## Installation
#### Pip
```bash
# clone project
git clone https://github.com/huzongxiang/sowa
cd sowa
# [OPTIONAL] create conda environment
conda create -n sowa python=3.9
conda activate sowa
# install pytorch according to instructions
# https://pytorch.org/get-started/
# install requirements
pip install -r requirements.txt
```
#### Conda
```bash
# clone project
git clone https://github.com/huzongxiang/sowa
cd sowa
# create conda environment and install dependencies
conda env create -f environment.yaml -n sowa
# activate conda environment
conda activate sowa
```
## How to run
Train model with default configuration
```bash
# train on CPU
python src/train.py trainer=cpu data=sowa_visa model=sowa_hfwa
# train on GPU
python src/train.py trainer=gpu data=sowa_visa model=sowa_hfwa
```
## Results
Comparisons with few-shot (K=4) anomaly detection methods on datasets of MVTec-AD, Visa, BTAD, DAGM and DTD Synthetic.
| Metric | Dataset | WinCLIP | April-GAN | Ours |
|-----------|----------------|-------------|-------------|-------------|
| AC AUROC | MVTec-AD | 95.2±1.3 | 92.8±0.2 | 96.8±0.3 |
| | Visa | 87.3±1.8 | 92.6±0.4 | 92.9±0.2 |
| | BTAD | 87.0±0.2 | 92.1±0.2 | 94.8±0.2 |
| | DAGM | 93.8±0.2 | 96.2±1.1 | 98.9±0.3 |
| | DTD-Synthetic | 98.1±0.2 | 98.5±0.1 | 99.1±0.0 |
| AC AP | MVTec-AD | 97.3±0.6 | 96.3±0.1 | 98.3±0.3 |
| | Visa | 88.8±1.8 | 94.5±0.3 | 94.5±0.2 |
| | BTAD | 86.8±0.0 | 95.2±0.5 | 95.5±0.7 |
| | DAGM | 83.8±1.1 | 86.7±4.5 | 95.2±1.7 |
| | DTD-Synthetic | 99.1±0.1 | 99.4±0.0 | 99.6±0.0 |
| AS AUROC | MVTec-AD | 96.2±0.3 | 95.9±0.0 | 95.7±0.1 |
| | Visa | 97.2±0.2 | 96.2±0.0 | 97.1±0.0 |
| | BTAD | 95.8±0.0 | 94.4±0.1 | 97.1±0.0 |
| | DAGM | 93.8±0.1 | 88.9±0.4 | 96.9±0.0 |
| | DTD-Synthetic | 96.8±0.2 | 96.7±0.0 | 98.7±0.0 |
| AS AUPRO | MVTec-AD | 89.0±0.8 | 91.8±0.1 | 92.4±0.2 |
| | Visa | 87.6±0.9 | 90.2±0.1 | 91.4±0.0 |
| | BTAD | 66.6±0.2 | 78.2±0.1 | 81.2±0.2 |
| | DAGM | 82.4±0.3 | 77.8±0.9 | 94.4±0.1 |
| | DTD-Synthetic | 90.1±0.5 | 92.2±0.0 | 96.6±0.1 |
<!-- 零宽空格 -->
Performance Comparison on MVTec-AD and Visa Datasets.
| Method | Source | MVTec-AD AC AUROC | MVTec-AD AS AUROC | MVTec-AD AS PRO | Visa AC AUROC | Visa AS AUROC | Visa AS PRO |
|---------------|-------------------------|-------------------|-------------------|-----------------|---------------|---------------|-------------|
| SPADE | arXiv 2020 | 84.8±2.5 | 92.7±0.3 | 87.0±0.5 | 81.7±3.4 | 96.6±0.3 | 87.3±0.8 |
| PaDiM | ICPR 2021 | 80.4±2.4 | 92.6±0.7 | 81.3±1.9 | 72.8±2.9 | 93.2±0.5 | 72.6±1.9 |
| PatchCore | CVPR 2022 | 88.8±2.6 | 94.3±0.5 | 84.3±1.6 | 85.3±2.1 | 96.8±0.3 | 84.9±1.4 |
| WinCLIP | CVPR 2023 | 95.2±1.3 | 96.2±0.3 | 89.0±0.8 | 87.3±1.8 | 97.2±0.2 | 87.6±0.9 |
| April-GAN | CVPR 2023 VAND workshop | 92.8±0.2 | 95.9±0.0 | 91.8±0.1 | 92.6±0.4 | 96.2±0.0 | 90.2±0.1 |
| PromptAD | CVPR 2024 | 96.6±0.9 | 96.5±0.2 | - | 89.1±1.7 | 97.4±0.3 | - |
| InCTRL | CVPR 2024 | 94.5±1.8 | - | - | 87.7±1.9 | - | - |
| SOWA | Ours | 96.8±0.3 | 95.7±0.1 | 92.4±0.2 | 92.9±0.2 | 97.1±0.0 | 91.4±0.0 |
<!-- 零宽空格 -->
Comparisons with few-shot anomaly detection methods on datasets of MVTec-AD, Visa, BTAD, DAGM and DTD Synthetic.
<div align="center">
<img src="https://github.com/huzongxiang/sowa/blob/resources/fig5.png" alt="few-shot" style="width: 70%;">
</div>
## Visualization
Visualization results under the few-shot setting (K=4).
<div align="center">
<img src="https://github.com/huzongxiang/sowa/blob/resources/fig6.png" alt="concept" style="width: 70%;">
</div>
## Mechanism
Hierarchical Results on MVTec-AD Dataset. A set of images showing the real outputs of the model, illustrating how different layers (H1 to H4) process various feature modes. Each row represents a different sample, with columns showing the original image, segmentation mask, heatmap, and feature outputs from H1 to H4, and fusion.
![mechanism](https://github.com/huzongxiang/sowa/blob/resources/fig7.png)
## Inference Speed
Inference performance comparison of different methods on a single NVIDIA RTX3070 8GB GPU.
<div align="center">
<img src="https://github.com/huzongxiang/sowa/blob/resources/fig9.png" alt="speed" style="width: 80%;">
</div>
## Citation
Please cite the following paper if this work helps your project:
```
@article{hu2024sowa,
title={SOWA: Adapting Hierarchical Frozen Window Self-Attention to Visual-Language Models for Better Anomaly Detection},
author={Hu, Zongxiang and Zhang, zhaosheng},
journal={arXiv preprint arXiv:2407.03634},
year={2024}
}
```
|