CLIP ๊ธฐ๋ฐ ์ ํ ๊ฒฐํจ ํ์ง ๋ชจ๋ธ ์นด๋
๋ชจ๋ธ ์ธ๋ถ์ฌํญ
๋ชจ๋ธ ์ค๋ช
์ด ๋ชจ๋ธ์ CLIP ๊ธฐ๋ฐ์ ์ด์ ํ์ง ๋ฐฉ๋ฒ์ ์ฌ์ฉํ์ฌ ์ ํ ๊ฒฐํจ์ ํ์งํฉ๋๋ค. ์ฌ์ ํ๋ จ๋ CLIP ๋ชจ๋ธ์ fine-tuningํ์ฌ ์ ํ ์ด๋ฏธ์ง์์ ๊ฒฐํจ์ ์๋ณํ๊ณ , ์์ฐ ๋ผ์ธ์์ ํ์ง ๊ด๋ฆฌ ๋ฐ ๊ฒฐํจ ๊ฐ์ง๋ฅผ ์๋ํํฉ๋๋ค.
- Developed by: ์ค์
- Funded by: 4INLAB INC.
- Shared by: zhou2023anomalyclip
- Model type: CLIP based Anomaly Detection
- Language(s): Python, PyTorch
- License: Apache 2.0, MIT, GPL-3.0
๊ธฐ์ ์ ์ ํ์ฌํญ
- ๋ชจ๋ธ์ ๊ฒฐํจ ํ์ง๋ฅผ ์ํ ์ถฉ๋ถํ๊ณ ๋ค์ํ ํ๋ จ ๋ฐ์ดํฐ๋ฅผ ํ์๋ก ํฉ๋๋ค. ํ๋ จ ๋ฐ์ดํฐ์ ์ด ๋ถ์กฑํ๊ฑฐ๋ ๋ถ๊ท ํํ ๊ฒฝ์ฐ, ๋ชจ๋ธ์ ์ฑ๋ฅ์ด ์ ํ๋ ์ ์์ต๋๋ค.
- ์ค์๊ฐ ๊ฒฐํจ ๊ฐ์ง ์ฑ๋ฅ์ ํ๋์จ์ด ์ฌ์์ ๋ฐ๋ผ ๋ฌ๋ผ์ง ์ ์์ผ๋ฉฐ, ๋์ ํด์๋์์ ๊ฒฐํจ์ ํ์งํ๋ ์ ํ๋๊ฐ ๋จ์ด์ง ์ ์์ต๋๋ค.
- ๊ฒฐํจ์ด ๋ฏธ์ธํ๊ฑฐ๋ ์ ํ ๊ฐ ์ ์ฌ์ฑ์ด ๋งค์ฐ ๋์ ๊ฒฝ์ฐ, ๋ชจ๋ธ์ด ๊ฒฐํจ์ ์ ํํ๊ฒ ํ์งํ์ง ๋ชปํ ์ ์์ต๋๋ค.
ํ์ต ์ธ๋ถ์ฌํญ
Hardware
- CPU: Intel Core i9-13900K (24 Cores, 32 Threads)
- RAM: 64GB DDR5
- GPU: NVIDIA RTX 4090Ti 24GB
- Storage: 1TB NVMe SSD + 2TB HDD
- Operating System: Windows 11 pro
๋ฐ์ดํฐ์ ์ ๋ณด
์ด ๋ชจ๋ธ์ ์๊ณ์ด ์ฌ๊ณ ๋ฐ์ดํฐ๋ฅผ ์ฌ์ฉํ์ฌ ํ๋ จ๋ฉ๋๋ค. ์ด ๋ฐ์ดํฐ๋ ์ฌ๊ณ ์์ค, ๋ ์ง ๋ฐ ๊ธฐํ ๊ด๋ จ ํน์ฑ์ ๋ํ ์ ๋ณด๋ฅผ ํฌํจํ๊ณ ์์ต๋๋ค. ๋ฐ์ดํฐ๋ Conv1D์ BiLSTM ๋ ์ด์ด์ ์ ํฉํ๋๋ก MinMax ์ค์ผ์ผ๋ง์ ์ฌ์ฉํ์ฌ ์ ์ฒ๋ฆฌ๋๊ณ ์ ๊ทํ๋ฉ๋๋ค.
Data sources: https://huggingface.co/datasets/quandao92/vision-inventory-prediction-data
Training size:
- 1์ฐจ : Few-shot learning with anomaly (10ea), good (4ea)
- 2์ฐจ : Few-shot learning with anomaly (10ea), good (10ea)
- 3์ฐจ : Few-shot learning with anomaly (10ea), good (110ea)
Time-step: 5์ด ์ด๋ด
Data Processing Techniques:
- normalization: description: "์ด๋ฏธ์ง ํฝ์ ๊ฐ์ ํ๊ท ๋ฐ ํ์คํธ์ฐจ๋ก ํ์คํ" method: "'Normalize' from 'torchvision.transforms'"
- max_resize: description: "์ด๋ฏธ์ง์ ์ต๋ ํฌ๊ธฐ๋ฅผ ์ ์งํ๋ฉฐ, ๋น์จ์ ๋ง์ถ๊ณ ํจ๋ฉ์ ์ถ๊ฐํ์ฌ ํฌ๊ธฐ ์กฐ์ " method: "Custom 'ResizeMaxSize' class"
- random_resized_crop: description: "ํ๋ จ ์ค์ ์ด๋ฏธ์ง๋ฅผ ๋๋ค์ผ๋ก ์๋ฅด๊ณ ํฌ๊ธฐ๋ฅผ ์กฐ์ ํ์ฌ ๋ณํ์ ์ถ๊ฐ" method: "'RandomResizedCrop' from 'torchvision.transforms'"
- resize: description: "๋ชจ๋ธ ์ ๋ ฅ์ ๋ง๊ฒ ์ด๋ฏธ์ง๋ฅผ ๊ณ ์ ๋ ํฌ๊ธฐ๋ก ์กฐ์ " method: "'Resize' with BICUBIC interpolation"
- center_crop: description: "์ด๋ฏธ์ง์ ์ค์ ๋ถ๋ถ์ ์ง์ ๋ ํฌ๊ธฐ๋ก ์๋ฅด๊ธฐ" method: "'CenterCrop'"
- to_tensor: description: "์ด๋ฏธ์ง๋ฅผ PyTorch ํ ์๋ก ๋ณํ" method: "'ToTensor'"
- augmentation (optional): description: "๋ฐ์ดํฐ ์ฆ๊ฐ์ ์ํด ๋ค์ํ ๋๋ค ๋ณํ ์ ์ฉ, 'AugmentationCfg'๋ก ์ค์ ๊ฐ๋ฅ" method: "Uses 'timm' library if specified"
AD-CLIP Model Architecture
- model:
- input_layer:
- image_size: [640, 640, 3] # ํ์ค ์ ๋ ฅ ์ด๋ฏธ์ง ํฌ๊ธฐ
- backbone:
- name: CLIP (ViT-B-32) # CLIP ๋ชจ๋ธ์ ๋น์ ํธ๋์คํฌ๋จธ๋ฅผ ๋ฐฑ๋ณธ์ผ๋ก ์ฌ์ฉ
- filters: [32, 64, 128, 256, 512] # ๋น์ ํธ๋์คํฌ๋จธ์ ๊ฐ ๋ ์ด์ด ํํฐ ํฌ๊ธฐ
- neck:
- name: Anomaly Detection Module # ๊ฒฐํจ ํ์ง๋ฅผ ์ํ ์ถ๊ฐ ๋ชจ๋
- method: Contrastive Learning # CLIP ๋ชจ๋ธ์ ํน์ง์ ์ฌ์ฉํ ๋์กฐ ํ์ต ๊ธฐ๋ฒ
- head:
- name: Anomaly Detection Head # ๊ฒฐํจ ํ์ง๋ฅผ ์ํ ์ต์ข ์ถ๋ ฅ ๋ ์ด์ด
- outputs:
- anomaly_score: 1 # ์ด์ ํ์ง ์ ์ (๋น์ ์/์ ์ ๊ตฌ๋ถ)
- class_probabilities: N # ๊ฐ ํด๋์ค์ ๋ํ ํ๋ฅ (๊ฒฐํจ ์ฌ๋ถ)
- input_layer:
Optimizer and Loss Function
- training:
- optimizer:
- name: AdamW # AdamW ์ตํฐ๋ง์ด์ (๊ฐ์ค์น ๊ฐ์ ํฌํจ)
- lr: 0.0001 # ํ์ต๋ฅ
- loss:
- classification_loss: 1.0 # ๋ถ๋ฅ ์์ค (๊ต์ฐจ ์ํธ๋กํผ)
- anomaly_loss: 1.0 # ๊ฒฐํจ ํ์ง ์์ค (์ด์ ํ์ง ๋ชจ๋ธ์ ๋ํ ์์ค)
- contrastive_loss: 1.0 # ๋์กฐ ํ์ต ์์ค (์ ์ฌ๋ ๊ธฐ๋ฐ ์์ค)
- optimizer:
Metrics
- metrics:
- Precision # ์ ๋ฐ๋ (Precision)
- Recall # ์ฌํ์จ (Recall)
- mAP # ํ๊ท ์ ๋ฐ๋ (Mean Average Precision)
- F1-Score # F1-์ ์ (๊ท ํ ์กํ ํ๊ฐ ์งํ)
Training Parameters
ํ์ดํผํ๋ผ๋ฏธํฐ ์ค์
- Learning Rate: 0.001.
- Batch Size: 8.
- Epochs: 200.
Pre-trained CLIP model
Evaluation Parameters
- F1-score: 95%์ด์.
ํ์ต ์ฑ๋ฅ ๋ฐ ํ ์คํธ ๊ฒฐ๊ณผ
ํ ์คํธ ๊ฒฐ๊ณผ:
Anomaly Product
Normal Product
์ค์น ๋ฐ ์คํ ๊ฐ์ด๋ผ์ธ
์ด ๋ชจ๋ธ์ ์คํํ๋ ค๋ฉด Python๊ณผ ํจ๊ป ๋ค์ ๋ผ์ด๋ธ๋ฌ๋ฆฌ๊ฐ ํ์ํฉ๋๋ค:
- ftfy==6.2.0: ํ ์คํธ ์ ๊ทํ ๋ฐ ์ธ์ฝ๋ฉ ๋ฌธ์ ๋ฅผ ํด๊ฒฐํ๋ ๋ผ์ด๋ธ๋ฌ๋ฆฌ.
- matplotlib==3.9.0: ๋ฐ์ดํฐ ์๊ฐํ ๋ฐ ๊ทธ๋ํ ์์ฑ์ ์ํ ๋ผ์ด๋ธ๋ฌ๋ฆฌ.
- numpy==1.24.3: ์์น ์ฐ์ฐ์ ์ํ ํต์ฌ ๋ผ์ด๋ธ๋ฌ๋ฆฌ.
- opencv_python==4.9.0.80: ์ด๋ฏธ์ง ๋ฐ ๋น๋์ค ์ฒ๋ฆฌ์ฉ ๋ผ์ด๋ธ๋ฌ๋ฆฌ.
- pandas==2.2.2: ๋ฐ์ดํฐ ๋ถ์ ๋ฐ ์กฐ์์ ์ํ ๋ผ์ด๋ธ๋ฌ๋ฆฌ.
- Pillow==10.3.0: ์ด๋ฏธ์ง ํ์ผ ์ฒ๋ฆฌ ๋ฐ ๋ณํ์ ์ํ ๋ผ์ด๋ธ๋ฌ๋ฆฌ.
- PyQt5==5.15.10: GUI ์ ํ๋ฆฌ์ผ์ด์ ๊ฐ๋ฐ์ ์ํ ํ๋ ์์ํฌ.
- PyQt5_sip==12.13.0: PyQt5์ Python ๊ฐ์ ์ธํฐํ์ด์ค๋ฅผ ์ ๊ณตํ๋ ๋ผ์ด๋ธ๋ฌ๋ฆฌ.
- regex==2024.5.15: ์ ๊ท ํํ์ ์ฒ๋ฆฌ๋ฅผ ์ํ ๋ผ์ด๋ธ๋ฌ๋ฆฌ.
- scikit_learn==1.2.2: ๊ธฐ๊ณ ํ์ต ๋ฐ ๋ฐ์ดํฐ ๋ถ์์ ์ํ ๋ผ์ด๋ธ๋ฌ๋ฆฌ.
- scipy==1.9.1: ๊ณผํ ๋ฐ ๊ธฐ์ ๊ณ์ฐ์ ์ํ ๋ผ์ด๋ธ๋ฌ๋ฆฌ.
- setuptools==59.5.0: Python ํจํค์ง ๋ฐฐํฌ ๋ฐ ์ค์น๋ฅผ ์ํ ๋ผ์ด๋ธ๋ฌ๋ฆฌ.
- scikit-image: ์ด๋ฏธ์ง ์ฒ๋ฆฌ ๋ฐ ๋ถ์์ ์ํ ๋ผ์ด๋ธ๋ฌ๋ฆฌ.
- tabulate==0.9.0: ํ ํํ๋ก ๋ฐ์ดํฐ๋ฅผ ์ถ๋ ฅํ๋ ๋ผ์ด๋ธ๋ฌ๋ฆฌ.
- thop==0.1.1.post2209072238: PyTorch ๋ชจ๋ธ์ FLOP ์๋ฅผ ๊ณ์ฐํ๋ ๋๊ตฌ.
- timm==0.6.13: ๋ค์ํ ์ต์ ์ด๋ฏธ์ง ๋ถ๋ฅ ๋ชจ๋ธ์ ์ ๊ณตํ๋ ๋ผ์ด๋ธ๋ฌ๋ฆฌ.
- torch==2.0.0: PyTorch ๋ฅ๋ฌ๋ ํ๋ ์์ํฌ.
- torchvision==0.15.1: ์ปดํจํฐ ๋น์ ์์ ์ ์ํ PyTorch ํ์ฅ ๋ผ์ด๋ธ๋ฌ๋ฆฌ.
- tqdm==4.65.0: ์งํ ์ํฉ์ ์๊ฐ์ ์ผ๋ก ํ์ํ๋ ๋ผ์ด๋ธ๋ฌ๋ฆฌ.
- pyautogui: GUI ์๋ํ๋ฅผ ์ํ ๋ผ์ด๋ธ๋ฌ๋ฆฌ.
๋ชจ๋ธ ์คํ ๋จ๊ณ:
โ Prompt generating
training_lib/prompt_ensemble.py
๐ Prompts Built in the Code
- Normal Prompt: '["{ }"]'
โ Normal Prompt Example: "object" - Anomaly Prompt: '["damaged { }"]'
โ Anomaly Prompt Example: "damaged object"
๐ Construction Process
- 'prompts_pos (Normal)': Combines the class name with the normal template
- 'prompts_neg (Anomaly)': Combines the class name with the anomaly template
โ Initial setting for training
- Define the path to the training dataset and model checkpoint saving
parser.add_argument("--train_data_path", type=str, default="./data/", help="train dataset path")
parser.add_argument("--dataset", type=str, default='smoke_cloud', help="train dataset name")
parser.add_argument("--save_path", type=str, default='./checkpoint/', help='path to save results')
โ Hyper parameters setting
- Set the depth parameter: depth of the embedding learned during prompt training. This affects the model's ability to learn complex features from the data
parser.add_argument("--depth", type=int, default=9, help="image size")
- Define the size of input images used for training (pixel)
parser.add_argument("--image_size", type=int, default=518, help="image size")
- Setting parameters for training
parser.add_argument("--epoch", type=int, default=500, help="epochs")
parser.add_argument("--learning_rate", type=float, default=0.0001, help="learning rate")
parser.add_argument("--batch_size", type=int, default=8, help="batch size")
- Size/depth parameter for the DPAM (Deep Prompt Attention Mechanism)
parser.add_argument("--dpam", type=int, default=20, help="dpam size")
1. ViT-B/32 and ViT-B/16: --dpam should be around 10-13
2. ViT-L/14 and ViT-L/14@336px: --dpam should be around 20-24
โ DPAM is used to refine and enhance specific layers of a model, particularly in Vision Transformers (ViT).
โ Helps the model focus on important features within each layer through an attention mechanism
โ Layers: DPAM is applied across multiple layers, allowing deeper and more detailed feature extraction
โ Number of layers DPAM influences is adjustable (--dpam), controlling how much of the model is fine-tuned.
โ If you want to refine the entire model, you can set --dpam to the number of layers in the model (e.g., 12 for ViT-B and 24 for ViT-L).
โ If you want to focus only on the final layers (where the model usually learns complex features), you can choose fewer DPAM layers.
โ Test process
๐ Load pre-trained and Fine tuned (Checkpoints) models
- Pre-trained mode (./pre-trained model/):
โ Contains the pre-trained model (ViT-B, ViT-L,....)
โ Used as the starting point for training the CLIP model
โ Pre-trained model helps speed up and improve training by leveraging previously learned features
- Fine-tuned models (./checkpoint/):
โ "epoch_N.pth" files in this folder store the model's states during the fine-tuning process.
โ Each ".pth" file represents a version of the model fine-tuned from the pre-trained model
โ These checkpoints can be used to resume fine-tuning, evaluate the model at different stages, or select the best-performing version



