Spaces:

mohakapoor
/

CaptchaOCR

Running

App Files Files Community

mohakapoor commited on Aug 16

Commit

ada63c0

0 Parent(s):

Initial project setup on Dev branch

Browse files

Files changed (7) hide show

.gitattributes +14 -0
.gitignore +147 -0
README.md +143 -0
src/collate.py +34 -0
src/config.py +20 -0
src/data.py +63 -0
src/vocab.py +35 -0

.gitattributes ADDED Viewed

	@@ -0,0 +1,14 @@

+.pth filter=lfs diff=lfs merge=lfs -text
+.pt filter=lfs diff=lfs merge=lfs -text
+.ckpt filter=lfs diff=lfs merge=lfs -text
+.bin filter=lfs diff=lfs merge=lfs -text
+checkpoints/** filter=lfs diff=lfs merge=lfs -text
+.png filter=lfs diff=lfs merge=lfs -text
+**/.png filter=lfs diff=lfs merge=lfs -text
+.jpg filter=lfs diff=lfs merge=lfs -text
+**/.jpg filter=lfs diff=lfs merge=lfs -text
+.jpeg filter=lfs diff=lfs merge=lfs -text
+**/.jpeg filter=lfs diff=lfs merge=lfs -text
+.gif filter=lfs diff=lfs merge=lfs -text
+**/.gif filter=lfs diff=lfs merge=lfs -text
+Metrics/** filter=lfs diff=lfs merge=lfs -text

.gitignore ADDED Viewed

	@@ -0,0 +1,147 @@

+```bash
+#!/usr/bin/env bash
+# Create a .gitignore that keeps the Dataset folder but ignores its contents,
+# plus common Python/ML ignores. Run this from your repo root.
+set -e
+cat > .gitignore << 'EOF'
+# Keep the Dataset folder but ignore its contents
+Dataset/
+!Dataset/.gitkeep
+!Dataset/**/
+Dataset/**/*
+Dataset_test/
+!Dataset_test/.gitkeep
+!Dataset_test/**/
+Dataset_test/**/*
+# Python
+__pycache__/
+*.py[cod]
+*$py.class
+*.pyo
+*.pyd
+*.so
+*.egg-info/
+.eggs/
+dist/
+build/
+pip-wheel-metadata/
+wheels/
+.pytest_cache/
+.coverage
+#.coverage.*  # uncomment if you create multiple coverage files
+htmlcov/
+.cache/
+.mypy_cache/
+.pyre/
+.pytype/
+.dmypy.json
+.pyre_check/
+.ipynb_checkpoints/
+.site/
+# Virtual environments
+.env
+.venv
+env/
+venv/
+ENV/
+env.bak/
+venv.bak/
+# Logs and runtime
+*.log
+logs/
+*.pid
+*.seed
+*.out
+*.err
+# Jupyter
+.ipynb_checkpoints
+*.ipynb_checkpoints
+# IDE/editor
+.vscode/
+.history/
+.idea/
+*.code-workspace
+# OS-specific
+.DS_Store
+Thumbs.db
+desktop.ini
+# Images/artifacts (remove if you plan to commit images outside Dataset)
+*.png
+*.jpg
+*.jpeg
+*.bmp
+*.gif
+*.tiff
+*.webp
+# Models and checkpoints
+checkpoints/
+*.ckpt
+*.onnx
+*.tflite
+*.pth
+*.pt
+*.bin
+*.safetensors
+runs/
+outputs/
+artifacts/
+# Data/cache
+data/
+datasets/
+.input/
+.output/
+.cache/
+tmp/
+temp/
+*.tar
+*.tar.gz
+*.zip
+*.7z
+# Config/private
+*.env
+.env.*
+secrets.*
+*.key
+*.pem
+# Node/JS (if present)
+node_modules/
+npm-debug.log*
+yarn-debug.log*
+yarn-error.log*
+pnpm-lock.yaml
+# Rust (if present)
+target/
+# C/C++ build (if present)
+CMakeFiles/
+CMakeCache.txt
+cmake-build-*/
+*.o
+*.obj
+*.exe
+*.dll
+*.lib
+*.a
+*.out
+# Java (if present)
+*.class
+.gradle/
+build/
+EOF

README.md ADDED Viewed

	@@ -0,0 +1,143 @@

+# CAPTCHA OCR Project
+A PyTorch-based CAPTCHA recognition system using synthetic data generation and CTC-based sequence modeling.
+## 🎯 Project Overview
+This project implements an end-to-end CAPTCHA OCR system that can recognize text in CAPTCHA images. It uses:
+- **Synthetic CAPTCHA generation** for training data
+- **CRNN (CNN + RNN) architecture** for sequence recognition
+- **CTC (Connectionist Temporal Classification)** loss for training
+- **PyTorch** with CUDA support for GPU acceleration
+## 🏗️ Current Status
+### ✅ Completed Components
+- **Dataset Generation**: Synthetic CAPTCHA creation with train/val/test splits
+- **Configuration**: Centralized config with image dimensions and training parameters
+- **Vocabulary System**: Character encoding/decoding with CTC blank token support
+- **CTC Collate Function**: Proper batching for variable-length sequences
+- **CTC Decoding**: Greedy decode for inference
+### 🔧 In Progress / Next Steps
+- **PyTorch Dataset Class**: Image loading and preprocessing
+- **CRNN Model**: CNN encoder + BiLSTM + linear output
+- **Training Loop**: Complete training pipeline with validation
+- **Metrics**: CER (Character Error Rate) and exact match accuracy
+- **Inference Pipeline**: Model loading and prediction
+## 📁 Project Structure
+```
+CaptchaDetect/
+├── Dataset/                 # Full dataset (100k images) - for Colab training
+├── Dataset_test/           # Test dataset (1k images) - for local development
+│   └── captchas/
+│       ├── train/          # 80% of data
+│       ├── val/            # 10% of data
+│       └── test/           # 10% of data
+├── src/
+│   ├── config.py           # Configuration and hyperparameters
+│   ├── vocab.py            # Character vocabulary and CTC encoding
+│   ├── data.py             # Dataset generation script
+│   ├── collate.py          # CTC batching function
+│   └── [model files]       # Coming soon...
+├── .gitignore              # Ignores dataset contents, keeps structure
+└── README.md               # This file
+```
+## 🚀 Quick Start
+### 1. Environment Setup
+```bash
+# Install PyTorch with CUDA support (adjust version as needed)
+pip3 install torch torchvision --index-url https://download.pytorch.org/whl/cu128
+# Install other dependencies
+pip install captcha pandas pillow
+```
+### 2. Generate Test Dataset
+```bash
+cd src
+python data.py
+```
+This creates 1,000 synthetic CAPTCHAs in `Dataset_test/captchas/` with proper train/val/test splits.
+### 3. Configuration
+Edit `src/config.py` to adjust:
+- Image dimensions (H=48, W_max=224)
+- Batch sizes (32 for local GTX 1650, 128 for Colab T4)
+- Training parameters
+## 🎮 Usage
+### Local Development (GTX 1650)
+- Use `Dataset_test` (1k images)
+- Batch size: 32-48
+- Good for rapid iteration and testing
+### Colab Training (Tesla T4)
+- Use `Dataset` (100k images)
+- Batch size: 128
+- Expected training time: 2-4 hours for 40 epochs
+## 🔬 Technical Details
+### Model Architecture
+- **CNN Encoder**: Reduces image to sequence representation
+- **BiLSTM**: Processes sequential features
+- **Linear Output**: Maps to vocabulary size (including blank token)
+### CTC Training
+- **Input**: Images resized to 48×224
+- **Output**: Character sequences (a-z, A-Z, 0-9)
+- **Loss**: CTCLoss with blank=0
+- **Decoding**: Greedy CTC decode
+### Data Format
+- **Images**: Grayscale, normalized tensors
+- **Labels**: CSV with filename and text label
+- **Batching**: Variable-length sequences handled by custom collate
+## 📊 Performance Expectations
+### GTX 1650 (4GB VRAM)
+- Training time: 3-8 hours for 100k×40 epochs
+- Batch size: 32-48
+- Memory efficient with H=48
+### Tesla T4 (16GB VRAM)
+- Training time: 2-4 hours for 100k×40 epochs
+- Batch size: 128
+- Mixed precision (AMP) enabled
+## 🛠️ Development Workflow
+1. **Implement Dataset class** - Load and preprocess images
+2. **Build CRNN model** - CNN + BiLSTM architecture
+3. **Create training loop** - With validation and checkpoints
+4. **Add metrics** - CER and accuracy tracking
+5. **Test on small dataset** - Verify everything works
+6. **Scale to full dataset** - Train on Colab
+## 🤝 Contributing
+This is a learning project! Feel free to:
+- Ask questions about implementation details
+- Experiment with different architectures
+- Improve the data generation or training pipeline
+## 📚 Resources
+- [CTC Paper](https://www.cs.toronto.edu/~graves/icml_2006.pdf)
+- [CRNN Architecture](https://arxiv.org/abs/1507.05717)
+- [PyTorch CTC Tutorial](https://pytorch.org/docs/stable/generated/torch.nn.CTCLoss.html)
+## 📝 License
+This project is for educational purposes. Feel free to use and modify as needed.
+---
+**Happy coding! 🚀**

src/collate.py ADDED Viewed

	@@ -0,0 +1,34 @@

+from typing import List,Tuple
+import torch
+from src.config import cfg
+from src.vocab import encode_text
+def ctc_collate(batch: List[Tuple[torch.Tensor, str, str]]):
+    """
+    batch: list of (image_tensor [C,H,W_max], label_str, rel_path)
+    returns:
+      images: [B,C,H,W_max]
+      targets_flat: [sum(len(label_i))]
+      target_lengths: [B]
+      input_lengths: [B]  (all equal if same W_max/stride)
+      rel_paths: list[str]
+    """
+    images = torch.stack([item[0] for item in batch],dim =0)
+    labels = [item[1] for item in batch]
+    encoded = [torch.tensor(encode_text(t),dtype = torch.long) for t in labels]
+    target_lengths = torch.tensor([len(t) for t in encoded],dtype = torch.long)
+    if len(encoded) > 0:
+        targets_flat = torch.cat(encoded,dim = 0)
+    else:
+        targets_flat = torch.empty(0,dtype = torch.long)
+    B, C, H, W = images.shape
+    input_len = W // cfg.total_stride
+    input_lengths = torch.full((B,), input_len, dtype=torch.long)
+    rel_paths = [item[2] for item in batch]
+    return images, targets_flat, target_lengths, input_lengths, rel_paths

src/config.py ADDED Viewed

	@@ -0,0 +1,20 @@

+import os
+import string
+from dataclasses import dataclass
+@dataclass
+class Config:
+    data_root: str = os.getenv("DATA_ROOT","Dataset_test\captchas")
+    chars: str = string.ascii_letters + string.digits
+    H: int = 48
+    W_max: int = 224
+    grayscale: bool = True
+    total_stride: int = 4  #
+    batch_size: int = 32
+    num_workers: int = 4
+    amp: bool = True
+cfg = Config()

src/data.py ADDED Viewed

	@@ -0,0 +1,63 @@

+from captcha.image import ImageCaptcha
+import random
+import string
+import os
+import csv
+import pandas as pd
+# config
+DATASET_DIR = "Dataset_test/captchas"
+LABELS = "Dataset_test/labels.csv"
+NUM_IMAGES = 1000
+CHARS = string.ascii_letters + string.digits
+CAPTCHA_LEN_LOWER_LIMIT = 5
+CAPTCHA_LEN_UPPER_LIMIT = 7
+directories = [["train",0.8],["test",0.1],["val",0.1]]
+os.makedirs(DATASET_DIR, exist_ok=True)
+image = ImageCaptcha(width=160, height=60)
+with open(LABELS,mode="w",newline="") as f:
+    writer = csv.writer(f)
+    writer.writerow(["filename","label"])
+    OUTPUT_DIR = os.path.join(DATASET_DIR,directories[0][0])
+    os.makedirs(OUTPUT_DIR,exist_ok=True)
+    for i in range(NUM_IMAGES):
+        if i%(NUM_IMAGES/100) ==0:
+            print(f"{i} images made")
+        if i>(0.8*NUM_IMAGES-1) and i<(0.9*NUM_IMAGES):
+            OUTPUT_DIR = os.path.join(DATASET_DIR,directories[1][0])
+            os.makedirs(OUTPUT_DIR,exist_ok=True)
+        elif i>(0.9*NUM_IMAGES-1):
+            OUTPUT_DIR = os.path.join(DATASET_DIR,directories[2][0])
+            os.makedirs(OUTPUT_DIR,exist_ok=True)
+        text = ''.join(random.choices(CHARS, k=random.randint(CAPTCHA_LEN_LOWER_LIMIT,CAPTCHA_LEN_UPPER_LIMIT)))
+        filename = f"{text}_{i}.png"
+        filepath = os.path.join(OUTPUT_DIR, filename)
+        image.write(text, filepath)
+        writer.writerow([filename,text])
+print("Data Generated!")
+df = pd.read_csv(LABELS)
+n = len(df)
+train_end = int(n * directories[0][1])
+val_end = train_end + int(n * directories[2][1])
+# Split datasets
+df_train = df.iloc[:train_end]
+df_val = df.iloc[train_end:val_end]
+df_test = df.iloc[val_end:]
+# Save
+df_train.to_csv(os.path.join(DATASET_DIR,"train/labels.csv"), index=False)
+df_val.to_csv(os.path.join(DATASET_DIR,"val/labels.csv"), index=False)
+df_test.to_csv(os.path.join(DATASET_DIR,"test/labels.csv"), index=False)
+print("Labels Generated")

src/vocab.py ADDED Viewed

	@@ -0,0 +1,35 @@

+from typing import List
+from src.config import cfg
+itos = ["<blank>"] + list(cfg.chars)
+stoi = {c: i+1 for i,c in enumerate(cfg.chars)}
+def encode_text(text: str) -> List[int]:
+    return [stoi[c] for c in text]
+def decode_indices(indices: List[int]) -> str:
+    return "".join(itos[i] for i in indices if i != 0)
+def ctc_greedy_decode(logits) -> List[str]:
+    """
+    Greedy CTC decode for a batch.
+    logits: torch.Tensor of shape [T, B, V] (before softmax or log_softmax).
+    Returns: list of B decoded strings.
+    """
+    import torch
+    pred = logits.argmax(dim=-1)
+    B = pred.shape[1]
+    decoded = []
+    for b in range(B):
+        prev = -1
+        chars = []
+        for t in pred[:,b].tolist():
+            if t!=0 and t!= prev:
+                chars.append(itos[t])
+            prev = t
+        decoded.append("".join(chars))
+    return decoded
+def vocab_size() -> int:
+    return len(itos)