Spaces:
Sleeping
Sleeping
Commit
Β·
8b64495
1
Parent(s):
858aaf1
update Readme.md
Browse files
README.md
CHANGED
|
@@ -141,10 +141,48 @@ python inference.py
|
|
| 141 |
## π¬ Technical Details
|
| 142 |
|
| 143 |
### Model Architecture (CRNN)
|
| 144 |
-
|
| 145 |
-
|
| 146 |
-
|
| 147 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 148 |
|
| 149 |
### Training Optimizations
|
| 150 |
- **AdamW Optimizer**: lr=3e-4, weight_decay=1e-4
|
|
|
|
| 141 |
## π¬ Technical Details
|
| 142 |
|
| 143 |
### Model Architecture (CRNN)
|
| 144 |
+
|
| 145 |
+
The model uses a **CNN + RNN + CTC** architecture specifically designed for sequence recognition:
|
| 146 |
+
|
| 147 |
+
#### **CNN Encoder (SmallCNN)**
|
| 148 |
+
```
|
| 149 |
+
Input: [B, 1, 60, 256] β Output: [64, B, 128]
|
| 150 |
+
```
|
| 151 |
+
- **Conv1 Block**: 3Γ3 conv β BatchNorm β ReLU β MaxPool(2Γ2)
|
| 152 |
+
- Channels: 1 β 64
|
| 153 |
+
- Spatial: 60Γ256 β 30Γ128
|
| 154 |
+
- **Conv2 Block**: 3Γ3 conv β BatchNorm β ReLU β MaxPool(1Γ2)
|
| 155 |
+
- Channels: 64 β 128
|
| 156 |
+
- Spatial: 30Γ128 β 30Γ64
|
| 157 |
+
- **Residual Block**: 3Γ3 conv β BatchNorm β ReLU β 3Γ3 conv β BatchNorm + Skip Connection
|
| 158 |
+
- Maintains 128 channels and 30Γ64 spatial dimensions
|
| 159 |
+
- **Height Pooling**: AdaptiveAvgPool2d(1, None) β squeeze(2)
|
| 160 |
+
- Spatial: 30Γ64 β 1Γ64 β 64 timesteps
|
| 161 |
+
- Final: [64, B, 128] where T=64, B=batch_size, C=128
|
| 162 |
+
|
| 163 |
+
#### **RNN Decoder (BiLSTM)**
|
| 164 |
+
```
|
| 165 |
+
Input: [64, B, 128] β Output: [64, B, 640]
|
| 166 |
+
```
|
| 167 |
+
- **Architecture**: 2-layer bidirectional LSTM
|
| 168 |
+
- **Hidden Size**: 320 per direction (total 640)
|
| 169 |
+
- **Dropout**: 0.05 between layers
|
| 170 |
+
- **Output**: [T, B, 2Γhidden] = [64, B, 640]
|
| 171 |
+
|
| 172 |
+
#### **Output Layer**
|
| 173 |
+
```
|
| 174 |
+
Input: [64, B, 640] β Output: [64, B, 63]
|
| 175 |
+
```
|
| 176 |
+
- **LayerNorm**: Stabilizes 640-dimensional features
|
| 177 |
+
- **Linear**: Maps to vocabulary size (62 chars + 1 blank token)
|
| 178 |
+
- **Final Shape**: [T=64, B=batch_size, V=63]
|
| 179 |
+
|
| 180 |
+
#### **Key Design Features**
|
| 181 |
+
- **Total Stride**: 4 (256 β 64 timesteps)
|
| 182 |
+
- **Height Compression**: 60 β 1 (via pooling)
|
| 183 |
+
- **Residual Connections**: Prevents gradient vanishing
|
| 184 |
+
- **Bidirectional LSTM**: Captures context from both directions
|
| 185 |
+
- **LayerNorm**: Training stability before final classification
|
| 186 |
|
| 187 |
### Training Optimizations
|
| 188 |
- **AdamW Optimizer**: lr=3e-4, weight_decay=1e-4
|