mohakapoor commited on
Commit
8b64495
Β·
1 Parent(s): 858aaf1

update Readme.md

Browse files
Files changed (1) hide show
  1. README.md +42 -4
README.md CHANGED
@@ -141,10 +141,48 @@ python inference.py
141
  ## πŸ”¬ Technical Details
142
 
143
  ### Model Architecture (CRNN)
144
- - **CNN Encoder**: SmallCNN with stride=4, reduces W=256β†’64 timesteps
145
- - **BiLSTM**: 2-layer bidirectional LSTM (320 hidden, dropout=0.05)
146
- - **LayerNorm**: Stabilizes training before output layer
147
- - **Linear Output**: Maps to 63 classes (62 chars + 1 blank token)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
148
 
149
  ### Training Optimizations
150
  - **AdamW Optimizer**: lr=3e-4, weight_decay=1e-4
 
141
  ## πŸ”¬ Technical Details
142
 
143
  ### Model Architecture (CRNN)
144
+
145
+ The model uses a **CNN + RNN + CTC** architecture specifically designed for sequence recognition:
146
+
147
+ #### **CNN Encoder (SmallCNN)**
148
+ ```
149
+ Input: [B, 1, 60, 256] β†’ Output: [64, B, 128]
150
+ ```
151
+ - **Conv1 Block**: 3Γ—3 conv β†’ BatchNorm β†’ ReLU β†’ MaxPool(2Γ—2)
152
+ - Channels: 1 β†’ 64
153
+ - Spatial: 60Γ—256 β†’ 30Γ—128
154
+ - **Conv2 Block**: 3Γ—3 conv β†’ BatchNorm β†’ ReLU β†’ MaxPool(1Γ—2)
155
+ - Channels: 64 β†’ 128
156
+ - Spatial: 30Γ—128 β†’ 30Γ—64
157
+ - **Residual Block**: 3Γ—3 conv β†’ BatchNorm β†’ ReLU β†’ 3Γ—3 conv β†’ BatchNorm + Skip Connection
158
+ - Maintains 128 channels and 30Γ—64 spatial dimensions
159
+ - **Height Pooling**: AdaptiveAvgPool2d(1, None) β†’ squeeze(2)
160
+ - Spatial: 30Γ—64 β†’ 1Γ—64 β†’ 64 timesteps
161
+ - Final: [64, B, 128] where T=64, B=batch_size, C=128
162
+
163
+ #### **RNN Decoder (BiLSTM)**
164
+ ```
165
+ Input: [64, B, 128] β†’ Output: [64, B, 640]
166
+ ```
167
+ - **Architecture**: 2-layer bidirectional LSTM
168
+ - **Hidden Size**: 320 per direction (total 640)
169
+ - **Dropout**: 0.05 between layers
170
+ - **Output**: [T, B, 2Γ—hidden] = [64, B, 640]
171
+
172
+ #### **Output Layer**
173
+ ```
174
+ Input: [64, B, 640] β†’ Output: [64, B, 63]
175
+ ```
176
+ - **LayerNorm**: Stabilizes 640-dimensional features
177
+ - **Linear**: Maps to vocabulary size (62 chars + 1 blank token)
178
+ - **Final Shape**: [T=64, B=batch_size, V=63]
179
+
180
+ #### **Key Design Features**
181
+ - **Total Stride**: 4 (256 β†’ 64 timesteps)
182
+ - **Height Compression**: 60 β†’ 1 (via pooling)
183
+ - **Residual Connections**: Prevents gradient vanishing
184
+ - **Bidirectional LSTM**: Captures context from both directions
185
+ - **LayerNorm**: Training stability before final classification
186
 
187
  ### Training Optimizations
188
  - **AdamW Optimizer**: lr=3e-4, weight_decay=1e-4