lordhet commited on
Commit
39b73db
Β·
verified Β·
1 Parent(s): 1bfbba4

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +281 -3
README.md CHANGED
@@ -1,3 +1,281 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - en
5
+ tags:
6
+ - autonomous-driving
7
+ - self-driving-car
8
+ - robotics
9
+ - imitation-learning
10
+ - behavioral-cloning
11
+ - pilotnet
12
+ - esp8266
13
+ - esp32-cam
14
+ - pytorch
15
+ - classification
16
+ library_name: pytorch
17
+ pipeline_tag: image-classification
18
+ model-index:
19
+ - name: ActionNet
20
+ results: []
21
+ ---
22
+
23
+ # ActionNet β€” Autonomous RC Car Driving Model
24
+
25
+ A lightweight classification CNN that drives a small RC car by predicting discrete motor actions from raw camera frames. Trained through imitation learning β€” a human drives the car while the system records frames and commands, then the model learns to replicate that behavior.
26
+
27
+ Part of the [OpenBot PC Server Project](https://github.com/loki-smip/openbot-pc-server-projuct-).
28
+
29
+ ---
30
+
31
+ ## Model Description
32
+
33
+ ActionNet classifies a single 66Γ—200 RGB camera image into one of 9 discrete driving actions. It replaces the traditional regression approach (predicting continuous steering angles) because keyboard-driven training data only contains a handful of unique command pairs. Classification with cross-entropy loss handles this much better than mean-squared-error regression, which tends to average everything toward zero.
34
+
35
+ **Input:** 66Γ—200Γ—3 RGB image (cropped from 800Γ—600, top 40% removed)
36
+ **Output:** probability distribution over 9 driving actions
37
+
38
+ ### The 9 Actions
39
+
40
+ | Index | Action | Left Motor | Right Motor | Description |
41
+ |---|---|---|---|---|
42
+ | 0 | STOP | 0 | 0 | Both motors off |
43
+ | 1 | FORWARD | +70 | +70 | Straight ahead |
44
+ | 2 | BACKWARD | -70 | -70 | Straight reverse |
45
+ | 3 | TURN LEFT | -49 | +49 | Pivot left (in place) |
46
+ | 4 | TURN RIGHT | +49 | -49 | Pivot right (in place) |
47
+ | 5 | FORWARD+LEFT | +21 | +70 | Arc forward-left |
48
+ | 6 | FORWARD+RIGHT | +70 | +21 | Arc forward-right |
49
+ | 7 | BACKWARD+LEFT | -21 | -70 | Arc backward-left |
50
+ | 8 | BACKWARD+RIGHT | -70 | -21 | Arc backward-right |
51
+
52
+ Motor values are shown at speed=70 and scale proportionally with the speed setting.
53
+
54
+ ---
55
+
56
+ ## Architecture
57
+
58
+ The convolutional backbone is based on NVIDIA's PilotNet (from the "End to End Learning for Self-Driving Cars" paper), modified with batch normalization, ELU activations, and a classification head.
59
+
60
+ ```
61
+ Layer Output Shape Parameters
62
+ ─────────────────────────────────────────────────────────────
63
+ Input (B, 3, 66, 200) β€”
64
+
65
+ Conv2d(3β†’24, 5Γ—5, stride=2) (B, 24, 31, 98) 1,824
66
+ BatchNorm2d(24) 48
67
+ ELU β€”
68
+
69
+ Conv2d(24β†’36, 5Γ—5, stride=2) (B, 36, 14, 47) 21,636
70
+ BatchNorm2d(36) 72
71
+ ELU β€”
72
+
73
+ Conv2d(36β†’48, 5Γ—5, stride=2) (B, 48, 5, 22) 43,248
74
+ BatchNorm2d(48) 96
75
+ ELU β€”
76
+
77
+ Conv2d(48β†’64, 3Γ—3, stride=1) (B, 64, 3, 20) 27,712
78
+ BatchNorm2d(64) 128
79
+ ELU β€”
80
+
81
+ Conv2d(64β†’64, 3Γ—3, stride=1) (B, 64, 1, 18) 36,928
82
+ BatchNorm2d(64) 128
83
+ ELU β€”
84
+
85
+ Dropout2d(0.15) β€”
86
+
87
+ Flatten (B, 1152) β€”
88
+ Dropout(0.35) β€”
89
+ Linear(1152β†’64) (B, 64) 73,792
90
+ ELU β€”
91
+ Dropout(0.35) β€”
92
+ Linear(64β†’9) (B, 9) 585
93
+
94
+ ─────────────────────────────────────────────────────────────
95
+ Total trainable parameters: ~145,000
96
+ Model file size: ~1–2 MB (.pth)
97
+ ```
98
+
99
+ ### Design Decisions
100
+
101
+ - **BatchNorm after every conv layer** β€” stabilizes training and allows higher learning rates without divergence
102
+ - **ELU instead of ReLU** β€” avoids dead neurons and produces smoother gradients, which matters when the model is small
103
+ - **Spatial Dropout2d (15%)** β€” drops entire feature maps instead of individual pixels, forcing the network to spread information across channels
104
+ - **Two-layer classification head with 35% dropout** β€” the bottleneck at 64 units forces compression and fights overfitting on small datasets
105
+ - **Kaiming initialization** β€” all conv and linear layers use He initialization (fan-out mode), which pairs well with ELU activations
106
+ - **Label smoothing (0.2)** β€” prevents the model from becoming overconfident on exact training labels. A STOP frame labeled as [1.0, 0.0, 0.0, ...] becomes [0.82, 0.02, 0.02, ...], which improves generalization
107
+
108
+ ---
109
+
110
+ ## Preprocessing
111
+
112
+ The full pipeline from raw camera frame to model input:
113
+
114
+ ```
115
+ Raw 800οΏ½οΏ½600 BGR frame from ESP32-CAM
116
+ β”‚
117
+ β–Ό
118
+ Crop top 40% of the image
119
+ (removes ceiling, sky, and upper walls)
120
+ β”‚
121
+ β–Ό
122
+ Convert BGR β†’ RGB
123
+ β”‚
124
+ β–Ό
125
+ Resize to 200Γ—66 pixels
126
+ (using INTER_AREA interpolation)
127
+ β”‚
128
+ β–Ό
129
+ ToTensor β†’ normalize to [0, 1] float32
130
+ β”‚
131
+ β–Ό
132
+ Final shape: [batch, 3, 66, 200]
133
+ ```
134
+
135
+ The `crop_and_resize()` function in `trainer.py` performs this transformation. The exact same function is called during both training and inference (in `autopilot.py`) to guarantee consistency.
136
+
137
+ Why crop the top 40%? Because the camera is mounted on a low car pointing forward. The top portion of every frame shows ceiling, walls, or sky β€” none of which help the model decide where to steer. Removing it reduces noise and lets the model focus on the ground, obstacles, and path ahead.
138
+
139
+ ---
140
+
141
+ ## Training Configuration
142
+
143
+ | Parameter | Value | Notes |
144
+ |---|---|---|
145
+ | Optimizer | AdamW | weight_decay=5e-3 for L2 regularization |
146
+ | Learning Rate | 0.001 | Peak rate, with OneCycleLR schedule |
147
+ | LR Schedule | OneCycleLR | 10% warmup, cosine anneal, div_factor=10 |
148
+ | Loss Function | CrossEntropyLoss | label_smoothing=0.2 |
149
+ | Batch Size | 32 | Fits comfortably in CPU memory |
150
+ | Gradient Clipping | max_norm=1.0 | Prevents gradient explosions |
151
+ | Early Stopping | 30 epochs patience | Monitored by validation accuracy |
152
+ | Class Balancing | WeightedRandomSampler | Inverse-frequency weights per class |
153
+ | Train/Val Split | 80% / 20% | Random split |
154
+
155
+ ### Data Augmentation
156
+
157
+ Applied on-the-fly during training:
158
+
159
+ | Augmentation | Probability | Details |
160
+ |---|---|---|
161
+ | Horizontal flip | 50% | Action labels are mirrored (LEFT↔RIGHT) |
162
+ | Random shadow | 50% | Vertical band at random brightness (30–70%) |
163
+ | Random brightness | 50% | HSV V-channel scaled 0.6–1.4Γ— |
164
+ | Gaussian blur | 30% | Kernel 3Γ—3 or 5Γ—5 |
165
+ | Random translation | 40% | Shift Β±10% in X and Y |
166
+ | Random erasing | 50% | Rectangular cutout on tensor |
167
+
168
+ The horizontal flip augmentation automatically swaps left/right action labels using a predefined mirror table, so the model never sees contradictory labels.
169
+
170
+ ---
171
+
172
+ ## Inference
173
+
174
+ At runtime, the autopilot module:
175
+
176
+ 1. Reads the latest camera frame from the MJPEG stream
177
+ 2. Runs `crop_and_resize()` β†’ converts to tensor
178
+ 3. Forward pass through ActionNet β†’ gets 9 logits
179
+ 4. Applies softmax β†’ picks the action with highest probability
180
+ 5. Uses a 3-frame majority vote to smooth out flickering predictions
181
+ 6. Maps the smoothed action to (left, right) motor commands at the configured speed
182
+ 7. Sends the command to the ESP8266 over WebSocket
183
+
184
+ The inference loop runs at 10 FPS on a typical laptop CPU. No GPU required.
185
+
186
+ ---
187
+
188
+ ## Hardware Requirements
189
+
190
+ This model is designed for a specific hardware setup:
191
+
192
+ | Component | Role |
193
+ |---|---|
194
+ | ESP32-CAM (OV2640) | Streams 800Γ—600 MJPEG video over HTTP |
195
+ | ESP8266 (NodeMCU) | Receives motor commands over WebSocket, drives L298N |
196
+ | L298N Motor Driver | Controls 2 DC gear motors (differential drive) |
197
+ | SG90 Servo (optional) | Camera pan |
198
+ | PC (any laptop/desktop) | Runs the server, training, and inference |
199
+
200
+ The PC does all the heavy lifting. The microcontrollers are just I/O β€” one for video, one for motors. Total hardware cost is around $25–30 USD.
201
+
202
+ ---
203
+
204
+ ## How to Use This Model
205
+
206
+ ### Quickstart
207
+
208
+ ```python
209
+ import torch
210
+ import torch.nn.functional as F
211
+ from torchvision import transforms
212
+ from model import ActionNet, action_to_command
213
+
214
+ # Load
215
+ device = torch.device("cpu")
216
+ model = ActionNet().to(device)
217
+ checkpoint = torch.load("trained_models/autopilot.pth", map_location=device)
218
+ model.load_state_dict(checkpoint["model_state_dict"])
219
+ model.eval()
220
+
221
+ # Prepare a 66x200 RGB image as tensor
222
+ transform = transforms.ToTensor()
223
+ img_tensor = transform(your_66x200_rgb_image).unsqueeze(0).to(device)
224
+
225
+ # Predict
226
+ with torch.no_grad():
227
+ logits = model(img_tensor)
228
+ probs = F.softmax(logits, dim=1)
229
+ action = torch.argmax(probs, dim=1).item()
230
+ confidence = probs[0, action].item()
231
+
232
+ # Convert to motor command
233
+ left, right = action_to_command(action, speed=70)
234
+ print(f"Action: {action}, Motors: L={left} R={right}, Confidence: {confidence:.1%}")
235
+ ```
236
+
237
+ ### Within the Full System
238
+
239
+ The model is used automatically by the autopilot module. Start the server, record some training data through the dashboard, train from the dashboard, then click "Start Autopilot."
240
+
241
+ See the full [README](https://github.com/YOUR_USERNAME/openbot-pc-server-project) for step-by-step instructions including hardware assembly, firmware upload, and data collection.
242
+
243
+ ---
244
+
245
+ ## Training Your Own Model
246
+
247
+ 1. Assemble the hardware (ESP8266 + ESP32-CAM + motors)
248
+ 2. Flash firmware to both microcontrollers
249
+ 3. Start the PC server: `python app.py`
250
+ 4. Drive the car manually while recording data
251
+ 5. Click "Train" in the dashboard β€” or the model trains through the API
252
+ 6. The best checkpoint saves automatically to `trained_models/autopilot.pth`
253
+
254
+ Training runs on CPU. A dataset of 3,000 frames trains in under 5 minutes on a modern laptop. GPU is supported if available but not required.
255
+
256
+ ---
257
+
258
+ ## Limitations
259
+
260
+ - The model only knows what it has seen. If you train it in one room, it won't generalize to a different room without additional data.
261
+ - Keyboard inputs produce jerky, discrete commands. A joystick or gamepad would produce smoother training data.
262
+ - The 40% top-crop assumes the camera is mounted pointing roughly forward and slightly down. If your camera angle is very different, adjust the crop ratio in `trainer.py`.
263
+ - Performance depends heavily on lighting conditions matching between training and inference.
264
+ - The model has no notion of obstacles, goals, or maps. It purely replicates the visual patterns it was trained on.
265
+
266
+ ---
267
+
268
+ ## Citation
269
+
270
+ If you use this project in your work, a mention is appreciated but not required:
271
+
272
+ ```
273
+ OpenBot PC Server Project β€” Autonomous RC Car with Imitation Learning
274
+ https://github.com/YOUR_USERNAME/openbot-pc-server-project
275
+ ```
276
+
277
+ ---
278
+
279
+ ## License
280
+
281
+ MIT License β€” use it, modify it, ship it.