lordhet
/

opsp

+---
+license: mit
+language:
+- en
+tags:
+- autonomous-driving
+- self-driving-car
+- robotics
+- imitation-learning
+- behavioral-cloning
+- pilotnet
+- esp8266
+- esp32-cam
+- pytorch
+- classification
+library_name: pytorch
+pipeline_tag: image-classification
+model-index:
+- name: ActionNet
+  results: []
+---
+# ActionNet — Autonomous RC Car Driving Model
+A lightweight classification CNN that drives a small RC car by predicting discrete motor actions from raw camera frames. Trained through imitation learning — a human drives the car while the system records frames and commands, then the model learns to replicate that behavior.
+Part of the [OpenBot PC Server Project](https://github.com/loki-smip/openbot-pc-server-projuct-).
+---
+## Model Description
+ActionNet classifies a single 66×200 RGB camera image into one of 9 discrete driving actions. It replaces the traditional regression approach (predicting continuous steering angles) because keyboard-driven training data only contains a handful of unique command pairs. Classification with cross-entropy loss handles this much better than mean-squared-error regression, which tends to average everything toward zero.
+**Input:** 66×200×3 RGB image (cropped from 800×600, top 40% removed)
+**Output:** probability distribution over 9 driving actions
+### The 9 Actions
+| Index | Action | Left Motor | Right Motor | Description |
+|---|---|---|---|---|
+| 0 | STOP | 0 | 0 | Both motors off |
+| 1 | FORWARD | +70 | +70 | Straight ahead |
+| 2 | BACKWARD | -70 | -70 | Straight reverse |
+| 3 | TURN LEFT | -49 | +49 | Pivot left (in place) |
+| 4 | TURN RIGHT | +49 | -49 | Pivot right (in place) |
+| 5 | FORWARD+LEFT | +21 | +70 | Arc forward-left |
+| 6 | FORWARD+RIGHT | +70 | +21 | Arc forward-right |
+| 7 | BACKWARD+LEFT | -21 | -70 | Arc backward-left |
+| 8 | BACKWARD+RIGHT | -70 | -21 | Arc backward-right |
+Motor values are shown at speed=70 and scale proportionally with the speed setting.
+---
+## Architecture
+The convolutional backbone is based on NVIDIA's PilotNet (from the "End to End Learning for Self-Driving Cars" paper), modified with batch normalization, ELU activations, and a classification head.
+```
+Layer                          Output Shape      Parameters
+─────────────────────────────────────────────────────────────
+Input                          (B, 3, 66, 200)   —
+Conv2d(3→24, 5×5, stride=2)   (B, 24, 31, 98)   1,824
+BatchNorm2d(24)                                   48
+ELU                                               —
+Conv2d(24→36, 5×5, stride=2)  (B, 36, 14, 47)   21,636
+BatchNorm2d(36)                                   72
+ELU                                               —
+Conv2d(36→48, 5×5, stride=2)  (B, 48, 5, 22)    43,248
+BatchNorm2d(48)                                   96
+ELU                                               —
+Conv2d(48→64, 3×3, stride=1)  (B, 64, 3, 20)    27,712
+BatchNorm2d(64)                                   128
+ELU                                               —
+Conv2d(64→64, 3×3, stride=1)  (B, 64, 1, 18)    36,928
+BatchNorm2d(64)                                   128
+ELU                                               —
+Dropout2d(0.15)                                   —
+Flatten                        (B, 1152)          —
+Dropout(0.35)                                     —
+Linear(1152→64)                (B, 64)            73,792
+ELU                                               —
+Dropout(0.35)                                     —
+Linear(64→9)                   (B, 9)             585
+─────────────────────────────────────────────────────────────
+Total trainable parameters:    ~145,000
+Model file size:               ~1–2 MB (.pth)
+```
+### Design Decisions
+- **BatchNorm after every conv layer** — stabilizes training and allows higher learning rates without divergence
+- **ELU instead of ReLU** — avoids dead neurons and produces smoother gradients, which matters when the model is small
+- **Spatial Dropout2d (15%)** — drops entire feature maps instead of individual pixels, forcing the network to spread information across channels
+- **Two-layer classification head with 35% dropout** — the bottleneck at 64 units forces compression and fights overfitting on small datasets
+- **Kaiming initialization** — all conv and linear layers use He initialization (fan-out mode), which pairs well with ELU activations
+- **Label smoothing (0.2)** — prevents the model from becoming overconfident on exact training labels. A STOP frame labeled as [1.0, 0.0, 0.0, ...] becomes [0.82, 0.02, 0.02, ...], which improves generalization
+---
+## Preprocessing
+The full pipeline from raw camera frame to model input:
+```
+Raw 800��600 BGR frame from ESP32-CAM
+             │
+             ▼
+    Crop top 40% of the image
+    (removes ceiling, sky, and upper walls)
+             │
+             ▼
+    Convert BGR → RGB
+             │
+             ▼
+    Resize to 200×66 pixels
+    (using INTER_AREA interpolation)
+             │
+             ▼
+    ToTensor → normalize to [0, 1] float32
+             │
+             ▼
+    Final shape: [batch, 3, 66, 200]
+```
+The `crop_and_resize()` function in `trainer.py` performs this transformation. The exact same function is called during both training and inference (in `autopilot.py`) to guarantee consistency.
+Why crop the top 40%? Because the camera is mounted on a low car pointing forward. The top portion of every frame shows ceiling, walls, or sky — none of which help the model decide where to steer. Removing it reduces noise and lets the model focus on the ground, obstacles, and path ahead.
+---
+## Training Configuration
+| Parameter | Value | Notes |
+|---|---|---|
+| Optimizer | AdamW | weight_decay=5e-3 for L2 regularization |
+| Learning Rate | 0.001 | Peak rate, with OneCycleLR schedule |
+| LR Schedule | OneCycleLR | 10% warmup, cosine anneal, div_factor=10 |
+| Loss Function | CrossEntropyLoss | label_smoothing=0.2 |
+| Batch Size | 32 | Fits comfortably in CPU memory |
+| Gradient Clipping | max_norm=1.0 | Prevents gradient explosions |
+| Early Stopping | 30 epochs patience | Monitored by validation accuracy |
+| Class Balancing | WeightedRandomSampler | Inverse-frequency weights per class |
+| Train/Val Split | 80% / 20% | Random split |
+### Data Augmentation
+Applied on-the-fly during training:
+| Augmentation | Probability | Details |
+|---|---|---|
+| Horizontal flip | 50% | Action labels are mirrored (LEFT↔RIGHT) |
+| Random shadow | 50% | Vertical band at random brightness (30–70%) |
+| Random brightness | 50% | HSV V-channel scaled 0.6–1.4× |
+| Gaussian blur | 30% | Kernel 3×3 or 5×5 |
+| Random translation | 40% | Shift ±10% in X and Y |
+| Random erasing | 50% | Rectangular cutout on tensor |
+The horizontal flip augmentation automatically swaps left/right action labels using a predefined mirror table, so the model never sees contradictory labels.
+---
+## Inference
+At runtime, the autopilot module:
+1. Reads the latest camera frame from the MJPEG stream
+2. Runs `crop_and_resize()` → converts to tensor
+3. Forward pass through ActionNet → gets 9 logits
+4. Applies softmax → picks the action with highest probability
+5. Uses a 3-frame majority vote to smooth out flickering predictions
+6. Maps the smoothed action to (left, right) motor commands at the configured speed
+7. Sends the command to the ESP8266 over WebSocket
+The inference loop runs at 10 FPS on a typical laptop CPU. No GPU required.
+---
+## Hardware Requirements
+This model is designed for a specific hardware setup:
+| Component | Role |
+|---|---|
+| ESP32-CAM (OV2640) | Streams 800×600 MJPEG video over HTTP |
+| ESP8266 (NodeMCU) | Receives motor commands over WebSocket, drives L298N |
+| L298N Motor Driver | Controls 2 DC gear motors (differential drive) |
+| SG90 Servo (optional) | Camera pan |
+| PC (any laptop/desktop) | Runs the server, training, and inference |
+The PC does all the heavy lifting. The microcontrollers are just I/O — one for video, one for motors. Total hardware cost is around $25–30 USD.
+---
+## How to Use This Model
+### Quickstart
+```python
+import torch
+import torch.nn.functional as F
+from torchvision import transforms
+from model import ActionNet, action_to_command
+# Load
+device = torch.device("cpu")
+model = ActionNet().to(device)
+checkpoint = torch.load("trained_models/autopilot.pth", map_location=device)
+model.load_state_dict(checkpoint["model_state_dict"])
+model.eval()
+# Prepare a 66x200 RGB image as tensor
+transform = transforms.ToTensor()
+img_tensor = transform(your_66x200_rgb_image).unsqueeze(0).to(device)
+# Predict
+with torch.no_grad():
+    logits = model(img_tensor)
+    probs = F.softmax(logits, dim=1)
+    action = torch.argmax(probs, dim=1).item()
+    confidence = probs[0, action].item()
+# Convert to motor command
+left, right = action_to_command(action, speed=70)
+print(f"Action: {action}, Motors: L={left} R={right}, Confidence: {confidence:.1%}")
+```
+### Within the Full System
+The model is used automatically by the autopilot module. Start the server, record some training data through the dashboard, train from the dashboard, then click "Start Autopilot."
+See the full [README](https://github.com/YOUR_USERNAME/openbot-pc-server-project) for step-by-step instructions including hardware assembly, firmware upload, and data collection.
+---
+## Training Your Own Model
+1. Assemble the hardware (ESP8266 + ESP32-CAM + motors)
+2. Flash firmware to both microcontrollers
+3. Start the PC server: `python app.py`
+4. Drive the car manually while recording data
+5. Click "Train" in the dashboard — or the model trains through the API
+6. The best checkpoint saves automatically to `trained_models/autopilot.pth`
+Training runs on CPU. A dataset of 3,000 frames trains in under 5 minutes on a modern laptop. GPU is supported if available but not required.
+---
+## Limitations
+- The model only knows what it has seen. If you train it in one room, it won't generalize to a different room without additional data.
+- Keyboard inputs produce jerky, discrete commands. A joystick or gamepad would produce smoother training data.
+- The 40% top-crop assumes the camera is mounted pointing roughly forward and slightly down. If your camera angle is very different, adjust the crop ratio in `trainer.py`.
+- Performance depends heavily on lighting conditions matching between training and inference.
+- The model has no notion of obstacles, goals, or maps. It purely replicates the visual patterns it was trained on.
+---
+## Citation
+If you use this project in your work, a mention is appreciated but not required:
+```
+OpenBot PC Server Project — Autonomous RC Car with Imitation Learning
+https://github.com/YOUR_USERNAME/openbot-pc-server-project
+```
+---
+## License
+MIT License — use it, modify it, ship it.