omar-ah
/

vil-tracker

Model card Files Files and versions

xet

Community

omar-ah commited on 9 days ago

Commit

59fd921

verified ·

1 Parent(s): 7e7f067

Update README with full documentation

Browse files

Files changed (1) hide show

README.md +107 -41

README.md CHANGED Viewed

@@ -16,17 +16,20 @@ A lightweight single-object tracker (SOT) using Vision-LSTM (ViL) as backbone, d
 ### Core Design
 - **Backbone**: Vision-LSTM (ViL-S) with 24 mLSTM blocks, bidirectional scanning
-- **Temporal Modulation**: FiLM (Feature-wise Linear Modulation) for temporal context
 - **Prediction Heads**: Center-based heatmap + size regression + offset refinement
 - **Uncertainty**: Aleatoric uncertainty estimation for adaptive tracking
 - **TMoE**: Temporal Mixture-of-Experts MLP in last 2 blocks
 ### Key Innovations
 1. **LinearHeadwiseExpand Q/K/V projections**: Block-diagonal projections (192×4×4 = 3K params each vs 589K for full linear), matching the official NX-AI ViL-S architecture
 2. **No separate MLP/FFN**: Following ViL-S, the gated output inside the mLSTM cell serves as the MLP (SwiGLU-style gating via proj_up → split → z-gate → proj_down)
 3. **Bidirectional scanning**: Even blocks L→R, odd blocks R→L via `torch.flip`
-4. **FiLM temporal modulation**: Replaces DTPTrack temporal tokens (broken in R→L scan) with channel-wise affine modulation
 5. **TMoE in last 2 blocks**: Dense routing with frozen shared expert + 4 specialized experts for temporal dynamics
 ### Constraint Compliance
@@ -73,44 +76,61 @@ Input x (B, S, D=384)
 ```
 ### Training Pipeline
-- **Phase 1** (300 epochs): Full supervised training with focal + GIoU + size losses, ACL curriculum
-- **Phase 2** (100 epochs): Fine-tuning with frozen shared TMoE experts, contrastive loss
-## File Structure
-```
-vil_tracker/
-├── models/
-│   ├── mlstm.py          # LinearHeadwiseExpand, mLSTMCell, mLSTMBlock, SwiGLUMLP
-│   ├── backbone.py        # ViLBackbone, PatchEmbed, TMoEMLP, mLSTMBlockWithTMoE
-│   ├── film_temporal.py   # FiLM modulation, TemporalReliabilityCalibrator
-│   ├── heads.py           # CenterHead, UncertaintyHead, decode_predictions
-│   └── tracker.py         # ViLTracker, build_tracker, get_default_config
-├── training/
-│   ├── losses.py          # FocalLoss, GIoULoss, UncertaintyNLLLoss, CombinedTrackingLoss
-│   └── train.py           # Phase 1/2 training, ACL curriculum, AMP
-├── data/
-│   └── dataset.py         # TrackingDataset with synthetic fallback, ACL difficulty
-├── inference/
-│   ├── kalman.py          # 8-state Kalman filter with adaptive noise
-│   └── online_tracker.py  # OnlineTracker inference pipeline
-├── evaluation/
-│   └── evaluate.py        # BenchmarkEvaluator for LaSOT/UAV123/DTB70/VisDrone
-├── utils/
-│   └── helpers.py         # count_parameters, estimate_flops, print_model_summary
-└── configs/
-    └── default.json       # Full configuration
-```
 ## Quick Start
 ```python
 from vil_tracker.models.tracker import build_tracker
-# Build model with default config (36.33M params)
 tracker = build_tracker()
-# Forward pass
 import torch
 template = torch.randn(1, 3, 128, 128)
 search = torch.randn(1, 3, 256, 256)
@@ -120,19 +140,65 @@ print(output['boxes'])    # (1, 4) predicted [cx, cy, w, h]
 print(output['scores'])   # (1,) confidence scores
 ```
-## References
-### Seed Papers
-- **UETrack**: arXiv:2603.01412 — Uncertainty-aware tracker
-- **SGLATrack**: arXiv:2503.06625 — Structure-guided attention tracking
-- **SUTrack**: arXiv:2412.19138 — Unified tracking framework
-### Architecture References
 - **Vision-LSTM (ViL)**: Alkin et al., arXiv:2406.04303
 - **xLSTM**: Beck et al., arXiv:2405.04517
-- **FiLM**: Perez et al., "FiLM: Visual Reasoning with a General Conditioning Layer"
-- **MCITrack**: Distillation teacher (B256 variant)
 ## License
-MIT

 ### Core Design
 - **Backbone**: Vision-LSTM (ViL-S) with 24 mLSTM blocks, bidirectional scanning
+- **Temporal Modulation**: FiLM (Feature-wise Linear Modulation) integrated BETWEEN backbone blocks
 - **Prediction Heads**: Center-based heatmap + size regression + offset refinement
 - **Uncertainty**: Aleatoric uncertainty estimation for adaptive tracking
 - **TMoE**: Temporal Mixture-of-Experts MLP in last 2 blocks
+- **Online Tracking**: Kalman filter with uncertainty-adaptive noise + confidence-based template update
 ### Key Innovations
 1. **LinearHeadwiseExpand Q/K/V projections**: Block-diagonal projections (192×4×4 = 3K params each vs 589K for full linear), matching the official NX-AI ViL-S architecture
 2. **No separate MLP/FFN**: Following ViL-S, the gated output inside the mLSTM cell serves as the MLP (SwiGLU-style gating via proj_up → split → z-gate → proj_down)
 3. **Bidirectional scanning**: Even blocks L→R, odd blocks R→L via `torch.flip`
+4. **FiLM temporal modulation**: Replaces DTPTrack temporal tokens (broken in R→L scan) with channel-wise affine modulation, integrated between backbone blocks (not post-hoc)
 5. **TMoE in last 2 blocks**: Dense routing with frozen shared expert + 4 specialized experts for temporal dynamics
+6. **ACL curriculum**: Progressive difficulty ramp-up (sample jitter + temporal gap + loss weighting)
+7. **8-state Kalman filter**: Chi-squared gating for outlier rejection, uncertainty-adaptive measurement noise
 ### Constraint Compliance
 ```
 ### Training Pipeline
+- **Phase 1** (300 epochs): Full supervised training with focal + GIoU + size losses
+  - ACL curriculum: difficulty ramp 0→1 over 50 epochs (controls temporal gap, spatial jitter, loss weighting)
+  - FiLM temporal modulation activated after epoch 30
+  - Datasets: GOT-10k + LaSOT + TrackingNet + COCO (with synthetic fallback)
+- **Phase 2** (100 epochs): Fine-tuning with frozen shared TMoE experts
+  - Contrastive loss on template/search temporal features
+  - Optional AFKD distillation from MCITrack-B256 teacher
+  - FiLM temporal modulation always active
+### Loss Functions
+- **FocalLoss**: Center heatmap prediction (CornerNet-style, handles 1/256 positive ratio)
+- **GIoULoss**: Bounding box regression
+- **L1Loss**: Size regression
+- **UncertaintyNLLLoss**: Uncertainty-aware regression
+- **MemoryContrastiveLoss**: Temporal feature consistency (Phase 2)
+- **AFKDDistillationLoss**: Attention-free knowledge distillation (optional teacher)
+- **ADWLoss**: Adaptive dynamic weighting (homoscedastic uncertainty)
+### Inference Pipeline (OnlineTracker)
+1. Kalman filter predict → estimated position
+2. Crop search region (4x context) around prediction
+3. Model forward: template + search → heatmap + size + offset
+4. Decode predictions → candidate bounding box
+5. Map predictions back to frame coordinates
+6. Confidence check → update Kalman filter (with uncertainty-adaptive noise)
+7. Conditional template update (high confidence, every 10th frame)
+## Dataset Support
+### Training Datasets
+- **GOT-10k**: `root/train/GOT-10k_Train_NNNNNN/` (10K sequences)
+- **LaSOT**: `root/{category}/{seq_name}/img/` + `groundtruth.txt` (1120 sequences)
+- **TrackingNet**: `root/TRAIN_N/frames/{video}/` + `anno/{video}.txt` (30K sequences)
+- **COCO**: Pseudo-sequences from detection annotations (static pair pretraining)
+- **Synthetic**: Colored rectangles on noise backgrounds (no external data needed)
+### Evaluation Datasets
+- **LaSOT** (test): 280 sequences, AUC metric
+- **UAV123**: 123 sequences at 123fps
+- **DTB70**: 70 drone tracking sequences
+- **VisDrone-SOT**: Drone-perspective tracking
 ## Quick Start
+### Build and Inspect Model
 ```python
 from vil_tracker.models.tracker import build_tracker
+from vil_tracker.utils.helpers import print_model_summary
 tracker = build_tracker()
+print_model_summary(tracker)
+```
+### Forward Pass
+```python
 import torch
 template = torch.randn(1, 3, 128, 128)
 search = torch.randn(1, 3, 256, 256)
 print(output['scores'])   # (1,) confidence scores
 ```
+### Online Tracking
+```python
+from vil_tracker.inference.online_tracker import OnlineTracker
+online = OnlineTracker(tracker, device='cuda')
+online.initialize(first_frame, init_bbox)
+for frame in video_frames[1:]:
+    bbox = online.track(frame)
+```
+### Training
+```python
+from vil_tracker.models.tracker import build_tracker, get_default_config
+from vil_tracker.data.dataset import build_tracking_dataset
+from vil_tracker.training.train import train_phase1, train_phase2
+config = get_default_config()
+model = build_tracker(config)
+dataset = build_tracking_dataset({
+    'got10k_root': '/data/GOT-10k',
+    'lasot_root': '/data/LaSOT',
+    'trackingnet_root': '/data/TrackingNet',
+})
+model = train_phase1(model, dataset, config, device='cuda',
+                     push_to_hub=True, hub_model_id='user/vil-tracker')
+model = train_phase2(model, dataset, config, device='cuda',
+                     push_to_hub=True, hub_model_id='user/vil-tracker')
+```
+### Evaluation
+```python
+from vil_tracker.inference.online_tracker import OnlineTracker
+from vil_tracker.evaluation.evaluate import BenchmarkEvaluator
+online = OnlineTracker(model, device='cuda')
+evaluator = BenchmarkEvaluator(online)
+results = evaluator.evaluate_dataset('/data/LaSOT', 'lasot')
+print(f"LaSOT AUC: {results['mean_seq_auc']:.3f}")
+```
+## Tests
+Run the full test suite (16 tests):
+```bash
+python test_all.py
+```
+## References
 - **Vision-LSTM (ViL)**: Alkin et al., arXiv:2406.04303
 - **xLSTM**: Beck et al., arXiv:2405.04517
+- **UETrack**: arXiv:2603.01412
+- **SGLATrack**: arXiv:2503.06625
+- **SUTrack**: arXiv:2412.19138
+- **FiLM**: Perez et al.
+- **MCITrack**: Distillation teacher
 ## License
+MIT