Upload docs/ARCHITECTURE.md with huggingface_hub
11e1f9d
verified
Oculus 0.1 Architecture
Overview
Oculus is a ~3.8B parameter multimodal vision-language model combining DINOv3, SigLIP2, and LFM2.5-1.2B. Designed for Apple Silicon using MLX.
Architecture Components
1. DINOv3 Encoder (ViT-L/16)
- Model: DINOv3 ViT-L/16 (pretrained)
- Parameters: ~1.7B
- Input: 224Γ224 images
- Output: 197 tokens (1 CLS + 196 patches)
- Patch Grid: 14Γ14
- Feature Dimension: 1024D
- Capabilities: Universal vision backbone, dense prediction
2. SigLIP2 Encoder (SO400M)
- Model: SigLIP2 SO400M (pretrained)
- Parameters: ~400M
- Input: 384Γ384 images
- Output: 576 patch tokens
- Patch Grid: 24Γ24
- Feature Dimension: 1152D
- Capabilities: Vision-language understanding, fine-grained features
3. Feature Fusion
- Method: Concatenation
- Input: DINOv3 patches (1024D) + SigLIP2 patches (1152D)
- Output: 2176D per spatial location
- Note: SigLIP2 features resampled to 14Γ14 to match DINOv3
4. Vision-Language Projector
- Type: 2-layer MLP with GELU
- Input: 2176D
- Hidden: 4352D
- Output: 1536D (LFM2.5 embedding dimension)
- Parameters: ~5M
5. LFM2.5-1.2B Language Model
- Model: LFM2.5-1.2B-Base (pretrained)
- Parameters: ~1.2B
- Architecture: Hybrid transformer (full_attention + conv layers)
- Embedding Dimension: 1536D
- Depth: 16 layers
- Attention Heads: 24
- Vocab Size: 131072
- Context Length: 32768 tokens
- Why LFM2.5: 3x faster training, 2x faster inference than Qwen3 on CPU
6. Task-Specific Heads
Segmentation Head
- Type: MLP
- Input: 2176D
- Hidden: 256D
- Output: num_classes (e.g., 150 for ADE20K)
- Output Shape: (batch, 14, 14, num_classes)
Classification Head
- Type: MLP
- Input: 2176D
- Hidden: 256D
- Output: num_classes (e.g., 1000 for ImageNet)
- Uses: CLS token from fused features
Detection Head
- Type: MLP
- Input: 2176D
- Hidden: 256D
- Outputs:
- Class logits: (batch, 196, anchors, num_classes)
- Box predictions: (batch, 196, anchors, 4)
OCR Head
- Type: CNN + MLP
- Input: 2176D
- Outputs:
- Text logits: (batch, 14, 14, max_seq_len)
- Geometry: (batch, 196, 4) [x, y, w, h]
Model Flow
Input Image 1 (224Γ224) βββ DINOv3 Encoder
β
196 patches (14Γ14)
1024D per patch
β
βββββββββββββββββββ
β
Input Image 2 (384Γ384) βββ SigLIP2 Encoder β
β β
576 patches (24Γ24) β
1152D per patch β
β β
Resample to 14Γ14 β
β β
βββββββ Concatenate βββ 2176D features
β
β
Vision Projector (MLP)
β
β
1536D embeddings
β
ββββββββββββββββββββ¬βββββββββββββββββββββ΄βββββββββββββββββββββ
β β β
Segmentation Classification LFM2.5 LM
Head Head (1.2B)
β β β
(14Γ14, classes) (class_id) Text Output
(Caption/VQA)
β β β
Segmentation Classification Generated
Predictions Predictions Text
βββββββββββββββββββββββββ
β β
Detection Head OCR Head
β β
(boxes + classes) (text + geometry)
Parameter Count
| Component |
Parameters |
| DINOv3 Encoder |
1,700,000,000 |
| SigLIP2 Encoder |
400,000,000 |
| Projector |
5,000,000 |
| LFM2.5 Language Model |
1,200,000,000 |
| Segmentation Head |
500,000 |
| Classification Head |
300,000 |
| Detection Head |
500,000 |
| OCR Head |
300,000 |
| Total |
~3,806,600,000 |
Training Strategy
Stage 1: Connector Pretraining
- Freeze: All vision encoders, LFM2.5
- Train: Projector only
- Data: Image-caption pairs (CC3M, LAION)
- Goal: Align vision and language representations
- Batch Size: 8-16
- Learning Rate: 1e-3
Stage 2: Head Training
- Freeze: Encoders, LFM2.5, Projector
- Train: Task heads only
- Data: Task-specific datasets
- Goal: Learn task-specific heads
- Batch Size: 8-16
- Learning Rate: 1e-3
Stage 3: Full Fine-tuning
- Freeze: None
- Train: All components
- Data: Multi-task or specific task
- Goal: End-to-end optimization
- Learning Rate: 1e-5 (encoders), 1e-4 (heads)
Memory Requirements
| Mode |
Memory |
| Inference |
~10 GB |
| Training (frozen encoders) |
~12 GB |
| Training (full) |
~30 GB |
Why LFM2.5?
- 3x faster training than Qwen3 on CPU
- 2x faster decode/prefill on CPU
- Optimized for edge - runs under 1GB memory
- Native MLX support
- Hybrid architecture - mix of attention and conv layers
Comparison with Alternatives
| Aspect |
Oculus (LFM2.5) |
Oculus (Qwen2) |
| LM Parameters |
1.2B |
1.5B |
| Training Speed |
3x faster |
Baseline |
| Inference Speed |
2x faster |
Baseline |
| MLX Support |
Native |
Via mlx-lm |
| Edge Performance |
Excellent |
Good |
Supported Tasks
| Task |
Input |
Output |
| Captioning |
Image + prompt |
Generated text |
| VQA |
Image + question |
Answer text |
| Segmentation |
Image |
Class per pixel |
| Classification |
Image |
Class label |
| Detection |
Image |
Boxes + classes |
| OCR |
Image |
Text + bounding boxes |
| Feature Extraction |
Image |
2176D features |
Input/Output Shapes
| Input |
Shape |
| DINOv3 Image |
(B, 3, 224, 224) |
| SigLIP2 Image |
(B, 3, 384, 384) |
| Input IDs |
(B, seq_len) |
| Output |
Shape |
| Generated Text |
(B, seq_len + new_tokens) |
| Segmentation |
(B, 14, 14) |
| Classification |
(B,) |
| Detection |
(B, 196, 9, 80), (B, 196, 9, 4) |
| OCR Text |
(B, 14, 14, max_seq_len) |