Orion Atlas 7B
Custom Mamba-2 Hybrid architecture built from scratch by Avery Palermini / Cendrix LLC.
Architecture
- Type: Mamba-2 Hybrid (SSM + Differential Attention)
- Parameters: ~7.1B transformer-equivalent (8.77B actual hybrid)
- Layers: 32 total
- 25 Mamba-2 (SSD) layers
- 7 Differential Attention layers at indices [3, 7, 11, 15, 19, 23, 27]
- SwiGLU FFN after every layer (all 32)
- Attention heads: 32 Q / 8 KV (GQA 4:1)
- Attention type: Microsoft Differential Attention (ICLR 2025) -- full causal, no sliding window
- SSM: Mamba-2 / SSD, expand=2, d_state=128, d_conv=4
- FFN hidden dim: 14,336 (SwiGLU)
- Model dim: 4,096
- Normalization: RMSNorm
- Position encoding: RoPE (theta=500,000) on attention layers only; Mamba-2 is position-free
- Context window: 128K tokens (131,072)
- Tokenizer: Custom SentencePiece (32K vocab)
- Weight tying: embedding and output head shared
Why Hybrid?
Validated by NVIDIA (arXiv:2406.07887), Jamba (AI21), and Zamba (Zyphra):
- Mamba-2 SSM layers handle sequence modeling efficiently (O(T) vs O(T^2))
- Full-causal attention layers (7/32) provide global recall for long-context tasks
- Result: better perplexity than pure transformer at same parameter count
- 128K context without the quadratic cost of all-attention
Design Goals
Built for agentic tasks: tool calling, structured JSON output, multi-step reasoning.
Part of the Orion Atlas model family (1B -> 3B -> 7B -> 14B -> 37B).
Files
- model_7b.py -- full architecture, pure PyTorch (no mamba-ssm package required)
- model.py -- original 1B Llama-style transformer (reference)
Training
- Pre-training: FineWeb-Edu, SlimPajama, StarCoder, OpenWebMath + custom datasets
- Fine-tuning: OpenClaw tool-calling traces (in progress)
- Framework: Custom PyTorch training loop
Status
Architecture released. Weights in training.
License
Apache 2.0