File size: 6,246 Bytes
354a706
 
 
e2f0513
354a706
 
 
e2f0513
354a706
 
e2f0513
354a706
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
---
library_name: deepseek-moe
tags:
- mixture-of-experts
- transformers
- pytorch
- moe
- efficient-transformer
pipeline_tag: text-generation
language: en
license: apache-2.0
---

# DeepSeek MoE Implementation
[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)

*Note: This repository contains a modular implementation of the DeepSeek MoE architecture, not trained model weights.*

A clean, efficient implementation of DeepSeek's Mixture of Experts (MoE) architecture in PyTorch. This repository provides a simplified version of the architecture described in the DeepSeek paper, focusing on the core innovations that make their MoE approach unique.

This repository is part of a series implementing the key architectural innovations from the DeepSeek paper. See the 'Related Implementations' section for the complete series.

<p align="center">
  <img src="./assets/moe_architecture.png" alt="DeepSeek MoE Architecture" width="600"/>
</p>

## Overview

Mixture of Experts (MoE) architectures enable dramatic scaling of model parameters while maintaining computational efficiency by activating only a subset of parameters for any given input. DeepSeek's approach introduces several key innovations to the MoE architecture that improve performance and efficiency.

Key features of this implementation:

- **Hybrid Expert Structure**: Combines shared experts (processing all tokens) with routed experts (processing specific tokens)
- **Efficient Top-K Routing**: Token-to-expert affinity calculation based on dot product similarity
- **Multi-Level Load Balancing**: Cascading auxiliary losses at expert, device, and communication levels
- **Device-Limited Routing**: Bounds communication costs in distributed training scenarios
- **Token Dropping Strategy**: Optimize computation by dropping tokens with low affinities

## Quick Start

```python
import torch
from moe import MixtureOfExperts

# Create input tensor
batch_size = 8
seq_length = 16
d_model = 512
inputs = torch.randn(batch_size, seq_length, d_model)

# Create MoE layer
moe = MixtureOfExperts(
    d_model=512,       # Input dimension
    d_expert=1024,     # Expert hidden dimension
    K=2,               # Top-K experts per token
    N_s=2,             # Number of shared experts
    N_r=8,             # Number of routed experts
    alpha1=0.01,       # Expert balance factor
    alpha2=0.01,       # Device balance factor 
    alpha3=0.01,       # Communication balance factor
    D=4,               # Number of devices
    M=3                # Device limit for routing
)

# Forward pass
outputs, expert_loss, device_loss, commu_loss = moe(inputs)
```

## Architecture Details

For a detailed explanation of the architecture, see [architecture.md](insights/architecture.md).

### DeepSeek MoE Key Innovations

The DeepSeek MoE architecture introduces several elegant design choices:

1. **Hybrid Expert Structure**: Using both shared experts and routed experts with residual connections maintains global information flow while allowing for specialization.

2. **Token-Expert Affinity**: Calculating token-to-expert similarity through dot product with expert centroids, similar to attention mechanisms.

3. **Multi-Level Balancing**: Cascading auxiliary losses that enforce balance at expert, device, and communication levels, creating a holistic approach to load distribution.

4. **Device-Limited Routing**: Constraining each token to experts on at most M devices to bound communication costs.

## Implementation Details

The implementation consists of two main classes:

### 1. Expert

A feed-forward network with two linear transformations and a ReLU activation in between.

```python
Expert(x) = max(0, xW1 + b1)W2 + b2
```

### 2. MixtureOfExperts

The main MoE implementation that:
- Combines shared and routed experts
- Calculates token-to-expert affinities
- Applies top-K routing
- Calculates auxiliary balance losses

```python
MoE(x) = x + ∑ Expert^s_i(x) + ∑ gate(x;K)*Expert^r_i(x)
```

## Testing

Unit tests are provided to verify the correct functioning of:
- Expert computations
- MoE routing mechanisms
- Load balancing losses
- Residual connections

Run the tests with:

```bash
python -m src.tests.test_moe
```

## Related Implementations

This repository is part of a series implementing the key architectural innovations from the DeepSeek paper:

1. **[DeepSeek MoE](https://huggingface.co/bird-of-paradise/deepseek-moe)** (This Repository): Implementation of DeepSeek's Mixture of Experts architecture that enables efficient scaling of model parameters.

2. **[DeepSeek Multi-head Latent Attention](https://huggingface.co/bird-of-paradise/deepseek-mla)**: Implementation of DeepSeek's MLA mechanism for efficient KV cache usage during inference.

3. **[Transformer Implementation Tutorial](https://huggingface.co/datasets/bird-of-paradise/transformer-from-scratch-tutorial)**: A detailed tutorial on implementing transformer architecture with explanations of key components.

Together, these implementations cover the core innovations that power DeepSeek's state-of-the-art performance. By combining the MoE architecture with Multi-head Latent Attention, you can build a complete DeepSeek-style model with improved training efficiency and inference performance.

## Contributing

Contributions are welcome! Feel free to:
- Report bugs and issues
- Submit pull requests for improvements
- Add additional test cases
- Provide documentation clarifications

Please ensure all tests pass before submitting pull requests.


## Citation

If you use this implementation in your research, please cite:

```bibtex
@misc{deepseek-moe-2025,
  author = {Jen Wei},
  title = {DeepSeek MoE Implementation},
  year = {2025},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://huggingface.co/bird-of-paradise/deepseek-moe}}
}
```

## License

This project is licensed under the Apache License 2.0.


## Acknowledgements

This implementation is inspired by the DeepSeek paper and other open-source MoE implementations:

- [DeepSeek](https://github.com/deepseek-ai)
- [Switch Transformers](https://arxiv.org/abs/2101.03961)
- [GShard](https://arxiv.org/abs/2006.16668)