File size: 3,336 Bytes
7e1e47f
 
 
 
a89cf2c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
---
license: apache-2.0
---

<h1 align="center">
    <img alt="Drop-Upcycling" src="images/drop-upcycling.png"></a><br>
<b>Drop-Upcycling: Training Sparse Mixture of Experts with Partial Re-initialization</b><br>
</h1>

<p align="center">
  πŸ“„ <a href="https://openreview.net/forum?id=gx1wHnf5Vp">[Paper]</a> |
  πŸ€— <a href="https://huggingface.co/collections/llm-jp/drop-upcycling-674dc5be7bbb45e12a476b80">[Hugging Face]</a>
  πŸ“ <a href="https://gitlab.llm-jp.nii.ac.jp/datasets/llm-jp-corpus-v3">[Dataset]</a>
  πŸ’» <a href="https://github.com/Taishi-N324/Drop-Upcycling">[Code]</a> |
  πŸ“Š <a href="https://wandb.ai/taishi-nakamura/Drop-Upcycling">[Log]</a>
</p>

# Model Index

We provide model checkpoints for all experiments to ensure reproducibility of the results presented in Tables 1 and 2.

## Table 1

|Model|Link|
|---|---|
|1 Dense 152M| [Link](https://huggingface.co/llm-jp/Dense-152M) |
|2 MoE FS 8x152M| [Link](https://huggingface.co/llm-jp/FS-8x152M) |
|3 MoE BTX 8x152M| [Link](https://huggingface.co/llm-jp/BTX-8x152M) |
|4 MoE NU 8x152M| [Link](https://huggingface.co/llm-jp/NU-8x152M) |
|5 MoE RNU (r=0.5) 8x152M| [Link](https://huggingface.co/llm-jp/RNU-0.5-8x152M) |
|6 MoE DU (r=0.5) 8x152M| [Link](https://huggingface.co/llm-jp/DU-0.5-8x152M) |
|7 MoE DU (r=1.0) 8x152M| [Link](https://huggingface.co/llm-jp/DU-1.0-8x152M) |
|8 Dense 1.5B| [Link](https://huggingface.co/llm-jp/Dense-1.5B) |
|9 MoE FS 8x1.5B| [Link](https://huggingface.co/llm-jp/FS-8x1.5B) |
|10 MoE BTX 8x1.5B| [Link](https://huggingface.co/llm-jp/BTX-8x1.5B) |
|11 MoE NU 8x1.5B| [Link](https://huggingface.co/llm-jp/NU-8x1.5B) |
|12 MoE RNU (r=0.5) 8x1.5B| [Link](https://huggingface.co/llm-jp/RNU-0.5-8x1.5B) |
|13 MoE DU (r=0.5) 8x1.5B| [Link](https://huggingface.co/llm-jp/DU-0.5-8x1.5B) |
|14 MoE DU (r=1.0) 8x1.5B| [Link](https://huggingface.co/llm-jp/DU-1.0-8x1.5B) |

## Table 2

|Model|Link|
|---|---|
|1 Dense 3.7B| [Link](https://huggingface.co/llm-jp/Dense-3.7B) |
|2 MoE FS 8x3.7B| [Link](https://huggingface.co/llm-jp/FS-8x3.7B) |
|3 MoE DU (r=0.5) 8x3.7B| [Link](https://huggingface.co/llm-jp/DU-0.5-8x3.7B) |
|4 Dense 13B| [Link](https://huggingface.co/llm-jp/Dense-13B) |
|5 Dense 3.7B| [Link](https://huggingface.co/llm-jp/llm-jp-3-3.7b) |

## BTX Experts

|Model|Link|
|---|---|
|Japanese expert 152M| [Link](https://huggingface.co/llm-jp/Dense-btx-japanese-expert-152M) |
|English expert 152M| [Link](https://huggingface.co/llm-jp/Dense-btx-english-expert-152M) |
|Code expert 152M| [Link](https://huggingface.co/llm-jp/Dense-btx-code-expert-152M) |
|Japanese expert 1.5B| [Link](https://huggingface.co/llm-jp/Dense-btx-japanese-expert-1.5B) |
|English expert 1.5B| [Link](https://huggingface.co/llm-jp/Dense-btx-english-expert-1.5B) |
|Code expert 1.5B| [Link](https://huggingface.co/llm-jp/Dense-btx-code-expert-1.5B) |

## How to cite

If you find our work helpful, please feel free to cite.

```
@inproceedings{
    nakamura2025dropupcycling,
    title={Drop-Upcycling: Training Sparse Mixture of Experts with Partial Re-initialization},
    author={Taishi Nakamura and Takuya Akiba and Kazuki Fujii and Yusuke Oda and Rio Yokota and Jun Suzuki},
    booktitle={The Thirteenth International Conference on Learning Representations},
    year={2025},
    url={https://openreview.net/forum?id=gx1wHnf5Vp}
}
```