File size: 3,429 Bytes
b29249f
cf7b767
b29249f
cf7b767
 
 
 
b29249f
 
cf7b767
b29249f
cf7b767
 
b29249f
cf7b767
 
 
b29249f
cf7b767
b29249f
cf7b767
 
 
 
 
b29249f
cf7b767
b29249f
 
 
cf7b767
 
 
 
 
 
 
 
b29249f
 
 
cf7b767
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
---
license: apache-2.0
tags:
- mixtral
- dense
- mistral
- expert
---

# Unmixtraled 22B 8x linear merge

> [!WARNING]
> This model outputs gibberish as it was not trained under the dense configuration. Finetuning or merging is needed to make this model useful.

This is a 22B Mistral model recycling weights from [mistral-community/Mixtral-8x22B-v0.1](https://huggingface.co/mistral-community/Mixtral-8x22B-v0.1). 
The model was adapted from a Mixtral architecture to a dense Mistral architecture with the same number of layers, attention heads and hidden dimensions.  
Embeddings, attention, layer norms and LM head weights were taken directly from the 8x22B model, MLP weights are a linear merge of experts 0 to 7 weights.

The following named weight correspondance was used:

| Mistral weight | Mixtral weight               |
|----------------|------------------------------|
| `gate_proj`    | `experts.{layer_num}.w1`     |
| `down_proj`    | `experts.{layer_num}.w2`     |
| `up_proj`      | `experts.{layer_num}.w3`     |

This mergekit configuration was used to merge the experts:

```yaml
models:
  - model: thomasgauthier/Unmixtraled-22B-v0.1-expert-0
  - model: thomasgauthier/Unmixtraled-22B-v0.1-expert-1
  - model: thomasgauthier/Unmixtraled-22B-v0.1-expert-2
  - model: thomasgauthier/Unmixtraled-22B-v0.1-expert-3
  - model: thomasgauthier/Unmixtraled-22B-v0.1-expert-4
  - model: thomasgauthier/Unmixtraled-22B-v0.1-expert-5
  - model: thomasgauthier/Unmixtraled-22B-v0.1-expert-6
  - model: thomasgauthier/Unmixtraled-22B-v0.1-expert-7
merge_method: linear
dtype: float16
```

## Unmixtraled models
| Expert | Source | Wikitext perplexity |
|--------|-----------------|---------------------|
| [Unmixtraled-22B-v0.1-expert-0](https://huggingface.co/thomasgauthier/Unmixtraled-22B-v0.1-expert-0) | Mixtral 8x22B embed, attn, layernorm, lm_head + expert 0 MLPs | 696.6932983398438 |
| [Unmixtraled-22B-v0.1-expert-1](https://huggingface.co/thomasgauthier/Unmixtraled-22B-v0.1-expert-1) | Mixtral 8x22B embed, attn, layernorm, lm_head + expert 1 MLPs | 6853.04248046875 |
| [Unmixtraled-22B-v0.1-expert-2](https://huggingface.co/thomasgauthier/Unmixtraled-22B-v0.1-expert-2) | Mixtral 8x22B embed, attn, layernorm, lm_head + expert 2 MLPs | 4689.181640625 |
| [Unmixtraled-22B-v0.1-expert-3](https://huggingface.co/thomasgauthier/Unmixtraled-22B-v0.1-expert-3) | Mixtral 8x22B embed, attn, layernorm, lm_head + expert 3 MLPs | 782.3755493164062 |
| [Unmixtraled-22B-v0.1-expert-4](https://huggingface.co/thomasgauthier/Unmixtraled-22B-v0.1-expert-4) | Mixtral 8x22B embed, attn, layernorm, lm_head + expert 4 MLPs | 2844.943603515625 |
| [Unmixtraled-22B-v0.1-expert-5](https://huggingface.co/thomasgauthier/Unmixtraled-22B-v0.1-expert-5) | Mixtral 8x22B embed, attn, layernorm, lm_head + expert 5 MLPs | 1099.32373046875 |
| [Unmixtraled-22B-v0.1-expert-6](https://huggingface.co/thomasgauthier/Unmixtraled-22B-v0.1-expert-6) | Mixtral 8x22B embed, attn, layernorm, lm_head + expert 6 MLPs | 341.5309753417969 |
| [Unmixtraled-22B-v0.1-expert-7](https://huggingface.co/thomasgauthier/Unmixtraled-22B-v0.1-expert-7) | Mixtral 8x22B embed, attn, layernorm, lm_head + expert 7 MLPs | 2099.63818359375 |
| [**Unmixtraled-22B-v0.1-lerp**](https://huggingface.co/thomasgauthier/Unmixtraled-22B-v0.1-lerp) | **Mixtral 8x22B embed, attn, layernorm, lm_head + linear merge of expert 0-7 MLPs** | **1873.9874267578125** |