---
library_name: transformers
license: apache-2.0
---

# Dynamic 8x7B Mixtral Model

Nous-Hermes-2-Mixtral-8x7B-17m-DPO-raw : 17 MoE FF Layers, 15 Dense FF Layers

## Model Details

### Model Description

MoE layer pruning test modified from Nous-Hermes-2-Mixtral-8x7B-DPO. So it uses the same chatml format for conversations.

15 layers of MoE is merged into a normal feed forward layer ( 17/32 layers are MoE), so the total params are reduced from 47B to 14B.

Pruned layers index are as follows:

```
[3, 4, 7, 10, 11, 23, 24, 25, 26, 27, 28, 29]
```

- **Developed by:** MistralAI, NousResearch, theblackcat
- **Model type:** Modified Mixtral Architecture for dynamic MoE
- **License:** apache-2.0

### Model Sources [optional]

<!-- Provide the basic links for the model. -->

- **Repository:** [More Information Needed]
- **Paper [optional]:** [More Information Needed]
- **Demo [optional]:** [More Information Needed]

## Uses

Experiment stage, still finding the best sweet spot for running just under 24G memory under 4 bit-quantization config.

```python
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = CustomMixtralForCausalLM.from_pretrained(model_path,
                                            torch_dtype=torch.bfloat16,
                                            low_cpu_mem_usage=True,
                                            load_in_4bit=True,
                                            trust_remote_code=True
                                        )
pytorch_total_params = sum(p.numel() for p in model.parameters())
print(pytorch_total_params/1e9)
max_length = 100
input_text = """<|im_start|>user\nHow are you? Write a story for me please<|im_end|><|im_start|>assistant\n"""
input_ids = tokenizer(input_text, return_tensors="pt")["input_ids"].to('cuda')
print(len(input_ids[0]))
output = model.generate(input_ids, max_length=max_length, temperature=0.7, repetition_penalty=1.1, do_sample=True)
print(tokenizer.decode(output[0]))
```