--- library_name: transformers license: apache-2.0 --- # Dynamic 8x7B Mixtral Model Nous-Hermes-2-Mixtral-8x7B-17m-DPO-raw : 17 MoE FF Layers, 15 Dense FF Layers ## Model Details ### Model Description MoE layer pruning test modified from Nous-Hermes-2-Mixtral-8x7B-DPO. So it uses the same chatml format for conversations. 15 layers of MoE is merged into a normal feed forward layer ( 17/32 layers are MoE), so the total params are reduced from 47B to 14B. Pruned layers index are as follows: ``` [3, 4, 7, 10, 11, 23, 24, 25, 26, 27, 28, 29] ``` - **Developed by:** MistralAI, NousResearch, theblackcat - **Model type:** Modified Mixtral Architecture for dynamic MoE - **License:** apache-2.0 ### Model Sources [optional] - **Repository:** [More Information Needed] - **Paper [optional]:** [More Information Needed] - **Demo [optional]:** [More Information Needed] ## Uses Experiment stage, still finding the best sweet spot for running just under 24G memory under 4 bit-quantization config. ```python tokenizer = AutoTokenizer.from_pretrained(model_path) model = CustomMixtralForCausalLM.from_pretrained(model_path, torch_dtype=torch.bfloat16, low_cpu_mem_usage=True, load_in_4bit=True, trust_remote_code=True ) pytorch_total_params = sum(p.numel() for p in model.parameters()) print(pytorch_total_params/1e9) max_length = 100 input_text = """<|im_start|>user\nHow are you? Write a story for me please<|im_end|><|im_start|>assistant\n""" input_ids = tokenizer(input_text, return_tensors="pt")["input_ids"].to('cuda') print(len(input_ids[0])) output = model.generate(input_ids, max_length=max_length, temperature=0.7, repetition_penalty=1.1, do_sample=True) print(tokenizer.decode(output[0])) ```