---
library_name: transformers
tags:
- moe
- moah
- mod
license: apache-2.0
datasets:
- Locutusque/UltraTextbooks
language:
- en
---

# Model Card for Model ID

## Model Details

### Model Description

<!-- Provide a longer summary of what this model is. -->

MoM: Mixture of Mixture

This Model is a test to combine [Jamba](https://huggingface.co/ai21labs/Jamba-v0.1) architecture with 1.58 bits linear layers **excpted for attention layer**, mixture of attention head and mixture of depth.

The goal is to developpe and test if this kind of architectures have not too much quality loss for a fast inference.

Only 17.8M parameter over 1025 is in bf16 precision wich is ~ 1.7% of the total number of parameters


- **Model type:** Mixture of attention head mixture of depth and mixture of expert 1.58bit linear layers **excepted for attention layer**
- **License:** Apache licence 2.0

### Model Sources [optional]


- **Repository:** https://github.com/ostix360/optimized-LLM


## How to Get Started with the Model


If you want to test  this model please look at this repo at this [commit](https://github.com/ostix360/optimized-LLM/tree/04cae61fb252a5927756c86ec0efde32d0dd3794)


## Training Details

  - **wandb**: [training detail](https://wandb.ai/ostix360/Mixture%20of%20mixture%20(mod,%20moah%20moe)/runs/68hieuwt)

### Training Data

We use the first 100k data of Locutusque/UltraTextbooks to train this model

### Training Procedure

We use adam-8 bits with default betas and epsilon values

#### Preprocessing [optional]


The data fit the model max length i.e. 512 tokens


#### Training Hyperparameters

Please look at the wandb metadata file or the train.py file in the repo to see the hyperparameters


## Technical Specifications [optional]

### Compute Infrastructure

#### Hardware

- one 4070 ti GPU 

#### Software

- pytorch, transformers etc