--- library_name: transformers tags: - moe - moah - mod license: apache-2.0 datasets: - Locutusque/UltraTextbooks language: - en --- # Model Card for Model ID ## Model Details ### Model Description MoM: Mixture of Mixture This Model is a test to combine [Jamba](https://huggingface.co/ai21labs/Jamba-v0.1) architecture with 1.58 bits linear layers **excpted for attention layer**, mixture of attention head and mixture of depth. The goal is to developpe and test if this kind of architectures have not too much quality loss for a fast inference. Only 17.8M parameter over 1025 is in bf16 precision wich is ~ 1.7% of the total number of parameters - **Model type:** Mixture of attention head mixture of depth and mixture of expert 1.58bit linear layers **excepted for attention layer** - **License:** Apache licence 2.0 ### Model Sources [optional] - **Repository:** https://github.com/ostix360/optimized-LLM ## How to Get Started with the Model If you want to test this model please look at this repo at this [commit](https://github.com/ostix360/optimized-LLM/tree/04cae61fb252a5927756c86ec0efde32d0dd3794) ## Training Details - **wandb**: [training detail](https://wandb.ai/ostix360/Mixture%20of%20mixture%20(mod,%20moah%20moe)/runs/68hieuwt) ### Training Data We use the first 100k data of Locutusque/UltraTextbooks to train this model ### Training Procedure We use adam-8 bits with default betas and epsilon values #### Preprocessing [optional] The data fit the model max length i.e. 512 tokens #### Training Hyperparameters Please look at the wandb metadata file or the train.py file in the repo to see the hyperparameters ## Technical Specifications [optional] ### Compute Infrastructure #### Hardware - one 4070 ti GPU #### Software - pytorch, transformers etc