Ostixe360
/

MoMv3-M-A-mixed-precision

Text Generation

Mixture of Experts

Inference Endpoints

Model card Files Files and versions Community

MoMv3-M-A-mixed-precision / README.md

Ostixe360's picture

Upload AnemoneForCausalLM

7921f0f verified 7 months ago

|

history blame contribute delete

1.87 kB

	---
	language:
	- en
	license: apache-2.0
	library_name: transformers
	tags:
	- moe
	- moah
	- mod
	datasets:
	- Locutusque/UltraTextbooks
	---

	# Model Card for Model ID

	## Model Details

	### Model Description

	<!-- Provide a longer summary of what this model is. -->

	MoM: Mixture of Mixture

	This Model is a first test to combine [Jamba](https://huggingface.co/ai21labs/Jamba-v0.1) architecture with mixture of attention head and mixture of depth.

	Mamba and attention layers are in bf16 precision and the rest is in 1.58bits precision

	107M over a total of 1025M parameters are in bf16 precision ~ 10% of the parameters are in bf16

	The goal is to developpe and test if this kind of architectures have not too much quality loss for a fast inference.


	- Model type: Mixture of attention head mixture of depth and mixture of expert with 1.58bits linear layer for MLP
	- License: Apache licence 2.0

	### Model Sources [optional]


	- Repository: https://github.com/ostix360/optimized-LLM


	## How to Get Started with the Model


	If you want to test this model please look at this repo at this [commit](https://github.com/ostix360/optimized-LLM/tree/d266bc404346b71ea237c0744be0f8928f6b3217)


	## Training Details

	- wandb: [training detail](https://wandb.ai/ostix360/Mixture%20of%20mixture%20(mod,%20moah%20moe)/runs/wtoujazq)

	### Training Data

	We use the first 100k data of Locutusque/UltraTextbooks to train this model

	### Training Procedure

	We use adam-8 bits with default betas and epsilon values

	#### Preprocessing [optional]


	The data fit the model max length i.e. 512 tokens


	#### Training Hyperparameters

	Please look at the wandb meta data or the train.py in the repo to see the hyperparameters


	## Technical Specifications [optional]

	### Compute Infrastructure

	#### Hardware

	- one 4070 ti GPU

	#### Software

	- pytorch, transformers etc