Edit model card

Hyperion-3.0-Mixtral-3x7B

Model Details

This is an experimental first attempt at creating a Mixture of Experts (MoE) language model by combining several Mistral expert models. The model uses the hyperion-3.0-beta architecture as the base, with a bfloat16 output dtype. The gating mechanism is set to hidden and two experts are consulted per token (experts_per_token: 2).

The model incorporates three expert models:

  1. hyperion-3.0-beta: Focused on science, math, and coding tasks
  2. dibt-mistral-7b: Handles open-ended questions, summarization, and stream of consciousness.
  3. rp-mistral-7b: Specializes in roleplaying and character-based conversations

Each expert is trained on a set of positive and negative prompts to guide its specialization.

Intended Use and Limitations

This MoE model is an early prototype and may not exhibit optimal performance. It is intended for research and experimentation purposes only, and should not be used in production environments or for critical applications.

Please note that the expert models mentioned in the configuration have not been publicly released yet. They are expected to be made available in the near future, at which point this MoE model can be fully instantiated and evaluated.

Training Details

The base model and experts were trained using QLoRA and SFT. However, the specific details of the training data, hyperparameters, and optimization techniques used for this MoE model are not available at this time.

Feedback and Future Updates

As this is an experimental model, feedback and suggestions are welcome. Future updates may include improvements to the gating mechanism, fine-tuning of the expert models, and the incorporation of additional experts to enhance the model's performance and breadth of knowledge.

Downloads last month
2,788
Safetensors
Model size
18.5B params
Tensor type
BF16
·

Datasets used to train Locutusque/Hyperion-3.0-Mixtral-3x7B