MoD. The readme is the same for both, with more detail below

Hey, I'm Lucas

I'm excited to share an early release of a project that has kept me busy for the last couple of weeks. Mixtral's release propelled me into a deep dive into MoEs.

With the release of Qwen1.5, I was curious to see how it would compare to Mixtral.

Coming from a background as an acting teacher and coach, I saw parallels between high-quality scripts' impact on performances and the importance of curating high-quality data for training models. This led me to explore data curation, especially for training Mixture of Experts (MoE) models. I looked into Teknium's OpenHermes dataset, Jon Durbin's collections on GitHub, and Eric Hartford's methods for achieving specific outcomes with models.

I curated a dataset, named Mixture of Data (MoD), from various sources, including Bagel, OpenHermes, and many more, totaling about 780,000 distinct ShareGPT conversations. This dataset aims to encourage MoE models to develop their own distinct experts.

After training Qwen1.5-7b on 100k random samples from MoD over four epochs and merging the fine-tuned model 8x, I used an approach utilizing a random gate, without specialized fine-tuning done to any of the 8 experts. The result was a model that initially made no sense, lacking a base model and clear guidance on expert usage.

Despite challenges, such as training interruptions via cuda errors with Runpod , the model showed promising adaptability to the rest of the MoD dataset, even with limited training (0.45/4 planned epochs were completed before my compute budget ran out). It performs comparably to Mixtral in (admittedly naive) preliminary reasoning tests.

These weeks have been incredibly rewarding and educational, thanks to the contributions of Jon Durbin, Maxime Labonne, Teknium, Eric Hartford, and Charles Goddard. Their work has made these technologies accessible and inspired my project. A special thank you to Teknium and Eric Hartford, who have been generous with their time - answering my questions with kindness and humility.

I am currently training a 2.0 model - that I expect to beat Mixtral on most benchmarks. Thank you for your interest and support. Let's push the boundaries of what's possible together.

Lucas

Crystalcareai
/

Qwen1.5-8x7b

Please note this is the model that accompanies the dataset; https://huggingface.co/datasets/Crystalcareai/MoD. The readme is the same for both, with more detail below

Hey, I'm Lucas

Dataset used to train Crystalcareai/Qwen1.5-8x7b