Crystalcareai
/

GemMoE-Base-Hidden

Text Generation

Transformers

Safetensors

gemmoe

custom_code

Model card Files Files and versions Community

Crystalcareai commited on Mar 15

Commit

dbcbc09

•

1 Parent(s): 3d3abb4

Update README.md

Browse files

Files changed (1) hide show

README.md +44 -33

README.md CHANGED Viewed

@@ -1,54 +1,65 @@
-# GemMoE: Sharing Tools and Improved Base Models
-I'm excited to share the tools I used to create GemMoE and release improved base models for the community to explore and build upon.
-## Updates to GemMoE-Beta-1
-GemMoE-Beta-1 will continue to serve as the repository for the `modeling_files` required to operate the Mixture of Experts (MoEs). However, I will be removing the PyTorch files from this repository.
-## Sponsors
-I am currently looking for compute sponsors to continue the post-training of GemMoE, to bring it to its full potential. I've reached the limit to what I'm able to personally spend on this project. If you're interesting in helping out, please reach out.
-## New Models
-I'm introducing two new models:
-1. [Crystalcareai/GemMoE-Base-Hidden](https://huggingface.co/Crystalcareai/GemMoE-Base-Hidden)
-   - This is a new MoE created using an improved method that I will explain below.
-   - It utilizes a hidden gate and shows strong potential.
-   - The model has not been altered and requires finetuning to reach its full potential.
-   - If you're looking to achieve great performance with relatively minimal training, this is an excellent starting point.
-2. [Crystalcareai/GemMoE-Base-Random](https://huggingface.co/Crystalcareai/GemMoE-Base-Random)
-   - This model was created using the same merge method as GemMoE-Base-Hidden, but with a RANDOM gate.
-   - It randomly selects the experts during the merging process.
-   - With finetuning, the model learns to choose the appropriate experts naturally, potentially leading to better results compared to GemMoE-Base-Hidden.
-   - This method offers an intriguing mix between clown-car and mixtral-style approaches.
-The new merge method and modeling files also reduce VRAM usage, making the models easier to finetune.
-## Training Experiences and Challenges
-I have successfully trained the models on a single A100 using Qlora, although it required careful monitoring and posed some difficulties. It appears there is currently an issue with Qlora and GemMoE. I observed better VRAM usage when using 4 A6000 cards and finetuning with Dora without any quantization and deepspeed_Zero3.
-## Creating Your Own Merges
-You can create your own merges using my modified branch of mergekit:
-```bash
-git clone -b gemmoe https://github.com/Crystalcareai/mergekit.git
-```
-To create an exact replica of Crystalcareai/GemMoE-Base-Hidden, use the following command:
-```bash
-mergekit-moe examples/gemmoe.yml ./merged --cuda --lazy-unpickle --allow-crimes
-```
-Feel free to modify the `/examples/gemmoe.yml` file to customize the merge according to your preferences.
-Alternatively, you can use my modified lazymergekit available on Colab: [lazymergekit-Gemmoe](https://colab.research.google.com/drive/1WWxCE4NYvJNZkjFhkL79cf-dRc3xTpGn?usp=drive_link)
-Happy experimenting and building!
 -Lucas

+---
+license: other
+license_name: gemma-terms-of-use
+license_link: https://ai.google.dev/gemma/terms
+---
+<p align="center">
+  <a href="https://ibb.co/wgSH65b">
+    <img src="https://i.ibb.co/JtvLkDb/Gemmoe-Logo.png" alt="Gemmoe-Logo" border="0">
+  </a>
+</p>
+# Using GemMoE (Beta)
+> **Note:** If you wish to use GemMoE while it is in beta, you must add the flag `trust_remote_code=True` in your training/inference config.
+Please read the [howto.md](https://huggingface.co/Crystalcareai/GemMoE-Base-Random/blob/main/howto.md). below for a description of these newer models.
+## Creating Your Own GemMoE Model
+I've shared instructions on how you can create your own GemMoE model. You can find the guide [here](https://huggingface.co/Crystalcareai/GemMoE-Base-Random/blob/main/howto.md).
+# GemMoE: An 8x8 Mixture Of Experts based on Gemma.
+I am delighted to finally be able to share the beta release of GemMoE, a project that has been a labor of love and a testament to the power of collaboration within the AI community. GemMoE is a Mixture of Experts (MoE) model that combines the strength of Deepmind's Gemma architecture with a custom-tailored approach to enable easy training and inference.
+At its core, GemMoE comprises 8 separately fine-tuned Gemma models, with 2 experts per token. I have meticulously incorporated bug fixes for Gemma, ensuring that you can train and perform inference with GemMoE just as you would with any other model in the transformers library. My goal was to create a model that is not only powerful but also user-friendly and accessible to researchers, developers, and users alike.
+## The Journey: Challenges and Gratitude
+The development of GemMoE was challenging yet incredibly rewarding. Although I encountered numerous obstacles along the way, with the help and support of the AI community, I was able to overcome them and create something truly special.
+One of the most significant challenges I faced was the lack of experience in developing a custom MoE architecture. As a self-taught developer, I had to learn many hard lessons from my previous project, Qwen1.5-8x7b. Although the methodology was there, the execution was lacking. I soon realized that I would need to create a completely new model class within the transformers library and implement a custom MoE architecture tailored specifically for Gemma. Through countless iterations and a tremendous amount of trial and error, I was able to refine the architecture and optimize its performance.
+I cannot stress enough how grateful I am for the support and contributions of the AI community throughout this journey. Hugging Face, in particular, played a crucial role in making GemMoE a reality. Their generous compute grant allowed me to refine the architecture and optimize the model before committing resources to the full training. Victor M and Apolinaro from the Hugging Face team were instrumental in getting this project off the ground, and their swift support and dedication were truly inspiring.
+I also want to express my heartfelt thanks to Eric Hartford, who provided invaluable assistance in troubleshooting Gemma's unstable loss curve during the early stages of development. Daniel Han from Unsloth deserves a special mention for his tireless work in identifying and fixing many of the bugs in Gemma's transformers implementation. His fixes enabled me to fine-tune 8 different versions of Gemma and combine them using a hidden gate with a heavily modified version of mergekit, a tool developed by the brilliant Charles Goddard.
+Adapting Mergekit to support Gemma was no small feat, and I had to make significant alterations to ensure compatibility with GemMoE's architecture. This process involved a great deal of trial and error, but with persistence and the guidance of others in the community, I was able to integrate the modified mergekit into GemMoE successfully.
+I want to extend my gratitude to Maxime Labonne for his incredibly useful LLM course and colab notebooks, which helped me level up my fine-tuning skills. Jon Durbin's bagel GitHub repository was a crash course in what makes good data, and it played a crucial role in informing my data selection process. The transparency and example set by Teknium inspired me to turn my AI side hustle into a full-time gig, and for that, I am deeply grateful.
+Locutusque's datasets were a prime example of aggregating data like a pro. Justin Lin from Alibaba research supported my previous project, Qwen1.5 - 8x7b, which laid the foundation for GemMoE. I also want to thank Deepmind for releasing Gemma and acknowledge the hard work and dedication of everyone who contributed to its development.
+The Deepseek team's DeepseekMoE paper was a game-changer for me, providing critical insights into what makes an MoE as good as possible. I am also incredibly grateful to the entire Perplexity team, whose answer engine accelerated my education and understanding of AI by a factor of five (source: vibes).
+## The Future: Continuous Improvement and Community Collaboration
+GemMoE is currently in beta, and I am actively working on refining the model, architecture, and implementation for full integration into the transformers library. I will release a comprehensive technical report detailing the development process and the solutions employed in GemMoE.
+I am incredibly excited about GemMoE's future and potential. By collaborating with the community, we can continue to refine and improve GemMoE, pushing the boundaries of what is possible with MoE architectures and limited amounts of compute.
+To facilitate this collaboration, I will release all 8 of the Gemma fine-tunes that went into GemMoE, along with a detailed technical breakdown of the model's architecture and training process. My hope is that by sharing this information, I can inspire and empower others to build upon GemMoE and create even more powerful models.
+## A Heartfelt Thank You
+Finally, I want to express my deepest gratitude to every member of the open-source community who has contributed to and supported this project. Your dedication, expertise, and willingness to share knowledge have been instrumental in bringing GemMoE to life.
+To the users who will be working with GemMoE, thank you for your interest and support. Your enthusiasm humbles me, and I cannot wait to see what incredible applications and use cases you will find with this model.
+GemMoE is a testament to the collaboration, perseverance, and unwavering commitment of the AI community to push the boundaries of what is possible for the GPU poor. I am honored to be a part of this community and look forward to continuing this journey of discovery with all of you.
 -Lucas