--- license: apache-2.0 language: - en pipeline_tag: fill-mask inference: false --- # Monarch Mixer-BERT An 80M checkpoint of M2-BERT, pretrained with sequence length 32768. **This is a BERT-style model that has not been fine-tuned. We recommend fine-tuning it for specific use cases before using it.** Check out the paper [Monarch Mixer: A Simple Sub-Quadratic GEMM-Based Architecture](https://arxiv.org/abs/2310.12109) and our [blog post]() on retrieval for more on how we trained this model for long sequence. This model was trained by Jon Saad-Falcon, Dan Fu, and Simran Arora. Check out our [GitHub](https://github.com/HazyResearch/m2/tree/main) for instructions on how to download and fine-tune it! ## How to use You can load this model using Hugging Face `AutoModel`: ```python from transformers import AutoModelForMaskedLM model = AutoModelForMaskedLM.from_pretrained( "togethercomputer/m2-bert-80M-32k-retrieval", trust_remote_code=True ) ``` You should expect to see a large error message about unused parameters for FlashFFTConv. If you'd like to load the model with FlashFFTConv, you can check out our [GitHub](https://github.com/HazyResearch/m2/tree/main). ## Acknowledgments Alycia Lee helped with AutoModel support. ## Citation If you use this model, or otherwise found our work valuable, you can cite us as follows: ``` @inproceedings{fu2023monarch, title={Monarch Mixer: A Simple Sub-Quadratic GEMM-Based Architecture}, author={Fu, Daniel Y and Arora, Simran and Grogan, Jessica and Johnson, Isys and Eyuboglu, Sabri and Thomas, Armin W and Spector, Benjamin and Poli, Michael and Rudra, Atri and R{\'e}, Christopher}, booktitle={Advances in Neural Information Processing Systems}, year={2023} } ```