Edit model card

4M: Massively Multimodal Masked Modeling

David Mizrahi*, Roman Bachmann*, Oğuzhan Fatih Kar, Teresa Yeo, Mingfei Gao, Afshin Dehghan, Amir Zamir

Official implementation and pre-trained models for "4M: Massively Multimodal Masked Modeling" (NeurIPS 2023).

Website | Paper | GitHub

4M is a framework for training "any-to-any" foundation models, using tokenization and masking to scale to many diverse modalities. Models trained using 4M can perform a wide range of vision tasks, transfer well to unseen tasks and modalities, and are flexible and steerable multimodal generative models.

Installation

For install instructions, please see https://github.com/apple/ml-4m.

Usage

The depth tokenizer can be loaded from Hugging Face Hub as follows:

from fourm.vq.vqvae import DiVAE
tok_rgb = DiVAE.from_pretrained('EPFL-VILAB/4M_tokenizers_depth_8k_224-448')

Please see https://github.com/apple/ml-4m/blob/main/README_TOKENIZATION.md for more detailed instructions and https://github.com/apple/ml-4m for other tokenizer and 4M model checkpoints.

Safetensors checkpoints are hosted under https://huggingface.co/EPFL-VILAB/4M.

Citation

If you find this repository helpful, please consider citing our work:

@inproceedings{mizrahi20234m,
    title={{4M}: Massively Multimodal Masked Modeling},
    author={David Mizrahi and Roman Bachmann and O{\u{g}}uzhan Fatih Kar and Teresa Yeo and Mingfei Gao and Afshin Dehghan and Amir Zamir},
    booktitle={Thirty-seventh Conference on Neural Information Processing Systems},
    year={2023},
}

License

The model weights in this repository are released under the Sample Code license as found in the LICENSE file.

Downloads last month
263
Unable to determine this model’s pipeline type. Check the docs .