This is a demo for our preprint, Grounding Language Models to Images for Multimodal Generation (https://arxiv.org/abs/2301.13823), done at Carnegie Mellon University. It is a model that takes image + text inputs, and generates image + text outputs.
Hi @jykoh, we have assigned a gpu to this space. Note that GPU Grants are provided temporarily and might be removed after some time if the usage is very low.
To learn more about GPUs in Spaces, please check out https://huggingface.co/docs/hub/spaces-gpus