|
--- |
|
license: mit |
|
tags: |
|
- lumos |
|
- image to image |
|
- text to image |
|
- novel view synthesis |
|
- image to video |
|
--- |
|
<p align="center"> |
|
<img src="asset/logo.gif" height=20> |
|
</p> |
|
|
|
<div style="display:flex;justify-content: center"> |
|
<a href="https://arxiv.org/pdf/2412.07767"><img src="https://img.shields.io/static/v1?label=Paper&message=Arxiv:Lumos&color=red&logo=arxiv"></a>   |
|
<a href="https://xiaomabufei.github.io/lumos/"><img src="https://img.shields.io/static/v1?label=Project%20Page&message=Github&color=blue&logo=github-pages"></a>   |
|
</div> |
|
|
|
# π₯³ What is Lumos ? |
|
<b>TL; DR: <font color="purple">Lumos</font> is a pure vision-based generative framework, which confirms the feasibility and the scalability of learning visual generative priors. It can be efficiently adapted to visual generative tasks such as text-to-image, image-to-3D, and image-to-video generation.</b> |
|
<details><summary>CLICK for the full abstract</summary> |
|
Although text-to-image (T2I) models have recently thrived as visual generative priors, their reliance on high-quality text-image pairs makes scaling up expensive. |
|
We argue that grasping the cross-modality alignment is not a necessity for a sound visual generative prior, whose focus should be on texture modeling. |
|
Such a philosophy inspires us to study image-to-image (I2I) generation, where models can learn from in-the-wild images in a self-supervised manner. |
|
We first develop a pure vision-based training framework, Lumos, and confirm the feasibility and the scalability of learning I2I models. |
|
We then find that, as an upstream task of T2I, our I2I model serves as a more foundational visual prior and achieves on-par or better performance than existing T2I models using only 1/10 text-image pairs for fine-tuning. |
|
We further demonstrate the superiority of I2I priors over T2I priors on some text-irrelevant visual generative tasks, like image-to-3D and image-to-video. |
|
</details> |
|
|
|
# πͺβ¨ Lumos Model Card |
|
![row01](asset/teaser.png) |
|
|
|
## π Model Structure |
|
![pipeline](asset/method.png) |
|
|
|
[Lumos](https://arxiv.org/pdf/2412.07767) consists of transformer blocks for latent diffusion, which is applied for various visual generative tasks such as text-to-image, image-to-3D, and image-to-video generation. |
|
|
|
Source code is available at https://github.com/xiaomabufei/lumos. |
|
|
|
## π Model Description |
|
|
|
- **Developed by:** Lumos |
|
- **Model type:** Diffusion-Transformer-based generative model |
|
- **License:** [CreativeML Open RAIL++-M License](https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0/blob/main/LICENSE.md) |
|
- **Model Description:** **Lumos-I2I** is a model designed for generating images based on image prompts. It utilizes a [Transformer Latent Diffusion architecture](https://arxiv.org/abs/2310.00426) and incorporates a fixed, pretrained vision encoder ([DINO]( |
|
https://dl.fbaipublicfiles.com/dino/dino_vitbase16_pretrain/dino_vitbase16_pretrain.pth)). **Lumos-T2I** is a model that can be used to generate images based on text prompts. |
|
It is a [Transformer Latent Diffusion Model](https://arxiv.org/abs/2310.00426) that uses one fixed, pretrained text encoders ([T5]( |
|
https://huggingface.co/DeepFloyd/t5-v1_1-xxl)). |
|
- **Resources for more information:** Check out our [GitHub Repository](https://github.com/xiaomabufei/lumos) and the [Lumos report on arXiv](https://arxiv.org/pdf/2412.07767). |