Xiaomabufei
/

lumos

novel view synthesis

Model card Files Files and versions Community

lumos / README.md

Xiaomabufei's picture

Update README.md

ac43c9f verified 17 days ago

|

history blame contribute delete

3.36 kB

	---
	license: mit
	tags:
	- lumos
	- image to image
	- text to image
	- novel view synthesis
	- image to video
	---
	<p align="center">
	<img src="asset/logo.gif" height=20>
	</p>

	<div style="display:flex;justify-content: center">
	<a href="https://arxiv.org/pdf/2412.07767"><img src="https://img.shields.io/static/v1?label=Paper&message=Arxiv:Lumos&color=red&logo=arxiv"></a> &ensp;
	<a href="https://xiaomabufei.github.io/lumos/"><img src="https://img.shields.io/static/v1?label=Project%20Page&message=Github&color=blue&logo=github-pages"></a> &ensp;
	</div>

	# 🥳 What is Lumos ?
	<b>TL; DR: <font color="purple">Lumos</font> is a pure vision-based generative framework, which confirms the feasibility and the scalability of learning visual generative priors. It can be efficiently adapted to visual generative tasks such as text-to-image, image-to-3D, and image-to-video generation.</b>
	<details><summary>CLICK for the full abstract</summary>
	Although text-to-image (T2I) models have recently thrived as visual generative priors, their reliance on high-quality text-image pairs makes scaling up expensive.
	We argue that grasping the cross-modality alignment is not a necessity for a sound visual generative prior, whose focus should be on texture modeling.
	Such a philosophy inspires us to study image-to-image (I2I) generation, where models can learn from in-the-wild images in a self-supervised manner.
	We first develop a pure vision-based training framework, Lumos, and confirm the feasibility and the scalability of learning I2I models.
	We then find that, as an upstream task of T2I, our I2I model serves as a more foundational visual prior and achieves on-par or better performance than existing T2I models using only 1/10 text-image pairs for fine-tuning.
	We further demonstrate the superiority of I2I priors over T2I priors on some text-irrelevant visual generative tasks, like image-to-3D and image-to-video.
	</details>

	# 🪄✨ Lumos Model Card
	![row01](asset/teaser.png)

	## 🚀 Model Structure
	![pipeline](asset/method.png)

	[Lumos](https://arxiv.org/pdf/2412.07767) consists of transformer blocks for latent diffusion, which is applied for various visual generative tasks such as text-to-image, image-to-3D, and image-to-video generation.

	Source code is available at https://github.com/xiaomabufei/lumos.

	## 📋 Model Description

	- Developed by: Lumos
	- Model type: Diffusion-Transformer-based generative model
	- License: [CreativeML Open RAIL++-M License](https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0/blob/main/LICENSE.md)
	- Model Description: Lumos-I2I is a model designed for generating images based on image prompts. It utilizes a [Transformer Latent Diffusion architecture](https://arxiv.org/abs/2310.00426) and incorporates a fixed, pretrained vision encoder ([DINO](
	https://dl.fbaipublicfiles.com/dino/dino_vitbase16_pretrain/dino_vitbase16_pretrain.pth)). Lumos-T2I is a model that can be used to generate images based on text prompts.
	It is a [Transformer Latent Diffusion Model](https://arxiv.org/abs/2310.00426) that uses one fixed, pretrained text encoders ([T5](
	https://huggingface.co/DeepFloyd/t5-v1_1-xxl)).
	- Resources for more information: Check out our [GitHub Repository](https://github.com/xiaomabufei/lumos) and the [Lumos report on arXiv](https://arxiv.org/pdf/2412.07767).