Papers
arxiv:2402.15391

Genie: Generative Interactive Environments

Published on Feb 23
· Featured in Daily Papers on Feb 26
Authors:
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,

Abstract

We introduce Genie, the first generative interactive environment trained in an unsupervised manner from unlabelled Internet videos. The model can be prompted to generate an endless variety of action-controllable virtual worlds described through text, synthetic images, photographs, and even sketches. At 11B parameters, Genie can be considered a foundation world model. It is comprised of a spatiotemporal video tokenizer, an autoregressive dynamics model, and a simple and scalable latent action model. Genie enables users to act in the generated environments on a frame-by-frame basis despite training without any ground-truth action labels or other domain-specific requirements typically found in the world model literature. Further the resulting learned latent action space facilitates training agents to imitate behaviors from unseen videos, opening the path for training generalist agents of the future.

Community

Looks really interesting!

great work! !

incroyable !

What.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Can this model be Lego-ed / broken apart and recombined to do other things? Especially since the LAM seem to be already so capable with a bit of additional FT.

For e.g., I'd imagine you can predict the next optimal action (assuming the dynamics model is trained to play the game well) by just reconfiguring things a bit (without doing additional training or FT?):

Send a sequence of video frames $(z_1, ... z_{t-1})$ and actions $(a_1, ...)$ into the dynamics model to get the next video frame token $z_t$, then use the LAM encoder to annotate and output the next action to take (assuming the LAM encoder can autoregressively generate the inputs).

You can do this autoregressively to "hallucinate" the gameplay, or you can use this as a frame-by-frame agent to play the game.

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2402.15391 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2402.15391 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2402.15391 in a Space README.md to link it from this page.

Collections including this paper 21