Papers
arxiv:2406.08862

Cognitively Inspired Energy-Based World Models

Published on Jun 13
· Submitted by alexiglad on Jun 14
Authors:
,

Abstract

One of the predominant methods for training world models is autoregressive prediction in the output space of the next element of a sequence. In Natural Language Processing (NLP), this takes the form of Large Language Models (LLMs) predicting the next token; in Computer Vision (CV), this takes the form of autoregressive models predicting the next frame/token/pixel. However, this approach differs from human cognition in several respects. First, human predictions about the future actively influence internal cognitive processes. Second, humans naturally evaluate the plausibility of predictions regarding future states. Based on this capability, and third, by assessing when predictions are sufficient, humans allocate a dynamic amount of time to make a prediction. This adaptive process is analogous to System 2 thinking in psychology. All these capabilities are fundamental to the success of humans at high-level reasoning and planning. Therefore, to address the limitations of traditional autoregressive models lacking these human-like capabilities, we introduce Energy-Based World Models (EBWM). EBWM involves training an Energy-Based Model (EBM) to predict the compatibility of a given context and a predicted future state. In doing so, EBWM enables models to achieve all three facets of human cognition described. Moreover, we developed a variant of the traditional autoregressive transformer tailored for Energy-Based models, termed the Energy-Based Transformer (EBT). Our results demonstrate that EBWM scales better with data and GPU Hours than traditional autoregressive transformers in CV, and that EBWM offers promising early scaling in NLP. Consequently, this approach offers an exciting path toward training future models capable of System 2 thinking and intelligently searching across state spaces.

Community

Paper author Paper submitter
This comment has been hidden
Paper author Paper submitter

ebwm_comparison.png

I'm always happy to see another energy-based world modelling paper, and I love the direction this one takes drawing analogues between human psychology and various predictive architectures (I've personally started down this rabbit hole many times, I think there's a lot of value to be found in seeing where current methods fall short of capturing the various capabilities that we associate with human intelligence). I'm not convinced that traditional autoregressive structures are inherently unable to model System 2 thinking, nor that MCMC facilitates it completely. I've felt for a while now that there should exist a way of proving some kind of guarantees related to the two structures, possibly showing their equivalence under N steps of MCMC vs N steps of autoregressive rollout COT style but I haven't been able to get there yet myself. I look forward to the code release, and hopefully a follow up report with a more refined/competitive NLP method.

energy models for images are not new, the part relevant part for "Text" (Causal Language) is very vague in details. Look forward for the code someday.

·
Paper author

I agree that energy based models for images are not new. But I have not seen them employed for world modeling as we did over video yet! (If you find a paper that has done this please let me know and I’ll add it :)). We plan to release the source code in a couple of months.

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2406.08862 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2406.08862 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2406.08862 in a Space README.md to link it from this page.

Collections including this paper 2