Papers
arxiv:2507.02092

Energy-Based Transformers are Scalable Learners and Thinkers

Published on Jul 2
· Submitted by amanchadha on Jul 4
Authors:
,
,
,
,
,
,
,

Abstract

Energy-Based Transformers, trained via unsupervised learning, outperform existing models in both scaling and inference across text and image tasks by re-framing predictions as optimization problems.

AI-generated summary

Inference-time computation techniques, analogous to human System 2 Thinking, have recently become popular for improving model performances. However, most existing approaches suffer from several limitations: they are modality-specific (e.g., working only in text), problem-specific (e.g., verifiable domains like math and coding), or require additional supervision/training on top of unsupervised pretraining (e.g., verifiers or verifiable rewards). In this paper, we ask the question "Is it possible to generalize these System 2 Thinking approaches, and develop models that learn to think solely from unsupervised learning?" Interestingly, we find the answer is yes, by learning to explicitly verify the compatibility between inputs and candidate-predictions, and then re-framing prediction problems as optimization with respect to this verifier. Specifically, we train Energy-Based Transformers (EBTs) -- a new class of Energy-Based Models (EBMs) -- to assign an energy value to every input and candidate-prediction pair, enabling predictions through gradient descent-based energy minimization until convergence. Across both discrete (text) and continuous (visual) modalities, we find EBTs scale faster than the dominant Transformer++ approach during training, achieving an up to 35% higher scaling rate with respect to data, batch size, parameters, FLOPs, and depth. During inference, EBTs improve performance with System 2 Thinking by 29% more than the Transformer++ on language tasks, and EBTs outperform Diffusion Transformers on image denoising while using fewer forward passes. Further, we find that EBTs achieve better results than existing models on most downstream tasks given the same or worse pretraining performance, suggesting that EBTs generalize better than existing approaches. Consequently, EBTs are a promising new paradigm for scaling both the learning and thinking capabilities of models.

Community

Paper author Paper submitter

Screenshot 2025-07-03 at 11.04.29 PM.jpg

Energy-Based Transformers (EBTs) generalize System 2 Thinking to arbitrary modalities and problem types using a scalable, unsupervised energy-based optimization framework that combines verification, uncertainty modeling, and dynamic compute allocation.

  • Unified System 2 Thinking via Energy-Based Optimization: EBTs treat inference as iterative energy minimization over a learned verifier function, enabling dynamic computation, uncertainty modeling, and explicit prediction verification across both discrete and continuous modalities, entirely from unsupervised pretraining.

  • Scalable Transformer-Based EBM Architecture: EBTs implement autoregressive (GPT-style) and bidirectional (BERT/DiT-style) Transformer variants, achieving superior pretraining scaling across parameters, depth, data, batch size, and FLOPs—surpassing the Transformer++ recipe.

  • Inference-Time Thinking via Gradient Descent and Best-of-N Sampling: EBTs support reasoning-like behavior at inference using two methods: more gradient descent steps ("thinking longer") and selecting the lowest-energy prediction from multiple candidates ("self-verification"), both yielding significant gains, especially on out-of-distribution data.

Very interesting work!! Congrats!
I have a question how does this approach compares with deep equilibrium models? Can DEQ models be seen as Energy based models? I would love to ear your thoughts on the similarities and differences between them!

·
Paper author

Thanks for the question @SSamDav ---yes, EBMs are a generalization of DEQ, where DEQ can be seen as minimizing an implicit energy-function until convergence! Having a more explicit EBM, as we do, allows for capabilities such as self-verification (generating n samples and choosing the best, minimum energy, sample). We have a section in the paper on implicit vs. explicit EBMs, called Energy-Based Model Types (https://arxiv.org/pdf/2507.02092#page=41.09).

The biggest difference lies in the dynamics formulation, where DEQs use a fixed-point solver to find the local minima whereas the EBMs we train just use gradient descent. While we didn't explicitly compare to DEQs, my general intuition is that this more simple gradient descent approach is less prone to issues with instability and more flexible, which allows EBTs to scale well as we demonstrate.

Hi, great work and super exciting, always nice to see alternative approaches to transformer++. I have a question on the bidirectional EBT: did you have a chance to run any experiments on bidirectional language modeling (BERT style MLM), and did it perform well there?

While the paper covers both architectures, the focus seems to be image related for the bidirectional model (reflected in the code as well). I tried a quick pass at text bidirectional EBT over the weekend, but despite trying several variants based on your code I was consistently seeing model collapse (predicting 'the' for all masked tokens etc). Unsure if I'm simply missing some critical component in modeling, or if you saw the same in the text domain? Any insights would be appreciated!

·

Hi @pszemraj , this likely relates to one of the limitations discussed, where EBTs currently struggle with learning data distributions that are super multimodal. My guess would be you can't do the same thing as diffusion where you generate the entire sequence at once and have to be careful about masking/restricting the number of modes within the learned distributions.

In the past we did briefly explore bidirectional EBTs that were BERT style and had some success, although we didn't get to spend too much time exploring hyperparameters (which are likely very important here). But if the model is always outputting 'the' and you're doing BERT style training (only some tokens are masked) that does seem strange.

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2507.02092 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2507.02092 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2507.02092 in a Space README.md to link it from this page.

Collections including this paper 5