Why ResNet is Explicit Euler, and What That Tells Us About Deep Learning

Community Article Published June 13, 2026

Upvote

Zixi "Oz" Li

An open-source monograph reframes architectures as dynamical mechanics systems and training as terrain motion.

You've trained a ResNet. You've run Adam. You've watched the loss curve creep downward.

But have you ever asked: what kind of mathematical object is a ResNet, really?

It's not just a "network with skip connections." It's an explicit Euler integrator on a vector field. And once you see it, everything in deep learning clicks into a single geometric story.

The Reframe

Here's a ResNet block:

$h_{l+1} = h_l + f_\theta(h_l)$

Now here's the explicit Euler method for solving an ODE $\frac{dh}{dt} = v(h)$ with step size $\eta$:

$h_{t+1} = h_t + \eta \cdot v(h_t)$

They are the same equation. ResNet is not "a network that adds the input to the output." ResNet is a numerical ODE solver. The residual branch $f_\theta$ is a vector field, and each layer takes one Euler step along it.

This is not a metaphor. It's an identity.

What Else Hides in Plain Sight

Once you accept that architectures are dynamical systems in disguise, the standard toolbox of deep learning starts to reveal its hidden geometry:

GPT autoregression is implicit Euler

A GPT model predicts the next token by feeding its own output back as input. This is not "just recurrence." It's an implicit-state Euler iteration—the same numerical method that's stable where explicit Euler explodes. That's why transformers can handle long-range dependencies: implicit methods don't care about step size.

DEQ (Deep Equilibrium Models) is a fixed-point iteration

A DEQ solves $h^* = f_\theta(h^*, x)$ —find the hidden state that equals its own transformation. This is the Banach fixed-point theorem in production. The forward pass is not a forward pass. It's root-finding. And the backward pass uses implicit differentiation, not backprop through layers—because there are no layers to backprop through.

KL divergence measures distance on curved space

You minimize KL divergence every day. But KL divergence is a Bregman divergence on the entropy landscape. Your belief space—the simplex of probability distributions—is not flat. The curvature of that space is what forces your optimizer to take small steps. You feel the curvature in your loss curve. Now you know what it is.

Chain-of-thought reasoning is a trajectory on a reasoning field

When a model generates a chain of thought, you see tokens. But the hidden states are moving. Each reasoning step is a hidden state taking one Euler step along a reasoning vector field toward an attractor basin. The correct answer is a wide basin. An incorrect answer is a narrow groove carved by training data. The number of reasoning steps is determined by the terrain—not by the problem's "difficulty."

Diffusion is flow along a score vector field

Forward diffusion adds noise. Reverse diffusion removes it. The reverse process follows the score function $\nabla_x \log p(x)$ —a vector field that points toward "more data-like" regions. Diffusion models are systems flowing downhill on an entropy landscape, from disorder to order, from high energy to low energy.

One Idea, 337 Years

There is a single thread running through all of this:

F = ma  →  H = T + V  →  loss landscape + gradient field

Newton (1687) analyzed forces one by one. You draw a free-body diagram. You enumerate every force. You sum them. It works, but it doesn't tell you why the system moves the way it does.

Hamilton (1833) showed that the entire system is a single point moving on an energy surface. The geometry of that surface—its ridges, valleys, saddle points—determines everything. One geometric object replaces a catalog of forces. You stop counting forces and start reading the terrain.

This book does the same for deep learning.

Instead of enumerating tricks—skip connections, attention, KL regularization, chain-of-thought, diffusion schedules—we draw the energy landscape and read its geometry. All of these techniques turn out to be the same thing: motion on a terrain.

What the Book Contains

The Terrain of Learning — 4 volumes, 12 chapters, bilingual (中文/English), 30+ print-grade SVG figures. Completely free and open-source (CC BY-NC-SA 4.0).

Volume	Content
I: The Terrain of Learning	Parameter space, representation space, loss landscapes, gradient fields. From Newtonian to Hamiltonian mechanics.
II: The Dynamics of Intelligence	Optimizers as walking styles. Bregman divergence, KL geometry. Dynamical systems, fixed points, attractors.
III: The Geometry of Reasoning	Chain-of-thought as trajectory projection. Reasoning fields, attractor basins, verifiers. Long reasoning geomorphology.
IV: Algorithmic Landscapes	Geometric rereading of linear regression, PCA, SVM, Attention, LoRA, diffusion models.

Math prerequisite: undergraduate calculus and linear algebra. Every concept gets a spatial intuition before any symbol appears. If you've trained models and felt that "gradient descent finds a local minimum" doesn't explain anything—this book gives you the language to say what's actually happening.

Read It

📖 Online: datawhalechina.github.io/learning-terrain 📂 Source: github.com/datawhalechina/learning-terrain 💬 Discussion: GitHub Discussions

The book is complete. If you find errors or have ideas, open an issue or join the discussion. I'm here.

Convergence is not hope. Convergence is geometry. You see.

Arcade-3B: SLM Optimization via Orthogonal Decoupling of Latent State Spaces

March 15, 2026

Arcade-3B: 基于隐藏层状态空间正交解耦的 SLM 优化

March 15, 2026

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote