# Introduction to State Space Models (SSM)

**Une version en français est disponible sur mon blog.**

##

Changelog

2023-12-14: article release.

2024-04-08: typos corrections (my English isn't perfect 😅).

2024-06-11: added links to the second article of my SSM blog posts serie.

##
**Foreword**

I'd like to extend my warmest thanks to Boris ALBAR, Pierre BEDU and Nicolas PREVOT for agreeing to set up a working group on the subject of SSMs and thus accompanying me in my discovery of this type of model. A special thanks to the former for taking the time to proofread this blog post.

##
**Introduction**

The * States Spaces Models* are traditionally used in control theory to model a dynamic system via state variables.

In the context of deep learning, when we speak of SSMs, we are referring to a subset of existing representations, namely linear invariant (or stationary) systems.

These models showed impressive performance as early as October 2021 with the paper

*Efficiently Modeling Long Sequences with Structured State Spaces*by Albert GU et al., to the point of positioning themselves as an alternative to

*transformers*.

In this article, we will define the basics of a deep learning SSM based on S4. Like the paper

*Attention is all you need*by Ashish VASWANI et al. (2017) for

*transformers*, the S4 is the foundation of a new type of neural network architecture that needs to be known, but it is not a model that is used as such in practice (other SSMs with better performance or easier to implement now being available). Released a week earlier than S4, LSSL, by the same authors, is also an important source of information on the subject. We'll take a look at the various developments arising from S4 in a future blog post (now available here). Before that, let's delve into the basics of SSM.

##
**Definition of an SSM in deep learning**

Let's use the image below to define an SSM:

Figure 1: View of a continuous, time-invariant SSM (Source: https://en.wikipedia.org/wiki/State-space_representation) |

It can be seen that an SSM is based on three variables that depend on time $tt$ :

- $x(t)\in {\mathbb{C}}^{n}x(t)\; \backslash in\; \backslash mathbb\; \{C\}^\{n\}$ represents the $nn$ state variables,
- $u(t)\in {\mathbb{C}}^{m}u(t)\; \backslash in\; \backslash mathbb\; \{C\}^\{m\}$ represents the $mm$ state inputs,
- $y(t)\in {\mathbb{C}}^{p}y(t)\; \backslash in\; \backslash mathbb\; \{C\}^\{p\}$ represents the $pp$ outputs,

We can also see that it's made up of four learnable matrices: $\mathbf{A},\mathbf{B},\mathbf{C}\backslash mathbf\; A,\; \backslash mathbf\; B,\; \backslash mathbf\; C$ and $\mathbf{D}\backslash mathbf\; D$.

- $\mathbf{A}\in {\mathbb{C}}^{m\times n}\backslash mathbf\; A\; \backslash in\; \backslash mathbb\; \{C\}^\{m\; \backslash times\; n\}$ is the state matrix (controlling the latent state $xx$),
- $\mathbf{B}\in {\mathbb{C}}^{n\times m}\backslash mathbf\; B\; \backslash in\; \backslash mathbb\; \{C\}^\{n\; \backslash times\; m\}$ is the control matrix,
- $\mathbf{C}\in {\mathbb{C}}^{p\times n}\backslash mathbf\; C\; \backslash in\; \backslash mathbb\; \{C\}^\{p\; \backslash times\; n\}$ is the output matrix,
- $\mathbf{D}\in {\mathbb{C}}^{p\times m}\backslash mathbf\; D\; \backslash in\; \backslash mathbb\; \{C\}^\{p\; \backslash times\; m\}$ is the command matrix,

The above picture can be reduced to the following system of equations:

$$\begin{array}{cc}{\displaystyle {x}^{\mathrm{\prime}}(t)}& {\displaystyle =\mathbf{A}x(t)+\mathbf{B}u(t)}\\ {\displaystyle y(t)}& {\displaystyle =\mathbf{C}x(t)+\mathbf{D}u(t)}\end{array}\backslash begin\{aligned\}\; x\text{'}(t)\; \&=\; \backslash mathbf\{A\}x(t)\; +\; \backslash mathbf\{B\}u(t)\; \backslash \backslash \; y(t)\; \&=\; \backslash mathbf\{C\}x(t)\; +\; \backslash mathbf\{D\}u(t)\; \backslash end\{aligned\}$$

Note: here we use the notation ${x}^{\mathrm{\prime}}x\text{'}$ to designate the derivative of $xx$. It's not out of the question to encounter the notation $\dot{x}\u1e8b$ in the literature instead.

Similarly, since it is implicit that the variables depend on time, the preceding equation is generally written in the following form for the sake of simplicity:

$$\begin{array}{cc}{\displaystyle {x}^{\mathrm{\prime}}}& {\displaystyle =\mathbf{A}x+\mathbf{B}u}\\ {\displaystyle y}& {\displaystyle =\mathbf{C}x+\mathbf{D}u}\end{array}\backslash begin\{aligned\}\; x\text{'}\; \&=\; \backslash mathbf\{A\}x\; +\; \backslash mathbf\{B\}u\; \backslash \backslash \; y\; \&=\; \backslash mathbf\{C\}x\; +\; \backslash mathbf\{D\}u\; \backslash end\{aligned\}$$

This system can be made even lighter, because in deep learning SSMs, $\mathbf{D}u=0\backslash mathbf\{D\}u\; =\; 0$ is seen as an easily computable *skip connection*.

$$\begin{array}{cc}{\displaystyle {x}^{\mathrm{\prime}}}& {\displaystyle =\mathbf{A}x+\mathbf{B}u}\\ {\displaystyle y}& {\displaystyle =\mathbf{C}x}\end{array}\backslash begin\{aligned\}\; x\text{'}\; \&=\; \backslash mathbf\{A\}x\; +\; \backslash mathbf\{B\}u\; \backslash \backslash \; y\; \&=\; \backslash mathbf\{C\}x\; \backslash end\{aligned\}$$

This system is continuous. It must therefore first be discretized before it can be supplied to a computer.

##
**Discretization**

Discretization is one of, if not the most important point in SSM. All the efficiency of this architecture lies in this step, since it enables us to pass from the continuous view of the SSM to its two other views: the **recursive view** and the **convolutive view**.

If there's one thing to remember from this article, it's this.

Figure 2: Image from blog post « Structured State Spaces: Combining Continuous-Time, Recurrent, and Convolutional Models » by Albert GU et al. (2022) |

We'll see in later article that there are several possible discretizations. This is one of the main differences between the various existing SSM architectures.

For this first article, let's apply the "original" discretization proposed in S4 to illustrate the two additional views of an SSM.

##
**Recursive view of an SSM**

To discretize the continuous case, let's use the trapezoid method where the principle is to assimilate the region under the representative curve of a function $ff$ defined on a segment $[{t}_{n},{t}_{n+1}][t\_n\; ,\; t\_\{n+1\}]$ to a trapezoid and calculate its area $TT$ : $T=({t}_{n+1}-{t}_{n})\frac{f({t}_{n})+f({t}_{n+1})}{2}T=(t\_\{n+1\}\; -\; t\_n)\{\backslash frac\; \{f(t\_n)+f(t\_\{n+1\})\}\{2\}\}$.

We then have: ${x}_{n+1}-{x}_{n}=\frac{1}{2}\mathrm{\Delta}(f({t}_{n})+f({t}_{n+1}))x\_\{n+1\}\; -\; x\_n\; =\; \backslash frac\{1\}\{2\}\backslash Delta(f(t\_n)\; +\; f(t\_\{n+1\}))$ with $\mathrm{\Delta}={t}_{n+1}-{t}_{n}\backslash Delta\; =\; t\_\{n+1\}\; -\; t\_n$.

If ${x}_{n}^{\mathrm{\prime}}=\mathbf{A}{x}_{n}+\mathbf{B}{u}_{n}x\text{'}\_n\; =\; \backslash mathbf\{A\}x\_n\; +\; \backslash mathbf\{B\}\; u\_n$ (first line of the SSM equation), corresponds to $ff$, so:

$$\begin{array}{cc}{\displaystyle {x}_{n+1}}& {\displaystyle ={x}_{n}+\frac{\mathrm{\Delta}}{2}(\mathbf{A}{x}_{n}+\mathbf{B}{u}_{n}+\mathbf{A}{x}_{n+1}+\mathbf{B}{u}_{n+1})}\\ {\displaystyle \u27fa{x}_{n+1}-\frac{\mathrm{\Delta}}{2}\mathbf{A}{x}_{n+1}}& {\displaystyle ={x}_{n}+\frac{\mathrm{\Delta}}{2}\mathbf{A}{x}_{n}+\frac{\mathrm{\Delta}}{2}\mathbf{B}({u}_{n+1}+{u}_{n})}\\ {\displaystyle (\ast )\u27fa(\mathbf{I}-\frac{\mathrm{\Delta}}{2}\mathbf{A}){x}_{n+1}}& {\displaystyle =(\mathbf{I}+\frac{\mathrm{\Delta}}{2}\mathbf{A}){x}_{n}+\mathrm{\Delta}\mathbf{B}{u}_{n+1}}\\ {\displaystyle \u27fa{x}_{n+1}}& {\displaystyle =(\mathbf{I}-\frac{\mathrm{\Delta}}{2}\mathbf{A}{)}^{-1}(\mathbf{I}+\frac{\mathrm{\Delta}}{2}\mathbf{A}){x}_{n}+(\mathbf{I}-\frac{\mathrm{\Delta}}{2}\mathbf{A}{)}^{-1}\mathrm{\Delta}\mathbf{B}{u}_{n+1}}\end{array}\backslash begin\{aligned\}\; x\_\{n+1\}\; \&\; =\; x\_n\; +\; \backslash frac\{\backslash Delta\}\{2\}\; (\backslash mathbf\{A\}x\_n\; +\; \backslash mathbf\{B\}\; u\_n\; +\; \backslash mathbf\{A\}x\_\{n+1\}\; +\; \backslash mathbf\{B\}\; u\_\{n+1\})\; \backslash \backslash \; \backslash Longleftrightarrow\; x\_\{n+1\}\; -\; \backslash frac\{\backslash Delta\}\{2\}\backslash mathbf\{A\}x\_\{n+1\}\; \&\; =\; x\_n\; +\; \backslash frac\{\backslash Delta\}\{2\}\backslash mathbf\{A\}x\_\{n\}\; +\; \backslash frac\{\backslash Delta\}\{2\}\backslash mathbf\{B\}(u\_\{n+1\}\; +\; u\_n)\; \backslash \backslash \; (*)\; \backslash Longleftrightarrow\; (\backslash mathbf\{I\}\; -\; \backslash frac\{\backslash Delta\}\{2\}\; \backslash mathbf\{A\})\; x\_\{n+1\}\; \&\; =\; (\backslash mathbf\{I\}\; +\; \backslash frac\{\backslash Delta\}\{2\}\; \backslash mathbf\{A\})\; x\_\{n\}\; +\; \backslash Delta\; \backslash mathbf\{B\}\; u\_\{n+1\}\backslash \backslash \; \backslash Longleftrightarrow\; x\_\{n+1\}\; \&\; =\; (\backslash mathbf\{I\}\; -\; \backslash frac\{\backslash Delta\}\{2\}\; \backslash mathbf\{A\})^\{-1\}\; (\backslash mathbf\{I\}\; +\; \backslash frac\{\backslash Delta\}\{2\}\; \backslash mathbf\{A\})\; x\_n\; +\; (\backslash mathbf\{I\}\; -\; \backslash frac\{\backslash Delta\}\{2\}\; \backslash mathbf\{A\})^\{-1\}\; \backslash Delta\; \backslash mathbf\{B\}\; u\_\{n+1\}\; \backslash end\{aligned\}$$

(*) ${u}_{n+1}{u}_{n}u\_\{n+1\}\; \backslash overset\{\backslash Delta\}\{\backslash simeq\}\; u\_n$ (the control vector is assumed to be constant over a small $\mathrm{\Delta}\backslash Delta$).

We've just obtained our discretized SSM!

To make this completely explicit, let's pose :

$$\begin{array}{cc}{\displaystyle \stackrel{\u02c9}{\mathbf{A}}}& {\displaystyle =(\mathbf{I}-\frac{\mathrm{\Delta}}{2}\mathbf{A}{)}^{-1}(\mathbf{I}+\frac{\mathrm{\Delta}}{2}\mathbf{A})}\\ {\displaystyle \stackrel{\u02c9}{\mathbf{B}}}& {\displaystyle =(\mathbf{I}-\frac{\mathrm{\Delta}}{2}\mathbf{A}{)}^{-1}\mathrm{\Delta}\mathbf{B}}\\ {\displaystyle \stackrel{\u02c9}{\mathbf{C}}}& {\displaystyle =\mathbf{C}}\end{array}\backslash begin\{aligned\}\; \backslash mathbf\{\backslash bar\{A\}\}\; \&=\; (\backslash mathbf\; \{I\}\; -\; \backslash frac\{\backslash Delta\}\{2\}\; \backslash mathbf\{A\})^\{-1\}(\backslash mathbf\; \{I\}\; +\; \backslash frac\{\backslash Delta\}\{2\}\; \backslash mathbf\{A\})\; \backslash \backslash \; \backslash mathbf\; \{\backslash bar\{B\}\}\; \&=\; (\backslash mathbf\{I\}\; -\; \backslash frac\{\backslash Delta\}\{2\}\; \backslash mathbf\; \{A\})^\{-1\}\; \backslash Delta\; \backslash mathbf\{B\}\; \backslash \backslash \; \backslash mathbf\; \{\backslash bar\{C\}\}\; \&=\; \backslash mathbf\{C\}\backslash \backslash \; \backslash end\{aligned\}$$

We then have

$$\begin{array}{cc}{\displaystyle {x}_{k}}& {\displaystyle =\stackrel{\u02c9}{\mathbf{A}}{x}_{k-1}+\stackrel{\u02c9}{\mathbf{B}}{u}_{k}}\\ {\displaystyle {y}_{k}}& {\displaystyle =\stackrel{\u02c9}{\mathbf{C}}{x}_{k}}\end{array}\backslash begin\{aligned\}\; x\_k\; \&=\; \backslash mathbf\{\backslash bar\{A\}\}x\_\{k-1\}\; +\; \backslash mathbf\{\backslash bar\{B\}\}u\_k\; \backslash \backslash \; y\_k\; \&=\; \backslash mathbf\{\backslash bar\{C\}\}x\_k\; \backslash end\{aligned\}$$

The notation of matrices with a bar was introduced in S4 to designate matrices in the discrete case and has since become a convention in the field of SSM applied to deep learning.

##
**Convolutive view of an SSM**

This recurrence can be written as a convolution. To do this, simply iterate the equations of the system

$$\begin{array}{cc}{\displaystyle {x}_{k}}& {\displaystyle =\stackrel{\u02c9}{\mathbf{A}}{x}_{k-1}+\stackrel{\u02c9}{\mathbf{B}}{u}_{k}}\\ {\displaystyle {y}_{k}}& {\displaystyle =\stackrel{\u02c9}{\mathbf{C}}{x}_{k}}\end{array}\backslash begin\{aligned\}\; x\_k\; \&=\; \backslash mathbf\{\backslash bar\{A\}\}x\_\{k-1\}\; +\; \backslash mathbf\{\backslash bar\{B\}\}u\_k\; \backslash \backslash \; y\_k\; \&=\; \backslash mathbf\{\backslash bar\{C\}\}x\_k\; \backslash end\{aligned\}$$

Let's start with the first line of the system:

Step 0: ${x}_{0}=\stackrel{\u02c9}{\mathbf{B}}{u}_{0}x\_0\; =\; \backslash mathbf\{\backslash bar\{B\}\}\; u\_0$

Step 1: ${x}_{1}=\stackrel{\u02c9}{\mathbf{A}}{x}_{0}+\stackrel{\u02c9}{\mathbf{B}}{u}_{1}=\stackrel{\u02c9}{\mathbf{A}}\stackrel{\u02c9}{\mathbf{B}}{u}_{0}+\stackrel{\u02c9}{\mathbf{B}}{u}_{1}x\_1\; =\; \backslash mathbf\{\backslash bar\{A\}\}x\_\{0\}\; +\; \backslash mathbf\{\backslash bar\{B\}\}u\_1\; =\; \backslash mathbf\{\backslash bar\{A\}\}\; \backslash mathbf\{\backslash bar\{B\}\}\; u\_0\; +\; \backslash mathbf\{\backslash bar\{B\}\}u\_1$

Step 2: ${x}_{2}=\stackrel{\u02c9}{\mathbf{A}}{x}_{1}+\stackrel{\u02c9}{\mathbf{B}}{u}_{2}=\stackrel{\u02c9}{\mathbf{A}}(\stackrel{\u02c9}{\mathbf{A}}\stackrel{\u02c9}{\mathbf{B}}{u}_{0}+\stackrel{\u02c9}{\mathbf{B}}{u}_{1})+\stackrel{\u02c9}{\mathbf{B}}{u}_{2}={\stackrel{\u02c9}{\mathbf{A}}}^{2}\stackrel{\u02c9}{\mathbf{B}}{u}_{0}+\stackrel{\u02c9}{\mathbf{A}}\stackrel{\u02c9}{\mathbf{B}}{u}_{1}+\stackrel{\u02c9}{\mathbf{B}}{u}_{2}x\_2\; =\; \backslash mathbf\{\backslash bar\{A\}\}x\_\{1\}\; +\; \backslash mathbf\{\backslash bar\{B\}\}u\_2\; =\; \backslash mathbf\{\backslash bar\{A\}\}\; (\backslash mathbf\{\backslash bar\{A\}\}\; \backslash mathbf\{\backslash bar\{B\}\}\; u\_0\; +\; \backslash mathbf\{\backslash bar\{B\}\}u\_1)\; +\; \backslash mathbf\{\backslash bar\{B\}\}u\_2\; =\; \backslash mathbf\{\backslash bar\{A\}\}^\{2\}\; \backslash mathbf\{\backslash bar\{B\}\}\; u\_0\; +\; \backslash mathbf\{\backslash bar\{A\}\}\; \backslash mathbf\{\backslash bar\{B\}\}\; u\_1\; +\; \backslash mathbf\{\backslash bar\{B\}\}u\_2$

We have ${x}_{k}x\_k$ which can be written as a function $ff$ parametrized by $({u}_{0},{u}_{1},\mathrm{.}\mathrm{.}\mathrm{.}{u}_{k})(u\_0,\; u\_1,\; ...\; u\_k)$.

Let's move on to the second line of the system, where we can now inject the ${x}_{k}x\_k$ values calculated just now:

Step 0: ${y}_{0}=\stackrel{\u02c9}{\mathbf{C}}{x}_{0}=\stackrel{\u02c9}{\mathbf{C}}\stackrel{\u02c9}{\mathbf{B}}{u}_{0}y\_0\; =\; \backslash mathbf\{\backslash bar\{C\}\}\; x\_0\; =\; \backslash mathbf\{\backslash bar\{C\}\}\; \backslash mathbf\{\backslash bar\{B\}\}\; u\_0$

Step 1: ${y}_{1}=\stackrel{\u02c9}{\mathbf{C}}{x}_{1}=\stackrel{\u02c9}{\mathbf{C}}(\stackrel{\u02c9}{\mathbf{A}}\stackrel{\u02c9}{\mathbf{B}}{u}_{0}+\stackrel{\u02c9}{\mathbf{B}}{u}_{1})=\stackrel{\u02c9}{\mathbf{C}}\stackrel{\u02c9}{\mathbf{A}}\stackrel{\u02c9}{\mathbf{B}}{u}_{0}+\stackrel{\u02c9}{\mathbf{C}}\stackrel{\u02c9}{\mathbf{B}}{u}_{1}y\_1\; =\; \backslash mathbf\{\backslash bar\{C\}\}\; x\_1\; =\; \backslash mathbf\{\backslash bar\{C\}\}\; (\; \backslash mathbf\{\backslash bar\{A\}\}\; \backslash mathbf\{\backslash bar\{B\}\}\; u\_0\; +\; \backslash mathbf\{\backslash bar\{B\}\}u\_1)\; =\; \backslash mathbf\{\backslash bar\{C\}\}\; \backslash mathbf\{\backslash bar\{A\}\}\; \backslash mathbf\{\backslash bar\{B\}\}\; u\_0\; +\; \backslash mathbf\{\backslash bar\{C\}\}\; \backslash mathbf\{\backslash bar\{B\}\}u\_1$

Step 2: ${y}_{2}=\stackrel{\u02c9}{\mathbf{C}}{x}_{2}=\stackrel{\u02c9}{\mathbf{C}}({\stackrel{\u02c9}{\mathbf{A}}}^{2}\stackrel{\u02c9}{\mathbf{B}}{u}_{0}+\stackrel{\u02c9}{\mathbf{A}}\stackrel{\u02c9}{\mathbf{B}}{u}_{1}+\stackrel{\u02c9}{\mathbf{B}}{u}_{2})=\stackrel{\u02c9}{\mathbf{C}}{\stackrel{\u02c9}{\mathbf{A}}}^{2}\stackrel{\u02c9}{\mathbf{B}}{u}_{0}+\stackrel{\u02c9}{\mathbf{C}}\stackrel{\u02c9}{\mathbf{A}}\stackrel{\u02c9}{\mathbf{B}}{u}_{1}+\stackrel{\u02c9}{\mathbf{C}}\stackrel{\u02c9}{\mathbf{B}}{u}_{2}y\_2\; =\; \backslash mathbf\{\backslash bar\{C\}\}\; x\_2\; =\; \backslash mathbf\{\backslash bar\{C\}\}(\backslash mathbf\{\backslash bar\{A\}\}^\{2\}\; \backslash mathbf\{\backslash bar\{B\}\}\; u\_0\; +\; \backslash mathbf\{\backslash bar\{A\}\}\; \backslash mathbf\{\backslash bar\{B\}\}\; u\_1\; +\; \backslash mathbf\{\backslash bar\{B\}\}u\_2\; )\; =\; \backslash mathbf\{\backslash bar\{C\}\}\backslash mathbf\{\backslash bar\{A\}\}^\{2\}\; \backslash mathbf\{\backslash bar\{B\}\}\; u\_0\; +\; \backslash mathbf\{\backslash bar\{C\}\}\backslash mathbf\{\backslash bar\{A\}\}\; \backslash mathbf\{\backslash bar\{B\}\}\; u\_1\; +\; \backslash mathbf\{\backslash bar\{C\}\}\backslash mathbf\{\backslash bar\{B\}\}u\_2$
We can observe the convolution kernel ${\stackrel{\u02c9}{\mathbf{K}}}_{k}=(\stackrel{\u02c9}{\mathbf{C}}\stackrel{\u02c9}{\mathbf{B}},\stackrel{\u02c9}{\mathbf{C}}\stackrel{\u02c9}{\mathbf{A}}\stackrel{\u02c9}{\mathbf{B}},\mathrm{.}\mathrm{.}\mathrm{.},\stackrel{\u02c9}{\mathbf{C}}{\stackrel{\u02c9}{\mathbf{A}}}^{k}\stackrel{\u02c9}{\mathbf{B}})\backslash mathbf\{\backslash bar\{K\}\}\; \_k\; =\; (\backslash mathbf\{\backslash bar\{C\}\}\; \backslash mathbf\{\backslash bar\{B\}\},\; \backslash mathbf\{\backslash bar\{C\}\}\; \backslash mathbf\{\backslash bar\{A\}\}\; \backslash mathbf\{\backslash bar\{B\}\},\; ...,\; \backslash mathbf\{\backslash bar\{C\}\}\; \backslash mathbf\{\backslash bar\{A\}\}^\{k\}\; \backslash mathbf\{\backslash bar\{B\}\})$ applicable to ${u}_{k}u\_k$, hence $K\ast uK\; \backslash ast\; u$.

As with matrices, we apply a bar to the $\stackrel{\u02c9}{\mathbf{K}}\backslash mathbf\{\backslash bar\{K\}\}$ to specify that it is the convolution kernel obtained after discretization. It is generally referred to as the **SSM convolution kernel** in the literature, and its size is equivalent to the entire input sequence.

This convolution kernel is calculated by Fast Fourier Transform (FFT) and will be explained in future articles (do you like the Flash Attention of *transformers*? You'll love Flash FFT Convolution, which we'll look at in the third blog post).

##
**Advantages and limitations of each of the three views**

Figure 3: Image from the paper Combining Recurrent, Convolutional, and Continuous-time Models with Linear State-Space Layers by Albert GU et al, released a week before S4 |

The different views of SSM each have their advantages and disadvantages - let's take a closer look.

For the __ continuous view__, the advantages and disadvantages are as follows:

✓ Automatically handles continuous data (audio signals, time series, for example). This represents a huge practical advantage when processing data with irregular or time-shifted sampling.

✓ Mathematically feasible analysis, e.g. by calculating exact trajectories or building memory systems (HiPPO).

✗ Extremely slow for both training and inference.

For the __ recursive view__ these are the well-known advantages and disadvantages of recursive neural networks, namely:

✓ Natural inductive bias for sequential data, and in principle unbounded context.

✓ Efficient inference (constant-time state updates).

✗ Slow learning (lack of parallelism).

✗ Gradient disappearance or explosion when training too-long sequences.

For the __ convolutional view__, we're talking here about the well-known advantages and disadvantages of convolutional neural networks (we're here in the context of their one-dimensional version), namely:

✓ Local, interpretable features.

✓ Efficient (parallelizable) training.

✗ Slowness in online or autoregressive contexts (must recalculate entire input for each new data point).

✗ Fixed context size.

So, depending on the stage of the process (training or inference) or the type of data at our disposal, it is possible to switch from one view to another in order to fall back on a favorable framework for getting the most out of the model.

We prefer the convolutional training view for fast training via parallelization, the recursive view for efficient inference, and the continuous view for handling continuous data.

##
**Learning matrices**

In the convolution kernel developed above, $\stackrel{\u02c9}{\mathbf{C}}\backslash mathbf\{\backslash bar\{C\}\}$ and $\stackrel{\u02c9}{\mathbf{B}}\backslash mathbf\{\backslash bar\{B\}\}$, are learnable scalars.

Concerning $\stackrel{\u02c9}{\mathbf{A}}\backslash mathbf\{\backslash bar\{A\}\}$, we've seen that in our convolution kernel, it's expressed as a power of $kk$ at time $kk$. This can be very time-consuming to calculate, so we're looking for a fixed $\stackrel{\u02c9}{\mathbf{A}}\backslash mathbf\{\backslash bar\{A\}\}$. For this, the best option is to have it diagonal:

$$\mathbf{A}=\left[\begin{array}{cccc}{\textstyle {\lambda}_{1}}& {\textstyle 0}& {\textstyle \cdots}& {\textstyle 0}\\ {\textstyle 0}& {\textstyle {\lambda}_{2}}& {\textstyle \cdots}& {\textstyle 0}\\ {\textstyle}& {\textstyle}& {\textstyle \ddots}& {\textstyle}\\ {\textstyle 0}& {\textstyle 0}& {\textstyle \cdots}& {\textstyle {\lambda}_{n}}\end{array}\right]\Rightarrow {\mathbf{A}}^{\mathbf{k}}=\left[\begin{array}{cccc}{\textstyle {\lambda}_{1}^{k}}& {\textstyle 0}& {\textstyle \cdots}& {\textstyle 0}\\ {\textstyle 0}& {\textstyle {\lambda}_{2}^{k}}& {\textstyle \cdots}& {\textstyle 0}\\ {\textstyle}& {\textstyle}& {\textstyle \ddots}& {\textstyle}\\ {\textstyle 0}& {\textstyle 0}& {\textstyle \cdots}& {\textstyle {\lambda}_{n}^{k}}\end{array}\right]\backslash mathbf\{A\}\; =\; \backslash begin\{bmatrix\}\; \backslash lambda\_\{1\}\; \&\; 0\; \&\; \backslash cdots\; \&\; 0\; \backslash \backslash \; 0\; \&\; \backslash lambda\_\{2\}\; \&\; \backslash cdots\; \&\; 0\; \backslash \backslash \; \backslash vdots\; \&\; \backslash vdots\; \&\; \backslash ddots\; \&\; \backslash vdots\; \backslash \backslash \; 0\; \&\; 0\; \&\; \backslash cdots\; \&\; \backslash lambda\_\{n\}\; \backslash end\{bmatrix\}\; \backslash Rightarrow\; \backslash mathbf\{A^k\}\; =\; \backslash begin\{bmatrix\}\; \backslash lambda\_\{1\}^k\; \&\; 0\; \&\; \backslash cdots\; \&\; 0\; \backslash \backslash \; 0\; \&\; \backslash lambda\_\{2\}^k\; \&\; \backslash cdots\; \&\; 0\; \backslash \backslash \; \backslash vdots\; \&\; \backslash vdots\; \&\; \backslash ddots\; \&\; \backslash vdots\; \backslash \backslash \; 0\; \&\; 0\; \&\; \backslash cdots\; \&\; \backslash lambda\_\{n\}^k\; \backslash end\{bmatrix\}$$

By the spectral theorem of linear algebra, this is exactly the class of normal matrices.

In addition to the choice of discretization mentioned above, the way in which $\stackrel{\u02c9}{\mathbf{A}}\backslash mathbf\{\backslash bar\{A\}\}$ is defined and initiated is one of the points that differentiates the various SSM architectures developed in the literature, which we'll develop in the next blog post. Indeed, empirically, it appears that an SSM initialized with a random $\mathbf{A}\backslash mathbf\{A\}$ matrix leads to poor results, whereas an initialization based on the **HiPPO** matrix (for *High-Order Polynomial Projection Operator*) gives very good results (from 60% to 98% on the MNIST sequential benchmark).

The **HiPPO** matrix was introduced by the S4 authors in a previous paper (2020). It is included in the LSSL paper (2021), also by the S4 authors, as well as in the S4 appendix.
Its formula is as follows:

$$\mathbf{A}=\left[\begin{array}{c}{\textstyle 1}\\ {\textstyle -1}& {\textstyle 2}\\ {\textstyle 1}& {\textstyle -3}& {\textstyle 3}\\ {\textstyle -1}& {\textstyle 3}& {\textstyle -5}& {\textstyle 4}\\ {\textstyle 1}& {\textstyle -3}& {\textstyle 5}& {\textstyle -7}& {\textstyle 5}\\ {\textstyle -1}& {\textstyle 3}& {\textstyle -5}& {\textstyle 7}& {\textstyle -9}& {\textstyle 6}\\ {\textstyle 1}& {\textstyle -3}& {\textstyle 5}& {\textstyle -7}& {\textstyle 9}& {\textstyle -11}& {\textstyle 7}\\ {\textstyle -1}& {\textstyle 3}& {\textstyle -5}& {\textstyle 7}& {\textstyle -9}& {\textstyle 11}& {\textstyle -13}& {\textstyle 8}\\ {\textstyle}& {\textstyle}& {\textstyle}& {\textstyle}& {\textstyle}& {\textstyle}& {\textstyle}& {\textstyle}& {\textstyle \ddots}\end{array}\right]\phantom{\rule{0ex}{0ex}}\Rightarrow {\mathbf{A}}_{nk}=\{\begin{array}{cc}{\textstyle (-1{)}^{n-k}(2k+1)}& {\textstyle n>k}\\ {\textstyle k+1}& {\textstyle n=k}\\ {\textstyle 0}& {\textstyle n<k}\end{array}\backslash mathbf\{A\}\; =\; \backslash begin\{bmatrix\}\; 1\; \backslash \backslash \; -1\; \&\; 2\; \backslash \backslash \; 1\; \&\; -3\; \&\; 3\; \backslash \backslash \; -1\; \&\; 3\; \&\; -5\; \&\; 4\; \backslash \backslash \; 1\; \&\; -3\; \&\; 5\; \&\; -7\; \&\; 5\; \backslash \backslash \; -1\; \&\; 3\; \&\; -5\; \&\; 7\; \&\; -9\; \&\; 6\; \backslash \backslash \; 1\; \&\; -3\; \&\; 5\; \&\; -7\; \&\; 9\; \&\; -11\; \&\; 7\; \backslash \backslash \; -1\; \&\; 3\; \&\; -5\; \&\; 7\; \&\; -9\; \&\; 11\; \&\; -13\; \&\; 8\; \backslash \backslash \; \backslash vdots\; \&\; \&\; \&\; \&\; \&\; \&\; \&\; \&\; \backslash ddots\; \backslash \backslash \; \backslash end\{bmatrix\}\; \backslash \backslash \; \backslash Rightarrow\; \backslash mathbf\{A\}\_\{nk\}\; =\; \backslash begin\{cases\}\%\; (-1)^\{n-k\}\; (2k+1)\; \&\; n\; >\; k\; \backslash \backslash \; k+1\; \&\; n=k\; \backslash \backslash \; 0\; \&\; n<k\; \backslash end\{cases\}$$

This matrix is not normal, but it can be decomposed as a normal matrix plus a matrix of lower rank (summarized in the paper as NPLR for *Normal Plus Low Rank*). The authors prove in their paper that this type of matrix can be computed efficiently via three techniques (see Algorithm 1 in the paper): truncated generating series, Cauchy kernels and Woodbury identity.

Details of the demonstration showing that an NPLR matrix can be computed efficiently as a diagonal matrix can be found in the appendix (see part B and C) of the paper.

The authors of S4 subsequently made modifications to the **HiPPO** matrix (on how to initiate it) in their paper *How to Train Your HiPPO* (2022). The model resulting from this paper is generally referred to as "S4 V2" or "S4 updated" in the literature as opposed to the "original S4" or "S4 V1".

In the next article, we'll see that other authors (notably Ankit GUPTA) have proposed using a diagonal matrix instead of an NPRL matrix, an approach that is now preferred as it is simpler to implement.

##
**Experimental results**

Let's end this blog post by analyzing a selection of the S4's results on various tasks and benchmarks to get a feel for the potential of SSMs.

Let's start with an audio task and the benchmark *Speech Commands* by WARDEN (2018).

Figure 4: Image from the paper On the Parameterization and Initialization of Diagonal State Space Models by Albert GU et al. (2022), also known as S4D, published after S4 but which reproduces in a more structured form the results of S4 for this benchmark (the results of S4D having been removed from the image so as not to spoil the next article ;) |

Several things can be observed in this table.

Firstly, for a more or less equivalent number of parameters, the S4 performs much better (at least +13%) than the other models, here of the ConvNet type.

Secondly, to achieve equivalent performance, a ConvNet requires 85 times more parameters.

Thirdly, a ConvNet trained on 16K Hz gives very poor results when then applied to 8K Hz data. In contrast, the S4 retains 95% of its performance on this resampling. This can be explained by the continuous view of the SSM, where it was sufficient to halve the $\mathrm{\Delta}\backslash Delta$ value at the time of the test phase.

Let's continue with a time series task (introduced in a revision of S4).

The authors of the paper take up the methodology of the Informer model by ZHOU et al. (2020) and show that their model outperforms this *transformer* on 40 of the 50 configurations. The results in the table are shown in a univariate framework, but the same is observable for a multivariate framework (table 14 in the appendix).

Let's continue with a vision task and the benchmark *sCIFAR-10* by KRIZHESKY (2009).

S4 establishes SoTA on sCIFAR-10 with just 100,000 parameters (the authors don't specify the number for the other methods).

Let's conclude with a textual task and the benchmark *Long Range Arena* (LRA) by TAY et al. (2020).

The LRA consisted of 6 tasks, including Path-X with a length of 16K tokens, for which the S4 was the first model to succeed, demonstrating its performance on very long-sequence tasks.

It would be more than 2 years before AMOS et al. showed in their paper *Never Train from Scratch: Fair Comparison of Long-Sequence Models Requires Data-Driven Priors* (2023) that *transformers* (not hybridized with an SSM) could also solve this task. However, unlike SSMs, they are unable to pass the 65K token PathX-256.

Note, however, a negative point concerning the text for S4: it obtains a higher perplexity compared to that of a *transformer* (standard, with more optimized versions having an even lower perplexity) on WikiText-103 by MERITY et al. (2016).

This is probably due to the non-continuous nature of text (it has not been sampled from an underlying physical process such as speech or time series). We'll see in the article devoted to developments in SSM in 2023 that this point has been the subject of a great deal of work, and that SSM has now succeeded in bridging this gap.

##
**Conclusion**

SSMs are models with three views. A continuous view, and when discretized, a recurrent as well as a convolutive view.

The challenge with this type of architecture is to know when to favor one view over another, depending on the stage of the process (training or inference) and the type of data being processed.

This type of model is highly versatile, since it can be applied to text, vision, audio and time-series tasks (or even graphs).

One of its strengths is its ability to handle very long sequences, generally with a lower number of parameters than other models (ConvNet or *transformers*), while still being very fast.

As we'll see in later article, the main differences between the various existing SSM architectures lie in the way the basic SSM equation is discretized, or in the definition of the $\mathbf{A}\backslash mathbf\; A$ matrix.

##
**To dig deeper**

For S4, please consult the following resources:

- Videos:
- Efficiently Modeling Long Sequences with Structured State Spaces - Albert Gu - Stanford MLSys #46 by Albert GU
- MedAI #41: Efficiently Modeling Long Sequences with Structured State Spaces by Albert GU (a little longer, as more examples are covered)
- JAX Talk: Generating Extremely Long Sequences with S4 by Sasha RUSH + the slides used in the video

- Codes:
- The Annotated S4 (in Jax) by Sasha RUSH and Sidd KARAMCHETI
- The GitHub of the official S4 implementation (in PyTorch)

- Blog posts:
- Paper:
- The S4's predecessor is the Legendre Memory Units: Continuous-Time Representation in Recurrent Neural Networks (LMU) by VOELKER et al. (2019)

For more information on the **HiPPO** matrix, please consult the following resources:

- The Hazy Research blog post on the subject
- The paper
*How to Train Your HiPPO: State Space Models with Generalized Orthogonal Basis Projections*by Albert GU et al. (2022)

To find out more about SSM, take a look at :

- The course (in French) on dynamic systems by Ion HAZYUK, Maitre de Conferences at INSA Toulouse (the part on state-space models starts from section 5.2)
- The doctoral thesis of Albert GU

##
**References**

- Learning Multiple Layers of Features from Tiny Images by Alex KRIZHESKY (2009)
- Pointer Sentinel Mixture Models by Stephen MERITY, Caiming XIONG, James BRADBURY, Richard SOCHER (2016)
*Attention is all you need*by Ashish VASWANI, Noam SHAZEER, Niki PARMAR, Jakob USZKOREIT, Llion JONES, Aidan N. GOMEZ, Lukasz KAISER, Illia POLOSUKHIN(2017)- Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition by Pete WARDEN (2018)
- Long Range Arena: A Benchmark for Efficient Transformers by Yi TAY, Mostafa DEHGHANI, Samira ABNAR, Yikang SHEN, Dara BAHRI, Philip PHAM, Jinfeng RAO, Liu YANG, Sebastian RUDER, Donald METZLER (2020)
- Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting by Haoyi ZHOU, Shanghang ZHANG, Jieqi peng, Shuai ZHANG, Jianxin LI, Hui XIONG, Wancai ZHANG (2020)
- HiPPO: Recurrent Memory with Optimal Polynomial Projections by Albert GU, Tri DAO, Stefano ERMON, Atri RUDRA, Christopher RÉ (2020)
- Combining Recurrent, Convolutional, and Continuous-time Models with Linear State-Space Layers by Albert GU, Isys JOHNSON, Karan GOEL, Khaled SAAB, Tri DAO, Atri RUDRA, Christopher RÉ (2021)
- Efficiently Modeling Long Sequences with Structured State Spaces by Albert GU, Karan GOEL, et Christopher RÉ (2021)
- On the Parameterization and Initialization of Diagonal State Space Models by Albert GU, Ankit GUPTA, Karan GOEL, Christopher RÉ (2022)
- Never Train from Scratch: Fair Comparison of Long-Sequence Models Requires Data-Driven Priors by Ido AMOS, Jonathan BERANT, Ankit GUPTA (2023)