braindecode
/

MEDFormer

+---
+license: bsd-3-clause
+library_name: braindecode
+pipeline_tag: feature-extraction
+tags:
+  - eeg
+  - biosignal
+  - pytorch
+  - neuroscience
+  - braindecode
+  - foundation-model
+  - convolutional
+  - transformer
+---
+# MEDFormer
+Medformer from Wang et al (2024) .
+> **Architecture-only repository.** This repo documents the
+> `braindecode.models.MEDFormer` class. **No pretrained weights are
+> distributed here** — instantiate the model and train it on your own
+> data, or fine-tune from a published foundation-model checkpoint
+> separately.
+## Quick start
+```bash
+pip install braindecode
+```
+```python
+from braindecode.models import MEDFormer
+model = MEDFormer(
+    n_chans=22,
+    sfreq=250,
+    input_window_seconds=4.0,
+    n_outputs=4,
+)
+```
+The signal-shape arguments above are example defaults — adjust them
+to match your recording.
+## Documentation
+- Full API reference (parameters, references, architecture figure):
+  <https://braindecode.org/stable/generated/braindecode.models.MEDFormer.html>
+- Interactive browser with live instantiation:
+  <https://huggingface.co/spaces/braindecode/model-explorer>
+- Source on GitHub: <https://github.com/braindecode/braindecode/blob/master/braindecode/models/medformer.py#L20>
+## Architecture description
+The block below is the rendered class docstring (parameters,
+references, architecture figure where available).
+<div class='bd-doc'><main>
+<p>Medformer from Wang et al (2024) <a class="citation-reference" href="#medformer2024" id="citation-reference-1" role="doc-biblioref">[Medformer2024]</a>.</p>
+<span style="display:inline-block;padding:2px 8px;border-radius:4px;background:#5cb85c;color:white;font-size:11px;font-weight:600;margin-right:4px;">Convolution</span><span style="display:inline-block;padding:2px 8px;border-radius:4px;background:#d9534f;color:white;font-size:11px;font-weight:600;margin-right:4px;">Foundation Model</span><figure class="align-center">
+<img alt="MEDFormer Architecture." src="https://raw.githubusercontent.com/DL4mHealth/Medformer/refs/heads/main/figs/medformer_architecture.png" />
+<figcaption>
+<p>a) Workflow. b) For the input sample <math xmlns="http://www.w3.org/1998/Math/MathML">
+  <msub>
+    <mi>x</mi>
+    <mtext>in</mtext>
+  </msub>
+</math>, the authors apply <math xmlns="http://www.w3.org/1998/Math/MathML">
+  <mi>n</mi>
+</math>
+different patch lengths in parallel to create patched features <math xmlns="http://www.w3.org/1998/Math/MathML">
+  <msubsup>
+    <mi>x</mi>
+    <mi>p</mi>
+    <mrow>
+      <mo stretchy="false">(</mo>
+      <mi>i</mi>
+      <mo stretchy="false">)</mo>
+    </mrow>
+  </msubsup>
+</math>, where <math xmlns="http://www.w3.org/1998/Math/MathML">
+  <mi>i</mi>
+</math>
+ranges from 1 to <math xmlns="http://www.w3.org/1998/Math/MathML">
+  <mi>n</mi>
+</math>. Each patch length represents a different granularity. These patched
+features are linearly transformed into <math xmlns="http://www.w3.org/1998/Math/MathML">
+  <msubsup>
+    <mi>x</mi>
+    <mi>e</mi>
+    <mrow>
+      <mo stretchy="false">(</mo>
+      <mi>i</mi>
+      <mo stretchy="false">)</mo>
+    </mrow>
+  </msubsup>
+</math> and augmented into <math xmlns="http://www.w3.org/1998/Math/MathML">
+  <msup>
+    <munderover>
+      <mi>x</mi>
+      <mi>e</mi>
+      <mo accent="true">~</mo>
+    </munderover>
+    <mrow>
+      <mo stretchy="false">(</mo>
+      <mi>i</mi>
+      <mo stretchy="false">)</mo>
+    </mrow>
+  </msup>
+</math>.
+c) The final patch embedding <math xmlns="http://www.w3.org/1998/Math/MathML">
+  <msup>
+    <mi>x</mi>
+    <mrow>
+      <mo stretchy="false">(</mo>
+      <mi>i</mi>
+      <mo stretchy="false">)</mo>
+    </mrow>
+  </msup>
+</math> fuses augmented <math xmlns="http://www.w3.org/1998/Math/MathML">
+  <msup>
+    <munderover>
+      <mi>x</mi>
+      <mi>e</mi>
+      <mo accent="true">~</mo>
+    </munderover>
+    <mrow>
+      <mo stretchy="false">(</mo>
+      <mi>i</mi>
+      <mo stretchy="false">)</mo>
+    </mrow>
+  </msup>
+</math> with the
+positional embedding <math xmlns="http://www.w3.org/1998/Math/MathML">
+  <msub>
+    <mi>W</mi>
+    <mtext>pos</mtext>
+  </msub>
+</math> and the granularity embedding <math xmlns="http://www.w3.org/1998/Math/MathML">
+  <msubsup>
+    <mi>W</mi>
+    <mtext>gr</mtext>
+    <mrow>
+      <mo stretchy="false">(</mo>
+      <mi>i</mi>
+      <mo stretchy="false">)</mo>
+    </mrow>
+  </msubsup>
+</math>.
+Each granularity employs a router <math xmlns="http://www.w3.org/1998/Math/MathML">
+  <msup>
+    <mi>u</mi>
+    <mrow>
+      <mo stretchy="false">(</mo>
+      <mi>i</mi>
+      <mo stretchy="false">)</mo>
+    </mrow>
+  </msup>
+</math> to capture aggregated information.
+Intra-granularity attention focuses within individual granularities, and inter-granularity attention
+leverages the routers to integrate information across granularities.</p>
+</figcaption>
+</figure>
+<p>The <strong>MedFormer</strong> is a multi-granularity patching transformer tailored to medical
+time-series (MedTS) classification, with an emphasis on EEG and ECG signals. It captures
+local temporal dynamics, inter-channel correlations, and multi-scale temporal structure
+through cross-channel patching, multi-granularity embeddings, and two-stage attention
+<a class="citation-reference" href="#medformer2024" id="citation-reference-2" role="doc-biblioref">[Medformer2024]</a>.</p>
+<p><strong>Architecture Overview</strong></p>
+<p>MedFormer integrates three mechanisms to enhance representation learning <a class="citation-reference" href="#medformer2024" id="citation-reference-3" role="doc-biblioref">[Medformer2024]</a>:</p>
+<ol class="arabic simple">
+<li><p><strong>Cross-channel patching.</strong> Leverages inter-channel correlations by forming patches
+across multiple channels and timestamps, capturing multi-timestamp and cross-channel
+patterns.</p></li>
+<li><p><strong>Multi-granularity embedding.</strong> Extracts features at different temporal scales from
+:attr:`patch_len_list`, emulating frequency-band behavior without hand-crafted filters.</p></li>
+<li><p><strong>Two-stage multi-granularity self-attention.</strong> Learns intra- and inter-granularity
+correlations to fuse information across temporal scales.</p></li>
+</ol>
+<p><strong>Macro Components</strong></p>
+<dl>
+<dt><span class="docutils literal">MEDFormer.enc_embedding</span> (Embedding Layer)</dt>
+<dd><p><strong>Operations.</strong> :class:`~braindecode.models.medformer._ListPatchEmbedding` implements
+cross-channel multi-granularity patching. For each patch length <math xmlns="http://www.w3.org/1998/Math/MathML">
+  <msub>
+    <mi>L</mi>
+    <mi>i</mi>
+  </msub>
+</math>, the input
+<math xmlns="http://www.w3.org/1998/Math/MathML">
+  <msub>
+    <mi>𝐱</mi>
+    <mtext>in</mtext>
+  </msub>
+  <mo>∈</mo>
+  <msup>
+    <mi>ℝ</mi>
+    <mrow>
+      <mi>T</mi>
+      <mo>×</mo>
+      <mi>C</mi>
+    </mrow>
+  </msup>
+</math> is segmented into
+<math xmlns="http://www.w3.org/1998/Math/MathML">
+  <msub>
+    <mi>N</mi>
+    <mi>i</mi>
+  </msub>
+</math> cross-channel non-overlapping patches
+<math xmlns="http://www.w3.org/1998/Math/MathML">
+  <msubsup>
+    <mi>𝐱</mi>
+    <mi>p</mi>
+    <mrow>
+      <mo stretchy="false">(</mo>
+      <mi>i</mi>
+      <mo stretchy="false">)</mo>
+    </mrow>
+  </msubsup>
+  <mo>∈</mo>
+  <msup>
+    <mi>ℝ</mi>
+    <mrow>
+      <msub>
+        <mi>N</mi>
+        <mi>i</mi>
+      </msub>
+      <mo>×</mo>
+      <mo stretchy="false">(</mo>
+      <msub>
+        <mi>L</mi>
+        <mi>i</mi>
+      </msub>
+      <mo>⋅</mo>
+      <mi>C</mi>
+      <mo stretchy="false">)</mo>
+    </mrow>
+  </msup>
+</math>, where
+<math xmlns="http://www.w3.org/1998/Math/MathML">
+  <msub>
+    <mi>N</mi>
+    <mi>i</mi>
+  </msub>
+  <mo>=</mo>
+  <mo>⌈</mo>
+  <mi>T</mi>
+  <mo stretchy="false">/</mo>
+  <msub>
+    <mi>L</mi>
+    <mi>i</mi>
+  </msub>
+  <mo>⌉</mo>
+</math>. Each patch is linearly projected via
+:class:`~braindecode.models.medformer._CrossChannelTokenEmbedding` to obtain
+<math xmlns="http://www.w3.org/1998/Math/MathML">
+  <msubsup>
+    <mi>𝐱</mi>
+    <mi>e</mi>
+    <mrow>
+      <mo stretchy="false">(</mo>
+      <mi>i</mi>
+      <mo stretchy="false">)</mo>
+    </mrow>
+  </msubsup>
+  <mo>∈</mo>
+  <msup>
+    <mi>ℝ</mi>
+    <mrow>
+      <msub>
+        <mi>N</mi>
+        <mi>i</mi>
+      </msub>
+      <mo>×</mo>
+      <mi>D</mi>
+    </mrow>
+  </msup>
+</math>. Data augmentations
+(masking, jittering) produce augmented embeddings <math xmlns="http://www.w3.org/1998/Math/MathML">
+  <msup>
+    <munderover>
+      <mi>𝐱</mi>
+      <mi>e</mi>
+      <mo stretchy="false">~</mo>
+    </munderover>
+    <mrow>
+      <mo stretchy="false">(</mo>
+      <mi>i</mi>
+      <mo stretchy="false">)</mo>
+    </mrow>
+  </msup>
+</math>.
+The final embedding combines augmented patches, fixed positional embeddings
+(:class:`~braindecode.models.medformer._PositionalEmbedding`), and learnable
+granularity embeddings <math xmlns="http://www.w3.org/1998/Math/MathML">
+  <msubsup>
+    <mi>𝐖</mi>
+    <mtext>gr</mtext>
+    <mrow>
+      <mo stretchy="false">(</mo>
+      <mi>i</mi>
+      <mo stretchy="false">)</mo>
+    </mrow>
+  </msubsup>
+</math>:</p>
+<div>
+<math xmlns="http://www.w3.org/1998/Math/MathML" display="block">
+  <msup>
+    <mi>𝐱</mi>
+    <mrow>
+      <mo stretchy="false">(</mo>
+      <mi>i</mi>
+      <mo stretchy="false">)</mo>
+    </mrow>
+  </msup>
+  <mo>=</mo>
+  <msup>
+    <munderover>
+      <mi>𝐱</mi>
+      <mi>e</mi>
+      <mo stretchy="false">~</mo>
+    </munderover>
+    <mrow>
+      <mo stretchy="false">(</mo>
+      <mi>i</mi>
+      <mo stretchy="false">)</mo>
+    </mrow>
+  </msup>
+  <mo>+</mo>
+  <msub>
+    <mi>𝐖</mi>
+    <mtext>pos</mtext>
+  </msub>
+  <mo stretchy="false">[</mo>
+  <mn>1</mn>
+  <mo>∶</mo>
+  <msub>
+    <mi>N</mi>
+    <mi>i</mi>
+  </msub>
+  <mo stretchy="false">]</mo>
+  <mo>+</mo>
+  <msubsup>
+    <mi>𝐖</mi>
+    <mtext>gr</mtext>
+    <mrow>
+      <mo stretchy="false">(</mo>
+      <mi>i</mi>
+      <mo stretchy="false">)</mo>
+    </mrow>
+  </msubsup>
+</math>
+</div>
+<p>Additionally, a router token is initialized for each granularity:</p>
+<div>
+<math xmlns="http://www.w3.org/1998/Math/MathML" display="block">
+  <msup>
+    <mi>𝐮</mi>
+    <mrow>
+      <mo stretchy="false">(</mo>
+      <mi>i</mi>
+      <mo stretchy="false">)</mo>
+    </mrow>
+  </msup>
+  <mo>=</mo>
+  <msub>
+    <mi>𝐖</mi>
+    <mtext>pos</mtext>
+  </msub>
+  <mo stretchy="false">[</mo>
+  <msub>
+    <mi>N</mi>
+    <mi>i</mi>
+  </msub>
+  <mo>+</mo>
+  <mn>1</mn>
+  <mo stretchy="false">]</mo>
+  <mo>+</mo>
+  <msubsup>
+    <mi>𝐖</mi>
+    <mtext>gr</mtext>
+    <mrow>
+      <mo stretchy="false">(</mo>
+      <mi>i</mi>
+      <mo stretchy="false">)</mo>
+    </mrow>
+  </msubsup>
+</math>
+</div>
+<p><strong>Role.</strong> Converts raw input into granularity-specific patch embeddings
+<math xmlns="http://www.w3.org/1998/Math/MathML">
+  <mo>{</mo>
+  <msup>
+    <mi>𝐱</mi>
+    <mrow>
+      <mo stretchy="false">(</mo>
+      <mn>1</mn>
+      <mo stretchy="false">)</mo>
+    </mrow>
+  </msup>
+  <mo>,</mo>
+  <mi>…</mi>
+  <mo>,</mo>
+  <msup>
+    <mi>𝐱</mi>
+    <mrow>
+      <mo stretchy="false">(</mo>
+      <mi>n</mi>
+      <mo stretchy="false">)</mo>
+    </mrow>
+  </msup>
+  <mo>}</mo>
+</math> and router embeddings
+<math xmlns="http://www.w3.org/1998/Math/MathML">
+  <mo>{</mo>
+  <msup>
+    <mi>𝐮</mi>
+    <mrow>
+      <mo stretchy="false">(</mo>
+      <mn>1</mn>
+      <mo stretchy="false">)</mo>
+    </mrow>
+  </msup>
+  <mo>,</mo>
+  <mi>…</mi>
+  <mo>,</mo>
+  <msup>
+    <mi>𝐮</mi>
+    <mrow>
+      <mo stretchy="false">(</mo>
+      <mi>n</mi>
+      <mo stretchy="false">)</mo>
+    </mrow>
+  </msup>
+  <mo>}</mo>
+</math> for multi-scale processing.</p>
+</dd>
+<dt><span class="docutils literal">MEDFormer.encoder</span> (Transformer Encoder Stack)</dt>
+<dd><p><strong>Operations.</strong> A stack of :class:`~braindecode.models.medformer._EncoderLayer` modules,
+each containing a :class:`~braindecode.models.medformer._MedformerLayer` that implements
+two-stage self-attention. The two-stage mechanism splits self-attention into:</p>
+<p><strong>(a) Intra-Granularity Self-Attention.</strong> For granularity <math xmlns="http://www.w3.org/1998/Math/MathML">
+  <mi>i</mi>
+</math>, the patch embedding
+<math xmlns="http://www.w3.org/1998/Math/MathML">
+  <msup>
+    <mi>𝐱</mi>
+    <mrow>
+      <mo stretchy="false">(</mo>
+      <mi>i</mi>
+      <mo stretchy="false">)</mo>
+    </mrow>
+  </msup>
+  <mo>∈</mo>
+  <msup>
+    <mi>ℝ</mi>
+    <mrow>
+      <msub>
+        <mi>N</mi>
+        <mi>i</mi>
+      </msub>
+      <mo>×</mo>
+      <mi>D</mi>
+    </mrow>
+  </msup>
+</math> and router embedding
+<math xmlns="http://www.w3.org/1998/Math/MathML">
+  <msup>
+    <mi>𝐮</mi>
+    <mrow>
+      <mo stretchy="false">(</mo>
+      <mi>i</mi>
+      <mo stretchy="false">)</mo>
+    </mrow>
+  </msup>
+  <mo>∈</mo>
+  <msup>
+    <mi>ℝ</mi>
+    <mrow>
+      <mn>1</mn>
+      <mo>×</mo>
+      <mi>D</mi>
+    </mrow>
+  </msup>
+</math> are concatenated:</p>
+<div>
+<math xmlns="http://www.w3.org/1998/Math/MathML" display="block">
+  <msup>
+    <mi>𝐳</mi>
+    <mrow>
+      <mo stretchy="false">(</mo>
+      <mi>i</mi>
+      <mo stretchy="false">)</mo>
+    </mrow>
+  </msup>
+  <mo>=</mo>
+  <mo stretchy="false">[</mo>
+  <msup>
+    <mi>𝐱</mi>
+    <mrow>
+      <mo stretchy="false">(</mo>
+      <mi>i</mi>
+      <mo stretchy="false">)</mo>
+    </mrow>
+  </msup>
+  <mo>‖</mo>
+  <msup>
+    <mi>𝐮</mi>
+    <mrow>
+      <mo stretchy="false">(</mo>
+      <mi>i</mi>
+      <mo stretchy="false">)</mo>
+    </mrow>
+  </msup>
+  <mo stretchy="false">]</mo>
+  <mo>∈</mo>
+  <msup>
+    <mi>ℝ</mi>
+    <mrow>
+      <mo stretchy="false">(</mo>
+      <msub>
+        <mi>N</mi>
+        <mi>i</mi>
+      </msub>
+      <mo>+</mo>
+      <mn>1</mn>
+      <mo stretchy="false">)</mo>
+      <mo>×</mo>
+      <mi>D</mi>
+    </mrow>
+  </msup>
+</math>
+</div>
+<p>Self-attention is applied to update both embeddings:</p>
+<div>
+<math xmlns="http://www.w3.org/1998/Math/MathML" display="block">
+  <mtable class="ams-align" displaystyle="true">
+    <mtr>
+      <mtd>
+        <msup>
+          <mi>𝐱</mi>
+          <mrow>
+            <mo stretchy="false">(</mo>
+            <mi>i</mi>
+            <mo stretchy="false">)</mo>
+          </mrow>
+        </msup>
+      </mtd>
+      <mtd>
+        <mo>←</mo>
+        <msub>
+          <mtext>Attn</mtext>
+          <mtext>intra</mtext>
+        </msub>
+        <mo stretchy="false">(</mo>
+        <msup>
+          <mi>𝐱</mi>
+          <mrow>
+            <mo stretchy="false">(</mo>
+            <mi>i</mi>
+            <mo stretchy="false">)</mo>
+          </mrow>
+        </msup>
+        <mo>,</mo>
+        <msup>
+          <mi>𝐳</mi>
+          <mrow>
+            <mo stretchy="false">(</mo>
+            <mi>i</mi>
+            <mo stretchy="false">)</mo>
+          </mrow>
+        </msup>
+        <mo>,</mo>
+        <msup>
+          <mi>𝐳</mi>
+          <mrow>
+            <mo stretchy="false">(</mo>
+            <mi>i</mi>
+            <mo stretchy="false">)</mo>
+          </mrow>
+        </msup>
+        <mo stretchy="false">)</mo>
+      </mtd>
+    </mtr>
+    <mtr>
+      <mtd>
+        <msup>
+          <mi>𝐮</mi>
+          <mrow>
+            <mo stretchy="false">(</mo>
+            <mi>i</mi>
+            <mo stretchy="false">)</mo>
+          </mrow>
+        </msup>
+      </mtd>
+      <mtd>
+        <mo>←</mo>
+        <msub>
+          <mtext>Attn</mtext>
+          <mtext>intra</mtext>
+        </msub>
+        <mo stretchy="false">(</mo>
+        <msup>
+          <mi>𝐮</mi>
+          <mrow>
+            <mo stretchy="false">(</mo>
+            <mi>i</mi>
+            <mo stretchy="false">)</mo>
+          </mrow>
+        </msup>
+        <mo>,</mo>
+        <msup>
+          <mi>𝐳</mi>
+          <mrow>
+            <mo stretchy="false">(</mo>
+            <mi>i</mi>
+            <mo stretchy="false">)</mo>
+          </mrow>
+        </msup>
+        <mo>,</mo>
+        <msup>
+          <mi>𝐳</mi>
+          <mrow>
+            <mo stretchy="false">(</mo>
+            <mi>i</mi>
+            <mo stretchy="false">)</mo>
+          </mrow>
+        </msup>
+        <mo stretchy="false">)</mo>
+      </mtd>
+    </mtr>
+  </mtable>
+</math>
+</div>
+<p>This captures temporal features within each granularity independently.</p>
+<p><strong>(b) Inter-Granularity Self-Attention.</strong> All router embeddings are concatenated:</p>
+<div>
+<math xmlns="http://www.w3.org/1998/Math/MathML" display="block">
+  <mi>𝐔</mi>
+  <mo>=</mo>
+  <mo stretchy="false">[</mo>
+  <msup>
+    <mi>𝐮</mi>
+    <mrow>
+      <mo stretchy="false">(</mo>
+      <mn>1</mn>
+      <mo stretchy="false">)</mo>
+    </mrow>
+  </msup>
+  <mo>‖</mo>
+  <msup>
+    <mi>𝐮</mi>
+    <mrow>
+      <mo stretchy="false">(</mo>
+      <mn>2</mn>
+      <mo stretchy="false">)</mo>
+    </mrow>
+  </msup>
+  <mo>‖</mo>
+  <mi>⋯</mi>
+  <mo>‖</mo>
+  <msup>
+    <mi>𝐮</mi>
+    <mrow>
+      <mo stretchy="false">(</mo>
+      <mi>n</mi>
+      <mo stretchy="false">)</mo>
+    </mrow>
+  </msup>
+  <mo stretchy="false">]</mo>
+  <mo>∈</mo>
+  <msup>
+    <mi>ℝ</mi>
+    <mrow>
+      <mi>n</mi>
+      <mo>×</mo>
+      <mi>D</mi>
+    </mrow>
+  </msup>
+</math>
+</div>
+<p>Self-attention among routers exchanges information across granularities:</p>
+<div>
+<math xmlns="http://www.w3.org/1998/Math/MathML" display="block">
+  <msup>
+    <mi>𝐮</mi>
+    <mrow>
+      <mo stretchy="false">(</mo>
+      <mi>i</mi>
+      <mo stretchy="false">)</mo>
+    </mrow>
+  </msup>
+  <mo>←</mo>
+  <msub>
+    <mtext>Attn</mtext>
+    <mtext>inter</mtext>
+  </msub>
+  <mo stretchy="false">(</mo>
+  <msup>
+    <mi>𝐮</mi>
+    <mrow>
+      <mo stretchy="false">(</mo>
+      <mi>i</mi>
+      <mo stretchy="false">)</mo>
+    </mrow>
+  </msup>
+  <mo>,</mo>
+  <mi>𝐔</mi>
+  <mo>,</mo>
+  <mi>𝐔</mi>
+  <mo stretchy="false">)</mo>
+</math>
+</div>
+<p><strong>Role.</strong> Learns representations and correlations within and across temporal scales while
+reducing complexity from <math xmlns="http://www.w3.org/1998/Math/MathML">
+  <mi>O</mi>
+  <mo stretchy="false">(</mo>
+  <mo stretchy="false">(</mo>
+  <munder>
+    <mo movablelimits="true">∑</mo>
+    <mi>i</mi>
+  </munder>
+  <msub>
+    <mi>N</mi>
+    <mi>i</mi>
+  </msub>
+  <msup>
+    <mo stretchy="false">)</mo>
+    <mn>2</mn>
+  </msup>
+  <mo stretchy="false">)</mo>
+</math> to
+<math xmlns="http://www.w3.org/1998/Math/MathML">
+  <mi>O</mi>
+  <mo stretchy="false">(</mo>
+  <munder>
+    <mo movablelimits="true">∑</mo>
+    <mi>i</mi>
+  </munder>
+  <msubsup>
+    <mi>N</mi>
+    <mi>i</mi>
+    <mn>2</mn>
+  </msubsup>
+  <mo>+</mo>
+  <msup>
+    <mi>n</mi>
+    <mn>2</mn>
+  </msup>
+  <mo stretchy="false">)</mo>
+</math> through the router mechanism.</p>
+</dd>
+</dl>
+<p><strong>Temporal, Spatial, and Spectral Encoding</strong></p>
+<ul class="simple">
+<li><p><strong>Temporal:</strong> Multiple patch lengths in :attr:`patch_len_list` capture features at several
+temporal granularities, while intra-granularity attention supports long-range temporal
+dependencies.</p></li>
+<li><p><strong>Spatial:</strong> Cross-channel patching embeds inter-channel dependencies by applying kernels
+that span every input channel.</p></li>
+<li><p><strong>Spectral:</strong> Differing patch lengths simulate multiple sampling frequencies analogous to
+clinically relevant bands (e.g., alpha, beta, gamma).</p></li>
+</ul>
+<p><strong>Additional Mechanisms</strong></p>
+<ul class="simple">
+<li><p><strong>Granularity router:</strong> Each granularity <math xmlns="http://www.w3.org/1998/Math/MathML">
+  <mi>i</mi>
+</math> receives a dedicated router token
+<math xmlns="http://www.w3.org/1998/Math/MathML">
+  <msup>
+    <mi>𝐮</mi>
+    <mrow>
+      <mo stretchy="false">(</mo>
+      <mi>i</mi>
+      <mo stretchy="false">)</mo>
+    </mrow>
+  </msup>
+</math>. Intra-attention updates the token, and inter-attention exchanges
+aggregated information across scales.</p></li>
+<li><p><strong>Complexity:</strong> Router-mediated two-stage attention maintains <math xmlns="http://www.w3.org/1998/Math/MathML">
+  <mi>O</mi>
+  <mo stretchy="false">(</mo>
+  <msup>
+    <mi>T</mi>
+    <mn>2</mn>
+  </msup>
+  <mo stretchy="false">)</mo>
+</math> complexity for
+suitable patch lengths (e.g., power series), preserving transformer-like efficiency while
+modeling multiple granularities.</p></li>
+</ul>
+<section id="parameters">
+<h2>Parameters</h2>
+<dl class="simple">
+<dt>patch_len_list<span class="classifier">list of int, optional</span></dt>
+<dd><p>Patch lengths for multi-granularity patching; each entry selects a temporal scale.
+The default is <span class="docutils literal">[14, 44, 45]</span>.</p>
+</dd>
+<dt>embed_dim<span class="classifier">int, optional</span></dt>
+<dd><p>Embedding dimensionality. The default is <span class="docutils literal">128</span>.</p>
+</dd>
+<dt>num_heads<span class="classifier">int, optional</span></dt>
+<dd><p>Number of attention heads, which must divide :attr:`d_model`. The default is <span class="docutils literal">8</span>.</p>
+</dd>
+<dt>drop_prob<span class="classifier">float, optional</span></dt>
+<dd><p>Dropout probability. The default is <span class="docutils literal">0.1</span>.</p>
+</dd>
+<dt>no_inter_attn<span class="classifier">bool, optional</span></dt>
+<dd><p>If <span class="docutils literal">True</span>, disables inter-granularity attention. The default is <span class="docutils literal">False</span>.</p>
+</dd>
+<dt>num_layers<span class="classifier">int, optional</span></dt>
+<dd><p>Number of encoder layers. The default is <span class="docutils literal">6</span>.</p>
+</dd>
+<dt>dim_feedforward<span class="classifier">int, optional</span></dt>
+<dd><p>Feedforward dimensionality. The default is <span class="docutils literal">256</span>.</p>
+</dd>
+<dt>activation_trans<span class="classifier">nn.Module, optional</span></dt>
+<dd><p>Activation module used in transformer encoder layers. The default is :class:`nn.ReLU`.</p>
+</dd>
+<dt>single_channel<span class="classifier">bool, optional</span></dt>
+<dd><p>If <span class="docutils literal">True</span>, processes each channel independently, increasing capacity and cost. The default is <span class="docutils literal">False</span>.</p>
+</dd>
+<dt>output_attention<span class="classifier">bool, optional</span></dt>
+<dd><p>If <span class="docutils literal">True</span>, returns attention weights for interpretability. The default is <span class="docutils literal">True</span>.</p>
+</dd>
+<dt>activation_class<span class="classifier">nn.Module, optional</span></dt>
+<dd><p>Activation used in the final classification layer. The default is :class:`nn.GELU`.</p>
+</dd>
+</dl>
+</section>
+<section id="notes">
+<h2>Notes</h2>
+<ul class="simple">
+<li><p>MedFormer outperforms strong baselines across six metrics on five MedTS datasets in a
+subject-independent evaluation <a class="citation-reference" href="#medformer2024" id="citation-reference-4" role="doc-biblioref">[Medformer2024]</a>.</p></li>
+<li><p>Cross-channel patching provides the largest F1 improvement in ablation studies (average
++6.10%), highlighting its importance for MedTS tasks <a class="citation-reference" href="#medformer2024" id="citation-reference-5" role="doc-biblioref">[Medformer2024]</a>.</p></li>
+<li><p>Setting :attr:`no_inter_attn` to <span class="docutils literal">True</span> disables inter-granularity attention while retaining
+intra-granularity attention.</p></li>
+</ul>
+</section>
+<section id="references">
+<h2>References</h2>
+<div role="list" class="citation-list">
+<div class="citation" id="medformer2024" role="doc-biblioentry">
+<span class="label"><span class="fn-bracket">[</span>Medformer2024<span class="fn-bracket">]</span></span>
+<span class="backrefs">(<a role="doc-backlink" href="#citation-reference-1">1</a>,<a role="doc-backlink" href="#citation-reference-2">2</a>,<a role="doc-backlink" href="#citation-reference-3">3</a>,<a role="doc-backlink" href="#citation-reference-4">4</a>,<a role="doc-backlink" href="#citation-reference-5">5</a>)</span>
+<p>Wang, Y., Huang, N., Li, T., Yan, Y., &amp; Zhang, X. (2024).
+Medformer: A Multi-Granularity Patching Transformer for Medical Time-Series Classification.
+In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, &amp; C. Zhang (Eds.),
+Advances in Neural Information Processing Systems (Vol. 37, pp. 36314-36341).
+doi:10.52202/079017-1145.</p>
+</div>
+</div>
+<p><strong>Hugging Face Hub integration</strong></p>
+<p>When the optional <span class="docutils literal">huggingface_hub</span> package is installed, all models
+automatically gain the ability to be pushed to and loaded from the
+Hugging Face Hub. Install with:</p>
+<pre class="literal-block">pip install braindecode[hub]</pre>
+<p><strong>Pushing a model to the Hub:</strong></p>
+<p><strong>Loading a model from the Hub:</strong></p>
+<p><strong>Extracting features and replacing the head:</strong></p>
+<p><strong>Saving and restoring full configuration:</strong></p>
+<p>All model parameters (both EEG-specific and model-specific such as
+dropout rates, activation functions, number of filters) are automatically
+saved to the Hub and restored when loading.</p>
+<p>See :ref:`load-pretrained-models` for a complete tutorial.</p>
+</section>
+</main>
+</div>
+## Citation
+Please cite both the original paper for this architecture (see the
+*References* section above) and braindecode:
+```bibtex
+@article{aristimunha2025braindecode,
+  title   = {Braindecode: a deep learning library for raw electrophysiological data},
+  author  = {Aristimunha, Bruno and others},
+  journal = {Zenodo},
+  year    = {2025},
+  doi     = {10.5281/zenodo.17699192},
+}
+```
+## License
+BSD-3-Clause for the model code (matching braindecode).
+Pretraining-derived weights, if you fine-tune from a checkpoint,
+inherit the licence of that checkpoint and its training corpus.