braindecode
/

PBT

+---
+license: bsd-3-clause
+library_name: braindecode
+pipeline_tag: feature-extraction
+tags:
+  - eeg
+  - biosignal
+  - pytorch
+  - neuroscience
+  - braindecode
+  - foundation-model
+  - convolutional
+  - transformer
+---
+# PBT
+Patched Brain Transformer (PBT) model from Klein et al (2025) .
+> **Architecture-only repository.** This repo documents the
+> `braindecode.models.PBT` class. **No pretrained weights are
+> distributed here** — instantiate the model and train it on your own
+> data, or fine-tune from a published foundation-model checkpoint
+> separately.
+## Quick start
+```bash
+pip install braindecode
+```
+```python
+from braindecode.models import PBT
+model = PBT(
+    n_chans=22,
+    sfreq=250,
+    input_window_seconds=4.0,
+    n_outputs=4,
+)
+```
+The signal-shape arguments above are example defaults — adjust them
+to match your recording.
+## Documentation
+- Full API reference (parameters, references, architecture figure):
+  <https://braindecode.org/stable/generated/braindecode.models.PBT.html>
+- Interactive browser with live instantiation:
+  <https://huggingface.co/spaces/braindecode/model-explorer>
+- Source on GitHub: <https://github.com/braindecode/braindecode/blob/master/braindecode/models/patchedtransformer.py#L17>
+## Architecture description
+The block below is the rendered class docstring (parameters,
+references, architecture figure where available).
+<div class='bd-doc'><main>
+<p>Patched Brain Transformer (PBT) model from Klein et al (2025) [pbt]_.</p>
+<span style="display:inline-block;padding:2px 8px;border-radius:4px;background:#d9534f;color:white;font-size:11px;font-weight:600;margin-right:4px;">Foundation Model</span>
+ This implementation was based in https://github.com/timonkl/PatchedBrainTransformer/
+ .. figure:: https://raw.githubusercontent.com/timonkl/PatchedBrainTransformer/refs/heads/main/PBT_sketch.png
+    :align: center
+    :alt:  Patched Brain Transformer Architecture
+    :width: 680px
+ PBT tokenizes EEG trials into per-channel patches, linearly projects each
+ patch to a model embedding dimension, prepends a classification token and
+ adds channel-aware positional embeddings. The token sequence is processed
+ by a Transformer encoder stack and classification is performed from the
+ classification token.
+ .. rubric:: Macro Components
+ - ``PBT.tokenization`` **(patch extraction)**
+   *Operations.* The pre-processed EEG signal :math:`X \in \mathbb{R}^{C \times T}`
+   (with :math:`C = \text{n_chans}` and :math:`T = \text{n_times}`) is divided into
+   non-overlapping patches of size :math:`d_{\text{input}}` along the time axis.
+   This process yields :math:`N` total patches, calculated as
+   :math:`N = C \left\lfloor \frac{T}{D} \right\rfloor` (where :math:`D = d_{\text{input}}`).
+   When time shifts are applied, :math:`N` decreases to
+   :math:`N = C \left\lfloor \frac{T - T_{\text{aug}}}{D} \right\rfloor`.
+   *Role.* Tokenizes EEG trials into fixed-size, per-channel patches so the model
+   remains adaptive to different numbers of channels and recording lengths.
+   Process is inspired by Vision Transformers [visualtransformer]_ and
+   adapted for GPT context from [efficient-batchpacking]_.
+ - ``PBT.patch_projection`` **(patch embedding)**
+   *Operations.* The linear layer ``PBT.patch_projection`` maps the tokens from dimension
+   :math:`d_{\text{input}}` to the Transformer embedding dimension :math:`d_{\text{model}}`.
+   Patches :math:`X_P` are projected as :math:`X_E = X_P W_E^\top`, where
+   :math:`W_E \in \mathbb{R}^{d_{\text{model}} \times D}`. In this configuration
+   :math:`d_{\text{model}} = 2D` with :math:`D = d_{\text{input}}`.
+   *Interpretability.* Learns periodic structures similar to frequency filters in
+   the first convolutional layers of CNNs (for example :class:`~braindecode.models.EEGNet`).
+   The learned filters frequently focus on the high-frequency range (20-40 Hz),
+   which correlates with beta and gamma waves linked to higher concentration levels.
+ - ``PBT.cls_token`` **(classification token)**
+   *Operations.* A classification token :math:`[c_{\text{ls}}] \in \mathbb{R}^{1 \times d_{\text{model}}}`
+   is prepended to the projected patch sequence :math:`X_E`. The CLS token can optionally
+   be learnable (see ``learnable_cls``).
+   *Role.* Acts as a dedicated readout token that aggregates information through the
+   Transformer encoder stack.
+ - ``PBT.pos_embedding`` **(positional embedding)**
+   *Operations.* Positional indices are generated by ``PBT.linear_projection``, an instance
+   of :class:`~braindecode.models.patchedtransformer._ChannelEncoding`, and mapped to vectors
+   through :class:`~torch.nn.Embedding`. The embedding table
+   :math:`W_{\text{pos}} \in \mathbb{R}^{(N+1) \times d_{\text{model}}}` is added to the token
+   sequence, yielding :math:`X_{\text{pos}} = [c_{\text{ls}}, X_E] + W_{\text{pos}}`.
+   *Role/Interpretability.* Introduces spatial and temporal dependence to counter the
+   position invariance of the Transformer encoder. The learned positional embedding
+   exposes spatial relationships, often revealing a symmetric pattern in central regions
+   (C1-C6) associated with the motor cortex.
+ - ``PBT.transformer_encoder`` **(sequence processing and attention)**
+   *Operations.* The token sequence passes through :math:`n_{\text{blocks}}` Transformer
+   encoder layers. Each block combines a Multi-Head Self-Attention (MHSA) module with
+   ``num_heads`` attention heads and a Feed-Forward Network (FFN). Both MHSA
+   and FFN use parallel residual connections with Layer Normalization inside the blocks
+   and apply dropout (``drop_prob``) within the Transformer components.
+   *Role/Robustness.* Self-attention enables every token to consider all others, capturing
+   global temporal and spatial dependencies immediately and adaptively. This architecture
+   accommodates arbitrary numbers of patches and channels, supporting pre-training across
+   diverse datasets.
+ - ``PBT.final_layer`` **(readout)**
+   *Operations.* A linear layer operates on the processed CLS token only, and the model
+   predicts class probabilities as :math:`y = \operatorname{softmax}([c_{\text{ls}}] W_{\text{class}}^\top + b_{\text{class}})`.
+   *Role.* Performs the final classification from the information aggregated into the CLS
+   token after the Transformer encoder stack.
+ .. rubric:: Convolutional Details
+ PBT omits convolutional layers; equivalent feature extraction is carried out by the patch
+ pipeline and attention stack.
+ * **Temporal.** Tokenization slices the EEG into fixed windows of size :math:`D = d_{\text{input}}`
+   (for the default configuration, :math:`D=64` samples :math:`\approx 0.256\,\text{s}` at
+   :math:`250\,\text{Hz}`), while ``PBT.patch_projection`` learns periodic patterns within each
+   patch. The Transformer encoder then models long- and short-range temporal dependencies through
+   self-attention.
+ * **Spatial.** Patches are channel-specific, keeping the architecture adaptive to any electrode
+   montage. Channel-aware positional encodings :math:`W_{\text{pos}}` capture relationships between
+   nearby sensors; learned embeddings often form symmetric motifs across motor cortex electrodes
+   (C1–C6), and self-attention propagates information across all channels jointly.
+ * **Spectral.** ``PBT.patch_projection`` acts similarly to the first convolutional layer in
+   :class:`~braindecode.models.EEGNet`, learning frequency-selective filters without an explicit
+   Fourier transform. The highest-energy filters typically reside between :math:`20` and
+   :math:`40\,\text{Hz}`, aligning with beta/gamma rhythms tied to focused motor imagery.
+ .. rubric:: Attention / Sequential Modules
+ * **Attention Details.** ``PBT.transformer_encoder`` stacks :math:`n_{\text{blocks}}` Transformer
+   encoder layers with Multi-Head Self-Attention. Every token attends to all others, enabling
+   immediate global integration across time and channels and supporting heterogeneous datasets.
+   Attention rollout visualisations highlight strong activations over motor cortex electrodes
+   (C3, C4, Cz) during motor imagery decoding.
+ .. warning::
+     **Important:** As the other Foundation Models in Braindecode, :class:`PBT` is
+     designed for large-scale pre-training and fine-tuning. Training from
+     scratch on small datasets may lead to suboptimal results. Cross-Dataset
+     pre-training and subsequent fine-tuning is recommended to leverage the
+     full potential of this architecture.
+ Parameters
+ ----------
+ d_input : int, optional
+     Size (in samples) of each patch (token) extracted along the time axis.
+ embed_dim : int, optional
+     Transformer embedding dimensionality.
+ num_layers : int, optional
+     Number of Transformer encoder layers.
+ num_heads : int, optional
+     Number of attention heads.
+ drop_prob : float, optional
+     Dropout probability used in Transformer components.
+ learnable_cls : bool, optional
+     Whether the classification token is learnable.
+ bias_transformer : bool, optional
+     Whether to use bias in Transformer linear layers.
+ activation : nn.Module, optional
+     Activation function class to use in Transformer feed-forward layers.
+ References
+ ----------
+ .. [pbt] Klein, T., Minakowski, P., & Sager, S. (2025).
+     Flexible Patched Brain Transformer model for EEG decoding.
+     Scientific Reports, 15(1), 1-12.
+     https://www.nature.com/articles/s41598-025-86294-3
+ .. [visualtransformer]  Dosovitskiy, A., Beyer, L., Kolesnikov, A.,
+     Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M.,
+     Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J. & Houlsby,
+     N. (2021). An Image is Worth 16x16 Words: Transformers for Image
+     Recognition at Scale. International Conference on Learning
+     Representations (ICLR).
+ .. [efficient-batchpacking] Krell, M. M., Kosec, M., Perez, S. P., &
+     Fitzgibbon, A. (2021). Efficient sequence packing without
+     cross-contamination: Accelerating large language models without
+     impacting performance. arXiv preprint arXiv:2107.02027.
+ .. rubric:: Hugging Face Hub integration
+ When the optional ``huggingface_hub`` package is installed, all models
+ automatically gain the ability to be pushed to and loaded from the
+ Hugging Face Hub. Install with::
+     pip install braindecode[hub]
+ **Pushing a model to the Hub:**
+ .. code::
+     from braindecode.models import PBT
+     # Train your model
+     model = PBT(n_chans=22, n_outputs=4, n_times=1000)
+     # ... training code ...
+     # Push to the Hub
+     model.push_to_hub(
+         repo_id="username/my-pbt-model",
+         commit_message="Initial model upload",
+     )
+ **Loading a model from the Hub:**
+ .. code::
+     from braindecode.models import PBT
+     # Load pretrained model
+     model = PBT.from_pretrained("username/my-pbt-model")
+     # Load with a different number of outputs (head is rebuilt automatically)
+     model = PBT.from_pretrained("username/my-pbt-model", n_outputs=4)
+ **Extracting features and replacing the head:**
+ .. code::
+     import torch
+     x = torch.randn(1, model.n_chans, model.n_times)
+     # Extract encoder features (consistent dict across all models)
+     out = model(x, return_features=True)
+     features = out["features"]
+     # Replace the classification head
+     model.reset_head(n_outputs=10)
+ **Saving and restoring full configuration:**
+ .. code::
+     import json
+     config = model.get_config()            # all __init__ params
+     with open("config.json", "w") as f:
+         json.dump(config, f)
+     model2 = PBT.from_config(config)    # reconstruct (no weights)
+ All model parameters (both EEG-specific and model-specific such as
+ dropout rates, activation functions, number of filters) are automatically
+ saved to the Hub and restored when loading.
+ See :ref:`load-pretrained-models` for a complete tutorial.</main>
+</div>
+## Citation
+Please cite both the original paper for this architecture (see the
+*References* section above) and braindecode:
+```bibtex
+@article{aristimunha2025braindecode,
+  title   = {Braindecode: a deep learning library for raw electrophysiological data},
+  author  = {Aristimunha, Bruno and others},
+  journal = {Zenodo},
+  year    = {2025},
+  doi     = {10.5281/zenodo.17699192},
+}
+```
+## License
+BSD-3-Clause for the model code (matching braindecode).
+Pretraining-derived weights, if you fine-tune from a checkpoint,
+inherit the licence of that checkpoint and its training corpus.