braindecode
/

BENDR

@@ -10,18 +10,16 @@ tags:
   - braindecode
   - foundation-model
   - convolutional
-  - transformer
 ---
 # BENDR
-BENDR (BErt-inspired Neural Data Representations) from Kostas et al (2021) .
-> **Architecture-only repository.** This repo documents the
 > `braindecode.models.BENDR` class. **No pretrained weights are
-> distributed here** — instantiate the model and train it on your own
-> data, or fine-tune from a published foundation-model checkpoint
-> separately.
 ## Quick start
@@ -40,277 +38,52 @@ model = BENDR(
 )
 ```
-The signal-shape arguments above are example defaults — adjust them
-to match your recording.
 ## Documentation
-- Full API reference (parameters, references, architecture figure):
-  <https://braindecode.org/stable/generated/braindecode.models.BENDR.html>
-- Interactive browser with live instantiation:
   <https://huggingface.co/spaces/braindecode/model-explorer>
 - Source on GitHub: <https://github.com/braindecode/braindecode/blob/master/braindecode/models/bendr.py#L60>
-## Architecture description
-The block below is the rendered class docstring (parameters,
-references, architecture figure where available).
-<div class='bd-doc'><main>
-<p>BENDR (BErt-inspired Neural Data Representations) from Kostas et al (2021) [bendr]_.</p>
-<span style="display:inline-block;padding:2px 8px;border-radius:4px;background:#5cb85c;color:white;font-size:11px;font-weight:600;margin-right:4px;">Convolution</span><span style="display:inline-block;padding:2px 8px;border-radius:4px;background:#d9534f;color:white;font-size:11px;font-weight:600;margin-right:4px;">Foundation Model</span>
- .. figure:: https://www.frontiersin.org/files/Articles/653659/fnhum-15-653659-HTML/image_m/fnhum-15-653659-g001.jpg
-     :align: center
-     :alt: BENDR Architecture
-     :width: 1000px
- The **BENDR** architecture adapts techniques used for language modeling (LM) toward the
- development of encephalography modeling (EM) [bendr]_. It utilizes a self-supervised
- training objective to learn compressed representations of raw EEG signals [bendr]_. The
- model is capable of modeling completely novel raw EEG sequences recorded with differing
- hardware and subjects, aiming for transferable performance across a variety of downstream
- BCI and EEG classification tasks [bendr]_.
- .. rubric:: Architectural Overview
- BENDR is adapted from wav2vec 2.0 [wav2vec2]_ and is composed of two main stages: a
- feature extractor (Convolutional stage) that produces BErt-inspired Neural Data
- Representations (BENDR), followed by a transformer encoder (Contextualizer) [bendr]_.
- .. rubric:: Macro Components
- - `BENDR.encoder` **(Convolutional Stage/Feature Extractor)**
-     - *Operations.* A stack of six short-receptive field 1D convolutions [bendr]_. Each
-       block consists of 1D convolution, GroupNorm, and GELU activation.
-     - *Role.* Takes raw data :math:`X_{raw}` and dramatically downsamples it to a new
-       sequence of vectors (BENDR) [bendr]_. Each resulting vector has a length of 512.
- - `BENDR.contextualizer` **(Transformer Encoder)**
-     - *Operations.* A transformer encoder that uses layered, multi-head self-attention
-       [bendr]_. It employs T-Fixup weight initialization [tfixup]_ and uses 8 layers
-       and 8 heads.
-     - *Role.* Maps the sequence of BENDR vectors to a contextualized sequence. The output
-       of a fixed start token is typically used as the aggregate representation for
-       downstream classification [bendr]_.
- - `Contextualizer.position_encoder` **(Positional Encoding)**
-     - *Operations.* An additive (grouped) convolution layer with a receptive field of 25
-       and 16 groups [bendr]_.
-     - *Role.* Encodes position information before the input enters the transformer.
- .. rubric:: How the information is encoded temporally, spatially, and spectrally
- * **Temporal.**
-   The convolutional encoder uses a stack of blocks where the stride matches the receptive
-   field (e.g., 3 for the first block, 2 for subsequent blocks) [bendr]_. This process
-   downsamples the raw data by a factor of 96, resulting in an effective sampling frequency
-   of approximately 2.67 Hz.
- * **Spatial.**
-   To maintain simplicity and reduce complexity, the convolutional stage uses **1D
-   convolutions** and elects not to mix EEG channels across the first stage [bendr]_. The
-   input includes 20 channels (19 EEG channels and one relative amplitude channel).
- * **Spectral.**
-   The convolution operations implicitly extract features from the raw EEG signal [bendr]_.
-   The representations (BENDR) are derived from the raw waveform using convolutional
-   operations followed by sequence modeling [wav2vec2]_.
- .. rubric:: Additional Mechanisms
- - **Self-Supervision (Pre-training).** Uses a masked sequence learning approach (adapted
-   from wav2vec 2.0 [wav2vec2]_) where contiguous spans of BENDR sequences are masked, and
-   the model attempts to reconstruct the original underlying encoded vector based on the
-   transformer output and a set of negative distractors [bendr]_.
- - **Regularization.** LayerDrop [layerdrop]_ and Dropout (at probabilities 0.01 and 0.15,
-   respectively) are used during pre-training [bendr]_. The implementation also uses T-Fixup
-   scaling for parameter initialization [tfixup]_.
- - **Input Conditioning.** A fixed token (a vector filled with the value **-5**) is
-   prepended to the BENDR sequence before input to the transformer, serving as the aggregate
-   representation token [bendr]_.
- .. important::
-    **Pre-trained Weights Available**
-    This model has pre-trained weights available on the Hugging Face Hub.
-    You can load them using:
-    .. code:: python
-        from braindecode.models import BENDR
-        # Load pre-trained model from Hugging Face Hub
-        # you can specify `n_outputs` for your downstream task
-        model = BENDR.from_pretrained("braindecode/braindecode-bendr", n_outputs=2)
-    To push your own trained model to the Hub:
-    .. code:: python
-        # After training your model
-        model.push_to_hub(
-            repo_id="username/my-bendr-model", commit_message="Upload trained BENDR model"
-        )
-    Requires installing ``braindecode[hug]`` for Hub integration.
- Notes
- -----
- * The full BENDR architecture contains a large number of parameters; configuration (1)
-   involved training over **one billion parameters** [bendr]_.
- * Randomly initialized full BENDR architecture was generally ineffective at solving
-   downstream tasks without prior self-supervised training [bendr]_.
- * The pre-training task (contrastive predictive coding via masking) is generalizable,
-   exhibiting strong uniformity of performance across novel subjects, hardware, and
-   tasks [bendr]_.
- .. warning::
-     **Important:** To utilize the full potential of BENDR, the model requires
-     **self-supervised pre-training** on large, unlabeled EEG datasets (like TUEG) followed
-     by subsequent fine-tuning on the specific downstream classification task [bendr]_.
- References
- ----------
- .. [bendr] Kostas, D., Aroca-Ouellette, S., & Rudzicz, F. (2021).
-    BENDR: Using transformers and a contrastive self-supervised learning task to learn from
-    massive amounts of EEG data.
-    Frontiers in Human Neuroscience, 15, 653659.
-    https://doi.org/10.3389/fnhum.2021.653659
- .. [wav2vec2] Baevski, A., Zhou, Y., Mohamed, A., & Auli, M. (2020).
-    wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations.
-    In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, & H. Lin (Eds),
-    Advances in Neural Information Processing Systems (Vol. 33, pp. 12449-12460).
-    https://dl.acm.org/doi/10.5555/3495724.3496768
- .. [tfixup] Huang, T. K., Liang, S., Jha, A., & Salakhutdinov, R. (2020).
-    Improving Transformer Optimization Through Better Initialization.
-    In International Conference on Machine Learning (pp. 4475-4483). PMLR.
-    https://dl.acm.org/doi/10.5555/3524938.3525354
- .. [layerdrop] Fan, A., Grave, E., & Joulin, A. (2020).
-    Reducing Transformer Depth on Demand with Structured Dropout.
-    International Conference on Learning Representations.
-    Retrieved from https://openreview.net/forum?id=SylO2yStDr
- Parameters
- ----------
- encoder_h : int, default=512
-     Hidden size (number of output channels) of the convolutional encoder. This determines
-     the dimensionality of the BENDR feature vectors produced by the encoder.
- contextualizer_hidden : int, default=3076
-     Hidden size of the feedforward layer within each transformer block. The paper uses
-     approximately 2x the transformer dimension (3076 ~ 2 x 1536).
- projection_head : bool, default=False
-     If True, adds a projection layer at the end of the encoder to project back to the
-     input feature size. This is used during self-supervised pre-training but typically
-     disabled during fine-tuning.
- drop_prob : float, default=0.1
-     Dropout probability applied throughout the model. The paper recommends 0.15 for
-     pre-training and 0.0 for fine-tuning. Default is 0.1 as a compromise.
- layer_drop : float, default=0.0
-     Probability of dropping entire transformer layers during training (LayerDrop
-     regularization [layerdrop]_). The paper uses 0.01 for pre-training and 0.0 for
-     fine-tuning.
- activation : :class:`torch.nn.Module`, default=:class:`torch.nn.GELU`
-     Activation function used in the encoder convolutional blocks. The paper uses GELU
-     activation throughout.
- transformer_layers : int, default=8
-     Number of transformer encoder layers in the contextualizer. The paper uses 8 layers.
- transformer_heads : int, default=8
-     Number of attention heads in each transformer layer. The paper uses 8 heads with
-     head dimension of 192 (1536 / 8).
- position_encoder_length : int, default=25
-     Kernel size for the convolutional positional encoding layer. The paper uses a
-     receptive field of 25 with 16 groups.
- enc_width : tuple of int, default=(3, 2, 2, 2, 2, 2)
-     Kernel sizes for each of the 6 convolutional blocks in the encoder. Each value
-     corresponds to one block.
- enc_downsample : tuple of int, default=(3, 2, 2, 2, 2, 2)
-     Stride values for each of the 6 convolutional blocks in the encoder. The total
-     downsampling factor is the product of all strides (3 x 2 x 2 x 2 x 2 x 2 = 96).
- start_token : int or float, default=-5
-     Value used to fill the start token embedding that is prepended to the BENDR sequence
-     before input to the transformer. This token's output is used as the aggregate
-     representation for classification.
- final_layer : bool, default=True
-     If True, includes a final linear classification layer that maps from encoder_h to
-     n_outputs. If False, the model outputs the contextualized features directly.
- encoder_only : bool, default=False
-     If True, bypass the contextualizer and use 4-chunk temporal pooling on the encoder
-     output instead. This corresponds to the encoder-only configuration described in
-     Section 2.4 and Table 2 of Kostas et al. (2021) [bendr]_, which outperformed the
-     full model on 4 out of 5 downstream tasks. The encoder output is split into 4 equal
-     temporal chunks, each chunk is mean-pooled, and the results are concatenated to
-     produce a feature vector of size ``encoder_h * 4`` (2048-dim with default settings).
-     The contextualizer is still created (to allow loading pretrained weights) but is not
-     used in the forward pass. Requires input length of at least
-     ``4 * product(enc_downsample)`` samples (384 with default downsampling of 96x).
- .. rubric:: Hugging Face Hub integration
- When the optional ``huggingface_hub`` package is installed, all models
- automatically gain the ability to be pushed to and loaded from the
- Hugging Face Hub. Install with::
-     pip install braindecode[hub]
- **Pushing a model to the Hub:**
- .. code::
-     from braindecode.models import BENDR
-     # Train your model
-     model = BENDR(n_chans=22, n_outputs=4, n_times=1000)
-     # ... training code ...
-     # Push to the Hub
-     model.push_to_hub(
-         repo_id="username/my-bendr-model",
-         commit_message="Initial model upload",
-     )
- **Loading a model from the Hub:**
- .. code::
-     from braindecode.models import BENDR
-     # Load pretrained model
-     model = BENDR.from_pretrained("username/my-bendr-model")
-     # Load with a different number of outputs (head is rebuilt automatically)
-     model = BENDR.from_pretrained("username/my-bendr-model", n_outputs=4)
- **Extracting features and replacing the head:**
- .. code::
-     import torch
-     x = torch.randn(1, model.n_chans, model.n_times)
-     # Extract encoder features (consistent dict across all models)
-     out = model(x, return_features=True)
-     features = out["features"]
-     # Replace the classification head
-     model.reset_head(n_outputs=10)
- **Saving and restoring full configuration:**
- .. code::
-     import json
-     config = model.get_config()            # all __init__ params
-     with open("config.json", "w") as f:
-         json.dump(config, f)
-     model2 = BENDR.from_config(config)    # reconstruct (no weights)
- All model parameters (both EEG-specific and model-specific such as
- dropout rates, activation functions, number of filters) are automatically
- saved to the Hub and restored when loading.
- See :ref:`load-pretrained-models` for a complete tutorial.</main>
-</div>
 ## Citation
-Please cite both the original paper for this architecture (see the
-*References* section above) and braindecode:
 ```bibtex
 @article{aristimunha2025braindecode,

   - braindecode
   - foundation-model
   - convolutional
 ---
 # BENDR
+BENDR (BErt-inspired Neural Data Representations) from Kostas et al (2021) [bendr].
+> **Architecture-only repository.** Documents the
 > `braindecode.models.BENDR` class. **No pretrained weights are
+> distributed here.** Instantiate the model and train it on your own
+> data.
 ## Quick start
 )
 ```
+The signal-shape arguments above are illustrative defaults — adjust to
+match your recording.
 ## Documentation
+- Full API reference: <https://braindecode.org/stable/generated/braindecode.models.BENDR.html>
+- Interactive browser (live instantiation, parameter counts):
   <https://huggingface.co/spaces/braindecode/model-explorer>
 - Source on GitHub: <https://github.com/braindecode/braindecode/blob/master/braindecode/models/bendr.py#L60>
+## Architecture
+![BENDR architecture](https://www.frontiersin.org/files/Articles/653659/fnhum-15-653659-HTML/image_m/fnhum-15-653659-g001.jpg)
+## Parameters
+| Parameter | Type | Description |
+|---|---|---|
+| `encoder_h` | int, default=512 | Hidden size (number of output channels) of the convolutional encoder. This determines the dimensionality of the BENDR feature vectors produced by the encoder. |
+| `contextualizer_hidden` | int, default=3076 | Hidden size of the feedforward layer within each transformer block. The paper uses approximately 2x the transformer dimension (3076 ~ 2 x 1536). |
+| `projection_head` | bool, default=False | If True, adds a projection layer at the end of the encoder to project back to the input feature size. This is used during self-supervised pre-training but typically disabled during fine-tuning. |
+| `drop_prob` | float, default=0.1 | Dropout probability applied throughout the model. The paper recommends 0.15 for pre-training and 0.0 for fine-tuning. Default is 0.1 as a compromise. |
+| `layer_drop` | float, default=0.0 | Probability of dropping entire transformer layers during training (LayerDrop regularization [layerdrop]). The paper uses 0.01 for pre-training and 0.0 for fine-tuning. |
+| `activation` | :class:`torch.nn.Module`, default=:class:`torch.nn.GELU` | Activation function used in the encoder convolutional blocks. The paper uses GELU activation throughout. |
+| `transformer_layers` | int, default=8 | Number of transformer encoder layers in the contextualizer. The paper uses 8 layers. |
+| `transformer_heads` | int, default=8 | Number of attention heads in each transformer layer. The paper uses 8 heads with head dimension of 192 (1536 / 8). |
+| `position_encoder_length` | int, default=25 | Kernel size for the convolutional positional encoding layer. The paper uses a receptive field of 25 with 16 groups. |
+| `enc_width` | tuple of int, default=(3, 2, 2, 2, 2, 2) | Kernel sizes for each of the 6 convolutional blocks in the encoder. Each value corresponds to one block. |
+| `enc_downsample` | tuple of int, default=(3, 2, 2, 2, 2, 2) | Stride values for each of the 6 convolutional blocks in the encoder. The total downsampling factor is the product of all strides (3 x 2 x 2 x 2 x 2 x 2 = 96). |
+| `start_token` | int or float, default=-5 | Value used to fill the start token embedding that is prepended to the BENDR sequence before input to the transformer. This token's output is used as the aggregate representation for classification. |
+| `final_layer` | bool, default=True | If True, includes a final linear classification layer that maps from encoder_h to n_outputs. If False, the model outputs the contextualized features directly. |
+| `encoder_only` | bool, default=False | If True, bypass the contextualizer and use 4-chunk temporal pooling on the encoder output instead. This corresponds to the encoder-only configuration described in Section 2.4 and Table 2 of Kostas et al. (2021) [bendr], which outperformed the full model on 4 out of 5 downstream tasks. The encoder output is split into 4 equal temporal chunks, each chunk is mean-pooled, and the results are concatenated to produce a feature vector of size `encoder_h * 4` (2048-dim with default settings). The contextualizer is still created (to allow loading pretrained weights) but is not used in the forward pass. Requires input length of at least `4 * product(enc_downsample)` samples (384 with default downsampling of 96x). |
+## References
+1. Kostas, D., Aroca-Ouellette, S., & Rudzicz, F. (2021). BENDR: Using transformers and a contrastive self-supervised learning task to learn from massive amounts of EEG data. Frontiers in Human Neuroscience, 15, 653659. https://doi.org/10.3389/fnhum.2021.653659
+2. Baevski, A., Zhou, Y., Mohamed, A., & Auli, M. (2020). wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, & H. Lin (Eds), Advances in Neural Information Processing Systems (Vol. 33, pp. 12449-12460). https://dl.acm.org/doi/10.5555/3495724.3496768
+3. Huang, T. K., Liang, S., Jha, A., & Salakhutdinov, R. (2020). Improving Transformer Optimization Through Better Initialization. In International Conference on Machine Learning (pp. 4475-4483). PMLR. https://dl.acm.org/doi/10.5555/3524938.3525354
+4. Fan, A., Grave, E., & Joulin, A. (2020). Reducing Transformer Depth on Demand with Structured Dropout. International Conference on Learning Representations. Retrieved from https://openreview.net/forum?id=SylO2yStDr
 ## Citation
+Cite the original architecture paper (see *References* above) and braindecode:
 ```bibtex
 @article{aristimunha2025braindecode,