braindecode
/

BENDR

+---
+license: bsd-3-clause
+library_name: braindecode
+pipeline_tag: feature-extraction
+tags:
+  - eeg
+  - biosignal
+  - pytorch
+  - neuroscience
+  - braindecode
+  - foundation-model
+  - convolutional
+  - transformer
+---
+# BENDR
+BENDR (BErt-inspired Neural Data Representations) from Kostas et al (2021) .
+> **Architecture-only repository.** This repo documents the
+> `braindecode.models.BENDR` class. **No pretrained weights are
+> distributed here** — instantiate the model and train it on your own
+> data, or fine-tune from a published foundation-model checkpoint
+> separately.
+## Quick start
+```bash
+pip install braindecode
+```
+```python
+from braindecode.models import BENDR
+model = BENDR(
+    n_chans=20,
+    sfreq=256,
+    input_window_seconds=4.0,
+    n_outputs=2,
+)
+```
+The signal-shape arguments above are example defaults — adjust them
+to match your recording.
+## Documentation
+- Full API reference (parameters, references, architecture figure):
+  <https://braindecode.org/stable/generated/braindecode.models.BENDR.html>
+- Interactive browser with live instantiation:
+  <https://huggingface.co/spaces/braindecode/model-explorer>
+- Source on GitHub: <https://github.com/braindecode/braindecode/blob/master/braindecode/models/bendr.py#L60>
+## Architecture description
+The block below is the rendered class docstring (parameters,
+references, architecture figure where available).
+<div class='bd-doc'><main>
+<p>BENDR (BErt-inspired Neural Data Representations) from Kostas et al (2021) [bendr]_.</p>
+<span style="display:inline-block;padding:2px 8px;border-radius:4px;background:#5cb85c;color:white;font-size:11px;font-weight:600;margin-right:4px;">Convolution</span><span style="display:inline-block;padding:2px 8px;border-radius:4px;background:#d9534f;color:white;font-size:11px;font-weight:600;margin-right:4px;">Foundation Model</span>
+ .. figure:: https://www.frontiersin.org/files/Articles/653659/fnhum-15-653659-HTML/image_m/fnhum-15-653659-g001.jpg
+     :align: center
+     :alt: BENDR Architecture
+     :width: 1000px
+ The **BENDR** architecture adapts techniques used for language modeling (LM) toward the
+ development of encephalography modeling (EM) [bendr]_. It utilizes a self-supervised
+ training objective to learn compressed representations of raw EEG signals [bendr]_. The
+ model is capable of modeling completely novel raw EEG sequences recorded with differing
+ hardware and subjects, aiming for transferable performance across a variety of downstream
+ BCI and EEG classification tasks [bendr]_.
+ .. rubric:: Architectural Overview
+ BENDR is adapted from wav2vec 2.0 [wav2vec2]_ and is composed of two main stages: a
+ feature extractor (Convolutional stage) that produces BErt-inspired Neural Data
+ Representations (BENDR), followed by a transformer encoder (Contextualizer) [bendr]_.
+ .. rubric:: Macro Components
+ - `BENDR.encoder` **(Convolutional Stage/Feature Extractor)**
+     - *Operations.* A stack of six short-receptive field 1D convolutions [bendr]_. Each
+       block consists of 1D convolution, GroupNorm, and GELU activation.
+     - *Role.* Takes raw data :math:`X_{raw}` and dramatically downsamples it to a new
+       sequence of vectors (BENDR) [bendr]_. Each resulting vector has a length of 512.
+ - `BENDR.contextualizer` **(Transformer Encoder)**
+     - *Operations.* A transformer encoder that uses layered, multi-head self-attention
+       [bendr]_. It employs T-Fixup weight initialization [tfixup]_ and uses 8 layers
+       and 8 heads.
+     - *Role.* Maps the sequence of BENDR vectors to a contextualized sequence. The output
+       of a fixed start token is typically used as the aggregate representation for
+       downstream classification [bendr]_.
+ - `Contextualizer.position_encoder` **(Positional Encoding)**
+     - *Operations.* An additive (grouped) convolution layer with a receptive field of 25
+       and 16 groups [bendr]_.
+     - *Role.* Encodes position information before the input enters the transformer.
+ .. rubric:: How the information is encoded temporally, spatially, and spectrally
+ * **Temporal.**
+   The convolutional encoder uses a stack of blocks where the stride matches the receptive
+   field (e.g., 3 for the first block, 2 for subsequent blocks) [bendr]_. This process
+   downsamples the raw data by a factor of 96, resulting in an effective sampling frequency
+   of approximately 2.67 Hz.
+ * **Spatial.**
+   To maintain simplicity and reduce complexity, the convolutional stage uses **1D
+   convolutions** and elects not to mix EEG channels across the first stage [bendr]_. The
+   input includes 20 channels (19 EEG channels and one relative amplitude channel).
+ * **Spectral.**
+   The convolution operations implicitly extract features from the raw EEG signal [bendr]_.
+   The representations (BENDR) are derived from the raw waveform using convolutional
+   operations followed by sequence modeling [wav2vec2]_.
+ .. rubric:: Additional Mechanisms
+ - **Self-Supervision (Pre-training).** Uses a masked sequence learning approach (adapted
+   from wav2vec 2.0 [wav2vec2]_) where contiguous spans of BENDR sequences are masked, and
+   the model attempts to reconstruct the original underlying encoded vector based on the
+   transformer output and a set of negative distractors [bendr]_.
+ - **Regularization.** LayerDrop [layerdrop]_ and Dropout (at probabilities 0.01 and 0.15,
+   respectively) are used during pre-training [bendr]_. The implementation also uses T-Fixup
+   scaling for parameter initialization [tfixup]_.
+ - **Input Conditioning.** A fixed token (a vector filled with the value **-5**) is
+   prepended to the BENDR sequence before input to the transformer, serving as the aggregate
+   representation token [bendr]_.
+ .. important::
+    **Pre-trained Weights Available**
+    This model has pre-trained weights available on the Hugging Face Hub.
+    You can load them using:
+    .. code:: python
+        from braindecode.models import BENDR
+        # Load pre-trained model from Hugging Face Hub
+        # you can specify `n_outputs` for your downstream task
+        model = BENDR.from_pretrained("braindecode/braindecode-bendr", n_outputs=2)
+    To push your own trained model to the Hub:
+    .. code:: python
+        # After training your model
+        model.push_to_hub(
+            repo_id="username/my-bendr-model", commit_message="Upload trained BENDR model"
+        )
+    Requires installing ``braindecode[hug]`` for Hub integration.
+ Notes
+ -----
+ * The full BENDR architecture contains a large number of parameters; configuration (1)
+   involved training over **one billion parameters** [bendr]_.
+ * Randomly initialized full BENDR architecture was generally ineffective at solving
+   downstream tasks without prior self-supervised training [bendr]_.
+ * The pre-training task (contrastive predictive coding via masking) is generalizable,
+   exhibiting strong uniformity of performance across novel subjects, hardware, and
+   tasks [bendr]_.
+ .. warning::
+     **Important:** To utilize the full potential of BENDR, the model requires
+     **self-supervised pre-training** on large, unlabeled EEG datasets (like TUEG) followed
+     by subsequent fine-tuning on the specific downstream classification task [bendr]_.
+ References
+ ----------
+ .. [bendr] Kostas, D., Aroca-Ouellette, S., & Rudzicz, F. (2021).
+    BENDR: Using transformers and a contrastive self-supervised learning task to learn from
+    massive amounts of EEG data.
+    Frontiers in Human Neuroscience, 15, 653659.
+    https://doi.org/10.3389/fnhum.2021.653659
+ .. [wav2vec2] Baevski, A., Zhou, Y., Mohamed, A., & Auli, M. (2020).
+    wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations.
+    In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, & H. Lin (Eds),
+    Advances in Neural Information Processing Systems (Vol. 33, pp. 12449-12460).
+    https://dl.acm.org/doi/10.5555/3495724.3496768
+ .. [tfixup] Huang, T. K., Liang, S., Jha, A., & Salakhutdinov, R. (2020).
+    Improving Transformer Optimization Through Better Initialization.
+    In International Conference on Machine Learning (pp. 4475-4483). PMLR.
+    https://dl.acm.org/doi/10.5555/3524938.3525354
+ .. [layerdrop] Fan, A., Grave, E., & Joulin, A. (2020).
+    Reducing Transformer Depth on Demand with Structured Dropout.
+    International Conference on Learning Representations.
+    Retrieved from https://openreview.net/forum?id=SylO2yStDr
+ Parameters
+ ----------
+ encoder_h : int, default=512
+     Hidden size (number of output channels) of the convolutional encoder. This determines
+     the dimensionality of the BENDR feature vectors produced by the encoder.
+ contextualizer_hidden : int, default=3076
+     Hidden size of the feedforward layer within each transformer block. The paper uses
+     approximately 2x the transformer dimension (3076 ~ 2 x 1536).
+ projection_head : bool, default=False
+     If True, adds a projection layer at the end of the encoder to project back to the
+     input feature size. This is used during self-supervised pre-training but typically
+     disabled during fine-tuning.
+ drop_prob : float, default=0.1
+     Dropout probability applied throughout the model. The paper recommends 0.15 for
+     pre-training and 0.0 for fine-tuning. Default is 0.1 as a compromise.
+ layer_drop : float, default=0.0
+     Probability of dropping entire transformer layers during training (LayerDrop
+     regularization [layerdrop]_). The paper uses 0.01 for pre-training and 0.0 for
+     fine-tuning.
+ activation : :class:`torch.nn.Module`, default=:class:`torch.nn.GELU`
+     Activation function used in the encoder convolutional blocks. The paper uses GELU
+     activation throughout.
+ transformer_layers : int, default=8
+     Number of transformer encoder layers in the contextualizer. The paper uses 8 layers.
+ transformer_heads : int, default=8
+     Number of attention heads in each transformer layer. The paper uses 8 heads with
+     head dimension of 192 (1536 / 8).
+ position_encoder_length : int, default=25
+     Kernel size for the convolutional positional encoding layer. The paper uses a
+     receptive field of 25 with 16 groups.
+ enc_width : tuple of int, default=(3, 2, 2, 2, 2, 2)
+     Kernel sizes for each of the 6 convolutional blocks in the encoder. Each value
+     corresponds to one block.
+ enc_downsample : tuple of int, default=(3, 2, 2, 2, 2, 2)
+     Stride values for each of the 6 convolutional blocks in the encoder. The total
+     downsampling factor is the product of all strides (3 x 2 x 2 x 2 x 2 x 2 = 96).
+ start_token : int or float, default=-5
+     Value used to fill the start token embedding that is prepended to the BENDR sequence
+     before input to the transformer. This token's output is used as the aggregate
+     representation for classification.
+ final_layer : bool, default=True
+     If True, includes a final linear classification layer that maps from encoder_h to
+     n_outputs. If False, the model outputs the contextualized features directly.
+ encoder_only : bool, default=False
+     If True, bypass the contextualizer and use 4-chunk temporal pooling on the encoder
+     output instead. This corresponds to the encoder-only configuration described in
+     Section 2.4 and Table 2 of Kostas et al. (2021) [bendr]_, which outperformed the
+     full model on 4 out of 5 downstream tasks. The encoder output is split into 4 equal
+     temporal chunks, each chunk is mean-pooled, and the results are concatenated to
+     produce a feature vector of size ``encoder_h * 4`` (2048-dim with default settings).
+     The contextualizer is still created (to allow loading pretrained weights) but is not
+     used in the forward pass. Requires input length of at least
+     ``4 * product(enc_downsample)`` samples (384 with default downsampling of 96x).
+ .. rubric:: Hugging Face Hub integration
+ When the optional ``huggingface_hub`` package is installed, all models
+ automatically gain the ability to be pushed to and loaded from the
+ Hugging Face Hub. Install with::
+     pip install braindecode[hub]
+ **Pushing a model to the Hub:**
+ .. code::
+     from braindecode.models import BENDR
+     # Train your model
+     model = BENDR(n_chans=22, n_outputs=4, n_times=1000)
+     # ... training code ...
+     # Push to the Hub
+     model.push_to_hub(
+         repo_id="username/my-bendr-model",
+         commit_message="Initial model upload",
+     )
+ **Loading a model from the Hub:**
+ .. code::
+     from braindecode.models import BENDR
+     # Load pretrained model
+     model = BENDR.from_pretrained("username/my-bendr-model")
+     # Load with a different number of outputs (head is rebuilt automatically)
+     model = BENDR.from_pretrained("username/my-bendr-model", n_outputs=4)
+ **Extracting features and replacing the head:**
+ .. code::
+     import torch
+     x = torch.randn(1, model.n_chans, model.n_times)
+     # Extract encoder features (consistent dict across all models)
+     out = model(x, return_features=True)
+     features = out["features"]
+     # Replace the classification head
+     model.reset_head(n_outputs=10)
+ **Saving and restoring full configuration:**
+ .. code::
+     import json
+     config = model.get_config()            # all __init__ params
+     with open("config.json", "w") as f:
+         json.dump(config, f)
+     model2 = BENDR.from_config(config)    # reconstruct (no weights)
+ All model parameters (both EEG-specific and model-specific such as
+ dropout rates, activation functions, number of filters) are automatically
+ saved to the Hub and restored when loading.
+ See :ref:`load-pretrained-models` for a complete tutorial.</main>
+</div>
+## Citation
+Please cite both the original paper for this architecture (see the
+*References* section above) and braindecode:
+```bibtex
+@article{aristimunha2025braindecode,
+  title   = {Braindecode: a deep learning library for raw electrophysiological data},
+  author  = {Aristimunha, Bruno and others},
+  journal = {Zenodo},
+  year    = {2025},
+  doi     = {10.5281/zenodo.17699192},
+}
+```
+## License
+BSD-3-Clause for the model code (matching braindecode).
+Pretraining-derived weights, if you fine-tune from a checkpoint,
+inherit the licence of that checkpoint and its training corpus.