bruAristimunha commited on
Commit
58c2d67
·
verified ·
1 Parent(s): 808fd41

Replace with clean markdown card

Browse files
Files changed (1) hide show
  1. README.md +32 -266
README.md CHANGED
@@ -10,18 +10,16 @@ tags:
10
  - braindecode
11
  - foundation-model
12
  - convolutional
13
- - transformer
14
  ---
15
 
16
  # InterpolatedBENDR
17
 
18
  Channel-interpolating wrapper around :class:`BENDR`.
19
 
20
- > **Architecture-only repository.** This repo documents the
21
  > `braindecode.models.InterpolatedBENDR` class. **No pretrained weights are
22
- > distributed here** instantiate the model and train it on your own
23
- > data, or fine-tune from a published foundation-model checkpoint
24
- > separately.
25
 
26
  ## Quick start
27
 
@@ -40,284 +38,52 @@ model = InterpolatedBENDR(
40
  )
41
  ```
42
 
43
- The signal-shape arguments above are example defaults — adjust them
44
- to match your recording.
45
 
46
  ## Documentation
47
-
48
- - Full API reference (parameters, references, architecture figure):
49
- <https://braindecode.org/stable/generated/braindecode.models.InterpolatedBENDR.html>
50
- - Interactive browser with live instantiation:
51
  <https://huggingface.co/spaces/braindecode/model-explorer>
52
  - Source on GitHub: <https://github.com/braindecode/braindecode/blob/master/braindecode/models/interpolated.py#L1>
53
 
54
- ## Architecture description
55
-
56
- The block below is the rendered class docstring (parameters,
57
- references, architecture figure where available).
58
-
59
- <div class='bd-doc'><main>
60
- <p>Channel-interpolating wrapper around :class:`BENDR`.</p>
61
- <p>:bdg-dark-line:`Channel`</p>
62
- <p>Accepts arbitrary user <span class="docutils literal">chs_info</span> and projects them to the
63
- backbone's canonical channel set via
64
- :class:`~braindecode.modules.ChannelInterpolationLayer`.</p>
65
- <p>For all other parameters and behavior see the backbone
66
- documentation reproduced below.</p>
67
- <p>BENDR (BErt-inspired Neural Data Representations) from Kostas et al (2021) [bendr]_.</p>
68
- <span style="display:inline-block;padding:2px 8px;border-radius:4px;background:#5cb85c;color:white;font-size:11px;font-weight:600;margin-right:4px;">Convolution</span><span style="display:inline-block;padding:2px 8px;border-radius:4px;background:#d9534f;color:white;font-size:11px;font-weight:600;margin-right:4px;">Foundation Model</span>
69
-
70
-
71
-
72
- .. figure:: https://www.frontiersin.org/files/Articles/653659/fnhum-15-653659-HTML/image_m/fnhum-15-653659-g001.jpg
73
- :align: center
74
- :alt: BENDR Architecture
75
- :width: 1000px
76
-
77
- The **BENDR** architecture adapts techniques used for language modeling (LM) toward the
78
- development of encephalography modeling (EM) [bendr]_. It utilizes a self-supervised
79
- training objective to learn compressed representations of raw EEG signals [bendr]_. The
80
- model is capable of modeling completely novel raw EEG sequences recorded with differing
81
- hardware and subjects, aiming for transferable performance across a variety of downstream
82
- BCI and EEG classification tasks [bendr]_.
83
-
84
- .. rubric:: Architectural Overview
85
-
86
- BENDR is adapted from wav2vec 2.0 [wav2vec2]_ and is composed of two main stages: a
87
- feature extractor (Convolutional stage) that produces BErt-inspired Neural Data
88
- Representations (BENDR), followed by a transformer encoder (Contextualizer) [bendr]_.
89
-
90
- .. rubric:: Macro Components
91
-
92
- - `BENDR.encoder` **(Convolutional Stage/Feature Extractor)**
93
- - *Operations.* A stack of six short-receptive field 1D convolutions [bendr]_. Each
94
- block consists of 1D convolution, GroupNorm, and GELU activation.
95
- - *Role.* Takes raw data :math:`X_{raw}` and dramatically downsamples it to a new
96
- sequence of vectors (BENDR) [bendr]_. Each resulting vector has a length of 512.
97
- - `BENDR.contextualizer` **(Transformer Encoder)**
98
- - *Operations.* A transformer encoder that uses layered, multi-head self-attention
99
- [bendr]_. It employs T-Fixup weight initialization [tfixup]_ and uses 8 layers
100
- and 8 heads.
101
- - *Role.* Maps the sequence of BENDR vectors to a contextualized sequence. The output
102
- of a fixed start token is typically used as the aggregate representation for
103
- downstream classification [bendr]_.
104
- - `Contextualizer.position_encoder` **(Positional Encoding)**
105
- - *Operations.* An additive (grouped) convolution layer with a receptive field of 25
106
- and 16 groups [bendr]_.
107
- - *Role.* Encodes position information before the input enters the transformer.
108
-
109
- .. rubric:: How the information is encoded temporally, spatially, and spectrally
110
-
111
- * **Temporal.**
112
- The convolutional encoder uses a stack of blocks where the stride matches the receptive
113
- field (e.g., 3 for the first block, 2 for subsequent blocks) [bendr]_. This process
114
- downsamples the raw data by a factor of 96, resulting in an effective sampling frequency
115
- of approximately 2.67 Hz.
116
- * **Spatial.**
117
- To maintain simplicity and reduce complexity, the convolutional stage uses **1D
118
- convolutions** and elects not to mix EEG channels across the first stage [bendr]_. The
119
- input includes 20 channels (19 EEG channels and one relative amplitude channel).
120
- * **Spectral.**
121
- The convolution operations implicitly extract features from the raw EEG signal [bendr]_.
122
- The representations (BENDR) are derived from the raw waveform using convolutional
123
- operations followed by sequence modeling [wav2vec2]_.
124
-
125
- .. rubric:: Additional Mechanisms
126
-
127
- - **Self-Supervision (Pre-training).** Uses a masked sequence learning approach (adapted
128
- from wav2vec 2.0 [wav2vec2]_) where contiguous spans of BENDR sequences are masked, and
129
- the model attempts to reconstruct the original underlying encoded vector based on the
130
- transformer output and a set of negative distractors [bendr]_.
131
- - **Regularization.** LayerDrop [layerdrop]_ and Dropout (at probabilities 0.01 and 0.15,
132
- respectively) are used during pre-training [bendr]_. The implementation also uses T-Fixup
133
- scaling for parameter initialization [tfixup]_.
134
- - **Input Conditioning.** A fixed token (a vector filled with the value **-5**) is
135
- prepended to the BENDR sequence before input to the transformer, serving as the aggregate
136
- representation token [bendr]_.
137
-
138
- .. important::
139
- **Pre-trained Weights Available**
140
-
141
- This model has pre-trained weights available on the Hugging Face Hub.
142
- You can load them using:
143
-
144
- .. code:: python
145
- from braindecode.models import BENDR
146
-
147
- # Load pre-trained model from Hugging Face Hub
148
- # you can specify `n_outputs` for your downstream task
149
- model = BENDR.from_pretrained("braindecode/braindecode-bendr", n_outputs=2)
150
-
151
- To push your own trained model to the Hub:
152
-
153
- .. code:: python
154
- # After training your model
155
- model.push_to_hub(
156
- repo_id="username/my-bendr-model", commit_message="Upload trained BENDR model"
157
- )
158
-
159
- Requires installing ``braindecode[hug]`` for Hub integration.
160
-
161
- Notes
162
- -----
163
- * The full BENDR architecture contains a large number of parameters; configuration (1)
164
- involved training over **one billion parameters** [bendr]_.
165
- * Randomly initialized full BENDR architecture was generally ineffective at solving
166
- downstream tasks without prior self-supervised training [bendr]_.
167
- * The pre-training task (contrastive predictive coding via masking) is generalizable,
168
- exhibiting strong uniformity of performance across novel subjects, hardware, and
169
- tasks [bendr]_.
170
-
171
- .. warning::
172
-
173
- **Important:** To utilize the full potential of BENDR, the model requires
174
- **self-supervised pre-training** on large, unlabeled EEG datasets (like TUEG) followed
175
- by subsequent fine-tuning on the specific downstream classification task [bendr]_.
176
-
177
- References
178
- ----------
179
- .. [bendr] Kostas, D., Aroca-Ouellette, S., & Rudzicz, F. (2021).
180
- BENDR: Using transformers and a contrastive self-supervised learning task to learn from
181
- massive amounts of EEG data.
182
- Frontiers in Human Neuroscience, 15, 653659.
183
- https://doi.org/10.3389/fnhum.2021.653659
184
- .. [wav2vec2] Baevski, A., Zhou, Y., Mohamed, A., & Auli, M. (2020).
185
- wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations.
186
- In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, & H. Lin (Eds),
187
- Advances in Neural Information Processing Systems (Vol. 33, pp. 12449-12460).
188
- https://dl.acm.org/doi/10.5555/3495724.3496768
189
- .. [tfixup] Huang, T. K., Liang, S., Jha, A., & Salakhutdinov, R. (2020).
190
- Improving Transformer Optimization Through Better Initialization.
191
- In International Conference on Machine Learning (pp. 4475-4483). PMLR.
192
- https://dl.acm.org/doi/10.5555/3524938.3525354
193
- .. [layerdrop] Fan, A., Grave, E., & Joulin, A. (2020).
194
- Reducing Transformer Depth on Demand with Structured Dropout.
195
- International Conference on Learning Representations.
196
- Retrieved from https://openreview.net/forum?id=SylO2yStDr
197
-
198
- Parameters
199
- ----------
200
- encoder_h : int, default=512
201
- Hidden size (number of output channels) of the convolutional encoder. This determines
202
- the dimensionality of the BENDR feature vectors produced by the encoder.
203
- contextualizer_hidden : int, default=3076
204
- Hidden size of the feedforward layer within each transformer block. The paper uses
205
- approximately 2x the transformer dimension (3076 ~ 2 x 1536).
206
- projection_head : bool, default=False
207
- If True, adds a projection layer at the end of the encoder to project back to the
208
- input feature size. This is used during self-supervised pre-training but typically
209
- disabled during fine-tuning.
210
- drop_prob : float, default=0.1
211
- Dropout probability applied throughout the model. The paper recommends 0.15 for
212
- pre-training and 0.0 for fine-tuning. Default is 0.1 as a compromise.
213
- layer_drop : float, default=0.0
214
- Probability of dropping entire transformer layers during training (LayerDrop
215
- regularization [layerdrop]_). The paper uses 0.01 for pre-training and 0.0 for
216
- fine-tuning.
217
- activation : :class:`torch.nn.Module`, default=:class:`torch.nn.GELU`
218
- Activation function used in the encoder convolutional blocks. The paper uses GELU
219
- activation throughout.
220
- transformer_layers : int, default=8
221
- Number of transformer encoder layers in the contextualizer. The paper uses 8 layers.
222
- transformer_heads : int, default=8
223
- Number of attention heads in each transformer layer. The paper uses 8 heads with
224
- head dimension of 192 (1536 / 8).
225
- position_encoder_length : int, default=25
226
- Kernel size for the convolutional positional encoding layer. The paper uses a
227
- receptive field of 25 with 16 groups.
228
- enc_width : tuple of int, default=(3, 2, 2, 2, 2, 2)
229
- Kernel sizes for each of the 6 convolutional blocks in the encoder. Each value
230
- corresponds to one block.
231
- enc_downsample : tuple of int, default=(3, 2, 2, 2, 2, 2)
232
- Stride values for each of the 6 convolutional blocks in the encoder. The total
233
- downsampling factor is the product of all strides (3 x 2 x 2 x 2 x 2 x 2 = 96).
234
- start_token : int or float, default=-5
235
- Value used to fill the start token embedding that is prepended to the BENDR sequence
236
- before input to the transformer. This token's output is used as the aggregate
237
- representation for classification.
238
- final_layer : bool, default=True
239
- If True, includes a final linear classification layer that maps from encoder_h to
240
- n_outputs. If False, the model outputs the contextualized features directly.
241
- encoder_only : bool, default=False
242
- If True, bypass the contextualizer and use 4-chunk temporal pooling on the encoder
243
- output instead. This corresponds to the encoder-only configuration described in
244
- Section 2.4 and Table 2 of Kostas et al. (2021) [bendr]_, which outperformed the
245
- full model on 4 out of 5 downstream tasks. The encoder output is split into 4 equal
246
- temporal chunks, each chunk is mean-pooled, and the results are concatenated to
247
- produce a feature vector of size ``encoder_h * 4`` (2048-dim with default settings).
248
- The contextualizer is still created (to allow loading pretrained weights) but is not
249
- used in the forward pass. Requires input length of at least
250
- ``4 * product(enc_downsample)`` samples (384 with default downsampling of 96x).
251
-
252
- .. rubric:: Hugging Face Hub integration
253
-
254
- When the optional ``huggingface_hub`` package is installed, all models
255
- automatically gain the ability to be pushed to and loaded from the
256
- Hugging Face Hub. Install with::
257
-
258
- pip install braindecode[hub]
259
-
260
- **Pushing a model to the Hub:**
261
-
262
- .. code::
263
- from braindecode.models import BENDR
264
-
265
- # Train your model
266
- model = BENDR(n_chans=22, n_outputs=4, n_times=1000)
267
- # ... training code ...
268
-
269
- # Push to the Hub
270
- model.push_to_hub(
271
- repo_id="username/my-bendr-model",
272
- commit_message="Initial model upload",
273
- )
274
-
275
- **Loading a model from the Hub:**
276
-
277
- .. code::
278
- from braindecode.models import BENDR
279
-
280
- # Load pretrained model
281
- model = BENDR.from_pretrained("username/my-bendr-model")
282
-
283
- # Load with a different number of outputs (head is rebuilt automatically)
284
- model = BENDR.from_pretrained("username/my-bendr-model", n_outputs=4)
285
-
286
- **Extracting features and replacing the head:**
287
 
288
- .. code::
289
- import torch
290
 
291
- x = torch.randn(1, model.n_chans, model.n_times)
292
- # Extract encoder features (consistent dict across all models)
293
- out = model(x, return_features=True)
294
- features = out["features"]
295
 
296
- # Replace the classification head
297
- model.reset_head(n_outputs=10)
298
 
299
- **Saving and restoring full configuration:**
300
 
301
- .. code::
302
- import json
 
 
 
 
 
 
 
 
 
 
 
 
 
 
303
 
304
- config = model.get_config() # all __init__ params
305
- with open("config.json", "w") as f:
306
- json.dump(config, f)
307
 
308
- model2 = BENDR.from_config(config) # reconstruct (no weights)
309
 
310
- All model parameters (both EEG-specific and model-specific such as
311
- dropout rates, activation functions, number of filters) are automatically
312
- saved to the Hub and restored when loading.
 
313
 
314
- See :ref:`load-pretrained-models` for a complete tutorial.</main>
315
- </div>
316
 
317
  ## Citation
318
 
319
- Please cite both the original paper for this architecture (see the
320
- *References* section above) and braindecode:
321
 
322
  ```bibtex
323
  @article{aristimunha2025braindecode,
 
10
  - braindecode
11
  - foundation-model
12
  - convolutional
 
13
  ---
14
 
15
  # InterpolatedBENDR
16
 
17
  Channel-interpolating wrapper around :class:`BENDR`.
18
 
19
+ > **Architecture-only repository.** Documents the
20
  > `braindecode.models.InterpolatedBENDR` class. **No pretrained weights are
21
+ > distributed here.** Instantiate the model and train it on your own
22
+ > data.
 
23
 
24
  ## Quick start
25
 
 
38
  )
39
  ```
40
 
41
+ The signal-shape arguments above are illustrative defaults — adjust to
42
+ match your recording.
43
 
44
  ## Documentation
45
+ - Full API reference: <https://braindecode.org/stable/generated/braindecode.models.InterpolatedBENDR.html>
46
+ - Interactive browser (live instantiation, parameter counts):
 
 
47
  <https://huggingface.co/spaces/braindecode/model-explorer>
48
  - Source on GitHub: <https://github.com/braindecode/braindecode/blob/master/braindecode/models/interpolated.py#L1>
49
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
50
 
51
+ ## Architecture
 
52
 
53
+ ![InterpolatedBENDR architecture](https://www.frontiersin.org/files/Articles/653659/fnhum-15-653659-HTML/image_m/fnhum-15-653659-g001.jpg)
 
 
 
54
 
 
 
55
 
56
+ ## Parameters
57
 
58
+ | Parameter | Type | Description |
59
+ |---|---|---|
60
+ | `encoder_h` | int, default=512 | Hidden size (number of output channels) of the convolutional encoder. This determines the dimensionality of the BENDR feature vectors produced by the encoder. |
61
+ | `contextualizer_hidden` | int, default=3076 | Hidden size of the feedforward layer within each transformer block. The paper uses approximately 2x the transformer dimension (3076 ~ 2 x 1536). |
62
+ | `projection_head` | bool, default=False | If True, adds a projection layer at the end of the encoder to project back to the input feature size. This is used during self-supervised pre-training but typically disabled during fine-tuning. |
63
+ | `drop_prob` | float, default=0.1 | Dropout probability applied throughout the model. The paper recommends 0.15 for pre-training and 0.0 for fine-tuning. Default is 0.1 as a compromise. |
64
+ | `layer_drop` | float, default=0.0 | Probability of dropping entire transformer layers during training (LayerDrop regularization [layerdrop]). The paper uses 0.01 for pre-training and 0.0 for fine-tuning. |
65
+ | `activation` | :class:`torch.nn.Module`, default=:class:`torch.nn.GELU` | Activation function used in the encoder convolutional blocks. The paper uses GELU activation throughout. |
66
+ | `transformer_layers` | int, default=8 | Number of transformer encoder layers in the contextualizer. The paper uses 8 layers. |
67
+ | `transformer_heads` | int, default=8 | Number of attention heads in each transformer layer. The paper uses 8 heads with head dimension of 192 (1536 / 8). |
68
+ | `position_encoder_length` | int, default=25 | Kernel size for the convolutional positional encoding layer. The paper uses a receptive field of 25 with 16 groups. |
69
+ | `enc_width` | tuple of int, default=(3, 2, 2, 2, 2, 2) | Kernel sizes for each of the 6 convolutional blocks in the encoder. Each value corresponds to one block. |
70
+ | `enc_downsample` | tuple of int, default=(3, 2, 2, 2, 2, 2) | Stride values for each of the 6 convolutional blocks in the encoder. The total downsampling factor is the product of all strides (3 x 2 x 2 x 2 x 2 x 2 = 96). |
71
+ | `start_token` | int or float, default=-5 | Value used to fill the start token embedding that is prepended to the BENDR sequence before input to the transformer. This token's output is used as the aggregate representation for classification. |
72
+ | `final_layer` | bool, default=True | If True, includes a final linear classification layer that maps from encoder_h to n_outputs. If False, the model outputs the contextualized features directly. |
73
+ | `encoder_only` | bool, default=False | If True, bypass the contextualizer and use 4-chunk temporal pooling on the encoder output instead. This corresponds to the encoder-only configuration described in Section 2.4 and Table 2 of Kostas et al. (2021) [bendr], which outperformed the full model on 4 out of 5 downstream tasks. The encoder output is split into 4 equal temporal chunks, each chunk is mean-pooled, and the results are concatenated to produce a feature vector of size `encoder_h * 4` (2048-dim with default settings). The contextualizer is still created (to allow loading pretrained weights) but is not used in the forward pass. Requires input length of at least `4 * product(enc_downsample)` samples (384 with default downsampling of 96x). |
74
 
 
 
 
75
 
76
+ ## References
77
 
78
+ 1. Kostas, D., Aroca-Ouellette, S., & Rudzicz, F. (2021). BENDR: Using transformers and a contrastive self-supervised learning task to learn from massive amounts of EEG data. Frontiers in Human Neuroscience, 15, 653659. https://doi.org/10.3389/fnhum.2021.653659
79
+ 2. Baevski, A., Zhou, Y., Mohamed, A., & Auli, M. (2020). wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, & H. Lin (Eds), Advances in Neural Information Processing Systems (Vol. 33, pp. 12449-12460). https://dl.acm.org/doi/10.5555/3495724.3496768
80
+ 3. Huang, T. K., Liang, S., Jha, A., & Salakhutdinov, R. (2020). Improving Transformer Optimization Through Better Initialization. In International Conference on Machine Learning (pp. 4475-4483). PMLR. https://dl.acm.org/doi/10.5555/3524938.3525354
81
+ 4. Fan, A., Grave, E., & Joulin, A. (2020). Reducing Transformer Depth on Demand with Structured Dropout. International Conference on Learning Representations. Retrieved from https://openreview.net/forum?id=SylO2yStDr
82
 
 
 
83
 
84
  ## Citation
85
 
86
+ Cite the original architecture paper (see *References* above) and braindecode:
 
87
 
88
  ```bibtex
89
  @article{aristimunha2025braindecode,