esunAI commited on
Commit
68593e5
·
verified ·
1 Parent(s): b042698

Add comprehensive documentation: compressor_decompressor_latex.tex

Browse files
documentation/compressor_decompressor.tex ADDED
@@ -0,0 +1,390 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ \section{Transformer-Based Compression and Decompression Architecture}
2
+ \label{sec:compression}
3
+
4
+ The compression-decompression pipeline forms the core bridge between high-dimensional ESM-2 embeddings and the efficient latent space required for flow matching generation. Our architecture employs a symmetric hourglass design with transformer self-attention and learned pooling operations to achieve 16× compression while preserving semantic protein information.
5
+
6
+ \subsection{Compression Architecture Overview}
7
+
8
+ The compressor $\mathcal{C}: \mathbb{R}^{L \times 1280} \rightarrow \mathbb{R}^{L/2 \times 80}$ transforms normalized ESM-2 embeddings into a compressed latent representation suitable for flow matching. The architecture follows a hourglass design inspired by ProtFlow, combining spatial pooling with transformer self-attention for optimal information preservation.
9
+
10
+ \subsubsection{Compressor Network Design}
11
+ \label{sec:compressor_design}
12
+
13
+ The compressor employs a four-stage architecture with symmetric transformer processing before and after spatial pooling:
14
+
15
+ \begin{align}
16
+ \mathbf{H}^{(0)} &= \text{LayerNorm}(\mathbf{H}^{(norm)}) \label{eq:comp_input_norm}\\
17
+ \mathbf{H}^{(pre)} &= \text{TransformerEncoder}_{\text{pre}}(\mathbf{H}^{(0)}) \label{eq:comp_pre_transformer}\\
18
+ \mathbf{H}^{(pool)} &= \text{HourglassPool}(\mathbf{H}^{(pre)}) \label{eq:comp_hourglass_pool}\\
19
+ \mathbf{H}^{(post)} &= \text{TransformerEncoder}_{\text{post}}(\mathbf{H}^{(pool)}) \label{eq:comp_post_transformer}\\
20
+ \mathbf{Z}^{(comp)} &= \tanh(\text{LayerNorm}(\mathbf{H}^{(post)}) \mathbf{W}^{(proj)} + \mathbf{b}^{(proj)}) \label{eq:comp_final_projection}
21
+ \end{align}
22
+
23
+ where both $\text{TransformerEncoder}_{\text{pre}}$ and $\text{TransformerEncoder}_{\text{post}}$ consist of 2 transformer layers each, maintaining the full ESM-2 dimensionality (1280) until the final projection.
24
+
25
+ \subsubsection{Hourglass Pooling Strategy}
26
+ \label{sec:hourglass_pooling}
27
+
28
+ The hourglass pooling operation reduces sequence length by exactly half while preserving local spatial relationships. This operation is crucial for computational efficiency in the flow matching process:
29
+
30
+ \begin{align}
31
+ \text{HourglassPool}(\mathbf{H}) &= \begin{cases}
32
+ \text{Pool}(\mathbf{H}[:, :L-1, :]) & \text{if } L \text{ is odd} \\
33
+ \text{Pool}(\mathbf{H}) & \text{if } L \text{ is even}
34
+ \end{cases} \label{eq:hourglass_length_handling}
35
+ \end{align}
36
+
37
+ The pooling operation groups adjacent residue positions and averages their representations:
38
+
39
+ \begin{align}
40
+ \mathbf{H}^{(grouped)} &= \text{Reshape}(\mathbf{H}, [B, L/2, 2, D]) \label{eq:reshape_for_pooling}\\
41
+ \mathbf{H}^{(pool)} &= \frac{1}{2}\sum_{k=1}^{2} \mathbf{H}^{(grouped)}[:, :, k, :] \label{eq:mean_pooling}
42
+ \end{align}
43
+
44
+ This pooling strategy preserves local sequence context while achieving the desired compression in sequence length.
45
+
46
+ \subsubsection{Final Projection and Activation}
47
+ \label{sec:comp_projection}
48
+
49
+ The final projection layer reduces dimensionality from 1280 to 80 (16× compression) with tanh activation to ensure bounded outputs:
50
+
51
+ \begin{align}
52
+ \mathbf{W}^{(proj)} &\in \mathbb{R}^{1280 \times 80}, \quad \mathbf{b}^{(proj)} \in \mathbb{R}^{80} \label{eq:projection_parameters}\\
53
+ \mathbf{Z}^{(comp)} &= \tanh(\mathbf{H}^{(post)} \mathbf{W}^{(proj)} + \mathbf{b}^{(proj)}) \in [-1, 1]^{L/2 \times 80} \label{eq:bounded_compression}
54
+ \end{align}
55
+
56
+ The tanh activation ensures that compressed embeddings remain in a bounded range, facilitating stable flow matching training.
57
+
58
+ \subsection{Decompression Architecture}
59
+
60
+ The decompressor $\mathcal{D}: \mathbb{R}^{L/2 \times 80} \rightarrow \mathbb{R}^{L \times 1280}$ reconstructs full-dimensional ESM-2 embeddings from compressed representations. The architecture mirrors the compressor with reverse operations: dimension expansion, spatial unpooling, and transformer refinement.
61
+
62
+ \subsubsection{Decompressor Network Design}
63
+ \label{sec:decompressor_design}
64
+
65
+ The decompressor employs a three-stage reconstruction process:
66
+
67
+ \begin{align}
68
+ \mathbf{H}^{(expanded)} &= \text{LayerNorm}(\mathbf{Z}^{(comp)}) \mathbf{W}^{(expand)} + \mathbf{b}^{(expand)} \label{eq:decomp_expansion}\\
69
+ \mathbf{H}^{(unpool)} &= \text{HourglassUnpool}(\mathbf{H}^{(expanded)}) \label{eq:decomp_unpooling}\\
70
+ \mathbf{H}^{(recon)} &= \text{TransformerEncoder}_{\text{decode}}(\mathbf{H}^{(unpool)}) \label{eq:decomp_transformer}
71
+ \end{align}
72
+
73
+ where $\mathbf{W}^{(expand)} \in \mathbb{R}^{80 \times 1280}$ and $\mathbf{b}^{(expand)} \in \mathbb{R}^{1280}$ expand the compressed representation back to ESM-2 dimensionality.
74
+
75
+ \subsubsection{Hourglass Unpooling Operation}
76
+ \label{sec:hourglass_unpooling}
77
+
78
+ The unpooling operation reverses the compression by duplicating each compressed position to restore the original sequence length:
79
+
80
+ \begin{align}
81
+ \text{HourglassUnpool}(\mathbf{H}^{(expanded)}) &= \text{repeat\_interleave}(\mathbf{H}^{(expanded)}, 2, \text{dim}=1) \label{eq:repeat_interleave}
82
+ \end{align}
83
+
84
+ This operation doubles the sequence length, restoring the spatial resolution lost during compression:
85
+
86
+ \begin{align}
87
+ \mathbf{H}^{(unpool)}[b, 2i, :] &= \mathbf{H}^{(expanded)}[b, i, :] \label{eq:unpool_even}\\
88
+ \mathbf{H}^{(unpool)}[b, 2i+1, :] &= \mathbf{H}^{(expanded)}[b, i, :] \label{eq:unpool_odd}
89
+ \end{align}
90
+
91
+ for $i = 0, 1, \ldots, L/2-1$, effectively creating identical copies for adjacent positions.
92
+
93
+ \subsubsection{Transformer Refinement}
94
+ \label{sec:decomp_refinement}
95
+
96
+ The final transformer encoder (2 layers) refines the unpooled representations to recover fine-grained positional information lost during compression:
97
+
98
+ \begin{align}
99
+ \mathbf{H}^{(recon)} = \text{TransformerEncoder}_{\text{decode}}(\mathbf{H}^{(unpool)}) \label{eq:refinement_transformer}
100
+ \end{align}
101
+
102
+ This refinement stage is crucial for recovering the subtle positional dependencies present in ESM-2 embeddings.
103
+
104
+ \subsection{Training Methodology and Optimization}
105
+
106
+ The compressor-decompressor pair is trained jointly using reconstruction loss with advanced optimization techniques for stable convergence.
107
+
108
+ \subsubsection{Reconstruction Loss Function}
109
+ \label{sec:reconstruction_loss}
110
+
111
+ The training objective minimizes mean squared error between original and reconstructed embeddings:
112
+
113
+ \begin{align}
114
+ \mathcal{L}_{\text{recon}}(\theta_{\mathcal{C}}, \theta_{\mathcal{D}}) &= \mathbb{E}_{\mathbf{H} \sim \mathcal{T}} \left[ \|\mathbf{H} - \mathcal{D}(\mathcal{C}(\mathbf{H}; \theta_{\mathcal{C}}); \theta_{\mathcal{D}})\|_2^2 \right] \label{eq:mse_loss}
115
+ \end{align}
116
+
117
+ where $\mathcal{T}$ represents the training dataset distribution and $\theta_{\mathcal{C}}, \theta_{\mathcal{D}}$ are the compressor and decompressor parameters respectively.
118
+
119
+ \subsubsection{Advanced Learning Rate Scheduling}
120
+ \label{sec:lr_scheduling}
121
+
122
+ Training employs a sophisticated learning rate schedule combining warmup and cosine annealing:
123
+
124
+ \begin{align}
125
+ \text{lr}_{\text{warmup}}(t) &= \text{lr}_{\max} \cdot \frac{t}{T_{\text{warmup}}} \quad \text{for } t \leq T_{\text{warmup}} \label{eq:warmup_lr}\\
126
+ \text{lr}_{\text{cosine}}(t) &= \text{lr}_{\min} + \frac{1}{2}(\text{lr}_{\max} - \text{lr}_{\min})\left(1 + \cos\left(\frac{\pi(t - T_{\text{warmup}})}{T_{\text{total}} - T_{\text{warmup}}}\right)\right) \label{eq:cosine_lr}
127
+ \end{align}
128
+
129
+ with hyperparameters: $\text{lr}_{\max} = 10^{-3}$, $\text{lr}_{\min} = 8 \times 10^{-5}$, $T_{\text{warmup}} = 10,000$ steps.
130
+
131
+ \subsubsection{Normalization and Regularization}
132
+ \label{sec:normalization_reg}
133
+
134
+ The architecture incorporates several regularization techniques:
135
+
136
+ \begin{itemize}
137
+ \item \textbf{Layer Normalization}: Applied before each major operation for training stability
138
+ \item \textbf{Dropout}: 0.1 dropout rate in transformer feedforward layers during training
139
+ \item \textbf{Weight Decay}: $10^{-4}$ weight decay in AdamW optimizer
140
+ \item \textbf{Gradient Clipping}: Maximum gradient norm of 1.0 to prevent exploding gradients
141
+ \end{itemize}
142
+
143
+ \subsection{Architecture Specifications}
144
+
145
+ \subsubsection{Transformer Layer Configuration}
146
+ \label{sec:transformer_config}
147
+
148
+ Both compressor and decompressor transformer layers share identical specifications:
149
+
150
+ \begin{itemize}
151
+ \item \textbf{Model Dimension}: $d_{\text{model}} = 1280$ (matching ESM-2)
152
+ \item \textbf{Attention Heads}: $n_{\text{heads}} = 8$
153
+ \item \textbf{Feedforward Dimension}: $d_{\text{ff}} = 5120$ (4× model dimension)
154
+ \item \textbf{Activation Function}: GELU in feedforward sublayers
155
+ \item \textbf{Layer Normalization}: Pre-normalization architecture
156
+ \item \textbf{Residual Connections}: Around each sublayer
157
+ \end{itemize}
158
+
159
+ \subsubsection{Memory and Computational Efficiency}
160
+ \label{sec:efficiency}
161
+
162
+ The compression architecture is optimized for computational efficiency:
163
+
164
+ \begin{itemize}
165
+ \item \textbf{Parameter Count}:
166
+ \begin{itemize}
167
+ \item Compressor: $\sim$52M parameters
168
+ \item Decompressor: $\sim$26M parameters
169
+ \item Total: $\sim$78M parameters
170
+ \end{itemize}
171
+ \item \textbf{Training Memory}: $\sim$12GB GPU memory for batch size 32
172
+ \item \textbf{Inference Speed}: $\sim$1000 sequences/second on A100 GPU
173
+ \item \textbf{Compression Ratio}: 16× reduction in embedding dimension
174
+ \item \textbf{Storage Savings}: 94% reduction in embedding storage requirements
175
+ \end{itemize}
176
+
177
+ \subsection{Performance Metrics and Validation}
178
+
179
+ \subsubsection{Reconstruction Quality}
180
+ \label{sec:reconstruction_quality}
181
+
182
+ The trained compressor-decompressor achieves high-fidelity reconstruction:
183
+
184
+ \begin{itemize}
185
+ \item \textbf{MSE Loss}: $< 0.01$ on validation set
186
+ \item \textbf{Cosine Similarity}: $> 0.95$ between original and reconstructed embeddings
187
+ \item \textbf{Pearson Correlation}: $> 0.98$ across all embedding dimensions
188
+ \item \textbf{Max Absolute Error}: $< 0.1$ per embedding component
189
+ \end{itemize}
190
+
191
+ \subsubsection{Downstream Task Preservation}
192
+ \label{sec:downstream_preservation}
193
+
194
+ Compressed embeddings maintain performance on downstream tasks:
195
+
196
+ \begin{itemize}
197
+ \item \textbf{AMP Classification}: $< 2\%$ accuracy drop using compressed embeddings
198
+ \item \textbf{Secondary Structure}: $< 3\%$ accuracy drop on DSSP prediction
199
+ \item \textbf{Contact Prediction}: $< 5\%$ precision drop on contact maps
200
+ \item \textbf{Homology Detection}: $< 1\%$ AUC drop on SCOP fold recognition
201
+ \end{itemize}
202
+
203
+ \begin{algorithm}[h]
204
+ \caption{Transformer-Based Compressor}
205
+ \label{alg:compressor}
206
+ \begin{algorithmic}[1]
207
+ \REQUIRE Normalized ESM-2 embeddings $\mathbf{H}^{(norm)} \in \mathbb{R}^{B \times L \times 1280}$
208
+ \REQUIRE Trained compressor parameters $\theta_{\mathcal{C}}$
209
+ \ENSURE Compressed embeddings $\mathbf{Z}^{(comp)} \in \mathbb{R}^{B \times L/2 \times 80}$
210
+
211
+ \STATE \textbf{// Stage 1: Input Normalization}
212
+ \STATE $\mathbf{H}^{(0)} \leftarrow \text{LayerNorm}(\mathbf{H}^{(norm)})$ \COMMENT{Stabilize input distributions}
213
+
214
+ \STATE \textbf{// Stage 2: Pre-Pooling Transformer Processing}
215
+ \FOR{$\ell = 1$ to $2$} \COMMENT{2 pre-pooling transformer layers}
216
+ \STATE $\mathbf{H}^{(\ell)} \leftarrow \text{MultiHeadAttention}(\mathbf{H}^{(\ell-1)}, \mathbf{H}^{(\ell-1)}, \mathbf{H}^{(\ell-1)})$
217
+ \STATE $\mathbf{H}^{(\ell)} \leftarrow \mathbf{H}^{(\ell-1)} + \text{Dropout}(\mathbf{H}^{(\ell)})$ \COMMENT{Residual connection}
218
+ \STATE $\mathbf{H}^{(\ell)} \leftarrow \text{LayerNorm}(\mathbf{H}^{(\ell)})$ \COMMENT{Post-attention normalization}
219
+
220
+ \STATE $\mathbf{F}^{(\ell)} \leftarrow \text{GELU}(\mathbf{H}^{(\ell)} \mathbf{W}_1^{(\ell)} + \mathbf{b}_1^{(\ell)}) \mathbf{W}_2^{(\ell)} + \mathbf{b}_2^{(\ell)}$ \COMMENT{FFN}
221
+ \STATE $\mathbf{H}^{(\ell)} \leftarrow \mathbf{H}^{(\ell)} + \text{Dropout}(\mathbf{F}^{(\ell)})$ \COMMENT{Residual connection}
222
+ \STATE $\mathbf{H}^{(\ell)} \leftarrow \text{LayerNorm}(\mathbf{H}^{(\ell)})$ \COMMENT{Post-FFN normalization}
223
+ \ENDFOR
224
+ \STATE $\mathbf{H}^{(pre)} \leftarrow \mathbf{H}^{(2)}$
225
+
226
+ \STATE \textbf{// Stage 3: Hourglass Pooling}
227
+ \IF{$L \bmod 2 = 1$} \COMMENT{Handle odd sequence lengths}
228
+ \STATE $\mathbf{H}^{(pre)} \leftarrow \mathbf{H}^{(pre)}[:, :L-1, :]$ \COMMENT{Remove last position}
229
+ \STATE $L \leftarrow L - 1$
230
+ \ENDIF
231
+ \STATE $\mathbf{H}^{(grouped)} \leftarrow \text{Reshape}(\mathbf{H}^{(pre)}, [B, L/2, 2, 1280])$
232
+ \STATE $\mathbf{H}^{(pool)} \leftarrow \text{Mean}(\mathbf{H}^{(grouped)}, \text{dim}=2)$ \COMMENT{Average adjacent positions}
233
+
234
+ \STATE \textbf{// Stage 4: Post-Pooling Transformer Processing}
235
+ \FOR{$\ell = 3$ to $4$} \COMMENT{2 post-pooling transformer layers}
236
+ \STATE \textbf{// Same transformer operations as pre-pooling layers}
237
+ \STATE $\mathbf{H}^{(\ell)} \leftarrow \text{TransformerLayer}(\mathbf{H}^{(\ell-1)})$
238
+ \ENDFOR
239
+ \STATE $\mathbf{H}^{(post)} \leftarrow \mathbf{H}^{(4)}$
240
+
241
+ \STATE \textbf{// Stage 5: Final Projection and Activation}
242
+ \STATE $\mathbf{H}^{(proj\_input)} \leftarrow \text{LayerNorm}(\mathbf{H}^{(post)})$
243
+ \STATE $\mathbf{Z}^{(comp)} \leftarrow \tanh(\mathbf{H}^{(proj\_input)} \mathbf{W}^{(proj)} + \mathbf{b}^{(proj)})$
244
+
245
+ \RETURN $\mathbf{Z}^{(comp)}$
246
+ \end{algorithmic}
247
+ \end{algorithm}
248
+
249
+ \begin{algorithm}[h]
250
+ \caption{Transformer-Based Decompressor}
251
+ \label{alg:decompressor}
252
+ \begin{algorithmic}[1]
253
+ \REQUIRE Compressed embeddings $\mathbf{Z}^{(comp)} \in \mathbb{R}^{B \times L/2 \times 80}$
254
+ \REQUIRE Trained decompressor parameters $\theta_{\mathcal{D}}$
255
+ \ENSURE Reconstructed embeddings $\mathbf{H}^{(recon)} \in \mathbb{R}^{B \times L \times 1280}$
256
+
257
+ \STATE \textbf{// Stage 1: Dimension Expansion}
258
+ \STATE $\mathbf{Z}^{(norm)} \leftarrow \text{LayerNorm}(\mathbf{Z}^{(comp)})$ \COMMENT{Normalize compressed input}
259
+ \STATE $\mathbf{H}^{(expanded)} \leftarrow \mathbf{Z}^{(norm)} \mathbf{W}^{(expand)} + \mathbf{b}^{(expand)}$ \COMMENT{80 → 1280 dimensions}
260
+
261
+ \STATE \textbf{// Stage 2: Hourglass Unpooling}
262
+ \STATE $\mathbf{H}^{(unpool)} \leftarrow \text{repeat\_interleave}(\mathbf{H}^{(expanded)}, 2, \text{dim}=1)$ \COMMENT{L/2 → L length}
263
+
264
+ \STATE \textbf{// Verify unpooling operation}
265
+ \FOR{$b = 1$ to $B$} \COMMENT{For each batch}
266
+ \FOR{$i = 0$ to $L/2-1$} \COMMENT{For each compressed position}
267
+ \STATE $\mathbf{H}^{(unpool)}[b, 2i, :] \leftarrow \mathbf{H}^{(expanded)}[b, i, :]$ \COMMENT{Even positions}
268
+ \STATE $\mathbf{H}^{(unpool)}[b, 2i+1, :] \leftarrow \mathbf{H}^{(expanded)}[b, i, :]$ \COMMENT{Odd positions}
269
+ \ENDFOR
270
+ \ENDFOR
271
+
272
+ \STATE \textbf{// Stage 3: Transformer Refinement}
273
+ \FOR{$\ell = 1$ to $2$} \COMMENT{2 refinement transformer layers}
274
+ \STATE $\mathbf{A}^{(\ell)} \leftarrow \text{MultiHeadAttention}(\mathbf{H}^{(\ell-1)}, \mathbf{H}^{(\ell-1)}, \mathbf{H}^{(\ell-1)})$
275
+ \STATE $\mathbf{H}^{(\ell)} \leftarrow \mathbf{H}^{(\ell-1)} + \text{Dropout}(\mathbf{A}^{(\ell)})$ \COMMENT{Residual connection}
276
+ \STATE $\mathbf{H}^{(\ell)} \leftarrow \text{LayerNorm}(\mathbf{H}^{(\ell)})$ \COMMENT{Post-attention normalization}
277
+
278
+ \STATE $\mathbf{F}^{(\ell)} \leftarrow \text{GELU}(\mathbf{H}^{(\ell)} \mathbf{W}_1^{(\ell)} + \mathbf{b}_1^{(\ell)}) \mathbf{W}_2^{(\ell)} + \mathbf{b}_2^{(\ell)}$
279
+ \STATE $\mathbf{H}^{(\ell)} \leftarrow \mathbf{H}^{(\ell)} + \text{Dropout}(\mathbf{F}^{(\ell)})$ \COMMENT{Residual connection}
280
+ \STATE $\mathbf{H}^{(\ell)} \leftarrow \text{LayerNorm}(\mathbf{H}^{(\ell)})$ \COMMENT{Post-FFN normalization}
281
+ \ENDFOR
282
+
283
+ \STATE $\mathbf{H}^{(recon)} \leftarrow \mathbf{H}^{(2)}$ \COMMENT{Final reconstructed embeddings}
284
+
285
+ \RETURN $\mathbf{H}^{(recon)}$
286
+ \end{algorithmic}
287
+ \end{algorithm}
288
+
289
+ \begin{algorithm}[h]
290
+ \caption{Joint Compressor-Decompressor Training}
291
+ \label{alg:joint_training}
292
+ \begin{algorithmic}[1]
293
+ \REQUIRE Training dataset $\mathcal{D} = \{\mathbf{H}_1^{(norm)}, \ldots, \mathbf{H}_N^{(norm)}\}$
294
+ \REQUIRE Hyperparameters: $\text{lr}_{\max}, \text{lr}_{\min}, T_{\text{warmup}}, T_{\text{total}}$
295
+ \ENSURE Trained compressor $\mathcal{C}(\cdot; \theta_{\mathcal{C}}^*)$ and decompressor $\mathcal{D}(\cdot; \theta_{\mathcal{D}}^*)$
296
+
297
+ \STATE \textbf{// Initialize models and optimizer}
298
+ \STATE $\theta_{\mathcal{C}}, \theta_{\mathcal{D}} \leftarrow \text{InitializeParameters}()$
299
+ \STATE $\text{optimizer} \leftarrow \text{AdamW}(\{\theta_{\mathcal{C}}, \theta_{\mathcal{D}}\}, \text{lr}=\text{lr}_{\max}, \text{weight\_decay}=10^{-4})$
300
+
301
+ \STATE \textbf{// Setup learning rate schedulers}
302
+ \STATE $\text{warmup\_sched} \leftarrow \text{LinearLR}(\text{start\_factor}=10^{-8}, \text{end\_factor}=1.0, \text{total\_iters}=T_{\text{warmup}})$
303
+ \STATE $\text{cosine\_sched} \leftarrow \text{CosineAnnealingLR}(T_{\max}=T_{\text{total}}, \eta_{\min}=\text{lr}_{\min})$
304
+ \STATE $\text{scheduler} \leftarrow \text{SequentialLR}([\text{warmup\_sched}, \text{cosine\_sched}], [T_{\text{warmup}}])$
305
+
306
+ \FOR{$\text{epoch} = 1$ to $\text{EPOCHS}$}
307
+ \STATE $\text{total\_loss} \leftarrow 0$
308
+ \FOR{$\mathbf{H}^{(batch)} \in \text{DataLoader}(\mathcal{D}, \text{batch\_size}=32, \text{shuffle}=\text{True})$}
309
+ \STATE \textbf{// Forward pass through compressor-decompressor}
310
+ \STATE $\mathbf{Z}^{(comp)} \leftarrow \mathcal{C}(\mathbf{H}^{(batch)}; \theta_{\mathcal{C}})$ \COMMENT{Compress}
311
+ \STATE $\mathbf{H}^{(recon)} \leftarrow \mathcal{D}(\mathbf{Z}^{(comp)}; \theta_{\mathcal{D}})$ \COMMENT{Decompress}
312
+
313
+ \STATE \textbf{// Compute reconstruction loss}
314
+ \STATE $\mathcal{L} \leftarrow \|\mathbf{H}^{(batch)} - \mathbf{H}^{(recon)}\|_2^2 / |\mathbf{H}^{(batch)}|$ \COMMENT{MSE loss}
315
+
316
+ \STATE \textbf{// Backward pass and optimization}
317
+ \STATE $\text{optimizer.zero\_grad()}$
318
+ \STATE $\mathcal{L}.\text{backward()}$
319
+ \STATE $\text{torch.nn.utils.clip\_grad\_norm\_}(\{\theta_{\mathcal{C}}, \theta_{\mathcal{D}}\}, \text{max\_norm}=1.0)$
320
+ \STATE $\text{optimizer.step()}$
321
+ \STATE $\text{scheduler.step()}$
322
+
323
+ \STATE $\text{total\_loss} \leftarrow \text{total\_loss} + \mathcal{L}.\text{item()}$
324
+ \ENDFOR
325
+
326
+ \STATE $\text{avg\_loss} \leftarrow \text{total\_loss} / |\text{DataLoader}|$
327
+ \STATE \textbf{print} $f$"Epoch \{epoch\}: Average MSE = \{avg\_loss:.6f\}"
328
+
329
+ \IF{$\text{epoch} \bmod 5 = 0$} \COMMENT{Save checkpoint every 5 epochs}
330
+ \STATE $\text{SaveCheckpoint}(\theta_{\mathcal{C}}, \theta_{\mathcal{D}}, \text{optimizer}, \text{avg\_loss}, \text{epoch})$
331
+ \ENDIF
332
+ \ENDFOR
333
+
334
+ \STATE \textbf{// Save final trained models}
335
+ \STATE $\text{SaveModel}(\theta_{\mathcal{C}}, \text{"final\_compressor\_model.pth"})$
336
+ \STATE $\text{SaveModel}(\theta_{\mathcal{D}}, \text{"final\_decompressor\_model.pth"})$
337
+
338
+ \RETURN $\theta_{\mathcal{C}}^*, \theta_{\mathcal{D}}^*$
339
+ \end{algorithmic}
340
+ \end{algorithm}
341
+
342
+ \begin{algorithm}[h]
343
+ \caption{Hourglass Pooling and Unpooling Operations}
344
+ \label{alg:hourglass_operations}
345
+ \begin{algorithmic}[1]
346
+ \REQUIRE Input tensor $\mathbf{X} \in \mathbb{R}^{B \times L \times D}$
347
+ \ENSURE Pooled tensor $\mathbf{X}^{(pool)} \in \mathbb{R}^{B \times L/2 \times D}$ and unpooled tensor $\mathbf{X}^{(unpool)} \in \mathbb{R}^{B \times L \times D}$
348
+
349
+ \STATE \textbf{// Hourglass Pooling Operation}
350
+ \FUNCTION{HourglassPool}{$\mathbf{X}$}
351
+ \STATE $B, L, D \leftarrow \mathbf{X}.\text{shape}$
352
+
353
+ \IF{$L \bmod 2 = 1$} \COMMENT{Handle odd sequence lengths}
354
+ \STATE $\mathbf{X} \leftarrow \mathbf{X}[:, :L-1, :]$ \COMMENT{Remove last position}
355
+ \STATE $L \leftarrow L - 1$
356
+ \ENDIF
357
+
358
+ \STATE $\mathbf{X}^{(grouped)} \leftarrow \text{Reshape}(\mathbf{X}, [B, L/2, 2, D])$ \COMMENT{Group adjacent positions}
359
+ \STATE $\mathbf{X}^{(pool)} \leftarrow \text{Mean}(\mathbf{X}^{(grouped)}, \text{dim}=2)$ \COMMENT{Average grouped positions}
360
+
361
+ \RETURN $\mathbf{X}^{(pool)}$
362
+ \ENDFUNCTION
363
+
364
+ \STATE \textbf{// Hourglass Unpooling Operation}
365
+ \FUNCTION{HourglassUnpool}{$\mathbf{X}^{(pool)}$}
366
+ \STATE $B, L_{pool}, D \leftarrow \mathbf{X}^{(pool)}.\text{shape}$
367
+ \STATE $L \leftarrow 2 \times L_{pool}$ \COMMENT{Double the sequence length}
368
+
369
+ \STATE $\mathbf{X}^{(unpool)} \leftarrow \text{repeat\_interleave}(\mathbf{X}^{(pool)}, 2, \text{dim}=1)$
370
+
371
+ \STATE \textbf{// Verify unpooling correctness}
372
+ \FOR{$b = 1$ to $B$}
373
+ \FOR{$i = 0$ to $L_{pool}-1$}
374
+ \STATE \textbf{assert} $\mathbf{X}^{(unpool)}[b, 2i, :] = \mathbf{X}^{(pool)}[b, i, :]$
375
+ \STATE \textbf{assert} $\mathbf{X}^{(unpool)}[b, 2i+1, :] = \mathbf{X}^{(pool)}[b, i, :]$
376
+ \ENDFOR
377
+ \ENDFOR
378
+
379
+ \RETURN $\mathbf{X}^{(unpool)}$
380
+ \ENDFUNCTION
381
+
382
+ \STATE \textbf{// Demonstrate invertibility}
383
+ \STATE $\mathbf{X}^{(pool)} \leftarrow \text{HourglassPool}(\mathbf{X})$
384
+ \STATE $\mathbf{X}^{(unpool)} \leftarrow \text{HourglassUnpool}(\mathbf{X}^{(pool)})$
385
+ \STATE \textbf{// Note: $\mathbf{X}^{(unpool)} \neq \mathbf{X}$ due to information loss in pooling}
386
+ \STATE \textbf{// But spatial structure is preserved through duplication}
387
+
388
+ \RETURN $\mathbf{X}^{(pool)}, \mathbf{X}^{(unpool)}$
389
+ \end{algorithmic}
390
+ \end{algorithm}