Add comprehensive documentation: encoder_process_latex.tex
Browse files
documentation/encoder_process.tex
ADDED
|
@@ -0,0 +1,254 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
\section{ESM-2 Contextual Encoder and Compression Pipeline}
|
| 2 |
+
\label{sec:encoder}
|
| 3 |
+
|
| 4 |
+
Our encoder transforms raw amino acid sequences into compressed contextual embeddings suitable for flow matching generation. The pipeline consists of four main stages: (1) sequence preprocessing and validation, (2) ESM-2 contextual embedding extraction, (3) statistical normalization, and (4) transformer-based compression with hourglass pooling.
|
| 5 |
+
|
| 6 |
+
\subsection{Encoder Architecture Overview}
|
| 7 |
+
|
| 8 |
+
The complete encoding pipeline $\mathcal{E}: \mathcal{S} \rightarrow \mathbb{R}^{L' \times d_{comp}}$ transforms sequences $s \in \mathcal{S}$ from the amino acid alphabet to compressed embeddings, where $L' = L/2$ due to hourglass pooling and $d_{comp} = 80$ is the compressed dimension:
|
| 9 |
+
|
| 10 |
+
\begin{align}
|
| 11 |
+
s &\rightarrow \mathbf{H}^{(esm)} \rightarrow \mathbf{H}^{(norm)} \rightarrow \mathbf{Z}^{(comp)} \label{eq:encoding_pipeline}
|
| 12 |
+
\end{align}
|
| 13 |
+
|
| 14 |
+
\subsubsection{Sequence Preprocessing and Validation}
|
| 15 |
+
\label{sec:preprocessing}
|
| 16 |
+
|
| 17 |
+
Input sequences undergo rigorous preprocessing to ensure compatibility with ESM-2 and biological validity:
|
| 18 |
+
|
| 19 |
+
\begin{enumerate}
|
| 20 |
+
\item \textbf{Canonical Amino Acid Filtering}: Only sequences containing the 20 canonical amino acids $\mathcal{A} = \{$A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y$\}$ are accepted.
|
| 21 |
+
|
| 22 |
+
\item \textbf{Length Constraints}: Sequences are filtered to $L_{min} \leq |s| \leq L_{max}$ where $L_{min} = 2$ and $L_{max} = 50$ for antimicrobial peptides.
|
| 23 |
+
|
| 24 |
+
\item \textbf{Sequence Standardization}: All sequences are converted to uppercase and stripped of whitespace.
|
| 25 |
+
|
| 26 |
+
\item \textbf{Padding and Truncation}: Sequences are standardized to length $L = 50$ through zero-padding (shorter sequences) or truncation (longer sequences).
|
| 27 |
+
\end{enumerate}
|
| 28 |
+
|
| 29 |
+
The preprocessing function $\text{Preprocess}(s)$ ensures uniform input format:
|
| 30 |
+
|
| 31 |
+
\begin{align}
|
| 32 |
+
s' = \begin{cases}
|
| 33 |
+
s \oplus \mathbf{0}^{L-|s|} & \text{if } |s| < L \\
|
| 34 |
+
s_{1:L} & \text{if } |s| \geq L
|
| 35 |
+
\end{cases} \label{eq:padding}
|
| 36 |
+
\end{align}
|
| 37 |
+
|
| 38 |
+
where $\oplus$ denotes concatenation and $\mathbf{0}^{k}$ represents $k$ padding tokens.
|
| 39 |
+
|
| 40 |
+
\subsubsection{ESM-2 Contextual Embedding Extraction}
|
| 41 |
+
\label{sec:esm_embedding}
|
| 42 |
+
|
| 43 |
+
We utilize the pre-trained ESM-2 model (esm2\_t33\_650M\_UR50D) to extract contextual per-residue embeddings. ESM-2's 33-layer transformer architecture captures evolutionary relationships and structural constraints learned from 65 million protein sequences.
|
| 44 |
+
|
| 45 |
+
The embedding extraction process follows ESM-2's standard protocol:
|
| 46 |
+
|
| 47 |
+
\begin{align}
|
| 48 |
+
\mathbf{T} &= \text{Tokenize}(s') \in \mathbb{R}^{L+2} \label{eq:tokenization}\\
|
| 49 |
+
\mathbf{H}^{(raw)} &= \text{ESM-2}_{33}(\mathbf{T}) \in \mathbb{R}^{(L+2) \times 1280} \label{eq:esm_forward}\\
|
| 50 |
+
\mathbf{H}^{(esm)} &= \mathbf{H}^{(raw)}_{2:L+1, :} \in \mathbb{R}^{L \times 1280} \label{eq:cls_eos_removal}
|
| 51 |
+
\end{align}
|
| 52 |
+
|
| 53 |
+
where tokenization adds special CLS and EOS tokens, and we extract representations from the 33rd (final) layer, removing the special tokens to obtain per-residue embeddings.
|
| 54 |
+
|
| 55 |
+
\subsubsection{Statistical Normalization}
|
| 56 |
+
\label{sec:normalization}
|
| 57 |
+
|
| 58 |
+
To stabilize training and ensure consistent embedding magnitudes across the dataset, we apply a two-stage normalization scheme computed from dataset statistics:
|
| 59 |
+
|
| 60 |
+
\begin{align}
|
| 61 |
+
\boldsymbol{\mu} &= \mathbb{E}[\mathbf{H}^{(esm)}], \quad \boldsymbol{\sigma}^2 = \text{Var}[\mathbf{H}^{(esm)}] \label{eq:dataset_stats}\\
|
| 62 |
+
\mathbf{H}^{(z)} &= \text{clamp}\left(\frac{\mathbf{H}^{(esm)} - \boldsymbol{\mu}}{\boldsymbol{\sigma} + \epsilon}, -4, 4\right) \label{eq:z_score}\\
|
| 63 |
+
\boldsymbol{\mu}_{min} &= \min(\mathbf{H}^{(z)}), \quad \boldsymbol{\mu}_{max} = \max(\mathbf{H}^{(z)}) \label{eq:minmax_stats}\\
|
| 64 |
+
\mathbf{H}^{(norm)} &= \text{clamp}\left(\frac{\mathbf{H}^{(z)} - \boldsymbol{\mu}_{min}}{\boldsymbol{\mu}_{max} - \boldsymbol{\mu}_{min} + \epsilon}, 0, 1\right) \label{eq:minmax_norm}
|
| 65 |
+
\end{align}
|
| 66 |
+
|
| 67 |
+
where $\epsilon = 10^{-8}$ prevents division by zero, and clamping operations ensure numerical stability. This normalization scheme combines z-score standardization with min-max scaling to produce embeddings in $[0, 1]^{L \times 1280}$.
|
| 68 |
+
|
| 69 |
+
\subsubsection{Transformer-Based Compression with Hourglass Pooling}
|
| 70 |
+
\label{sec:compression}
|
| 71 |
+
|
| 72 |
+
The compressor $\mathcal{C}: \mathbb{R}^{L \times 1280} \rightarrow \mathbb{R}^{L/2 \times 80}$ employs a hourglass architecture inspired by ProtFlow, combining transformer self-attention with spatial pooling for efficient compression:
|
| 73 |
+
|
| 74 |
+
\begin{align}
|
| 75 |
+
\mathbf{H}^{(0)} &= \text{LayerNorm}(\mathbf{H}^{(norm)}) \label{eq:input_norm}\\
|
| 76 |
+
\mathbf{H}^{(pre)} &= \text{TransformerEncoder}^{(2)}(\mathbf{H}^{(0)}) \label{eq:pre_transformer}\\
|
| 77 |
+
\mathbf{H}^{(pool)} &= \text{HourglassPool}(\mathbf{H}^{(pre)}) \label{eq:hourglass_pool}\\
|
| 78 |
+
\mathbf{H}^{(post)} &= \text{TransformerEncoder}^{(2)}(\mathbf{H}^{(pool)}) \label{eq:post_transformer}\\
|
| 79 |
+
\mathbf{Z}^{(comp)} &= \tanh(\text{LayerNorm}(\mathbf{H}^{(post)}) \mathbf{W}^{(proj)} + \mathbf{b}^{(proj)}) \label{eq:final_projection}
|
| 80 |
+
\end{align}
|
| 81 |
+
|
| 82 |
+
The hourglass pooling operation reduces sequence length while preserving critical information:
|
| 83 |
+
|
| 84 |
+
\begin{align}
|
| 85 |
+
\text{HourglassPool}(\mathbf{H}) = \begin{cases}
|
| 86 |
+
\text{Reshape}(\mathbf{H}_{1:L-1}, [B, (L-1)/2, 2, D]) \text{ if } L \text{ is odd} \\
|
| 87 |
+
\text{Reshape}(\mathbf{H}, [B, L/2, 2, D])
|
| 88 |
+
\end{cases} \label{eq:reshape_pool}
|
| 89 |
+
\end{align}
|
| 90 |
+
|
| 91 |
+
followed by mean pooling across the grouped dimension:
|
| 92 |
+
|
| 93 |
+
\begin{align}
|
| 94 |
+
\mathbf{H}^{(pool)} = \text{Mean}(\text{Reshape}(\mathbf{H}), \text{dim}=2) \label{eq:mean_pool}
|
| 95 |
+
\end{align}
|
| 96 |
+
|
| 97 |
+
This pooling strategy reduces computational complexity while maintaining spatial relationships between adjacent residues.
|
| 98 |
+
|
| 99 |
+
\subsection{Transformer Architecture Details}
|
| 100 |
+
|
| 101 |
+
Both pre-pooling and post-pooling transformer encoders use identical architectures:
|
| 102 |
+
|
| 103 |
+
\begin{itemize}
|
| 104 |
+
\item \textbf{Layers}: 2 transformer encoder layers each (4 total)
|
| 105 |
+
\item \textbf{Attention Heads}: 8 multi-head attention heads
|
| 106 |
+
\item \textbf{Hidden Dimension}: 1280 (matching ESM-2)
|
| 107 |
+
\item \textbf{Feedforward Dimension}: 5120 (4× hidden dimension)
|
| 108 |
+
\item \textbf{Activation}: GELU activation in feedforward layers
|
| 109 |
+
\item \textbf{Dropout}: 0.1 dropout rate during training
|
| 110 |
+
\end{itemize}
|
| 111 |
+
|
| 112 |
+
The final projection layer $\mathbf{W}^{(proj)} \in \mathbb{R}^{1280 \times 80}$ compresses to the target dimension with tanh activation to bound outputs in $[-1, 1]^{L/2 \times 80}$.
|
| 113 |
+
|
| 114 |
+
\subsection{Training Objective and Optimization}
|
| 115 |
+
|
| 116 |
+
The encoder-decoder pair is trained end-to-end using reconstruction loss to ensure information preservation:
|
| 117 |
+
|
| 118 |
+
\begin{align}
|
| 119 |
+
\mathcal{L}_{\text{recon}} &= \mathbb{E}_{\mathbf{H} \sim \mathcal{D}} \left[ \|\mathbf{H} - \mathcal{D}(\mathcal{C}(\mathbf{H}))\|_2^2 \right] \label{eq:reconstruction_loss}
|
| 120 |
+
\end{align}
|
| 121 |
+
|
| 122 |
+
where $\mathcal{D}$ is the decompressor and $\mathcal{D}$ represents the dataset distribution.
|
| 123 |
+
|
| 124 |
+
Training employs AdamW optimization with cosine annealing:
|
| 125 |
+
|
| 126 |
+
\begin{align}
|
| 127 |
+
\text{lr}(t) = \text{lr}_{\min} + \frac{1}{2}(\text{lr}_{\max} - \text{lr}_{\min})(1 + \cos(\pi t / T)) \label{eq:cosine_schedule}
|
| 128 |
+
\end{align}
|
| 129 |
+
|
| 130 |
+
with warmup schedule for the first 10,000 steps:
|
| 131 |
+
|
| 132 |
+
\begin{align}
|
| 133 |
+
\text{lr}_{\text{warmup}}(t) = \text{lr}_{\max} \cdot \frac{t}{T_{\text{warmup}}} \label{eq:warmup_schedule}
|
| 134 |
+
\end{align}
|
| 135 |
+
|
| 136 |
+
\subsection{Computational Efficiency and Scalability}
|
| 137 |
+
|
| 138 |
+
The encoder pipeline is optimized for large-scale processing:
|
| 139 |
+
|
| 140 |
+
\begin{itemize}
|
| 141 |
+
\item \textbf{Batch Processing}: Dynamic batching with GPU memory management
|
| 142 |
+
\item \textbf{Memory Optimization}: Gradient checkpointing and mixed precision training
|
| 143 |
+
\item \textbf{Parallel Processing}: Multi-GPU support with data parallelism
|
| 144 |
+
\item \textbf{Storage Efficiency}: Individual and combined tensor storage formats
|
| 145 |
+
\end{itemize}
|
| 146 |
+
|
| 147 |
+
Processing statistics for our dataset:
|
| 148 |
+
\begin{itemize}
|
| 149 |
+
\item \textbf{Dataset Size}: 6,983 validated AMP sequences
|
| 150 |
+
\item \textbf{Processing Speed}: ~100 sequences/second on A100 GPU
|
| 151 |
+
\item \textbf{Memory Usage}: ~8GB GPU memory for batch size 32
|
| 152 |
+
\item \textbf{Storage Requirements}: ~2.1GB for compressed embeddings
|
| 153 |
+
\end{itemize}
|
| 154 |
+
|
| 155 |
+
\subsection{Embedding Quality and Validation}
|
| 156 |
+
|
| 157 |
+
The compressed embeddings maintain high fidelity to the original ESM-2 representations:
|
| 158 |
+
|
| 159 |
+
\begin{itemize}
|
| 160 |
+
\item \textbf{Reconstruction MSE}: $< 0.01$ on validation set
|
| 161 |
+
\item \textbf{Cosine Similarity}: $> 0.95$ between original and reconstructed embeddings
|
| 162 |
+
\item \textbf{Downstream Performance}: Maintained classification accuracy on AMP prediction tasks
|
| 163 |
+
\item \textbf{Compression Ratio}: 16× reduction in embedding dimension (1280 → 80)
|
| 164 |
+
\end{itemize}
|
| 165 |
+
|
| 166 |
+
\begin{algorithm}[h]
|
| 167 |
+
\caption{ESM-2 Contextual Encoder Pipeline}
|
| 168 |
+
\label{alg:encoder}
|
| 169 |
+
\begin{algorithmic}[1]
|
| 170 |
+
\REQUIRE Raw amino acid sequences $\mathcal{S} = \{s_1, s_2, \ldots, s_N\}$
|
| 171 |
+
\REQUIRE Pre-trained ESM-2 model and compressor weights
|
| 172 |
+
\REQUIRE Dataset normalization statistics $\{\boldsymbol{\mu}, \boldsymbol{\sigma}, \boldsymbol{\mu}_{min}, \boldsymbol{\mu}_{max}\}$
|
| 173 |
+
\ENSURE Compressed embeddings $\mathbf{Z}^{(comp)} \in \mathbb{R}^{N \times L/2 \times 80}$
|
| 174 |
+
|
| 175 |
+
\STATE \textbf{// Stage 1: Sequence Preprocessing}
|
| 176 |
+
\FOR{$i = 1$ to $N$}
|
| 177 |
+
\STATE $s_i' \leftarrow \text{Preprocess}(s_i)$ \COMMENT{Filter, pad/truncate to length $L$}
|
| 178 |
+
\STATE \textbf{assert} $|s_i'| = L$ and $s_i' \subset \mathcal{A}^L$ \COMMENT{Validate canonical AAs}
|
| 179 |
+
\ENDFOR
|
| 180 |
+
|
| 181 |
+
\STATE \textbf{// Stage 2: ESM-2 Embedding Extraction}
|
| 182 |
+
\STATE $\mathcal{B} \leftarrow \text{CreateBatches}(\{s_1', \ldots, s_N'\}, \text{batch\_size})$
|
| 183 |
+
\FOR{$\mathbf{B} \in \mathcal{B}$}
|
| 184 |
+
\STATE $\mathbf{T} \leftarrow \text{ESM2Tokenize}(\mathbf{B})$ \COMMENT{Add CLS/EOS tokens}
|
| 185 |
+
\STATE $\mathbf{H}^{(raw)} \leftarrow \text{ESM-2}_{33}(\mathbf{T})$ \COMMENT{Extract layer 33 representations}
|
| 186 |
+
\STATE $\mathbf{H}^{(esm)} \leftarrow \mathbf{H}^{(raw)}[:, 1:L+1, :]$ \COMMENT{Remove CLS/EOS tokens}
|
| 187 |
+
\ENDFOR
|
| 188 |
+
|
| 189 |
+
\STATE \textbf{// Stage 3: Statistical Normalization}
|
| 190 |
+
\FOR{$i = 1$ to $N$}
|
| 191 |
+
\STATE $\mathbf{H}_i^{(z)} \leftarrow \text{clamp}\left(\frac{\mathbf{H}_i^{(esm)} - \boldsymbol{\mu}}{\boldsymbol{\sigma} + \epsilon}, -4, 4\right)$
|
| 192 |
+
\STATE $\mathbf{H}_i^{(norm)} \leftarrow \text{clamp}\left(\frac{\mathbf{H}_i^{(z)} - \boldsymbol{\mu}_{min}}{\boldsymbol{\mu}_{max} - \boldsymbol{\mu}_{min} + \epsilon}, 0, 1\right)$
|
| 193 |
+
\ENDFOR
|
| 194 |
+
|
| 195 |
+
\STATE \textbf{// Stage 4: Transformer Compression}
|
| 196 |
+
\FOR{$i = 1$ to $N$}
|
| 197 |
+
\STATE $\mathbf{H}_i^{(0)} \leftarrow \text{LayerNorm}(\mathbf{H}_i^{(norm)})$ \COMMENT{Input normalization}
|
| 198 |
+
\STATE $\mathbf{H}_i^{(pre)} \leftarrow \text{TransformerEncoder}^{(2)}(\mathbf{H}_i^{(0)})$ \COMMENT{Pre-pooling layers}
|
| 199 |
+
\STATE $\mathbf{H}_i^{(pool)} \leftarrow \text{HourglassPool}(\mathbf{H}_i^{(pre)})$ \COMMENT{Spatial pooling}
|
| 200 |
+
\STATE $\mathbf{H}_i^{(post)} \leftarrow \text{TransformerEncoder}^{(2)}(\mathbf{H}_i^{(pool)})$ \COMMENT{Post-pooling layers}
|
| 201 |
+
\STATE $\mathbf{Z}_i^{(comp)} \leftarrow \tanh(\text{LayerNorm}(\mathbf{H}_i^{(post)}) \mathbf{W}^{(proj)} + \mathbf{b}^{(proj)})$
|
| 202 |
+
\ENDFOR
|
| 203 |
+
|
| 204 |
+
\STATE $\mathbf{Z}^{(comp)} \leftarrow \text{Stack}(\{\mathbf{Z}_1^{(comp)}, \ldots, \mathbf{Z}_N^{(comp)}\})$
|
| 205 |
+
\RETURN $\mathbf{Z}^{(comp)}$
|
| 206 |
+
\end{algorithmic}
|
| 207 |
+
\end{algorithm}
|
| 208 |
+
|
| 209 |
+
\begin{algorithm}[h]
|
| 210 |
+
\caption{Hourglass Pooling Operation}
|
| 211 |
+
\label{alg:hourglass_pool}
|
| 212 |
+
\begin{algorithmic}[1]
|
| 213 |
+
\REQUIRE Input embeddings $\mathbf{H} \in \mathbb{R}^{B \times L \times D}$
|
| 214 |
+
\ENSURE Pooled embeddings $\mathbf{H}^{(pool)} \in \mathbb{R}^{B \times L/2 \times D}$
|
| 215 |
+
|
| 216 |
+
\IF{$L \bmod 2 = 1$} \COMMENT{Handle odd sequence lengths}
|
| 217 |
+
\STATE $\mathbf{H} \leftarrow \mathbf{H}[:, :L-1, :]$ \COMMENT{Remove last position}
|
| 218 |
+
\STATE $L \leftarrow L - 1$
|
| 219 |
+
\ENDIF
|
| 220 |
+
|
| 221 |
+
\STATE $\mathbf{H}^{(reshaped)} \leftarrow \text{Reshape}(\mathbf{H}, [B, L/2, 2, D])$ \COMMENT{Group adjacent positions}
|
| 222 |
+
\STATE $\mathbf{H}^{(pool)} \leftarrow \text{Mean}(\mathbf{H}^{(reshaped)}, \text{dim}=2)$ \COMMENT{Average grouped positions}
|
| 223 |
+
|
| 224 |
+
\RETURN $\mathbf{H}^{(pool)}$
|
| 225 |
+
\end{algorithmic}
|
| 226 |
+
\end{algorithm}
|
| 227 |
+
|
| 228 |
+
\begin{algorithm}[h]
|
| 229 |
+
\caption{Dataset Statistics Computation}
|
| 230 |
+
\label{alg:dataset_stats}
|
| 231 |
+
\begin{algorithmic}[1]
|
| 232 |
+
\REQUIRE ESM-2 embeddings $\{\mathbf{H}_1^{(esm)}, \ldots, \mathbf{H}_N^{(esm)}\}$
|
| 233 |
+
\ENSURE Normalization statistics $\{\boldsymbol{\mu}, \boldsymbol{\sigma}, \boldsymbol{\mu}_{min}, \boldsymbol{\mu}_{max}\}$
|
| 234 |
+
|
| 235 |
+
\STATE $\mathbf{H}^{(flat)} \leftarrow \text{Concatenate}(\{\text{Flatten}(\mathbf{H}_i^{(esm)})\}_{i=1}^N)$ \COMMENT{Flatten all embeddings}
|
| 236 |
+
|
| 237 |
+
\STATE \textbf{// Compute z-score statistics}
|
| 238 |
+
\STATE $\boldsymbol{\mu} \leftarrow \text{Mean}(\mathbf{H}^{(flat)}, \text{dim}=0)$ \COMMENT{Per-dimension mean}
|
| 239 |
+
\STATE $\boldsymbol{\sigma}^2 \leftarrow \text{Var}(\mathbf{H}^{(flat)}, \text{dim}=0)$ \COMMENT{Per-dimension variance}
|
| 240 |
+
\STATE $\boldsymbol{\sigma} \leftarrow \sqrt{\boldsymbol{\sigma}^2 + \epsilon}$ \COMMENT{Add epsilon for stability}
|
| 241 |
+
|
| 242 |
+
\STATE \textbf{// Apply z-score normalization}
|
| 243 |
+
\STATE $\mathbf{H}^{(z)} \leftarrow \text{clamp}\left(\frac{\mathbf{H}^{(flat)} - \boldsymbol{\mu}}{\boldsymbol{\sigma}}, -4, 4\right)$
|
| 244 |
+
|
| 245 |
+
\STATE \textbf{// Compute min-max statistics}
|
| 246 |
+
\STATE $\boldsymbol{\mu}_{min} \leftarrow \text{Min}(\mathbf{H}^{(z)}, \text{dim}=0)$ \COMMENT{Per-dimension minimum}
|
| 247 |
+
\STATE $\boldsymbol{\mu}_{max} \leftarrow \text{Max}(\mathbf{H}^{(z)}, \text{dim}=0)$ \COMMENT{Per-dimension maximum}
|
| 248 |
+
|
| 249 |
+
\STATE \textbf{// Save statistics for inference}
|
| 250 |
+
\STATE $\text{Save}(\{\boldsymbol{\mu}, \boldsymbol{\sigma}, \boldsymbol{\mu}_{min}, \boldsymbol{\mu}_{max}\}, \text{"normalization\_stats.pt"})$
|
| 251 |
+
|
| 252 |
+
\RETURN $\{\boldsymbol{\mu}, \boldsymbol{\sigma}, \boldsymbol{\mu}_{min}, \boldsymbol{\mu}_{max}\}$
|
| 253 |
+
\end{algorithmic}
|
| 254 |
+
\end{algorithm}
|