AbstractPhil commited on
Commit
eaad7fb
·
verified ·
1 Parent(s): d80e36b

Update analysis_bert_large_clip-vit-b+bigG+dino2-l-16.txt

Browse files
analysis_bert_large_clip-vit-b+bigG+dino2-l-16.txt CHANGED
@@ -0,0 +1,309 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Loading BERT-large...
2
+ config.json: 100%
3
+  571/571 [00:00<00:00, 70.6kB/s]
4
+ model.safetensors:  65%
5
+  871M/1.34G [00:04<00:13, 34.8MB/s]
6
+ Loading weights: 100%
7
+  391/391 [00:00<00:00, 1112.55it/s, Materializing param=pooler.dense.weight]
8
+ BertModel LOAD REPORT from: google-bert/bert-large-uncased
9
+ Key | Status | |
10
+ -------------------------------------------+------------+--+-
11
+ cls.predictions.transform.LayerNorm.bias | UNEXPECTED | |
12
+ cls.predictions.transform.dense.weight | UNEXPECTED | |
13
+ cls.predictions.transform.dense.bias | UNEXPECTED | |
14
+ cls.seq_relationship.bias | UNEXPECTED | |
15
+ cls.predictions.transform.LayerNorm.weight | UNEXPECTED | |
16
+ cls.predictions.bias | UNEXPECTED | |
17
+ cls.seq_relationship.weight | UNEXPECTED | |
18
+
19
+ Notes:
20
+ - UNEXPECTED :can be ignored when loading from different task/architecture; not ok if you expect identical arch.
21
+
22
+ ======================================================================
23
+ MODEL: BERT-large (1024d, 24L, 16H)
24
+ ======================================================================
25
+
26
+ --- WEIGHT CATALOG ---
27
+ embedding : 3 matrices, 31,780,864 params, shapes={'(2, 1024)', '(512, 1024)', '(30522, 1024)'}
28
+ mlp_down : 24 matrices, 100,663,296 params, shapes={'(1024, 4096)'}
29
+ mlp_up : 24 matrices, 100,663,296 params, shapes={'(4096, 1024)'}
30
+ pooler : 1 matrices, 1,048,576 params, shapes={'(1024, 1024)'}
31
+ self_attn_k : 24 matrices, 25,165,824 params, shapes={'(1024, 1024)'}
32
+ self_attn_o : 24 matrices, 25,165,824 params, shapes={'(1024, 1024)'}
33
+ self_attn_q : 24 matrices, 25,165,824 params, shapes={'(1024, 1024)'}
34
+ self_attn_v : 24 matrices, 25,165,824 params, shapes={'(1024, 1024)'}
35
+ TOTAL : 334,819,328 params (2D only)
36
+
37
+ --- SVD EFFECTIVE RANK ---
38
+ Type StableRank PR Active% Rank90 Condition
39
+ mlp_down 52.17 882.95 1.000 838.0 23.0
40
+ mlp_up 27.37 856.14 1.000 832.8 33.8
41
+ self_attn_k 37.72 597.14 0.949 642.1 16649.8
42
+ self_attn_o 125.04 662.72 0.976 660.3 20582.9
43
+ self_attn_q 50.84 606.30 0.956 643.2 60065.9
44
+ self_attn_v 113.04 653.41 0.974 658.8 59710.1
45
+
46
+ --- SPARSITY TOPOLOGY ---
47
+ Type <0.0001 <0.001 <0.01 <0.1
48
+ embedding 0.0018 0.0184 0.1815 0.9699
49
+ mlp_down 0.0025 0.0251 0.2466 0.9954
50
+ mlp_up 0.0023 0.0231 0.2283 0.9944
51
+ pooler 0.0028 0.0280 0.2741 0.9981
52
+ self_attn_k 0.0022 0.0221 0.2178 0.9913
53
+ self_attn_o 0.0031 0.0308 0.2997 0.9990
54
+ self_attn_q 0.0022 0.0218 0.2149 0.9907
55
+ self_attn_v 0.0029 0.0294 0.2852 0.9989
56
+ FULL MODEL 0.0024 0.0242 0.2373 0.9925
57
+
58
+ --- Q/K/V SPARSITY COMPARISON (<0.1 threshold) ---
59
+ self_attn_q : 99.1%
60
+ self_attn_k : 99.1%
61
+ self_attn_v : 99.9%
62
+
63
+ --- QK SIMILARITY MANIFOLD ---
64
+ Layer StableRk PR Pos Neg SymDev TopEig
65
+ 0 6.42 266.22 457 567 1.0866 20.61
66
+ 1 3.60 194.32 454 570 1.0641 29.77
67
+ 2 6.14 215.22 474 550 1.1773 23.79
68
+ 3 6.11 162.90 468 556 1.1421 34.35
69
+ 4 5.74 237.65 455 569 1.1145 30.60
70
+ 5 6.30 255.58 460 564 1.1704 26.16
71
+ ... (24 layers total)
72
+ 23 4.07 206.01 525 499 0.8791 43.56
73
+
74
+ Positive eig fraction: layer 0 = 0.446, last = 0.513
75
+
76
+ --- MLP DEAD NEURONS ---
77
+ Dead (<1% mean): 0/98304 (0.00%)
78
+ Weak (<10% mean): 0/98304 (0.00%)
79
+
80
+ --- CROSS-LAYER CORRELATION (adjacent pairs) ---
81
+ self_attn_q : adj_mean=0.0002, adj_range=[-0.0035, 0.0036]
82
+ self_attn_k : adj_mean=0.0003, adj_range=[-0.0036, 0.0033]
83
+ mlp_up : adj_mean=0.0315, adj_range=[0.0239, 0.0494]
84
+
85
+
86
+ Loading CLIP-ViT-B/16 (LAION)...
87
+ open_clip_model.safetensors: 100%
88
+  599M/599M [00:03<00:00, 218MB/s]
89
+
90
+ ======================================================================
91
+ MODEL: CLIP-ViT-B/16 LAION (768d, 12L, 12H)
92
+ ======================================================================
93
+
94
+ --- WEIGHT CATALOG ---
95
+ embedding : 1 matrices, 151,296 params, shapes={'(197, 768)'}
96
+ mlp_down : 12 matrices, 28,311,552 params, shapes={'(768, 3072)'}
97
+ mlp_up : 12 matrices, 28,311,552 params, shapes={'(3072, 768)'}
98
+ projection : 1 matrices, 393,216 params, shapes={'(768, 512)'}
99
+ self_attn_o : 12 matrices, 7,077,888 params, shapes={'(768, 768)'}
100
+ self_attn_qkv : 12 matrices, 21,233,664 params, shapes={'(2304, 768)'}
101
+ TOTAL : 85,479,168 params (2D only)
102
+
103
+ --- SVD EFFECTIVE RANK ---
104
+ Type StableRank PR Active% Rank90 Condition
105
+ mlp_down 125.16 644.07 1.000 601.9 43.5
106
+ mlp_up 59.69 631.06 0.993 603.4 372.0
107
+ self_attn_o 77.37 515.53 0.967 491.8 37372.2
108
+ self_attn_qkv 94.39 552.43 0.929 546.8 18558.2
109
+
110
+ --- SPARSITY TOPOLOGY ---
111
+ Type <0.0001 <0.001 <0.01 <0.1
112
+ embedding 0.0202 0.2072 0.8578 0.9983
113
+ mlp_down 0.0145 0.0992 0.5794 0.9999
114
+ mlp_up 0.0101 0.0797 0.5233 0.9999
115
+ projection 0.0058 0.0573 0.5237 1.0000
116
+ self_attn_o 0.0066 0.0655 0.5525 0.9999
117
+ self_attn_qkv 0.0535 0.1189 0.5087 0.9999
118
+ FULL MODEL 0.0221 0.0949 0.5413 0.9999
119
+
120
+ --- Q/K/V SPARSITY COMPARISON (<0.1 threshold) ---
121
+ self_attn_qkv : 100.0%
122
+
123
+ --- QK SIMILARITY MANIFOLD ---
124
+ Layer StableRk PR Pos Neg SymDev TopEig
125
+ 0 2.06 59.23 386 382 1.0944 5.51
126
+ 1 3.51 82.73 447 321 0.8367 8.79
127
+ 2 8.48 108.22 401 367 0.9786 4.70
128
+ 3 22.88 193.84 406 362 1.0676 2.31
129
+ 4 20.20 196.57 401 367 1.1014 2.38
130
+ 5 26.05 249.44 384 384 1.1135 1.80
131
+ ... (12 layers total)
132
+ 11 49.71 360.27 413 355 1.3842 0.53
133
+
134
+ Positive eig fraction: layer 0 = 0.503, last = 0.538
135
+
136
+ --- MLP DEAD NEURONS ---
137
+ Dead (<1% mean): 1316/36864 (3.57%)
138
+ Weak (<10% mean): 1356/36864 (3.68%)
139
+
140
+ --- CROSS-LAYER CORRELATION (adjacent pairs) ---
141
+ self_attn_qkv : adj_mean=-0.0004, adj_range=[-0.0024, 0.0013]
142
+ mlp_up : adj_mean=0.0075, adj_range=[0.0000, 0.0304]
143
+
144
+
145
+ Loading DINOv2-large...
146
+ config.json: 100%
147
+  549/549 [00:00<00:00, 69.5kB/s]
148
+ model.safetensors:  94%
149
+  1.15G/1.22G [00:06<00:03, 20.6MB/s]
150
+ Loading weights: 100%
151
+  439/439 [00:00<00:00, 1139.02it/s, Materializing param=layernorm.weight]
152
+
153
+ ======================================================================
154
+ MODEL: DINOv2-large (1024d, 24L, 16H)
155
+ ======================================================================
156
+
157
+ --- WEIGHT CATALOG ---
158
+ embedding : 1 matrices, 1,024 params, shapes={'(1, 1024)'}
159
+ mlp_down : 24 matrices, 100,663,296 params, shapes={'(1024, 4096)'}
160
+ mlp_up : 24 matrices, 100,663,296 params, shapes={'(4096, 1024)'}
161
+ self_attn_k : 24 matrices, 25,165,824 params, shapes={'(1024, 1024)'}
162
+ self_attn_o : 24 matrices, 25,165,824 params, shapes={'(1024, 1024)'}
163
+ self_attn_q : 24 matrices, 25,165,824 params, shapes={'(1024, 1024)'}
164
+ self_attn_v : 24 matrices, 25,165,824 params, shapes={'(1024, 1024)'}
165
+ TOTAL : 301,990,912 params (2D only)
166
+
167
+ --- SVD EFFECTIVE RANK ---
168
+ Type StableRank PR Active% Rank90 Condition
169
+ mlp_down 94.40 810.58 1.000 805.1 39.8
170
+ mlp_up 58.43 764.26 0.979 769.8 50.2
171
+ self_attn_k 55.47 485.95 0.827 533.2 1024763.2
172
+ self_attn_o 85.58 642.50 0.955 636.4 83125.7
173
+ self_attn_q 57.74 477.74 0.826 536.0 630324.9
174
+ self_attn_v 94.84 590.99 0.932 610.2 490421.1
175
+
176
+ --- SPARSITY TOPOLOGY ---
177
+ Type <0.0001 <0.001 <0.01 <0.1
178
+ embedding 1.0000 1.0000 1.0000 1.0000
179
+ mlp_down 0.0072 0.0714 0.6036 0.9999
180
+ mlp_up 0.0078 0.0687 0.5577 0.9999
181
+ self_attn_k 0.0081 0.0774 0.5406 0.9998
182
+ self_attn_o 0.0069 0.0687 0.5753 1.0000
183
+ self_attn_q 0.0088 0.0793 0.5452 0.9997
184
+ self_attn_v 0.0088 0.0861 0.5810 1.0000
185
+ FULL MODEL 0.0077 0.0727 0.5740 0.9999
186
+
187
+ --- Q/K/V SPARSITY COMPARISON (<0.1 threshold) ---
188
+ self_attn_q : 100.0%
189
+ self_attn_k : 100.0%
190
+ self_attn_v : 100.0%
191
+
192
+ --- QK SIMILARITY MANIFOLD ---
193
+ Layer StableRk PR Pos Neg SymDev TopEig
194
+ 0 1.23 5.71 510 514 1.3859 12.89
195
+ 1 5.40 35.56 515 509 1.0933 3.52
196
+ 2 4.28 74.13 531 493 1.0389 4.57
197
+ 3 4.49 80.31 559 465 1.0370 6.89
198
+ 4 7.19 121.15 524 500 1.0951 4.28
199
+ 5 7.72 117.31 551 473 0.9584 5.87
200
+ ... (24 layers total)
201
+ 23 6.71 341.20 561 463 1.1911 2.44
202
+
203
+ Positive eig fraction: layer 0 = 0.498, last = 0.548
204
+
205
+ --- MLP DEAD NEURONS ---
206
+ Dead (<1% mean): 0/98304 (0.00%)
207
+ Weak (<10% mean): 0/98304 (0.00%)
208
+
209
+ --- CROSS-LAYER CORRELATION (adjacent pairs) ---
210
+ self_attn_q : adj_mean=-0.0003, adj_range=[-0.0027, 0.0035]
211
+ self_attn_k : adj_mean=-0.0002, adj_range=[-0.0026, 0.0030]
212
+ mlp_up : adj_mean=0.0058, adj_range=[0.0006, 0.0217]
213
+
214
+
215
+ Loading CLIP-ViT-bigG/14 (LAION)...
216
+ open_clip_model.safetensors: 100%
217
+  10.2G/10.2G [00:29<00:00, 377MB/s]
218
+
219
+ ======================================================================
220
+ MODEL: CLIP-ViT-bigG/14 LAION (1664d, 48L, 16H)
221
+ ======================================================================
222
+
223
+ --- WEIGHT CATALOG ---
224
+ embedding : 1 matrices, 427,648 params, shapes={'(257, 1664)'}
225
+ mlp_down : 48 matrices, 654,311,424 params, shapes={'(1664, 8192)'}
226
+ mlp_up : 48 matrices, 654,311,424 params, shapes={'(8192, 1664)'}
227
+ projection : 1 matrices, 2,129,920 params, shapes={'(1664, 1280)'}
228
+ self_attn_o : 48 matrices, 132,907,008 params, shapes={'(1664, 1664)'}
229
+ self_attn_qkv : 48 matrices, 398,721,024 params, shapes={'(4992, 1664)'}
230
+ TOTAL : 1,842,808,448 params (2D only)
231
+
232
+ --- SVD EFFECTIVE RANK ---
233
+ Type StableRank PR Active% Rank90 Condition
234
+ mlp_down 58.27 757.89 0.644 855.5 5983209984.0
235
+ mlp_up 23.11 992.74 0.804 1045.1 6682717.5
236
+ self_attn_o 48.31 547.82 0.531 593.5 5320487424.0
237
+ self_attn_qkv 102.36 834.12 0.757 890.4 1150494.6
238
+
239
+ --- SPARSITY TOPOLOGY ---
240
+ Type <0.0001 <0.001 <0.01 <0.1
241
+ embedding 0.0255 0.2521 0.9654 0.9991
242
+ mlp_down 0.3578 0.4691 0.6310 0.9473
243
+ mlp_up 0.1763 0.3691 0.7113 1.0000
244
+ projection 0.0047 0.0469 0.4397 1.0000
245
+ self_attn_o 0.3510 0.4770 0.6900 0.9838
246
+ self_attn_qkv 0.1685 0.2917 0.7124 0.9999
247
+ FULL MODEL 0.2514 0.3952 0.6812 0.9801
248
+
249
+ --- Q/K/V SPARSITY COMPARISON (<0.1 threshold) ---
250
+ self_attn_qkv : 100.0%
251
+
252
+ --- QK SIMILARITY MANIFOLD ---
253
+ Layer StableRk PR Pos Neg SymDev TopEig
254
+ 0 1.18 9.24 829 835 1.0608 13.79
255
+ 1 2.50 32.71 834 830 1.0916 3.81
256
+ 2 1.63 11.28 831 833 0.8739 2.24
257
+ 3 2.06 13.32 832 832 1.2697 2.45
258
+ 4 1.96 23.28 836 828 1.1835 6.06
259
+ 5 3.96 41.52 839 825 1.0728 4.42
260
+ ... (48 layers total)
261
+ 47 32.79 637.78 968 696 1.2396 1.92
262
+
263
+ Positive eig fraction: layer 0 = 0.498, last = 0.582
264
+
265
+ --- MLP DEAD NEURONS ---
266
+ Dead (<1% mean): 0/393216 (0.00%)
267
+ Weak (<10% mean): 24163/393216 (6.14%)
268
+
269
+ --- CROSS-LAYER CORRELATION (adjacent pairs) ---
270
+ self_attn_qkv : adj_mean=0.0000, adj_range=[-0.0029, 0.0017]
271
+ mlp_up : adj_mean=0.0552, adj_range=[-0.0053, 0.2689]
272
+
273
+
274
+ ======================================================================
275
+ CROSS-MODEL COMPARISON
276
+ ======================================================================
277
+
278
+ --- Q SPARSITY (<0.1 threshold) ---
279
+ Model Q K V QKV
280
+ BERT-large (1024d, 24L, 16H) 99.1% 99.1% 99.9% -
281
+ CLIP-ViT-B/16 LAION (768d, 12L, 12H) - - - 100.0%
282
+ DINOv2-large (1024d, 24L, 16H) 100.0% 100.0% 100.0% -
283
+ CLIP-ViT-bigG/14 LAION (1664d, 48L, 16H) - - - 100.0%
284
+ T5-Small (512d, 6L, 8H) [reference] 93.7% 19.2% 12.1% -
285
+ T5-Base (768d, 12L, 12H) [reference] 99.4% 30.0% 16.2% -
286
+
287
+ --- SVD STABLE RANK (mean across layers) ---
288
+ Model Q K V MLP_up
289
+ BERT-large (1024d, 24L, 16H) 50.8 37.7 113.0 27.4
290
+ CLIP-ViT-B/16 LAION (768d, 12L, 12H) - - - 59.7
291
+ DINOv2-large (1024d, 24L, 16H) 57.7 55.5 94.8 58.4
292
+ CLIP-ViT-bigG/14 LAION (1664d, 48L, 16H) - - - 23.1
293
+
294
+ --- QK MANIFOLD: POSITIVE EIGENVALUE FRACTION ---
295
+ Model First Last Trend
296
+ BERT-large (1024d, 24L, 16H) 0.446 0.513 +0.066
297
+ CLIP-ViT-B/16 LAION (768d, 12L, 12H) 0.503 0.538 +0.035
298
+ DINOv2-large (1024d, 24L, 16H) 0.498 0.548 +0.050
299
+ CLIP-ViT-bigG/14 LAION (1664d, 48L, 16H) 0.498 0.582 +0.084
300
+
301
+ --- MLP DEAD NEURONS (<1% of mean) ---
302
+ BERT-large (1024d, 24L, 16H) : 0/98304 (0.00%)
303
+ CLIP-ViT-B/16 LAION (768d, 12L, 12H) : 1316/36864 (3.57%)
304
+ DINOv2-large (1024d, 24L, 16H) : 0/98304 (0.00%)
305
+ CLIP-ViT-bigG/14 LAION (1664d, 48L, 16H) : 0/393216 (0.00%)
306
+
307
+ ======================================================================
308
+ BATTERY COMPLETE
309
+ ======================================================================