Molbap HF Staff commited on
Commit
cf5f9aa
·
1 Parent(s): e7f22ff
Files changed (1) hide show
  1. content/article.md +38 -9
content/article.md CHANGED
@@ -164,11 +164,37 @@ Semantics stay in the model (a Linear stays a Linear), distribution is orthogona
164
 
165
 
166
  ## <a id="layers-attentions-caches"></a> Layers, attentions and caches
167
- With th
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
168
 
169
  ## <a id="community-kernels"></a>Community Kernels
170
 
171
- The same principle extends to normalization, activation, and other hot paths. The model defines **semantics**; a kernel defines **how** to execute them faster. We annotate the module to borrow a community‑provided forward, keeping a [consistent public surface](#consistent-public-surface)
172
 
173
  ```python
174
  @use_kernel_forward_from_hub("RMSNorm")
@@ -191,9 +217,10 @@ To get this graph, I used the heuristic of modular inheritance.
191
  2. In this `modular` file, what models, configurations and processings are imported?
192
  3. Recurse through the model list that way.
193
 
194
- So what do we see? Llama is a basis for many models, and it shows.
195
  Radically different architectures such as mamba have spawned their own dependency subgraph.
196
- {{{fragment-d3-graph}}}
 
197
 
198
 
199
  But there is no similar miracle for VLMs across the board.
@@ -204,7 +231,9 @@ One problem is, this is only for `modular` models. Several models do NOT have a
204
 
205
  ## Too many models, yet not enough, are alike
206
 
207
- So I looked into Jaccard similarity, which we use to measure set differences. I know that code is more than a set of characters stringed together, but it is a correct proxy for now. You can check out [[find_dependencies.py]] .
 
 
208
 
209
  {{{fragment-terminal}}}
210
 
@@ -215,14 +244,14 @@ The yellow areas are places where models are very different to each other. We ca
215
 
216
  We don't have cookbook for common VLM patterns (image token scatter, multi‑tower encoders, cross‑attn bridges). This is one of the main improvement points where we can work.
217
 
218
- So initially I thought of abstracting away the mixing of `inputs_embeds`, the tensor fed into an llm decoder in 95% of the existing VLMs. It would have looked like something like
219
 
220
  ```python
221
  class InputsEmbeddingMixerMixin(nn.Module):
222
  #
223
  ```
224
 
225
- But this is breaking [Standardize, don't abstract](#standardize-dont-abstract). Embedding mixin is part of the model, removing it would break it. A user opening `modeling_qwen2.5_vl` should not have to go to another file.
226
 
227
  This is the current state of abstractions across a modeling file:
228
 
@@ -279,7 +308,7 @@ But this is _within_ the modeling file, not in the `PreTrainedModel` base class.
279
  So the question abounds naturally: How can we modularize more?
280
  I took again a similarity measure and looked at the existing graphs. The tool is available on this [ZeroGPU-enabled Space](https://huggingface.co/spaces/Molbap/transformers-modular-refactor). It scans the whole transformers repository, and outputs a graph of candidates across models, using either a Jaccard similarity index (simple) or a SentenceTransformers embedding model. It is understandable that [encoder models still have a lion's share of the game.](#encoders-ftw) See also [Tom Aarsen and Arhur Bresnu's great blog post on the topic of sparse embeddings.](https://huggingface.co/blog/train-sparse-encoder).
281
 
282
- {{fragment-modular-growth}}
283
 
284
  ## <a id="encoders-ftw"></a> The neverending stories of encoder models.
285
 
@@ -338,7 +367,7 @@ Adding a model to transformers means:
338
  - having it immediately available to the community
339
  - usable in vLLM, SGLang, and so on without additional code. In April 2025, transformers was added as a backend to run models on vLLM, which optimizes throughput/latency on top of existing transformers architectures [as seen in this great blog post.](https://blog.vllm.ai/2025/04/11/transformers-backend.html)
340
 
341
- This cements
342
 
343
  ## Cooking faster CUDA warmups
344
 
 
164
 
165
 
166
  ## <a id="layers-attentions-caches"></a> Layers, attentions and caches
167
+
168
+ Following the same logic, the _nature_ of attention and caching per layer of a model should not be hardcoded. We should be able to specify in a configuration-based fashion how each layer is implemented. Thus we defined a mapping that can be then
169
+
170
+
171
+ ```python
172
+ ALLOWED_LAYER_TYPES = (
173
+ "full_attention",
174
+ "sliding_attention",
175
+ "chunked_attention",
176
+ "linear_attention",
177
+ ...
178
+ )
179
+ ```
180
+
181
+ and the configuration can be _explicit_ about which attention type is in which layer, see e.g. gpt-oss, which alternates sliding and full attention:
182
+
183
+ ```python
184
+ "layer_types": [
185
+ "sliding_attention",
186
+ "full_attention",
187
+ ...,
188
+ "sliding_attention",
189
+ "full_attention"
190
+ ],
191
+ ```
192
+
193
+ This is [minimal](#minimal-user-api) to implement on the user side, and allows to keep the modeling untouched. It is also [easy to tweak](#modular-toolbox).
194
 
195
  ## <a id="community-kernels"></a>Community Kernels
196
 
197
+ The same principle extends to normalization, activation, and other code paths. The model defines **semantics**; a kernel defines **how** to execute them faster. We annotate the module to borrow a community‑provided forward, keeping a [consistent public surface](#consistent-public-surface)
198
 
199
  ```python
200
  @use_kernel_forward_from_hub("RMSNorm")
 
217
  2. In this `modular` file, what models, configurations and processings are imported?
218
  3. Recurse through the model list that way.
219
 
220
+ So what do we see? Llama is a basis for many models, and it shows.
221
  Radically different architectures such as mamba have spawned their own dependency subgraph.
222
+
223
+ {{{fragment-dependency-graph}}}
224
 
225
 
226
  But there is no similar miracle for VLMs across the board.
 
231
 
232
  ## Too many models, yet not enough, are alike
233
 
234
+ So I looked into Jaccard similarity, which we use to measure set differences. I know that code is more than a set of characters stringed together, but it is a correct proxy for now. You can check out [[find_dependencies.py]] .
235
+
236
+ {{{fragment-model-timeline}}}
237
 
238
  {{{fragment-terminal}}}
239
 
 
244
 
245
  We don't have cookbook for common VLM patterns (image token scatter, multi‑tower encoders, cross‑attn bridges). This is one of the main improvement points where we can work.
246
 
247
+ For instance, I thought of abstracting away the mixing of `inputs_embeds`, the tensor fed into an llm decoder in 95% of the existing VLMs. It would have looked like something like
248
 
249
  ```python
250
  class InputsEmbeddingMixerMixin(nn.Module):
251
  #
252
  ```
253
 
254
+ But this is [abstracting away an important component of the modeling.](#standardize-dont-abstract). Embedding mixin is part of the model, removing it would break it. A user opening `modeling_qwen2.5_vl` should not have to go to another file.
255
 
256
  This is the current state of abstractions across a modeling file:
257
 
 
308
  So the question abounds naturally: How can we modularize more?
309
  I took again a similarity measure and looked at the existing graphs. The tool is available on this [ZeroGPU-enabled Space](https://huggingface.co/spaces/Molbap/transformers-modular-refactor). It scans the whole transformers repository, and outputs a graph of candidates across models, using either a Jaccard similarity index (simple) or a SentenceTransformers embedding model. It is understandable that [encoder models still have a lion's share of the game.](#encoders-ftw) See also [Tom Aarsen and Arhur Bresnu's great blog post on the topic of sparse embeddings.](https://huggingface.co/blog/train-sparse-encoder).
310
 
311
+ {{{fragment-loc-growth}}}
312
 
313
  ## <a id="encoders-ftw"></a> The neverending stories of encoder models.
314
 
 
367
  - having it immediately available to the community
368
  - usable in vLLM, SGLang, and so on without additional code. In April 2025, transformers was added as a backend to run models on vLLM, which optimizes throughput/latency on top of existing transformers architectures [as seen in this great blog post.](https://blog.vllm.ai/2025/04/11/transformers-backend.html)
369
 
370
+ This cements the need even more for a [consistent public surface](#consistent-public-surface): we are now a backend, and there's more optimized software than us to handle serving. At the time of writing, more effort is done in that direction. We already have compatible configs for VLMs for vLLM (say that three times fast), [here for GLM4 video support](https://github.com/huggingface/transformers/pull/40696/files),
371
 
372
  ## Cooking faster CUDA warmups
373