push
Browse files- content/article.md +38 -9
content/article.md
CHANGED
|
@@ -164,11 +164,37 @@ Semantics stay in the model (a Linear stays a Linear), distribution is orthogona
|
|
| 164 |
|
| 165 |
|
| 166 |
## <a id="layers-attentions-caches"></a> Layers, attentions and caches
|
| 167 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 168 |
|
| 169 |
## <a id="community-kernels"></a>Community Kernels
|
| 170 |
|
| 171 |
-
The same principle extends to normalization, activation, and other
|
| 172 |
|
| 173 |
```python
|
| 174 |
@use_kernel_forward_from_hub("RMSNorm")
|
|
@@ -191,9 +217,10 @@ To get this graph, I used the heuristic of modular inheritance.
|
|
| 191 |
2. In this `modular` file, what models, configurations and processings are imported?
|
| 192 |
3. Recurse through the model list that way.
|
| 193 |
|
| 194 |
-
So what do we see? Llama is a basis for many models, and it shows.
|
| 195 |
Radically different architectures such as mamba have spawned their own dependency subgraph.
|
| 196 |
-
|
|
|
|
| 197 |
|
| 198 |
|
| 199 |
But there is no similar miracle for VLMs across the board.
|
|
@@ -204,7 +231,9 @@ One problem is, this is only for `modular` models. Several models do NOT have a
|
|
| 204 |
|
| 205 |
## Too many models, yet not enough, are alike
|
| 206 |
|
| 207 |
-
So I looked into Jaccard similarity, which we use to measure set differences. I know that code is more than a set of characters stringed together, but it is a correct proxy for now. You can check out [[find_dependencies.py]] .
|
|
|
|
|
|
|
| 208 |
|
| 209 |
{{{fragment-terminal}}}
|
| 210 |
|
|
@@ -215,14 +244,14 @@ The yellow areas are places where models are very different to each other. We ca
|
|
| 215 |
|
| 216 |
We don't have cookbook for common VLM patterns (image token scatter, multi‑tower encoders, cross‑attn bridges). This is one of the main improvement points where we can work.
|
| 217 |
|
| 218 |
-
|
| 219 |
|
| 220 |
```python
|
| 221 |
class InputsEmbeddingMixerMixin(nn.Module):
|
| 222 |
#
|
| 223 |
```
|
| 224 |
|
| 225 |
-
But this is
|
| 226 |
|
| 227 |
This is the current state of abstractions across a modeling file:
|
| 228 |
|
|
@@ -279,7 +308,7 @@ But this is _within_ the modeling file, not in the `PreTrainedModel` base class.
|
|
| 279 |
So the question abounds naturally: How can we modularize more?
|
| 280 |
I took again a similarity measure and looked at the existing graphs. The tool is available on this [ZeroGPU-enabled Space](https://huggingface.co/spaces/Molbap/transformers-modular-refactor). It scans the whole transformers repository, and outputs a graph of candidates across models, using either a Jaccard similarity index (simple) or a SentenceTransformers embedding model. It is understandable that [encoder models still have a lion's share of the game.](#encoders-ftw) See also [Tom Aarsen and Arhur Bresnu's great blog post on the topic of sparse embeddings.](https://huggingface.co/blog/train-sparse-encoder).
|
| 281 |
|
| 282 |
-
{{fragment-
|
| 283 |
|
| 284 |
## <a id="encoders-ftw"></a> The neverending stories of encoder models.
|
| 285 |
|
|
@@ -338,7 +367,7 @@ Adding a model to transformers means:
|
|
| 338 |
- having it immediately available to the community
|
| 339 |
- usable in vLLM, SGLang, and so on without additional code. In April 2025, transformers was added as a backend to run models on vLLM, which optimizes throughput/latency on top of existing transformers architectures [as seen in this great blog post.](https://blog.vllm.ai/2025/04/11/transformers-backend.html)
|
| 340 |
|
| 341 |
-
This cements
|
| 342 |
|
| 343 |
## Cooking faster CUDA warmups
|
| 344 |
|
|
|
|
| 164 |
|
| 165 |
|
| 166 |
## <a id="layers-attentions-caches"></a> Layers, attentions and caches
|
| 167 |
+
|
| 168 |
+
Following the same logic, the _nature_ of attention and caching per layer of a model should not be hardcoded. We should be able to specify in a configuration-based fashion how each layer is implemented. Thus we defined a mapping that can be then
|
| 169 |
+
|
| 170 |
+
|
| 171 |
+
```python
|
| 172 |
+
ALLOWED_LAYER_TYPES = (
|
| 173 |
+
"full_attention",
|
| 174 |
+
"sliding_attention",
|
| 175 |
+
"chunked_attention",
|
| 176 |
+
"linear_attention",
|
| 177 |
+
...
|
| 178 |
+
)
|
| 179 |
+
```
|
| 180 |
+
|
| 181 |
+
and the configuration can be _explicit_ about which attention type is in which layer, see e.g. gpt-oss, which alternates sliding and full attention:
|
| 182 |
+
|
| 183 |
+
```python
|
| 184 |
+
"layer_types": [
|
| 185 |
+
"sliding_attention",
|
| 186 |
+
"full_attention",
|
| 187 |
+
...,
|
| 188 |
+
"sliding_attention",
|
| 189 |
+
"full_attention"
|
| 190 |
+
],
|
| 191 |
+
```
|
| 192 |
+
|
| 193 |
+
This is [minimal](#minimal-user-api) to implement on the user side, and allows to keep the modeling untouched. It is also [easy to tweak](#modular-toolbox).
|
| 194 |
|
| 195 |
## <a id="community-kernels"></a>Community Kernels
|
| 196 |
|
| 197 |
+
The same principle extends to normalization, activation, and other code paths. The model defines **semantics**; a kernel defines **how** to execute them faster. We annotate the module to borrow a community‑provided forward, keeping a [consistent public surface](#consistent-public-surface)
|
| 198 |
|
| 199 |
```python
|
| 200 |
@use_kernel_forward_from_hub("RMSNorm")
|
|
|
|
| 217 |
2. In this `modular` file, what models, configurations and processings are imported?
|
| 218 |
3. Recurse through the model list that way.
|
| 219 |
|
| 220 |
+
So what do we see? Llama is a basis for many models, and it shows.
|
| 221 |
Radically different architectures such as mamba have spawned their own dependency subgraph.
|
| 222 |
+
|
| 223 |
+
{{{fragment-dependency-graph}}}
|
| 224 |
|
| 225 |
|
| 226 |
But there is no similar miracle for VLMs across the board.
|
|
|
|
| 231 |
|
| 232 |
## Too many models, yet not enough, are alike
|
| 233 |
|
| 234 |
+
So I looked into Jaccard similarity, which we use to measure set differences. I know that code is more than a set of characters stringed together, but it is a correct proxy for now. You can check out [[find_dependencies.py]] .
|
| 235 |
+
|
| 236 |
+
{{{fragment-model-timeline}}}
|
| 237 |
|
| 238 |
{{{fragment-terminal}}}
|
| 239 |
|
|
|
|
| 244 |
|
| 245 |
We don't have cookbook for common VLM patterns (image token scatter, multi‑tower encoders, cross‑attn bridges). This is one of the main improvement points where we can work.
|
| 246 |
|
| 247 |
+
For instance, I thought of abstracting away the mixing of `inputs_embeds`, the tensor fed into an llm decoder in 95% of the existing VLMs. It would have looked like something like
|
| 248 |
|
| 249 |
```python
|
| 250 |
class InputsEmbeddingMixerMixin(nn.Module):
|
| 251 |
#
|
| 252 |
```
|
| 253 |
|
| 254 |
+
But this is [abstracting away an important component of the modeling.](#standardize-dont-abstract). Embedding mixin is part of the model, removing it would break it. A user opening `modeling_qwen2.5_vl` should not have to go to another file.
|
| 255 |
|
| 256 |
This is the current state of abstractions across a modeling file:
|
| 257 |
|
|
|
|
| 308 |
So the question abounds naturally: How can we modularize more?
|
| 309 |
I took again a similarity measure and looked at the existing graphs. The tool is available on this [ZeroGPU-enabled Space](https://huggingface.co/spaces/Molbap/transformers-modular-refactor). It scans the whole transformers repository, and outputs a graph of candidates across models, using either a Jaccard similarity index (simple) or a SentenceTransformers embedding model. It is understandable that [encoder models still have a lion's share of the game.](#encoders-ftw) See also [Tom Aarsen and Arhur Bresnu's great blog post on the topic of sparse embeddings.](https://huggingface.co/blog/train-sparse-encoder).
|
| 310 |
|
| 311 |
+
{{{fragment-loc-growth}}}
|
| 312 |
|
| 313 |
## <a id="encoders-ftw"></a> The neverending stories of encoder models.
|
| 314 |
|
|
|
|
| 367 |
- having it immediately available to the community
|
| 368 |
- usable in vLLM, SGLang, and so on without additional code. In April 2025, transformers was added as a backend to run models on vLLM, which optimizes throughput/latency on top of existing transformers architectures [as seen in this great blog post.](https://blog.vllm.ai/2025/04/11/transformers-backend.html)
|
| 369 |
|
| 370 |
+
This cements the need even more for a [consistent public surface](#consistent-public-surface): we are now a backend, and there's more optimized software than us to handle serving. At the time of writing, more effort is done in that direction. We already have compatible configs for VLMs for vLLM (say that three times fast), [here for GLM4 video support](https://github.com/huggingface/transformers/pull/40696/files),
|
| 371 |
|
| 372 |
## Cooking faster CUDA warmups
|
| 373 |
|