Molbap HF Staff commited on
Commit
4bb470e
ยท
1 Parent(s): 3a3c4d7
Files changed (2) hide show
  1. content/article.md +26 -23
  2. dist/index.html +23 -22
content/article.md CHANGED
@@ -173,9 +173,7 @@ That gives an โ€œeffective LOCโ€ curve: the ๐—บ๐—ฎ๐—ถ๐—ป๐˜๐—ฒ๐—ป๐—ฎ
173
 
174
  ๐—๐˜‚๐˜€๐˜ ๐—น๐—ผ๐—ผ๐—ธ ๐—ฎ๐˜ ๐˜๐—ต๐—ฒ ๐—ฟ๐—ฒ๐˜€๐˜‚๐—น๐˜: ๐˜๐—ต๐—ฒ ๐—ด๐—ฟ๐—ผ๐˜„๐˜๐—ต ๐—ฟ๐—ฎ๐˜๐—ฒ ๐—ผ๐—ณ ๐—น๐—ถ๐—ป๐—ฒ๐˜€ ๐—ผ๐—ณ ๐—ฐ๐—ผ๐—ฑ๐—ฒ ๐—ฐ๐—ผ๐—น๐—น๐—ฎ๐—ฝ๐˜€๐—ฒ๐—ฑ! Counting raw ๐š–๐š˜๐š๐šŽ๐š•๐š’๐š—๐š_*.๐š™๐šข (with โ€œCopied fromโ€ฆโ€ everywhere) we were around 362 new LOC/day; with ๐š–๐š˜๐š๐šž๐š•๐šŠ๐š› in place the effective rate is ~25 LOC/day. About ๐Ÿญ๐Ÿฑร— ๐—น๐—ผ๐˜„๐—ฒ๐—ฟ! Had we continued with a strict "one model, one file" policy who knows where we'd have ended up.
175
 
176
- Less code to hand-maintain means fewer places to break.
177
-
178
- Cyclomatic complexity isnโ€™t LOC, but they strongly correlate. As Les Hatton notes, defects scale like ๐™™ ~ ๐™ญ ๐™ก๐™ฃ ๐™ญ. Lower ๐˜… (lower loc) helps.
179
 
180
  {{{fragment-loc-growth}}}
181
 
@@ -191,13 +189,13 @@ However, we were adding specific torch operations for each backend (sdpa, flash-
191
 
192
  ### <a id="attention-classes"></a> External Attention classes
193
 
194
- Externalising the [attention classes](#external-attention-classes) has moved out a lot of repeated code that was [standard](#standardize-dont-abstract).
195
-
196
  We moved to an [attention interface](https://huggingface.co/docs/transformers/en/attention_interface) that allowed the following:
197
 
198
  We keep a `Callable` for the naive implementation of the attention, called "eager" computation. This Callable is named `eager_attention_forward`, and can be run as long as the user had `torch` installed, which is a requirement in any case.
199
 
200
- In other words, we moved from a class interface to a function interface: in order to use more complex attention implementations, the config is checked, and use other Callables, including kernel bindings.
 
 
201
 
202
  ```python
203
  attention_interface: Callable = eager_attention_forward
@@ -232,7 +230,7 @@ Hence, we want to touch [minimally](#minimal-user-api) to the modeling code, and
232
 
233
  The alternative would be to modify parent classes specific to their
234
 
235
- It is written once in the config and passed to `.from_pretrained()`. The plan maps module name patterns to partitioning strategies. Strategies are resolved by the internal `ParallelInterface`, which wires to sharding implementations `ColwiseParallel`, `RowwiseParallel`, packed variants, and so on.
236
 
237
  {{{fragment-tp-plan}}}
238
 
@@ -324,14 +322,14 @@ If you've checked out llava, you've seen that llava_video is a red node, connect
324
 
325
  We don't have cookbook for common VLM patterns (image token scatter, multiโ€‘tower encoders, crossโ€‘attn bridges). This is one of the main improvement points where we can work.
326
 
327
- For instance, I thought of abstracting away the mixing of `inputs_embeds`, the tensor fed into an llm decoder in 95% of the existing VLMs. It would have looked like something like
328
 
329
  ```python
330
  class InputsEmbeddingMixerMixin(nn.Module):
331
  #
332
  ```
333
 
334
- But this is [abstracting away an important component of the modeling.](#standardize-dont-abstract). Embedding mixin is part of the model, removing it would break it. A user opening `modeling_qwen2.5_vl` should not have to go to another file.
335
 
336
  This is the current state of abstractions across a modeling file:
337
 
@@ -383,15 +381,6 @@ The following [Pull request to standardize placeholder masking](https://github.c
383
 
384
  But this is _within_ the modeling file, not in the `PreTrainedModel` base class. It will not move away from it, because it'd break the [self-contained logic](#one-model-one-file) of the model.
385
 
386
-
387
- ### <a id="encoders-ftw"></a> Embedding models, now and forever.
388
-
389
- Models popularity speaks for itself! This is because the usage of encoders lies in embeddings. So we have to keep the encoders part viable, usable, fine-tune-able.
390
-
391
- {{{fragment-model-visualisation}}}
392
-
393
- As the codebase grows, with our friend codebase [Sentence Transformers](https://huggingface.co/sentence-transformers), we need to maintain this one as well. Retrieval use-cases, smart dbs, like FAISS-based indexing rely on it, and thus indirectly on transformers.
394
-
395
  ### On image processing and processors
396
 
397
  Choosing to be a `torch`-first software meant relieving a tremendous amount of support from `jax ` and `TensorFlow` , and it also meant that we could be more lenient into the amount of torch-dependent utilities that we were able to add. One of these is the _fast processing_ of images. Where they were before assumed to be minimal ndarrays, making stronger assumptions and enforcing `torch` and `torchvision`native inputs allowed up to speed up massively the processing time for each model.
@@ -408,13 +397,24 @@ This is an overall objective: there's no `transformers` without its community.
408
 
409
  Having a framework means forcing users into it. It restrains flexibility and creativity, which are the fertile soil for new ideas to grow.
410
 
411
- Among the most valuable contributions to `transformers` is of course the addition of new models. A second one is the ability to fine-tune and pipeline these models into many other softwares.
 
 
 
 
 
 
 
 
 
 
 
412
 
413
  In that regard, we DO want to be a modular toolbox, being [minimal](#minimal-user-api) enough and well documented enough so any ML/AI developer can use `transformers` without having to think about it. We aim to reduce the cognitive load brought about by model development, not increase it.
414
 
415
  So, how do these design choices, these "tenets" influence development of models and overall usage of transformers?
416
 
417
- ### A surgical toolbox for model development
418
 
419
  ### Attention visualisation
420
 
@@ -435,7 +435,7 @@ It just works with PyTorch models and is especially useful when aligning outputs
435
 
436
  ### Cooking faster CUDA warmups
437
 
438
- Having a clean _external_ API allows us to work on the true inner workings of transformers. One of the few recent additions was the _CUDA warmup_ via `caching_allocator_warmup` which improved massively the loading footprint by pre-allocating GPU memory to avoid malloc bottlenecks during model loading.
439
 
440
  {{{fragment-warmup_demo}}}
441
 
@@ -460,9 +460,12 @@ Continuous batching is in itself very much linked to the great work of vLLM with
460
 
461
  ## Community reusability
462
 
463
- Transformers-serve is transformers-first, for sure, but it's not limited to that. Adding a model to transformers means:
 
 
 
464
  - having it immediately available to the community
465
- - having it immediately usable in vLLM, [SGLang](https://huggingface.co/blog/transformers-backend-sglang), and so on without additional code. In April 2025, transformers was added as a backend to run models on vLLM, which optimizes throughput/latency on top of existing transformers architectures [as seen in this great blog post.](https://blog.vllm.ai/2025/04/11/transformers-backend.html)
466
 
467
  This cements the need even more for a [consistent public surface](#consistent-public-surface): we are now a backend, and there's more optimized software than us to handle serving. At the time of writing, more effort is done in that direction. We already have compatible configs for VLMs for vLLM (say that three times fast), [here for GLM4 video support](https://github.com/huggingface/transformers/pull/40696/files), and here for [MoE support](https://github.com/huggingface/transformers/pull/40132) for instance.
468
 
 
173
 
174
  ๐—๐˜‚๐˜€๐˜ ๐—น๐—ผ๐—ผ๐—ธ ๐—ฎ๐˜ ๐˜๐—ต๐—ฒ ๐—ฟ๐—ฒ๐˜€๐˜‚๐—น๐˜: ๐˜๐—ต๐—ฒ ๐—ด๐—ฟ๐—ผ๐˜„๐˜๐—ต ๐—ฟ๐—ฎ๐˜๐—ฒ ๐—ผ๐—ณ ๐—น๐—ถ๐—ป๐—ฒ๐˜€ ๐—ผ๐—ณ ๐—ฐ๐—ผ๐—ฑ๐—ฒ ๐—ฐ๐—ผ๐—น๐—น๐—ฎ๐—ฝ๐˜€๐—ฒ๐—ฑ! Counting raw ๐š–๐š˜๐š๐šŽ๐š•๐š’๐š—๐š_*.๐š™๐šข (with โ€œCopied fromโ€ฆโ€ everywhere) we were around 362 new LOC/day; with ๐š–๐š˜๐š๐šž๐š•๐šŠ๐š› in place the effective rate is ~25 LOC/day. About ๐Ÿญ๐Ÿฑร— ๐—น๐—ผ๐˜„๐—ฒ๐—ฟ! Had we continued with a strict "one model, one file" policy who knows where we'd have ended up.
175
 
176
+ Less code to hand-maintain means fewer places to break: cyclomatic complexity isnโ€™t LOC, but they strongly correlate.
 
 
177
 
178
  {{{fragment-loc-growth}}}
179
 
 
189
 
190
  ### <a id="attention-classes"></a> External Attention classes
191
 
 
 
192
  We moved to an [attention interface](https://huggingface.co/docs/transformers/en/attention_interface) that allowed the following:
193
 
194
  We keep a `Callable` for the naive implementation of the attention, called "eager" computation. This Callable is named `eager_attention_forward`, and can be run as long as the user had `torch` installed, which is a requirement in any case.
195
 
196
+ In other words, we moved from a class interface to a function interface: in order to use more complex attention implementations, the config is checked, and can use other Callables, including kernel bindings that are much faster, if they are available.
197
+
198
+ This exemplifies the fact that we prefer to have an interface that is [standard, but not abstract](#standardize-dont-abstract).
199
 
200
  ```python
201
  attention_interface: Callable = eager_attention_forward
 
230
 
231
  The alternative would be to modify parent classes specific to their
232
 
233
+ It is written once in the config and passed to `.from_pretrained()`. The plan maps module name patterns to partitioning strategies. Strategies are resolved by the internal `ParallelInterface`, which wires to sharding implementations `ColwiseParallel`, `RowwiseParallel`, packed variants, and so on.
234
 
235
  {{{fragment-tp-plan}}}
236
 
 
322
 
323
  We don't have cookbook for common VLM patterns (image token scatter, multiโ€‘tower encoders, crossโ€‘attn bridges). This is one of the main improvement points where we can work.
324
 
325
+ For instance, we thought of abstracting away the mixing of `inputs_embeds`, the tensor fed into an llm decoder in 95% of the existing VLMs. It would have looked like something like
326
 
327
  ```python
328
  class InputsEmbeddingMixerMixin(nn.Module):
329
  #
330
  ```
331
 
332
+ But this is [abstracting away an important component of the modeling.](#standardize-dont-abstract). Embedding mixin is part of the model, removing it would break it. A user opening [`modeling_qwen2.5_vl`](https://huggingface.co/collections/Qwen/qwen25-vl-6795ffac22b334a837c0f9a5) should not have to go to another file to understand how it works.
333
 
334
  This is the current state of abstractions across a modeling file:
335
 
 
381
 
382
  But this is _within_ the modeling file, not in the `PreTrainedModel` base class. It will not move away from it, because it'd break the [self-contained logic](#one-model-one-file) of the model.
383
 
 
 
 
 
 
 
 
 
 
384
  ### On image processing and processors
385
 
386
  Choosing to be a `torch`-first software meant relieving a tremendous amount of support from `jax ` and `TensorFlow` , and it also meant that we could be more lenient into the amount of torch-dependent utilities that we were able to add. One of these is the _fast processing_ of images. Where they were before assumed to be minimal ndarrays, making stronger assumptions and enforcing `torch` and `torchvision`native inputs allowed up to speed up massively the processing time for each model.
 
397
 
398
  Having a framework means forcing users into it. It restrains flexibility and creativity, which are the fertile soil for new ideas to grow.
399
 
400
+ Among the most valuable contributions to `transformers` is of course the addition of new models. Very recently, [OpenAI added GPT-OSS](https://huggingface.co/blog/welcome-openai-gpt-oss), which prompted the addition of many new features to the library in order to support [their model](https://huggingface.co/openai/gpt-oss-120b).
401
+
402
+ A second one is the ability to fine-tune and pipeline these models into many other softwares. Check here on the hub how many finetunes are registered for [gpt-oss 120b](https://huggingface.co/models?other=base_model:finetune:openai/gpt-oss-120b), despite its size!
403
+
404
+ ### <a id="encoders-ftw"></a> Models popularity
405
+
406
+ Talking about dependencies, we can take a look at the number of downloads for transformer models popularity. One thing we see is the prominence of encoders: This is because the usage of encoders lies in embeddings, just check out [EmbeddingGemma](https://huggingface.co/blog/embeddinggemma) for a modern recap. Hence, it is vital to keep the encoders part viable, usable, fine-tune-able.
407
+
408
+ {{{fragment-model-visualisation}}}
409
+
410
+ As the codebase grows, with our friend codebase [Sentence Transformers](https://huggingface.co/sentence-transformers), we need to maintain this one as well. Retrieval use-cases, smart dbs, like FAISS-based indexing rely on it, and thus indirectly on transformers.
411
+
412
 
413
  In that regard, we DO want to be a modular toolbox, being [minimal](#minimal-user-api) enough and well documented enough so any ML/AI developer can use `transformers` without having to think about it. We aim to reduce the cognitive load brought about by model development, not increase it.
414
 
415
  So, how do these design choices, these "tenets" influence development of models and overall usage of transformers?
416
 
417
+ ## A surgical toolbox for model development
418
 
419
  ### Attention visualisation
420
 
 
435
 
436
  ### Cooking faster CUDA warmups
437
 
438
+ Having a clean _external_ API allows us to work on the [true inner workings of transformers](#code-is-product). One of the few recent additions was the _CUDA warmup_ via `caching_allocator_warmup` which improved massively the loading footprint by pre-allocating GPU memory to avoid malloc bottlenecks during model loading, achieving a 7x factor for an 8B model, 6x for a 32B, you can check out [the source](https://github.com/huggingface/transformers/pull/36380)!
439
 
440
  {{{fragment-warmup_demo}}}
441
 
 
460
 
461
  ## Community reusability
462
 
463
+ Transformers-serve is transformers-first, for sure, but the library is made first and foremost to be _reused_ at large by the open-source ecosystem.
464
+
465
+ Adding a model to transformers means:
466
+
467
  - having it immediately available to the community
468
+ - having it immediately usable in vLLM, [SGLang](https://huggingface.co/blog/transformers-backend-sglang), and so on without additional code. In April 2025, transformers was added as a backend to run models on vLLM, which optimizes throughput/latency on top of existing transformers architectures [as seen in this great vLLM x HF blog post.](https://blog.vllm.ai/2025/04/11/transformers-backend.html)
469
 
470
  This cements the need even more for a [consistent public surface](#consistent-public-surface): we are now a backend, and there's more optimized software than us to handle serving. At the time of writing, more effort is done in that direction. We already have compatible configs for VLMs for vLLM (say that three times fast), [here for GLM4 video support](https://github.com/huggingface/transformers/pull/40696/files), and here for [MoE support](https://github.com/huggingface/transformers/pull/40132) for instance.
471
 
dist/index.html CHANGED
@@ -293,8 +293,7 @@ If it only has a modeling file, we add its LOC count.
293
  However, if a model has a modular_<em>.py and a corresponding automatically generated modeling_</em>/.py, we only count the LOC under the modular file. The modeling code has no maintenance cost as it is strictly dependent on the modular file.</p>
294
  <p>That gives an โ€œeffective LOCโ€ curve: the ๐—บ๐—ฎ๐—ถ๐—ป๐˜๐—ฒ๐—ป๐—ฎ๐—ป๐—ฐ๐—ฒ ๐˜€๐˜‚๐—ฟ๐—ณ๐—ฎ๐—ฐ๐—ฒ.</p>
295
  <p>๐—๐˜‚๐˜€๐˜ ๐—น๐—ผ๐—ผ๐—ธ ๐—ฎ๐˜ ๐˜๐—ต๐—ฒ ๐—ฟ๐—ฒ๐˜€๐˜‚๐—น๐˜: ๐˜๐—ต๐—ฒ ๐—ด๐—ฟ๐—ผ๐˜„๐˜๐—ต ๐—ฟ๐—ฎ๐˜๐—ฒ ๐—ผ๐—ณ ๐—น๐—ถ๐—ป๐—ฒ๐˜€ ๐—ผ๐—ณ ๐—ฐ๐—ผ๐—ฑ๐—ฒ ๐—ฐ๐—ผ๐—น๐—น๐—ฎ๐—ฝ๐˜€๐—ฒ๐—ฑ! Counting raw ๐š–๐š˜๐š๐šŽ๐š•๐š’๐š—๐š_*.๐š™๐šข (with โ€œCopied fromโ€ฆโ€ everywhere) we were around 362 new LOC/day; with ๐š–๐š˜๐š๐šž๐š•๐šŠ๐š› in place the effective rate is ~25 LOC/day. About ๐Ÿญ๐Ÿฑร— ๐—น๐—ผ๐˜„๐—ฒ๐—ฟ! Had we continued with a strict โ€œone model, one fileโ€ policy who knows where weโ€™d have ended up.</p>
296
- <p>Less code to hand-maintain means fewer places to break.</p>
297
- <p>Cyclomatic complexity isnโ€™t LOC, but they strongly correlate. As Les Hatton notes, defects scale like ๐™™ ~ ๐™ญ ๐™ก๐™ฃ ๐™ญ. Lower ๐˜… (lower loc) helps.</p>
298
  <p><iframe src=https://molbap-loc-1.hf.space style="width:100%; height:680px; border:0" allow="clipboard-read; clipboard-write; fullscreen" referrerpolicy=no-referrer-when-downgrade></iframe></p>
299
  <p>Thereโ€™s a sharp drop near the end, itโ€™s due to us <a href="https://github.com/huggingface/transformers/commit/4df2529d79d75f44e70396df5888a32ffa02d61e#diff-60849db3e9922197854ef1cac92bf4aba08b5d7fd3fe6f3c16a3511e29e0eacc">removing support for Jax and TensorFlow</a> library-wide.</p>
300
  <p>Of course, it is not only this effort that allowed to reduce the maintenance load.</p>
@@ -302,10 +301,10 @@ However, if a model has a modular_<em>.py and a corresponding automatically gene
302
  <p>The <em>attention computation</em> itself happens at a <em>lower</em> level of abstraction than the model itself.</p>
303
  <p>However, we were adding specific torch operations for each backend (sdpa, flash-attention iterations, flex attention) but it wasnโ€™t a <a href="#minimal-user-api">minimal user api</a>.</p>
304
  <h3><a id="attention-classes"></a> External Attention classes</h3>
305
- <p>Externalising the <a href="#external-attention-classes">attention classes</a> has moved out a lot of repeated code that was <a href="#standardize-dont-abstract">standard</a>.</p>
306
  <p>We moved to an <a href="https://huggingface.co/docs/transformers/en/attention_interface">attention interface</a> that allowed the following:</p>
307
  <p>We keep a <code>Callable</code> for the naive implementation of the attention, called โ€œeagerโ€ computation. This Callable is named <code>eager_attention_forward</code>, and can be run as long as the user had <code>torch</code> installed, which is a requirement in any case.</p>
308
- <p>In other words, we moved from a class interface to a function interface: in order to use more complex attention implementations, the config is checked, and use other Callables, including kernel bindings.</p>
 
309
  <pre><code class="language-python">attention_interface: Callable = eager_attention_forward
310
  if self.config._attn_implementation != &quot;eager&quot;:
311
  attention_interface = ALL_ATTENTION_FUNCTIONS[self.config._attn_implementation]
@@ -324,7 +323,7 @@ MyModelOutputAnnotated = Annotated[MyModelOutput, &quot;shape: (B, C, H, W)&quot
324
  We choose to place the level of abstraction higher than the device placement: a matrix multiplication - a <code>nn.Linear</code> layer - should be always expressed in the same way, regardless of how it is placed.</p>
325
  <p>Hence, we want to touch <a href="#minimal-user-api">minimally</a> to the modeling code, and only modify it when <em>architectural changes</em> are involved. For instance, for tensor parallelism, we instead now specify a simple <code>tp_plan</code>.</p>
326
  <p>The alternative would be to modify parent classes specific to their</p>
327
- <p>It is written once in the config and passed to <code>.from_pretrained()</code>. The plan maps module name patterns to partitioning strategies. Strategies are resolved by the internal <code>ParallelInterface</code>, which wires to sharding implementations <code>ColwiseParallel</code>, <code>RowwiseParallel</code>, packed variants, and so on.</p>
328
  <p><pre><code class="language-python"># In the model's config (example: ERNIE 4.5-style decoder blocks)
329
  base_model_tp_plan = {
330
  "layers.*.self_attn.q_proj": "colwise",
@@ -403,11 +402,11 @@ As you can see, there is a small DETR island, a little llava pocket, and so on,
403
  <p>If youโ€™ve checked out llava, youโ€™ve seen that llava_video is a red node, connected by a red edge to llava: itโ€™s a candidate, something that we can <em>likely</em> remodularize, <a href="#backwards-compatibility">not touching the actual model</a> but being much more readable with <a href="#do-repeat-yourself">DRY*</a>.</p>
404
  <h3>VLM improvements, avoiding abstraction</h3>
405
  <p>We donโ€™t have cookbook for common VLM patterns (image token scatter, multiโ€‘tower encoders, crossโ€‘attn bridges). This is one of the main improvement points where we can work.</p>
406
- <p>For instance, I thought of abstracting away the mixing of <code>inputs_embeds</code>, the tensor fed into an llm decoder in 95% of the existing VLMs. It would have looked like something like</p>
407
  <pre><code class="language-python">class InputsEmbeddingMixerMixin(nn.Module):
408
  #
409
  </code></pre>
410
- <p>But this is <a href="#standardize-dont-abstract">abstracting away an important component of the modeling.</a>. Embedding mixin is part of the model, removing it would break it. A user opening <code>modeling_qwen2.5_vl</code> should not have to go to another file.</p>
411
  <p>This is the current state of abstractions across a modeling file:</p>
412
  <p><img src="static/Bloatedness_visualizer.png" alt="Bloatedness visualizer showing abstraction levels"></p>
413
  <p>The following <a href="https://github.com/huggingface/transformers/pull/39777">Pull request to standardize placeholder masking</a> is a good example of what kind of changes are acceptable. In a VLM, we always need to insert embeddings from various encoders at various positions, so we can have a function to do it. For Qwen2 VL, for instance, it will look like this:</p>
@@ -452,8 +451,18 @@ As you can see, there is a small DETR island, a little llava pocket, and so on,
452
  return special_image_mask, special_video_mask
453
  </code></pre>
454
  <p>But this is <em>within</em> the modeling file, not in the <code>PreTrainedModel</code> base class. It will not move away from it, because itโ€™d break the <a href="#one-model-one-file">self-contained logic</a> of the model.</p>
455
- <h3><a id="encoders-ftw"></a> Embedding models, now and forever.</h3>
456
- <p>Models popularity speaks for itself! This is because the usage of encoders lies in embeddings. So we have to keep the encoders part viable, usable, fine-tune-able.</p>
 
 
 
 
 
 
 
 
 
 
457
  <p><html>
458
  <head><meta charset="utf-8" /></head>
459
  <body>
@@ -4340,18 +4349,9 @@ return Plotly;
4340
  </body>
4341
  </html></p>
4342
  <p>As the codebase grows, with our friend codebase <a href="https://huggingface.co/sentence-transformers">Sentence Transformers</a>, we need to maintain this one as well. Retrieval use-cases, smart dbs, like FAISS-based indexing rely on it, and thus indirectly on transformers.</p>
4343
- <h3>On image processing and processors</h3>
4344
- <p>Choosing to be a <code>torch</code>-first software meant relieving a tremendous amount of support from <code>jax </code> and <code>TensorFlow</code> , and it also meant that we could be more lenient into the amount of torch-dependent utilities that we were able to add. One of these is the <em>fast processing</em> of images. Where they were before assumed to be minimal ndarrays, making stronger assumptions and enforcing <code>torch</code> and <code>torchvision</code>native inputs allowed up to speed up massively the processing time for each model.</p>
4345
- <p>The gains in performance are immense, up to 20x speed for most models when compiled torchvision ops. Further, it allows to have the whole pipeline solely on GPU.</p>
4346
- <p><img src="static/fast_image_processors.png" alt="Fast Image Processors Performance"></p>
4347
- <p class="figure-legend">Thanks <a href="https://huggingface.co/yonigozlan">Yoni Gozlan</a> for the great work!</p>
4348
- <h2>Reduce barrier to entry/contribution</h2>
4349
- <p>This is an overall objective: thereโ€™s no <code>transformers</code> without its community.</p>
4350
- <p>Having a framework means forcing users into it. It restrains flexibility and creativity, which are the fertile soil for new ideas to grow.</p>
4351
- <p>Among the most valuable contributions to <code>transformers</code> is of course the addition of new models. A second one is the ability to fine-tune and pipeline these models into many other softwares.</p>
4352
  <p>In that regard, we DO want to be a modular toolbox, being <a href="#minimal-user-api">minimal</a> enough and well documented enough so any ML/AI developer can use <code>transformers</code> without having to think about it. We aim to reduce the cognitive load brought about by model development, not increase it.</p>
4353
  <p>So, how do these design choices, these โ€œtenetsโ€ influence development of models and overall usage of transformers?</p>
4354
- <h3>A surgical toolbox for model development</h3>
4355
  <h3>Attention visualisation</h3>
4356
  <p>All models have the same API internally for attention computation, thanks to <a href="#external-attention-classes">the externalisation of attention classes</a>. it allows us to build cool tools to visualize the inner workings of the attention mechanism.</p>
4357
  <p>One particular piece of machinery is the <code>attention mask</code>. Here you see the famous bidirectional attention pattern for the whole prefix (text + image) in PaliGemma and all Gemma2+ models, contrasting with the usual โ€œcausal-onlyโ€ models.</p>
@@ -4405,7 +4405,7 @@ return Plotly;
4405
  <p>It just works with PyTorch models and is especially useful when aligning outputs with a reference implementation, aligned with our <a href="#source-of-truth">core guideline</a>.</p>
4406
  <p><img src="static/model_debugger.png" alt="Model debugger interface"></p>
4407
  <h3>Cooking faster CUDA warmups</h3>
4408
- <p>Having a clean <em>external</em> API allows us to work on the true inner workings of transformers. One of the few recent additions was the <em>CUDA warmup</em> via <code>caching_allocator_warmup</code> which improved massively the loading footprint by pre-allocating GPU memory to avoid malloc bottlenecks during model loading.</p>
4409
  <p><style>.warmup-demo body{background-color:#f5f5f5;margin:0;padding:20px;font-family:Segoe UI,Tahoma,Geneva,Verdana,sans-serif}.warmup-demo .container{background:#fff;border-radius:12px;max-width:1200px;margin:0 auto;padding:30px;box-shadow:0 4px 6px #0000001a}.warmup-demo h1{text-align:center;color:#333;margin-bottom:10px}.warmup-demo .subtitle{text-align:center;color:#666;margin-bottom:30px;font-size:16px}.warmup-demo .demo-container{gap:40px;margin-bottom:30px;display:flex}.warmup-demo .side{background:#fafafa;border:2px solid #ddd;border-radius:8px;flex:1;padding:20px}.warmup-demo .side h2{text-align:center;color:#333;margin-top:0}.warmup-demo .no-warmup h2{color:#d63384}.warmup-demo .with-warmup h2{color:#198754}.warmup-demo .memory-area{background:#fff;border:2px dashed #ccc;border-radius:6px;height:400px;margin:20px 0;padding:10px;position:relative;overflow:hidden}.warmup-demo .layer-box{background:#fff;border:2px solid #666;border-radius:4px;width:80px;height:30px;margin:3px;transition:all .3s;display:inline-block;position:relative}.warmup-demo .layer-box.allocating{background:#e9ecef;border-color:#adb5bd}.warmup-demo .layer-box.allocating:after{content:"malloc";color:#666;font-size:10px;font-weight:700;position:absolute;top:50%;left:50%;transform:translate(-50%,-50%)}.warmup-demo .layer-box.loaded{background:#d1e7dd;border-color:#198754}.warmup-demo .layer-box.loaded:after{content:"data";color:#198754;font-size:10px;font-weight:700;position:absolute;top:50%;left:50%;transform:translate(-50%,-50%)}.warmup-demo .warmup-container{background:#fff;border:3px solid #666;border-radius:6px;width:100%;height:60px;margin-bottom:20px;position:relative;overflow:hidden}.warmup-demo .warmup-container.allocated{background:#e7f1ff;border-color:#0d6efd}.warmup-demo .warmup-container:before{content:"Pre-allocated Memory Pool";color:#666;z-index:1;font-size:14px;font-weight:700;position:absolute;top:50%;left:50%;transform:translate(-50%,-50%)}.warmup-demo .warmup-container.allocated:before{color:#0d6efd}.warmup-demo .warmup-fill{z-index:2;background:linear-gradient(90deg,#198754,#20c997);border-radius:3px;width:0%;height:100%;transition:width .5s;position:relative}.warmup-demo .warmup-fill:after{content:"Layer Data Loading";color:#fff;white-space:nowrap;font-size:12px;font-weight:700;position:absolute;top:50%;left:50%;transform:translate(-50%,-50%)}.warmup-demo .timing{text-align:center;min-height:30px;margin:15px 0;font-size:24px;font-weight:700}.warmup-demo .no-warmup .timing{color:#d63384}.warmup-demo .with-warmup .timing{color:#198754}.warmup-demo .controls{text-align:center;margin:30px 0}.warmup-demo .btn{color:#fff;cursor:pointer;background:#0d6efd;border:none;border-radius:6px;margin:0 10px;padding:12px 24px;font-size:16px;transition:background .3s}.warmup-demo .btn:hover{background:#0b5ed7}.warmup-demo .btn:disabled{cursor:not-allowed;background:#6c757d}.warmup-demo .description{background:#f8f9fa;border-radius:6px;margin-top:15px;padding:15px;font-size:14px;line-height:1.5}.warmup-demo .phase-indicator{color:#666;text-align:center;min-height:20px;margin-top:10px;font-size:14px}.warmup-demo .layer-counter{text-align:center;color:#495057;margin:10px 0;font-size:16px}</style>
4410
 
4411
  <div class=warmup-demo>
@@ -4469,10 +4469,11 @@ curl -X POST http://localhost:8000/v1/chat/completions \
4469
  <p>This provides an OpenAI-compatible API with features like <a href="https://github.com/huggingface/transformers/pull/38085">continuous batching</a> (also check <a href="https://github.com/huggingface/transformers/pull/40426">here</a>) for better GPU utilization.</p>
4470
  <p>Continuous batching is in itself very much linked to the great work of vLLM with the <code>paged attention kernel</code>, further justifying the facilitation of <a href="#community-kernels">external kernels</a>.</p>
4471
  <h2>Community reusability</h2>
4472
- <p>Transformers-serve is transformers-first, for sure, but itโ€™s not limited to that. Adding a model to transformers means:</p>
 
4473
  <ul>
4474
  <li>having it immediately available to the community</li>
4475
- <li>having it immediately usable in vLLM, <a href="https://huggingface.co/blog/transformers-backend-sglang">SGLang</a>, and so on without additional code. In April 2025, transformers was added as a backend to run models on vLLM, which optimizes throughput/latency on top of existing transformers architectures <a href="https://blog.vllm.ai/2025/04/11/transformers-backend.html">as seen in this great blog post.</a></li>
4476
  </ul>
4477
  <p>This cements the need even more for a <a href="#consistent-public-surface">consistent public surface</a>: we are now a backend, and thereโ€™s more optimized software than us to handle serving. At the time of writing, more effort is done in that direction. We already have compatible configs for VLMs for vLLM (say that three times fast), <a href="https://github.com/huggingface/transformers/pull/40696/files">here for GLM4 video support</a>, and here for <a href="https://github.com/huggingface/transformers/pull/40132">MoE support</a> for instance.</p>
4478
  <h2>What is coming next</h2>
 
293
  However, if a model has a modular_<em>.py and a corresponding automatically generated modeling_</em>/.py, we only count the LOC under the modular file. The modeling code has no maintenance cost as it is strictly dependent on the modular file.</p>
294
  <p>That gives an โ€œeffective LOCโ€ curve: the ๐—บ๐—ฎ๐—ถ๐—ป๐˜๐—ฒ๐—ป๐—ฎ๐—ป๐—ฐ๐—ฒ ๐˜€๐˜‚๐—ฟ๐—ณ๐—ฎ๐—ฐ๐—ฒ.</p>
295
  <p>๐—๐˜‚๐˜€๐˜ ๐—น๐—ผ๐—ผ๐—ธ ๐—ฎ๐˜ ๐˜๐—ต๐—ฒ ๐—ฟ๐—ฒ๐˜€๐˜‚๐—น๐˜: ๐˜๐—ต๐—ฒ ๐—ด๐—ฟ๐—ผ๐˜„๐˜๐—ต ๐—ฟ๐—ฎ๐˜๐—ฒ ๐—ผ๐—ณ ๐—น๐—ถ๐—ป๐—ฒ๐˜€ ๐—ผ๐—ณ ๐—ฐ๐—ผ๐—ฑ๐—ฒ ๐—ฐ๐—ผ๐—น๐—น๐—ฎ๐—ฝ๐˜€๐—ฒ๐—ฑ! Counting raw ๐š–๐š˜๐š๐šŽ๐š•๐š’๐š—๐š_*.๐š™๐šข (with โ€œCopied fromโ€ฆโ€ everywhere) we were around 362 new LOC/day; with ๐š–๐š˜๐š๐šž๐š•๐šŠ๐š› in place the effective rate is ~25 LOC/day. About ๐Ÿญ๐Ÿฑร— ๐—น๐—ผ๐˜„๐—ฒ๐—ฟ! Had we continued with a strict โ€œone model, one fileโ€ policy who knows where weโ€™d have ended up.</p>
296
+ <p>Less code to hand-maintain means fewer places to break: cyclomatic complexity isnโ€™t LOC, but they strongly correlate.</p>
 
297
  <p><iframe src=https://molbap-loc-1.hf.space style="width:100%; height:680px; border:0" allow="clipboard-read; clipboard-write; fullscreen" referrerpolicy=no-referrer-when-downgrade></iframe></p>
298
  <p>Thereโ€™s a sharp drop near the end, itโ€™s due to us <a href="https://github.com/huggingface/transformers/commit/4df2529d79d75f44e70396df5888a32ffa02d61e#diff-60849db3e9922197854ef1cac92bf4aba08b5d7fd3fe6f3c16a3511e29e0eacc">removing support for Jax and TensorFlow</a> library-wide.</p>
299
  <p>Of course, it is not only this effort that allowed to reduce the maintenance load.</p>
 
301
  <p>The <em>attention computation</em> itself happens at a <em>lower</em> level of abstraction than the model itself.</p>
302
  <p>However, we were adding specific torch operations for each backend (sdpa, flash-attention iterations, flex attention) but it wasnโ€™t a <a href="#minimal-user-api">minimal user api</a>.</p>
303
  <h3><a id="attention-classes"></a> External Attention classes</h3>
 
304
  <p>We moved to an <a href="https://huggingface.co/docs/transformers/en/attention_interface">attention interface</a> that allowed the following:</p>
305
  <p>We keep a <code>Callable</code> for the naive implementation of the attention, called โ€œeagerโ€ computation. This Callable is named <code>eager_attention_forward</code>, and can be run as long as the user had <code>torch</code> installed, which is a requirement in any case.</p>
306
+ <p>In other words, we moved from a class interface to a function interface: in order to use more complex attention implementations, the config is checked, and can use other Callables, including kernel bindings that are much faster, if they are available.</p>
307
+ <p>This exemplifies the fact that we prefer to have an interface that is <a href="#standardize-dont-abstract">standard, but not abstract</a>.</p>
308
  <pre><code class="language-python">attention_interface: Callable = eager_attention_forward
309
  if self.config._attn_implementation != &quot;eager&quot;:
310
  attention_interface = ALL_ATTENTION_FUNCTIONS[self.config._attn_implementation]
 
323
  We choose to place the level of abstraction higher than the device placement: a matrix multiplication - a <code>nn.Linear</code> layer - should be always expressed in the same way, regardless of how it is placed.</p>
324
  <p>Hence, we want to touch <a href="#minimal-user-api">minimally</a> to the modeling code, and only modify it when <em>architectural changes</em> are involved. For instance, for tensor parallelism, we instead now specify a simple <code>tp_plan</code>.</p>
325
  <p>The alternative would be to modify parent classes specific to their</p>
326
+ <p>It is written once in the config and passed to <code>.from_pretrained()</code>. The plan maps module name patterns to partitioning strategies. Strategies are resolved by the internal <code>ParallelInterface</code>, which wires to sharding implementations <code>ColwiseParallel</code>, <code>RowwiseParallel</code>, packed variants, and so on.</p>
327
  <p><pre><code class="language-python"># In the model's config (example: ERNIE 4.5-style decoder blocks)
328
  base_model_tp_plan = {
329
  "layers.*.self_attn.q_proj": "colwise",
 
402
  <p>If youโ€™ve checked out llava, youโ€™ve seen that llava_video is a red node, connected by a red edge to llava: itโ€™s a candidate, something that we can <em>likely</em> remodularize, <a href="#backwards-compatibility">not touching the actual model</a> but being much more readable with <a href="#do-repeat-yourself">DRY*</a>.</p>
403
  <h3>VLM improvements, avoiding abstraction</h3>
404
  <p>We donโ€™t have cookbook for common VLM patterns (image token scatter, multiโ€‘tower encoders, crossโ€‘attn bridges). This is one of the main improvement points where we can work.</p>
405
+ <p>For instance, we thought of abstracting away the mixing of <code>inputs_embeds</code>, the tensor fed into an llm decoder in 95% of the existing VLMs. It would have looked like something like</p>
406
  <pre><code class="language-python">class InputsEmbeddingMixerMixin(nn.Module):
407
  #
408
  </code></pre>
409
+ <p>But this is <a href="#standardize-dont-abstract">abstracting away an important component of the modeling.</a>. Embedding mixin is part of the model, removing it would break it. A user opening <a href="https://huggingface.co/collections/Qwen/qwen25-vl-6795ffac22b334a837c0f9a5"><code>modeling_qwen2.5_vl</code></a> should not have to go to another file to understand how it works.</p>
410
  <p>This is the current state of abstractions across a modeling file:</p>
411
  <p><img src="static/Bloatedness_visualizer.png" alt="Bloatedness visualizer showing abstraction levels"></p>
412
  <p>The following <a href="https://github.com/huggingface/transformers/pull/39777">Pull request to standardize placeholder masking</a> is a good example of what kind of changes are acceptable. In a VLM, we always need to insert embeddings from various encoders at various positions, so we can have a function to do it. For Qwen2 VL, for instance, it will look like this:</p>
 
451
  return special_image_mask, special_video_mask
452
  </code></pre>
453
  <p>But this is <em>within</em> the modeling file, not in the <code>PreTrainedModel</code> base class. It will not move away from it, because itโ€™d break the <a href="#one-model-one-file">self-contained logic</a> of the model.</p>
454
+ <h3>On image processing and processors</h3>
455
+ <p>Choosing to be a <code>torch</code>-first software meant relieving a tremendous amount of support from <code>jax </code> and <code>TensorFlow</code> , and it also meant that we could be more lenient into the amount of torch-dependent utilities that we were able to add. One of these is the <em>fast processing</em> of images. Where they were before assumed to be minimal ndarrays, making stronger assumptions and enforcing <code>torch</code> and <code>torchvision</code>native inputs allowed up to speed up massively the processing time for each model.</p>
456
+ <p>The gains in performance are immense, up to 20x speed for most models when compiled torchvision ops. Further, it allows to have the whole pipeline solely on GPU.</p>
457
+ <p><img src="static/fast_image_processors.png" alt="Fast Image Processors Performance"></p>
458
+ <p class="figure-legend">Thanks <a href="https://huggingface.co/yonigozlan">Yoni Gozlan</a> for the great work!</p>
459
+ <h2>Reduce barrier to entry/contribution</h2>
460
+ <p>This is an overall objective: thereโ€™s no <code>transformers</code> without its community.</p>
461
+ <p>Having a framework means forcing users into it. It restrains flexibility and creativity, which are the fertile soil for new ideas to grow.</p>
462
+ <p>Among the most valuable contributions to <code>transformers</code> is of course the addition of new models. Very recently, <a href="https://huggingface.co/blog/welcome-openai-gpt-oss">OpenAI added GPT-OSS</a>, which prompted the addition of many new features to the library in order to support <a href="https://huggingface.co/openai/gpt-oss-120b">their model</a>.</p>
463
+ <p>A second one is the ability to fine-tune and pipeline these models into many other softwares. Check here on the hub how many finetunes are registered for <a href="https://huggingface.co/models?other=base_model:finetune:openai/gpt-oss-120b">gpt-oss 120b</a>, despite its size!</p>
464
+ <h3><a id="encoders-ftw"></a> Models popularity</h3>
465
+ <p>Talking about dependencies, we can take a look at the number of downloads for transformer models popularity. One thing we see is the prominence of encoders: This is because the usage of encoders lies in embeddings, just check out <a href="https://huggingface.co/blog/embeddinggemma">EmbeddingGemma</a> for a modern recap. Hence, it is vital to keep the encoders part viable, usable, fine-tune-able.</p>
466
  <p><html>
467
  <head><meta charset="utf-8" /></head>
468
  <body>
 
4349
  </body>
4350
  </html></p>
4351
  <p>As the codebase grows, with our friend codebase <a href="https://huggingface.co/sentence-transformers">Sentence Transformers</a>, we need to maintain this one as well. Retrieval use-cases, smart dbs, like FAISS-based indexing rely on it, and thus indirectly on transformers.</p>
 
 
 
 
 
 
 
 
 
4352
  <p>In that regard, we DO want to be a modular toolbox, being <a href="#minimal-user-api">minimal</a> enough and well documented enough so any ML/AI developer can use <code>transformers</code> without having to think about it. We aim to reduce the cognitive load brought about by model development, not increase it.</p>
4353
  <p>So, how do these design choices, these โ€œtenetsโ€ influence development of models and overall usage of transformers?</p>
4354
+ <h2>A surgical toolbox for model development</h2>
4355
  <h3>Attention visualisation</h3>
4356
  <p>All models have the same API internally for attention computation, thanks to <a href="#external-attention-classes">the externalisation of attention classes</a>. it allows us to build cool tools to visualize the inner workings of the attention mechanism.</p>
4357
  <p>One particular piece of machinery is the <code>attention mask</code>. Here you see the famous bidirectional attention pattern for the whole prefix (text + image) in PaliGemma and all Gemma2+ models, contrasting with the usual โ€œcausal-onlyโ€ models.</p>
 
4405
  <p>It just works with PyTorch models and is especially useful when aligning outputs with a reference implementation, aligned with our <a href="#source-of-truth">core guideline</a>.</p>
4406
  <p><img src="static/model_debugger.png" alt="Model debugger interface"></p>
4407
  <h3>Cooking faster CUDA warmups</h3>
4408
+ <p>Having a clean <em>external</em> API allows us to work on the <a href="#code-is-product">true inner workings of transformers</a>. One of the few recent additions was the <em>CUDA warmup</em> via <code>caching_allocator_warmup</code> which improved massively the loading footprint by pre-allocating GPU memory to avoid malloc bottlenecks during model loading, achieving a 7x factor for an 8B model, 6x for a 32B, you can check out <a href="https://github.com/huggingface/transformers/pull/36380">the source</a>!</p>
4409
  <p><style>.warmup-demo body{background-color:#f5f5f5;margin:0;padding:20px;font-family:Segoe UI,Tahoma,Geneva,Verdana,sans-serif}.warmup-demo .container{background:#fff;border-radius:12px;max-width:1200px;margin:0 auto;padding:30px;box-shadow:0 4px 6px #0000001a}.warmup-demo h1{text-align:center;color:#333;margin-bottom:10px}.warmup-demo .subtitle{text-align:center;color:#666;margin-bottom:30px;font-size:16px}.warmup-demo .demo-container{gap:40px;margin-bottom:30px;display:flex}.warmup-demo .side{background:#fafafa;border:2px solid #ddd;border-radius:8px;flex:1;padding:20px}.warmup-demo .side h2{text-align:center;color:#333;margin-top:0}.warmup-demo .no-warmup h2{color:#d63384}.warmup-demo .with-warmup h2{color:#198754}.warmup-demo .memory-area{background:#fff;border:2px dashed #ccc;border-radius:6px;height:400px;margin:20px 0;padding:10px;position:relative;overflow:hidden}.warmup-demo .layer-box{background:#fff;border:2px solid #666;border-radius:4px;width:80px;height:30px;margin:3px;transition:all .3s;display:inline-block;position:relative}.warmup-demo .layer-box.allocating{background:#e9ecef;border-color:#adb5bd}.warmup-demo .layer-box.allocating:after{content:"malloc";color:#666;font-size:10px;font-weight:700;position:absolute;top:50%;left:50%;transform:translate(-50%,-50%)}.warmup-demo .layer-box.loaded{background:#d1e7dd;border-color:#198754}.warmup-demo .layer-box.loaded:after{content:"data";color:#198754;font-size:10px;font-weight:700;position:absolute;top:50%;left:50%;transform:translate(-50%,-50%)}.warmup-demo .warmup-container{background:#fff;border:3px solid #666;border-radius:6px;width:100%;height:60px;margin-bottom:20px;position:relative;overflow:hidden}.warmup-demo .warmup-container.allocated{background:#e7f1ff;border-color:#0d6efd}.warmup-demo .warmup-container:before{content:"Pre-allocated Memory Pool";color:#666;z-index:1;font-size:14px;font-weight:700;position:absolute;top:50%;left:50%;transform:translate(-50%,-50%)}.warmup-demo .warmup-container.allocated:before{color:#0d6efd}.warmup-demo .warmup-fill{z-index:2;background:linear-gradient(90deg,#198754,#20c997);border-radius:3px;width:0%;height:100%;transition:width .5s;position:relative}.warmup-demo .warmup-fill:after{content:"Layer Data Loading";color:#fff;white-space:nowrap;font-size:12px;font-weight:700;position:absolute;top:50%;left:50%;transform:translate(-50%,-50%)}.warmup-demo .timing{text-align:center;min-height:30px;margin:15px 0;font-size:24px;font-weight:700}.warmup-demo .no-warmup .timing{color:#d63384}.warmup-demo .with-warmup .timing{color:#198754}.warmup-demo .controls{text-align:center;margin:30px 0}.warmup-demo .btn{color:#fff;cursor:pointer;background:#0d6efd;border:none;border-radius:6px;margin:0 10px;padding:12px 24px;font-size:16px;transition:background .3s}.warmup-demo .btn:hover{background:#0b5ed7}.warmup-demo .btn:disabled{cursor:not-allowed;background:#6c757d}.warmup-demo .description{background:#f8f9fa;border-radius:6px;margin-top:15px;padding:15px;font-size:14px;line-height:1.5}.warmup-demo .phase-indicator{color:#666;text-align:center;min-height:20px;margin-top:10px;font-size:14px}.warmup-demo .layer-counter{text-align:center;color:#495057;margin:10px 0;font-size:16px}</style>
4410
 
4411
  <div class=warmup-demo>
 
4469
  <p>This provides an OpenAI-compatible API with features like <a href="https://github.com/huggingface/transformers/pull/38085">continuous batching</a> (also check <a href="https://github.com/huggingface/transformers/pull/40426">here</a>) for better GPU utilization.</p>
4470
  <p>Continuous batching is in itself very much linked to the great work of vLLM with the <code>paged attention kernel</code>, further justifying the facilitation of <a href="#community-kernels">external kernels</a>.</p>
4471
  <h2>Community reusability</h2>
4472
+ <p>Transformers-serve is transformers-first, for sure, but the library is made first and foremost to be <em>reused</em> at large by the open-source ecosystem.</p>
4473
+ <p>Adding a model to transformers means:</p>
4474
  <ul>
4475
  <li>having it immediately available to the community</li>
4476
+ <li>having it immediately usable in vLLM, <a href="https://huggingface.co/blog/transformers-backend-sglang">SGLang</a>, and so on without additional code. In April 2025, transformers was added as a backend to run models on vLLM, which optimizes throughput/latency on top of existing transformers architectures <a href="https://blog.vllm.ai/2025/04/11/transformers-backend.html">as seen in this great vLLM x HF blog post.</a></li>
4477
  </ul>
4478
  <p>This cements the need even more for a <a href="#consistent-public-surface">consistent public surface</a>: we are now a backend, and thereโ€™s more optimized software than us to handle serving. At the time of writing, more effort is done in that direction. We already have compatible configs for VLMs for vLLM (say that three times fast), <a href="https://github.com/huggingface/transformers/pull/40696/files">here for GLM4 video support</a>, and here for <a href="https://github.com/huggingface/transformers/pull/40132">MoE support</a> for instance.</p>
4479
  <h2>What is coming next</h2>