Spaces:

transformers-community
/

Transformers-tenets

Running

App Files Files Community

Molbap HF Staff commited on Sep 30

Commit

4bb470e

1 Parent(s): 3a3c4d7

content

Browse files

Files changed (2) hide show

content/article.md +26 -23
dist/index.html +23 -22

content/article.md CHANGED Viewed

@@ -173,9 +173,7 @@ That gives an “effective LOC” curve: the 𝗺𝗮𝗶𝗻𝘁𝗲𝗻𝗮
 𝗝𝘂𝘀𝘁 𝗹𝗼𝗼𝗸 𝗮𝘁 𝘁𝗵𝗲 𝗿𝗲𝘀𝘂𝗹𝘁: 𝘁𝗵𝗲 𝗴𝗿𝗼𝘄𝘁𝗵 𝗿𝗮𝘁𝗲 𝗼𝗳 𝗹𝗶𝗻𝗲𝘀 𝗼𝗳 𝗰𝗼𝗱𝗲 𝗰𝗼𝗹𝗹𝗮𝗽𝘀𝗲𝗱! Counting raw 𝚖𝚘𝚍𝚎𝚕𝚒𝚗𝚐_*.𝚙𝚢 (with “Copied from…” everywhere) we were around 362 new LOC/day; with 𝚖𝚘𝚍𝚞𝚕𝚊𝚛 in place the effective rate is ~25 LOC/day. About 𝟭𝟱× 𝗹𝗼𝘄𝗲𝗿! Had we continued with a strict "one model, one file" policy who knows where we'd have ended up.
-Less code to hand-maintain means fewer places to break.
-Cyclomatic complexity isn’t LOC, but they strongly correlate. As Les Hatton notes, defects scale like 𝙙 ~ 𝙭 𝙡𝙣 𝙭. Lower 𝘅 (lower loc) helps.
 {{{fragment-loc-growth}}}
@@ -191,13 +189,13 @@ However, we were adding specific torch operations for each backend (sdpa, flash-
 ### <a id="attention-classes"></a> External Attention classes
-Externalising the [attention classes](#external-attention-classes) has moved out a lot of repeated code that was [standard](#standardize-dont-abstract).
 We moved to an [attention interface](https://huggingface.co/docs/transformers/en/attention_interface) that allowed the following:
 We keep a `Callable` for the naive implementation of the attention, called "eager" computation. This Callable is named `eager_attention_forward`, and can be run as long as the user had `torch` installed, which is a requirement in any case.
-In other words, we moved from a class interface to a function interface: in order to use more complex attention implementations, the config is checked, and use other Callables, including kernel bindings.
 ```python
 attention_interface: Callable = eager_attention_forward
@@ -232,7 +230,7 @@ Hence, we want to touch [minimally](#minimal-user-api) to the modeling code, and
 The alternative would be to modify parent classes specific to their
-It is written once in the config and passed to `.from_pretrained()`.  The plan maps module name patterns to partitioning strategies. Strategies are resolved by the internal `ParallelInterface`, which wires to sharding implementations `ColwiseParallel`, `RowwiseParallel`, packed variants, and so on.
 {{{fragment-tp-plan}}}
@@ -324,14 +322,14 @@ If you've checked out llava, you've seen that llava_video is a red node, connect
 We don't have cookbook for common VLM patterns (image token scatter, multi‑tower encoders, cross‑attn bridges). This is one of the main improvement points where we can work.
-For instance, I thought of abstracting away the mixing of `inputs_embeds`, the tensor fed into an llm decoder in 95% of the existing VLMs. It would have looked like something like
 ```python
 class InputsEmbeddingMixerMixin(nn.Module):
     #
 ```
-But this is [abstracting away an important component of the modeling.](#standardize-dont-abstract). Embedding mixin is part of the model, removing it would break it. A user opening `modeling_qwen2.5_vl` should not have to go to another file.
 This is the current state of abstractions across a modeling file:
@@ -383,15 +381,6 @@ The following [Pull request to standardize placeholder masking](https://github.c
 But this is _within_ the modeling file, not in the `PreTrainedModel` base class. It will not move away from it, because it'd break the [self-contained logic](#one-model-one-file) of the model.
-### <a id="encoders-ftw"></a> Embedding models, now and forever.
-Models popularity speaks for itself! This is because the usage of encoders lies in embeddings. So we have to keep the encoders part viable, usable, fine-tune-able.
-{{{fragment-model-visualisation}}}
-As the codebase grows, with our friend codebase [Sentence Transformers](https://huggingface.co/sentence-transformers), we need to maintain this one as well. Retrieval use-cases, smart dbs, like FAISS-based indexing rely on it, and thus indirectly on transformers.
 ### On image processing and processors
 Choosing to be a `torch`-first software meant relieving a tremendous amount of support from `jax ` and `TensorFlow` , and it also meant that we could be more lenient into the amount of torch-dependent utilities that we were able to add. One of these is the _fast processing_ of images. Where they were before assumed to be minimal ndarrays, making stronger assumptions and enforcing `torch` and `torchvision`native inputs allowed up to speed up massively the processing time for each model.
@@ -408,13 +397,24 @@ This is an overall objective: there's no `transformers` without its community.
 Having a framework means forcing users into it. It restrains flexibility and creativity, which are the fertile soil for new ideas to grow.
-Among the most valuable contributions to `transformers` is of course the addition of new models. A second one is the ability to fine-tune and pipeline these models into many other softwares.
 In that regard, we DO want to be a modular toolbox, being [minimal](#minimal-user-api) enough and well documented enough so any ML/AI developer can use `transformers` without having to think about it. We aim to reduce the cognitive load brought about by model development, not increase it.
 So, how do these design choices, these "tenets" influence development of models and overall usage of transformers?
-### A surgical toolbox for model development
 ### Attention visualisation
@@ -435,7 +435,7 @@ It just works with PyTorch models and is especially useful when aligning outputs
 ### Cooking faster CUDA warmups
-Having a clean _external_ API allows us to work on the true inner workings of transformers. One of the few recent additions was the _CUDA warmup_ via `caching_allocator_warmup` which improved massively the loading footprint by pre-allocating GPU memory to avoid malloc bottlenecks during model loading.
 {{{fragment-warmup_demo}}}
@@ -460,9 +460,12 @@ Continuous batching is in itself very much linked to the great work of vLLM with
 ## Community reusability
-Transformers-serve is transformers-first, for sure, but it's not limited to that. Adding a model to transformers means:
 - having it immediately available to the community
-- having it immediately usable in vLLM, [SGLang](https://huggingface.co/blog/transformers-backend-sglang), and so on without additional code. In April 2025, transformers was added as a backend to run models on vLLM, which optimizes throughput/latency on top of existing transformers architectures [as seen in this great blog post.](https://blog.vllm.ai/2025/04/11/transformers-backend.html)
 This cements the need even more for a [consistent public surface](#consistent-public-surface): we are now a backend, and there's more optimized software than us to handle serving. At the time of writing, more effort is done in that direction. We already have compatible configs for VLMs for vLLM (say that three times fast), [here for GLM4 video support](https://github.com/huggingface/transformers/pull/40696/files),  and here for [MoE support](https://github.com/huggingface/transformers/pull/40132) for instance.

 𝗝𝘂𝘀𝘁 𝗹𝗼𝗼𝗸 𝗮𝘁 𝘁𝗵𝗲 𝗿𝗲𝘀𝘂𝗹𝘁: 𝘁𝗵𝗲 𝗴𝗿𝗼𝘄𝘁𝗵 𝗿𝗮𝘁𝗲 𝗼𝗳 𝗹𝗶𝗻𝗲𝘀 𝗼𝗳 𝗰𝗼𝗱𝗲 𝗰𝗼𝗹𝗹𝗮𝗽𝘀𝗲𝗱! Counting raw 𝚖𝚘𝚍𝚎𝚕𝚒𝚗𝚐_*.𝚙𝚢 (with “Copied from…” everywhere) we were around 362 new LOC/day; with 𝚖𝚘𝚍𝚞𝚕𝚊𝚛 in place the effective rate is ~25 LOC/day. About 𝟭𝟱× 𝗹𝗼𝘄𝗲𝗿! Had we continued with a strict "one model, one file" policy who knows where we'd have ended up.
+Less code to hand-maintain means fewer places to break: cyclomatic complexity isn’t LOC, but they strongly correlate.
 {{{fragment-loc-growth}}}
 ### <a id="attention-classes"></a> External Attention classes
 We moved to an [attention interface](https://huggingface.co/docs/transformers/en/attention_interface) that allowed the following:
 We keep a `Callable` for the naive implementation of the attention, called "eager" computation. This Callable is named `eager_attention_forward`, and can be run as long as the user had `torch` installed, which is a requirement in any case.
+In other words, we moved from a class interface to a function interface: in order to use more complex attention implementations, the config is checked, and can use other Callables, including kernel bindings that are much faster, if they are available.
+This exemplifies the fact that we prefer to have an interface that is [standard, but not abstract](#standardize-dont-abstract).
 ```python
 attention_interface: Callable = eager_attention_forward
 The alternative would be to modify parent classes specific to their
+It is written once in the config and passed to `.from_pretrained()`. The plan maps module name patterns to partitioning strategies. Strategies are resolved by the internal `ParallelInterface`, which wires to sharding implementations `ColwiseParallel`, `RowwiseParallel`, packed variants, and so on.
 {{{fragment-tp-plan}}}
 We don't have cookbook for common VLM patterns (image token scatter, multi‑tower encoders, cross‑attn bridges). This is one of the main improvement points where we can work.
+For instance, we thought of abstracting away the mixing of `inputs_embeds`, the tensor fed into an llm decoder in 95% of the existing VLMs. It would have looked like something like
 ```python
 class InputsEmbeddingMixerMixin(nn.Module):
     #
 ```
+But this is [abstracting away an important component of the modeling.](#standardize-dont-abstract). Embedding mixin is part of the model, removing it would break it. A user opening [`modeling_qwen2.5_vl`](https://huggingface.co/collections/Qwen/qwen25-vl-6795ffac22b334a837c0f9a5) should not have to go to another file to understand how it works.
 This is the current state of abstractions across a modeling file:
 But this is _within_ the modeling file, not in the `PreTrainedModel` base class. It will not move away from it, because it'd break the [self-contained logic](#one-model-one-file) of the model.
 ### On image processing and processors
 Choosing to be a `torch`-first software meant relieving a tremendous amount of support from `jax ` and `TensorFlow` , and it also meant that we could be more lenient into the amount of torch-dependent utilities that we were able to add. One of these is the _fast processing_ of images. Where they were before assumed to be minimal ndarrays, making stronger assumptions and enforcing `torch` and `torchvision`native inputs allowed up to speed up massively the processing time for each model.
 Having a framework means forcing users into it. It restrains flexibility and creativity, which are the fertile soil for new ideas to grow.
+Among the most valuable contributions to `transformers` is of course the addition of new models. Very recently, [OpenAI added GPT-OSS](https://huggingface.co/blog/welcome-openai-gpt-oss), which prompted the addition of many new features to the library in order to support [their model](https://huggingface.co/openai/gpt-oss-120b).
+A second one is the ability to fine-tune and pipeline these models into many other softwares. Check here on the hub how many finetunes are registered for [gpt-oss 120b](https://huggingface.co/models?other=base_model:finetune:openai/gpt-oss-120b), despite its size!
+### <a id="encoders-ftw"></a> Models popularity
+Talking about dependencies, we can take a look at the number of downloads for transformer models popularity. One thing we see is the prominence of encoders: This is because the usage of encoders lies in embeddings, just check out [EmbeddingGemma](https://huggingface.co/blog/embeddinggemma) for a modern recap. Hence, it is vital to keep the encoders part viable, usable, fine-tune-able.
+{{{fragment-model-visualisation}}}
+As the codebase grows, with our friend codebase [Sentence Transformers](https://huggingface.co/sentence-transformers), we need to maintain this one as well. Retrieval use-cases, smart dbs, like FAISS-based indexing rely on it, and thus indirectly on transformers.
 In that regard, we DO want to be a modular toolbox, being [minimal](#minimal-user-api) enough and well documented enough so any ML/AI developer can use `transformers` without having to think about it. We aim to reduce the cognitive load brought about by model development, not increase it.
 So, how do these design choices, these "tenets" influence development of models and overall usage of transformers?
+## A surgical toolbox for model development
 ### Attention visualisation
 ### Cooking faster CUDA warmups
+Having a clean _external_ API allows us to work on the [true inner workings of transformers](#code-is-product). One of the few recent additions was the _CUDA warmup_ via `caching_allocator_warmup` which improved massively the loading footprint by pre-allocating GPU memory to avoid malloc bottlenecks during model loading, achieving a 7x factor for an 8B model, 6x for a 32B, you can check out [the source](https://github.com/huggingface/transformers/pull/36380)!
 {{{fragment-warmup_demo}}}
 ## Community reusability
+Transformers-serve is transformers-first, for sure, but the library is made first and foremost to be _reused_ at large by the open-source ecosystem.
+Adding a model to transformers means:
 - having it immediately available to the community
+- having it immediately usable in vLLM, [SGLang](https://huggingface.co/blog/transformers-backend-sglang), and so on without additional code. In April 2025, transformers was added as a backend to run models on vLLM, which optimizes throughput/latency on top of existing transformers architectures [as seen in this great vLLM x HF blog post.](https://blog.vllm.ai/2025/04/11/transformers-backend.html)
 This cements the need even more for a [consistent public surface](#consistent-public-surface): we are now a backend, and there's more optimized software than us to handle serving. At the time of writing, more effort is done in that direction. We already have compatible configs for VLMs for vLLM (say that three times fast), [here for GLM4 video support](https://github.com/huggingface/transformers/pull/40696/files),  and here for [MoE support](https://github.com/huggingface/transformers/pull/40132) for instance.

dist/index.html CHANGED Viewed

@@ -293,8 +293,7 @@ If it only has a modeling file, we add its LOC count.
 However, if a model has a modular_<em>.py and a corresponding automatically generated modeling_</em>/.py, we only count the LOC under the modular file. The modeling code has no maintenance cost as it is strictly dependent on the modular file.</p>
 <p>That gives an “effective LOC” curve: the 𝗺𝗮𝗶𝗻𝘁𝗲𝗻𝗮𝗻𝗰𝗲 𝘀𝘂𝗿𝗳𝗮𝗰𝗲.</p>
 <p>𝗝𝘂𝘀𝘁 𝗹𝗼𝗼𝗸 𝗮𝘁 𝘁𝗵𝗲 𝗿𝗲𝘀𝘂𝗹𝘁: 𝘁𝗵𝗲 𝗴𝗿𝗼𝘄𝘁𝗵 𝗿𝗮𝘁𝗲 𝗼𝗳 𝗹𝗶𝗻𝗲𝘀 𝗼𝗳 𝗰𝗼𝗱𝗲 𝗰𝗼𝗹𝗹𝗮𝗽𝘀𝗲𝗱! Counting raw 𝚖𝚘𝚍𝚎𝚕𝚒𝚗𝚐_*.𝚙𝚢 (with “Copied from…” everywhere) we were around 362 new LOC/day; with 𝚖𝚘𝚍𝚞𝚕𝚊𝚛 in place the effective rate is ~25 LOC/day. About 𝟭𝟱× 𝗹𝗼𝘄𝗲𝗿! Had we continued with a strict “one model, one file” policy who knows where we’d have ended up.</p>
-<p>Less code to hand-maintain means fewer places to break.</p>
-<p>Cyclomatic complexity isn’t LOC, but they strongly correlate. As Les Hatton notes, defects scale like 𝙙 ~ 𝙭 𝙡𝙣 𝙭. Lower 𝘅 (lower loc) helps.</p>
 <p><iframe src=https://molbap-loc-1.hf.space style="width:100%; height:680px; border:0" allow="clipboard-read; clipboard-write; fullscreen" referrerpolicy=no-referrer-when-downgrade></iframe></p>
 <p>There’s a sharp drop near the end, it’s due to us <a href="https://github.com/huggingface/transformers/commit/4df2529d79d75f44e70396df5888a32ffa02d61e#diff-60849db3e9922197854ef1cac92bf4aba08b5d7fd3fe6f3c16a3511e29e0eacc">removing support for Jax and TensorFlow</a> library-wide.</p>
 <p>Of course, it is not only this effort that allowed to reduce the maintenance load.</p>
@@ -302,10 +301,10 @@ However, if a model has a modular_<em>.py and a corresponding automatically gene
 <p>The <em>attention computation</em> itself happens at a <em>lower</em> level of abstraction than the model itself.</p>
 <p>However, we were adding specific torch operations for each backend (sdpa, flash-attention iterations, flex attention) but it wasn’t a <a href="#minimal-user-api">minimal user api</a>.</p>
 <h3><a id="attention-classes"></a> External Attention classes</h3>
-<p>Externalising the <a href="#external-attention-classes">attention classes</a> has moved out a lot of repeated code that was <a href="#standardize-dont-abstract">standard</a>.</p>
 <p>We moved to an <a href="https://huggingface.co/docs/transformers/en/attention_interface">attention interface</a> that allowed the following:</p>
 <p>We keep a <code>Callable</code> for the naive implementation of the attention, called “eager” computation. This Callable is named <code>eager_attention_forward</code>, and can be run as long as the user had <code>torch</code> installed, which is a requirement in any case.</p>
-<p>In other words, we moved from a class interface to a function interface: in order to use more complex attention implementations, the config is checked, and use other Callables, including kernel bindings.</p>
 <pre><code class="language-python">attention_interface: Callable = eager_attention_forward
 if self.config._attn_implementation != &quot;eager&quot;:
     attention_interface = ALL_ATTENTION_FUNCTIONS[self.config._attn_implementation]
@@ -324,7 +323,7 @@ MyModelOutputAnnotated = Annotated[MyModelOutput, &quot;shape: (B, C, H, W)&quot
 We choose to place the level of abstraction higher than the device placement: a matrix multiplication - a <code>nn.Linear</code> layer - should be always expressed in the same way, regardless of how it is placed.</p>
 <p>Hence, we want to touch <a href="#minimal-user-api">minimally</a> to the modeling code, and only modify it when <em>architectural changes</em> are involved. For instance, for tensor parallelism, we instead now specify a simple <code>tp_plan</code>.</p>
 <p>The alternative would be to modify parent classes specific to their</p>
-<p>It is written once in the config and passed to <code>.from_pretrained()</code>.  The plan maps module name patterns to partitioning strategies. Strategies are resolved by the internal <code>ParallelInterface</code>, which wires to sharding implementations <code>ColwiseParallel</code>, <code>RowwiseParallel</code>, packed variants, and so on.</p>
 <p><pre><code class="language-python"># In the model's config (example: ERNIE 4.5-style decoder blocks)
 base_model_tp_plan = {
     "layers.*.self_attn.q_proj": "colwise",
@@ -403,11 +402,11 @@ As you can see, there is a small DETR island, a little llava pocket, and so on,
 <p>If you’ve checked out llava, you’ve seen that llava_video is a red node, connected by a red edge to llava: it’s a candidate, something that we can <em>likely</em> remodularize, <a href="#backwards-compatibility">not touching the actual model</a> but being much more readable with <a href="#do-repeat-yourself">DRY*</a>.</p>
 <h3>VLM improvements, avoiding abstraction</h3>
 <p>We don’t have cookbook for common VLM patterns (image token scatter, multi‑tower encoders, cross‑attn bridges). This is one of the main improvement points where we can work.</p>
-<p>For instance, I thought of abstracting away the mixing of <code>inputs_embeds</code>, the tensor fed into an llm decoder in 95% of the existing VLMs. It would have looked like something like</p>
 <pre><code class="language-python">class InputsEmbeddingMixerMixin(nn.Module):
     #
 </code></pre>
-<p>But this is <a href="#standardize-dont-abstract">abstracting away an important component of the modeling.</a>. Embedding mixin is part of the model, removing it would break it. A user opening <code>modeling_qwen2.5_vl</code> should not have to go to another file.</p>
 <p>This is the current state of abstractions across a modeling file:</p>
 <p><img src="static/Bloatedness_visualizer.png" alt="Bloatedness visualizer showing abstraction levels"></p>
 <p>The following <a href="https://github.com/huggingface/transformers/pull/39777">Pull request to standardize placeholder masking</a> is a good example of what kind of changes are acceptable. In a VLM, we always need to insert embeddings from various encoders at various positions, so we can have a function to do it. For Qwen2 VL, for instance, it will look like this:</p>
@@ -452,8 +451,18 @@ As you can see, there is a small DETR island, a little llava pocket, and so on,
         return special_image_mask, special_video_mask
 </code></pre>
 <p>But this is <em>within</em> the modeling file, not in the <code>PreTrainedModel</code> base class. It will not move away from it, because it’d break the <a href="#one-model-one-file">self-contained logic</a> of the model.</p>
-<h3><a id="encoders-ftw"></a> Embedding models, now and forever.</h3>
-<p>Models popularity speaks for itself! This is because the usage of encoders lies in embeddings. So we have to keep the encoders part viable, usable, fine-tune-able.</p>
 <p><html>
 <head><meta charset="utf-8" /></head>
 <body>
@@ -4340,18 +4349,9 @@ return Plotly;
 </body>
 </html></p>
 <p>As the codebase grows, with our friend codebase <a href="https://huggingface.co/sentence-transformers">Sentence Transformers</a>, we need to maintain this one as well. Retrieval use-cases, smart dbs, like FAISS-based indexing rely on it, and thus indirectly on transformers.</p>
-<h3>On image processing and processors</h3>
-<p>Choosing to be a <code>torch</code>-first software meant relieving a tremendous amount of support from <code>jax </code> and <code>TensorFlow</code> , and it also meant that we could be more lenient into the amount of torch-dependent utilities that we were able to add. One of these is the <em>fast processing</em> of images. Where they were before assumed to be minimal ndarrays, making stronger assumptions and enforcing <code>torch</code> and <code>torchvision</code>native inputs allowed up to speed up massively the processing time for each model.</p>
-<p>The gains in performance are immense, up to 20x speed for most models when compiled torchvision ops. Further, it allows to have the whole pipeline solely on GPU.</p>
-<p><img src="static/fast_image_processors.png" alt="Fast Image Processors Performance"></p>
-<p class="figure-legend">Thanks <a href="https://huggingface.co/yonigozlan">Yoni Gozlan</a> for the great work!</p>
-<h2>Reduce barrier to entry/contribution</h2>
-<p>This is an overall objective: there’s no <code>transformers</code> without its community.</p>
-<p>Having a framework means forcing users into it. It restrains flexibility and creativity, which are the fertile soil for new ideas to grow.</p>
-<p>Among the most valuable contributions to <code>transformers</code> is of course the addition of new models. A second one is the ability to fine-tune and pipeline these models into many other softwares.</p>
 <p>In that regard, we DO want to be a modular toolbox, being <a href="#minimal-user-api">minimal</a> enough and well documented enough so any ML/AI developer can use <code>transformers</code> without having to think about it. We aim to reduce the cognitive load brought about by model development, not increase it.</p>
 <p>So, how do these design choices, these “tenets” influence development of models and overall usage of transformers?</p>
-<h3>A surgical toolbox for model development</h3>
 <h3>Attention visualisation</h3>
 <p>All models have the same API internally for attention computation, thanks to <a href="#external-attention-classes">the externalisation of attention classes</a>. it allows us to build cool tools to visualize the inner workings of the attention mechanism.</p>
 <p>One particular piece of machinery is the <code>attention mask</code>. Here you see the famous bidirectional attention pattern for the whole prefix (text + image) in PaliGemma and all Gemma2+ models, contrasting with the usual “causal-only” models.</p>
@@ -4405,7 +4405,7 @@ return Plotly;
 <p>It just works with PyTorch models and is especially useful when aligning outputs with a reference implementation, aligned with our <a href="#source-of-truth">core guideline</a>.</p>
 <p><img src="static/model_debugger.png" alt="Model debugger interface"></p>
 <h3>Cooking faster CUDA warmups</h3>
-<p>Having a clean <em>external</em> API allows us to work on the true inner workings of transformers. One of the few recent additions was the <em>CUDA warmup</em> via <code>caching_allocator_warmup</code> which improved massively the loading footprint by pre-allocating GPU memory to avoid malloc bottlenecks during model loading.</p>
 <p><style>.warmup-demo body{background-color:#f5f5f5;margin:0;padding:20px;font-family:Segoe UI,Tahoma,Geneva,Verdana,sans-serif}.warmup-demo .container{background:#fff;border-radius:12px;max-width:1200px;margin:0 auto;padding:30px;box-shadow:0 4px 6px #0000001a}.warmup-demo h1{text-align:center;color:#333;margin-bottom:10px}.warmup-demo .subtitle{text-align:center;color:#666;margin-bottom:30px;font-size:16px}.warmup-demo .demo-container{gap:40px;margin-bottom:30px;display:flex}.warmup-demo .side{background:#fafafa;border:2px solid #ddd;border-radius:8px;flex:1;padding:20px}.warmup-demo .side h2{text-align:center;color:#333;margin-top:0}.warmup-demo .no-warmup h2{color:#d63384}.warmup-demo .with-warmup h2{color:#198754}.warmup-demo .memory-area{background:#fff;border:2px dashed #ccc;border-radius:6px;height:400px;margin:20px 0;padding:10px;position:relative;overflow:hidden}.warmup-demo .layer-box{background:#fff;border:2px solid #666;border-radius:4px;width:80px;height:30px;margin:3px;transition:all .3s;display:inline-block;position:relative}.warmup-demo .layer-box.allocating{background:#e9ecef;border-color:#adb5bd}.warmup-demo .layer-box.allocating:after{content:"malloc";color:#666;font-size:10px;font-weight:700;position:absolute;top:50%;left:50%;transform:translate(-50%,-50%)}.warmup-demo .layer-box.loaded{background:#d1e7dd;border-color:#198754}.warmup-demo .layer-box.loaded:after{content:"data";color:#198754;font-size:10px;font-weight:700;position:absolute;top:50%;left:50%;transform:translate(-50%,-50%)}.warmup-demo .warmup-container{background:#fff;border:3px solid #666;border-radius:6px;width:100%;height:60px;margin-bottom:20px;position:relative;overflow:hidden}.warmup-demo .warmup-container.allocated{background:#e7f1ff;border-color:#0d6efd}.warmup-demo .warmup-container:before{content:"Pre-allocated Memory Pool";color:#666;z-index:1;font-size:14px;font-weight:700;position:absolute;top:50%;left:50%;transform:translate(-50%,-50%)}.warmup-demo .warmup-container.allocated:before{color:#0d6efd}.warmup-demo .warmup-fill{z-index:2;background:linear-gradient(90deg,#198754,#20c997);border-radius:3px;width:0%;height:100%;transition:width .5s;position:relative}.warmup-demo .warmup-fill:after{content:"Layer Data Loading";color:#fff;white-space:nowrap;font-size:12px;font-weight:700;position:absolute;top:50%;left:50%;transform:translate(-50%,-50%)}.warmup-demo .timing{text-align:center;min-height:30px;margin:15px 0;font-size:24px;font-weight:700}.warmup-demo .no-warmup .timing{color:#d63384}.warmup-demo .with-warmup .timing{color:#198754}.warmup-demo .controls{text-align:center;margin:30px 0}.warmup-demo .btn{color:#fff;cursor:pointer;background:#0d6efd;border:none;border-radius:6px;margin:0 10px;padding:12px 24px;font-size:16px;transition:background .3s}.warmup-demo .btn:hover{background:#0b5ed7}.warmup-demo .btn:disabled{cursor:not-allowed;background:#6c757d}.warmup-demo .description{background:#f8f9fa;border-radius:6px;margin-top:15px;padding:15px;font-size:14px;line-height:1.5}.warmup-demo .phase-indicator{color:#666;text-align:center;min-height:20px;margin-top:10px;font-size:14px}.warmup-demo .layer-counter{text-align:center;color:#495057;margin:10px 0;font-size:16px}</style>
 <div class=warmup-demo>
@@ -4469,10 +4469,11 @@ curl -X POST http://localhost:8000/v1/chat/completions \
 <p>This provides an OpenAI-compatible API with features like <a href="https://github.com/huggingface/transformers/pull/38085">continuous batching</a> (also check <a href="https://github.com/huggingface/transformers/pull/40426">here</a>) for better GPU utilization.</p>
 <p>Continuous batching is in itself very much linked to the great work of vLLM with the <code>paged attention kernel</code>, further justifying the facilitation of <a href="#community-kernels">external kernels</a>.</p>
 <h2>Community reusability</h2>
-<p>Transformers-serve is transformers-first, for sure, but it’s not limited to that. Adding a model to transformers means:</p>
 <ul>
 <li>having it immediately available to the community</li>
-<li>having it immediately usable in vLLM, <a href="https://huggingface.co/blog/transformers-backend-sglang">SGLang</a>, and so on without additional code. In April 2025, transformers was added as a backend to run models on vLLM, which optimizes throughput/latency on top of existing transformers architectures <a href="https://blog.vllm.ai/2025/04/11/transformers-backend.html">as seen in this great blog post.</a></li>
 </ul>
 <p>This cements the need even more for a <a href="#consistent-public-surface">consistent public surface</a>: we are now a backend, and there’s more optimized software than us to handle serving. At the time of writing, more effort is done in that direction. We already have compatible configs for VLMs for vLLM (say that three times fast), <a href="https://github.com/huggingface/transformers/pull/40696/files">here for GLM4 video support</a>,  and here for <a href="https://github.com/huggingface/transformers/pull/40132">MoE support</a> for instance.</p>
 <h2>What is coming next</h2>

 However, if a model has a modular_<em>.py and a corresponding automatically generated modeling_</em>/.py, we only count the LOC under the modular file. The modeling code has no maintenance cost as it is strictly dependent on the modular file.</p>
 <p>That gives an “effective LOC” curve: the 𝗺𝗮𝗶𝗻𝘁𝗲𝗻𝗮𝗻𝗰𝗲 𝘀𝘂𝗿𝗳𝗮𝗰𝗲.</p>
 <p>𝗝𝘂𝘀𝘁 𝗹𝗼𝗼𝗸 𝗮𝘁 𝘁𝗵𝗲 𝗿𝗲𝘀𝘂𝗹𝘁: 𝘁𝗵𝗲 𝗴𝗿𝗼𝘄𝘁𝗵 𝗿𝗮𝘁𝗲 𝗼𝗳 𝗹𝗶𝗻𝗲𝘀 𝗼𝗳 𝗰𝗼𝗱𝗲 𝗰𝗼𝗹𝗹𝗮𝗽𝘀𝗲𝗱! Counting raw 𝚖𝚘𝚍𝚎𝚕𝚒𝚗𝚐_*.𝚙𝚢 (with “Copied from…” everywhere) we were around 362 new LOC/day; with 𝚖𝚘𝚍𝚞𝚕𝚊𝚛 in place the effective rate is ~25 LOC/day. About 𝟭𝟱× 𝗹𝗼𝘄𝗲𝗿! Had we continued with a strict “one model, one file” policy who knows where we’d have ended up.</p>
+<p>Less code to hand-maintain means fewer places to break: cyclomatic complexity isn’t LOC, but they strongly correlate.</p>
 <p><iframe src=https://molbap-loc-1.hf.space style="width:100%; height:680px; border:0" allow="clipboard-read; clipboard-write; fullscreen" referrerpolicy=no-referrer-when-downgrade></iframe></p>
 <p>There’s a sharp drop near the end, it’s due to us <a href="https://github.com/huggingface/transformers/commit/4df2529d79d75f44e70396df5888a32ffa02d61e#diff-60849db3e9922197854ef1cac92bf4aba08b5d7fd3fe6f3c16a3511e29e0eacc">removing support for Jax and TensorFlow</a> library-wide.</p>
 <p>Of course, it is not only this effort that allowed to reduce the maintenance load.</p>
 <p>The <em>attention computation</em> itself happens at a <em>lower</em> level of abstraction than the model itself.</p>
 <p>However, we were adding specific torch operations for each backend (sdpa, flash-attention iterations, flex attention) but it wasn’t a <a href="#minimal-user-api">minimal user api</a>.</p>
 <h3><a id="attention-classes"></a> External Attention classes</h3>
 <p>We moved to an <a href="https://huggingface.co/docs/transformers/en/attention_interface">attention interface</a> that allowed the following:</p>
 <p>We keep a <code>Callable</code> for the naive implementation of the attention, called “eager” computation. This Callable is named <code>eager_attention_forward</code>, and can be run as long as the user had <code>torch</code> installed, which is a requirement in any case.</p>
+<p>In other words, we moved from a class interface to a function interface: in order to use more complex attention implementations, the config is checked, and can use other Callables, including kernel bindings that are much faster, if they are available.</p>
+<p>This exemplifies the fact that we prefer to have an interface that is <a href="#standardize-dont-abstract">standard, but not abstract</a>.</p>
 <pre><code class="language-python">attention_interface: Callable = eager_attention_forward
 if self.config._attn_implementation != &quot;eager&quot;:
     attention_interface = ALL_ATTENTION_FUNCTIONS[self.config._attn_implementation]
 We choose to place the level of abstraction higher than the device placement: a matrix multiplication - a <code>nn.Linear</code> layer - should be always expressed in the same way, regardless of how it is placed.</p>
 <p>Hence, we want to touch <a href="#minimal-user-api">minimally</a> to the modeling code, and only modify it when <em>architectural changes</em> are involved. For instance, for tensor parallelism, we instead now specify a simple <code>tp_plan</code>.</p>
 <p>The alternative would be to modify parent classes specific to their</p>
+<p>It is written once in the config and passed to <code>.from_pretrained()</code>. The plan maps module name patterns to partitioning strategies. Strategies are resolved by the internal <code>ParallelInterface</code>, which wires to sharding implementations <code>ColwiseParallel</code>, <code>RowwiseParallel</code>, packed variants, and so on.</p>
 <p><pre><code class="language-python"># In the model's config (example: ERNIE 4.5-style decoder blocks)
 base_model_tp_plan = {
     "layers.*.self_attn.q_proj": "colwise",
 <p>If you’ve checked out llava, you’ve seen that llava_video is a red node, connected by a red edge to llava: it’s a candidate, something that we can <em>likely</em> remodularize, <a href="#backwards-compatibility">not touching the actual model</a> but being much more readable with <a href="#do-repeat-yourself">DRY*</a>.</p>
 <h3>VLM improvements, avoiding abstraction</h3>
 <p>We don’t have cookbook for common VLM patterns (image token scatter, multi‑tower encoders, cross‑attn bridges). This is one of the main improvement points where we can work.</p>
+<p>For instance, we thought of abstracting away the mixing of <code>inputs_embeds</code>, the tensor fed into an llm decoder in 95% of the existing VLMs. It would have looked like something like</p>
 <pre><code class="language-python">class InputsEmbeddingMixerMixin(nn.Module):
     #
 </code></pre>
+<p>But this is <a href="#standardize-dont-abstract">abstracting away an important component of the modeling.</a>. Embedding mixin is part of the model, removing it would break it. A user opening <a href="https://huggingface.co/collections/Qwen/qwen25-vl-6795ffac22b334a837c0f9a5"><code>modeling_qwen2.5_vl</code></a> should not have to go to another file to understand how it works.</p>
 <p>This is the current state of abstractions across a modeling file:</p>
 <p><img src="static/Bloatedness_visualizer.png" alt="Bloatedness visualizer showing abstraction levels"></p>
 <p>The following <a href="https://github.com/huggingface/transformers/pull/39777">Pull request to standardize placeholder masking</a> is a good example of what kind of changes are acceptable. In a VLM, we always need to insert embeddings from various encoders at various positions, so we can have a function to do it. For Qwen2 VL, for instance, it will look like this:</p>
         return special_image_mask, special_video_mask
 </code></pre>
 <p>But this is <em>within</em> the modeling file, not in the <code>PreTrainedModel</code> base class. It will not move away from it, because it’d break the <a href="#one-model-one-file">self-contained logic</a> of the model.</p>
+<h3>On image processing and processors</h3>
+<p>Choosing to be a <code>torch</code>-first software meant relieving a tremendous amount of support from <code>jax </code> and <code>TensorFlow</code> , and it also meant that we could be more lenient into the amount of torch-dependent utilities that we were able to add. One of these is the <em>fast processing</em> of images. Where they were before assumed to be minimal ndarrays, making stronger assumptions and enforcing <code>torch</code> and <code>torchvision</code>native inputs allowed up to speed up massively the processing time for each model.</p>
+<p>The gains in performance are immense, up to 20x speed for most models when compiled torchvision ops. Further, it allows to have the whole pipeline solely on GPU.</p>
+<p><img src="static/fast_image_processors.png" alt="Fast Image Processors Performance"></p>
+<p class="figure-legend">Thanks <a href="https://huggingface.co/yonigozlan">Yoni Gozlan</a> for the great work!</p>
+<h2>Reduce barrier to entry/contribution</h2>
+<p>This is an overall objective: there’s no <code>transformers</code> without its community.</p>
+<p>Having a framework means forcing users into it. It restrains flexibility and creativity, which are the fertile soil for new ideas to grow.</p>
+<p>Among the most valuable contributions to <code>transformers</code> is of course the addition of new models. Very recently, <a href="https://huggingface.co/blog/welcome-openai-gpt-oss">OpenAI added GPT-OSS</a>, which prompted the addition of many new features to the library in order to support <a href="https://huggingface.co/openai/gpt-oss-120b">their model</a>.</p>
+<p>A second one is the ability to fine-tune and pipeline these models into many other softwares. Check here on the hub how many finetunes are registered for <a href="https://huggingface.co/models?other=base_model:finetune:openai/gpt-oss-120b">gpt-oss 120b</a>, despite its size!</p>
+<h3><a id="encoders-ftw"></a> Models popularity</h3>
+<p>Talking about dependencies, we can take a look at the number of downloads for transformer models popularity. One thing we see is the prominence of encoders: This is because the usage of encoders lies in embeddings, just check out <a href="https://huggingface.co/blog/embeddinggemma">EmbeddingGemma</a> for a modern recap. Hence, it is vital to keep the encoders part viable, usable, fine-tune-able.</p>
 <p><html>
 <head><meta charset="utf-8" /></head>
 <body>
 </body>
 </html></p>
 <p>As the codebase grows, with our friend codebase <a href="https://huggingface.co/sentence-transformers">Sentence Transformers</a>, we need to maintain this one as well. Retrieval use-cases, smart dbs, like FAISS-based indexing rely on it, and thus indirectly on transformers.</p>
 <p>In that regard, we DO want to be a modular toolbox, being <a href="#minimal-user-api">minimal</a> enough and well documented enough so any ML/AI developer can use <code>transformers</code> without having to think about it. We aim to reduce the cognitive load brought about by model development, not increase it.</p>
 <p>So, how do these design choices, these “tenets” influence development of models and overall usage of transformers?</p>
+<h2>A surgical toolbox for model development</h2>
 <h3>Attention visualisation</h3>
 <p>All models have the same API internally for attention computation, thanks to <a href="#external-attention-classes">the externalisation of attention classes</a>. it allows us to build cool tools to visualize the inner workings of the attention mechanism.</p>
 <p>One particular piece of machinery is the <code>attention mask</code>. Here you see the famous bidirectional attention pattern for the whole prefix (text + image) in PaliGemma and all Gemma2+ models, contrasting with the usual “causal-only” models.</p>
 <p>It just works with PyTorch models and is especially useful when aligning outputs with a reference implementation, aligned with our <a href="#source-of-truth">core guideline</a>.</p>
 <p><img src="static/model_debugger.png" alt="Model debugger interface"></p>
 <h3>Cooking faster CUDA warmups</h3>
+<p>Having a clean <em>external</em> API allows us to work on the <a href="#code-is-product">true inner workings of transformers</a>. One of the few recent additions was the <em>CUDA warmup</em> via <code>caching_allocator_warmup</code> which improved massively the loading footprint by pre-allocating GPU memory to avoid malloc bottlenecks during model loading, achieving a 7x factor for an 8B model, 6x for a 32B, you can check out <a href="https://github.com/huggingface/transformers/pull/36380">the source</a>!</p>
 <p><style>.warmup-demo body{background-color:#f5f5f5;margin:0;padding:20px;font-family:Segoe UI,Tahoma,Geneva,Verdana,sans-serif}.warmup-demo .container{background:#fff;border-radius:12px;max-width:1200px;margin:0 auto;padding:30px;box-shadow:0 4px 6px #0000001a}.warmup-demo h1{text-align:center;color:#333;margin-bottom:10px}.warmup-demo .subtitle{text-align:center;color:#666;margin-bottom:30px;font-size:16px}.warmup-demo .demo-container{gap:40px;margin-bottom:30px;display:flex}.warmup-demo .side{background:#fafafa;border:2px solid #ddd;border-radius:8px;flex:1;padding:20px}.warmup-demo .side h2{text-align:center;color:#333;margin-top:0}.warmup-demo .no-warmup h2{color:#d63384}.warmup-demo .with-warmup h2{color:#198754}.warmup-demo .memory-area{background:#fff;border:2px dashed #ccc;border-radius:6px;height:400px;margin:20px 0;padding:10px;position:relative;overflow:hidden}.warmup-demo .layer-box{background:#fff;border:2px solid #666;border-radius:4px;width:80px;height:30px;margin:3px;transition:all .3s;display:inline-block;position:relative}.warmup-demo .layer-box.allocating{background:#e9ecef;border-color:#adb5bd}.warmup-demo .layer-box.allocating:after{content:"malloc";color:#666;font-size:10px;font-weight:700;position:absolute;top:50%;left:50%;transform:translate(-50%,-50%)}.warmup-demo .layer-box.loaded{background:#d1e7dd;border-color:#198754}.warmup-demo .layer-box.loaded:after{content:"data";color:#198754;font-size:10px;font-weight:700;position:absolute;top:50%;left:50%;transform:translate(-50%,-50%)}.warmup-demo .warmup-container{background:#fff;border:3px solid #666;border-radius:6px;width:100%;height:60px;margin-bottom:20px;position:relative;overflow:hidden}.warmup-demo .warmup-container.allocated{background:#e7f1ff;border-color:#0d6efd}.warmup-demo .warmup-container:before{content:"Pre-allocated Memory Pool";color:#666;z-index:1;font-size:14px;font-weight:700;position:absolute;top:50%;left:50%;transform:translate(-50%,-50%)}.warmup-demo .warmup-container.allocated:before{color:#0d6efd}.warmup-demo .warmup-fill{z-index:2;background:linear-gradient(90deg,#198754,#20c997);border-radius:3px;width:0%;height:100%;transition:width .5s;position:relative}.warmup-demo .warmup-fill:after{content:"Layer Data Loading";color:#fff;white-space:nowrap;font-size:12px;font-weight:700;position:absolute;top:50%;left:50%;transform:translate(-50%,-50%)}.warmup-demo .timing{text-align:center;min-height:30px;margin:15px 0;font-size:24px;font-weight:700}.warmup-demo .no-warmup .timing{color:#d63384}.warmup-demo .with-warmup .timing{color:#198754}.warmup-demo .controls{text-align:center;margin:30px 0}.warmup-demo .btn{color:#fff;cursor:pointer;background:#0d6efd;border:none;border-radius:6px;margin:0 10px;padding:12px 24px;font-size:16px;transition:background .3s}.warmup-demo .btn:hover{background:#0b5ed7}.warmup-demo .btn:disabled{cursor:not-allowed;background:#6c757d}.warmup-demo .description{background:#f8f9fa;border-radius:6px;margin-top:15px;padding:15px;font-size:14px;line-height:1.5}.warmup-demo .phase-indicator{color:#666;text-align:center;min-height:20px;margin-top:10px;font-size:14px}.warmup-demo .layer-counter{text-align:center;color:#495057;margin:10px 0;font-size:16px}</style>
 <div class=warmup-demo>
 <p>This provides an OpenAI-compatible API with features like <a href="https://github.com/huggingface/transformers/pull/38085">continuous batching</a> (also check <a href="https://github.com/huggingface/transformers/pull/40426">here</a>) for better GPU utilization.</p>
 <p>Continuous batching is in itself very much linked to the great work of vLLM with the <code>paged attention kernel</code>, further justifying the facilitation of <a href="#community-kernels">external kernels</a>.</p>
 <h2>Community reusability</h2>
+<p>Transformers-serve is transformers-first, for sure, but the library is made first and foremost to be <em>reused</em> at large by the open-source ecosystem.</p>
+<p>Adding a model to transformers means:</p>
 <ul>
 <li>having it immediately available to the community</li>
+<li>having it immediately usable in vLLM, <a href="https://huggingface.co/blog/transformers-backend-sglang">SGLang</a>, and so on without additional code. In April 2025, transformers was added as a backend to run models on vLLM, which optimizes throughput/latency on top of existing transformers architectures <a href="https://blog.vllm.ai/2025/04/11/transformers-backend.html">as seen in this great vLLM x HF blog post.</a></li>
 </ul>
 <p>This cements the need even more for a <a href="#consistent-public-surface">consistent public surface</a>: we are now a backend, and there’s more optimized software than us to handle serving. At the time of writing, more effort is done in that direction. We already have compatible configs for VLMs for vLLM (say that three times fast), <a href="https://github.com/huggingface/transformers/pull/40696/files">here for GLM4 video support</a>,  and here for <a href="https://github.com/huggingface/transformers/pull/40132">MoE support</a> for instance.</p>
 <h2>What is coming next</h2>