content
Browse files- content/article.md +26 -23
- dist/index.html +23 -22
content/article.md
CHANGED
|
@@ -173,9 +173,7 @@ That gives an โeffective LOCโ curve: the ๐บ๐ฎ๐ถ๐ป๐๐ฒ๐ป๐ฎ
|
|
| 173 |
|
| 174 |
๐๐๐๐ ๐น๐ผ๐ผ๐ธ ๐ฎ๐ ๐๐ต๐ฒ ๐ฟ๐ฒ๐๐๐น๐: ๐๐ต๐ฒ ๐ด๐ฟ๐ผ๐๐๐ต ๐ฟ๐ฎ๐๐ฒ ๐ผ๐ณ ๐น๐ถ๐ป๐ฒ๐ ๐ผ๐ณ ๐ฐ๐ผ๐ฑ๐ฒ ๐ฐ๐ผ๐น๐น๐ฎ๐ฝ๐๐ฒ๐ฑ! Counting raw ๐๐๐๐๐๐๐๐_*.๐๐ข (with โCopied fromโฆโ everywhere) we were around 362 new LOC/day; with ๐๐๐๐๐๐๐ in place the effective rate is ~25 LOC/day. About ๐ญ๐ฑร ๐น๐ผ๐๐ฒ๐ฟ! Had we continued with a strict "one model, one file" policy who knows where we'd have ended up.
|
| 175 |
|
| 176 |
-
Less code to hand-maintain means fewer places to break.
|
| 177 |
-
|
| 178 |
-
Cyclomatic complexity isnโt LOC, but they strongly correlate. As Les Hatton notes, defects scale like ๐ ~ ๐ญ ๐ก๐ฃ ๐ญ. Lower ๐
(lower loc) helps.
|
| 179 |
|
| 180 |
{{{fragment-loc-growth}}}
|
| 181 |
|
|
@@ -191,13 +189,13 @@ However, we were adding specific torch operations for each backend (sdpa, flash-
|
|
| 191 |
|
| 192 |
### <a id="attention-classes"></a> External Attention classes
|
| 193 |
|
| 194 |
-
Externalising the [attention classes](#external-attention-classes) has moved out a lot of repeated code that was [standard](#standardize-dont-abstract).
|
| 195 |
-
|
| 196 |
We moved to an [attention interface](https://huggingface.co/docs/transformers/en/attention_interface) that allowed the following:
|
| 197 |
|
| 198 |
We keep a `Callable` for the naive implementation of the attention, called "eager" computation. This Callable is named `eager_attention_forward`, and can be run as long as the user had `torch` installed, which is a requirement in any case.
|
| 199 |
|
| 200 |
-
In other words, we moved from a class interface to a function interface: in order to use more complex attention implementations, the config is checked, and use other Callables, including kernel bindings.
|
|
|
|
|
|
|
| 201 |
|
| 202 |
```python
|
| 203 |
attention_interface: Callable = eager_attention_forward
|
|
@@ -232,7 +230,7 @@ Hence, we want to touch [minimally](#minimal-user-api) to the modeling code, and
|
|
| 232 |
|
| 233 |
The alternative would be to modify parent classes specific to their
|
| 234 |
|
| 235 |
-
It is written once in the config and passed to `.from_pretrained()`.
|
| 236 |
|
| 237 |
{{{fragment-tp-plan}}}
|
| 238 |
|
|
@@ -324,14 +322,14 @@ If you've checked out llava, you've seen that llava_video is a red node, connect
|
|
| 324 |
|
| 325 |
We don't have cookbook for common VLM patterns (image token scatter, multiโtower encoders, crossโattn bridges). This is one of the main improvement points where we can work.
|
| 326 |
|
| 327 |
-
For instance,
|
| 328 |
|
| 329 |
```python
|
| 330 |
class InputsEmbeddingMixerMixin(nn.Module):
|
| 331 |
#
|
| 332 |
```
|
| 333 |
|
| 334 |
-
But this is [abstracting away an important component of the modeling.](#standardize-dont-abstract). Embedding mixin is part of the model, removing it would break it. A user opening `modeling_qwen2.5_vl` should not have to go to another file.
|
| 335 |
|
| 336 |
This is the current state of abstractions across a modeling file:
|
| 337 |
|
|
@@ -383,15 +381,6 @@ The following [Pull request to standardize placeholder masking](https://github.c
|
|
| 383 |
|
| 384 |
But this is _within_ the modeling file, not in the `PreTrainedModel` base class. It will not move away from it, because it'd break the [self-contained logic](#one-model-one-file) of the model.
|
| 385 |
|
| 386 |
-
|
| 387 |
-
### <a id="encoders-ftw"></a> Embedding models, now and forever.
|
| 388 |
-
|
| 389 |
-
Models popularity speaks for itself! This is because the usage of encoders lies in embeddings. So we have to keep the encoders part viable, usable, fine-tune-able.
|
| 390 |
-
|
| 391 |
-
{{{fragment-model-visualisation}}}
|
| 392 |
-
|
| 393 |
-
As the codebase grows, with our friend codebase [Sentence Transformers](https://huggingface.co/sentence-transformers), we need to maintain this one as well. Retrieval use-cases, smart dbs, like FAISS-based indexing rely on it, and thus indirectly on transformers.
|
| 394 |
-
|
| 395 |
### On image processing and processors
|
| 396 |
|
| 397 |
Choosing to be a `torch`-first software meant relieving a tremendous amount of support from `jax ` and `TensorFlow` , and it also meant that we could be more lenient into the amount of torch-dependent utilities that we were able to add. One of these is the _fast processing_ of images. Where they were before assumed to be minimal ndarrays, making stronger assumptions and enforcing `torch` and `torchvision`native inputs allowed up to speed up massively the processing time for each model.
|
|
@@ -408,13 +397,24 @@ This is an overall objective: there's no `transformers` without its community.
|
|
| 408 |
|
| 409 |
Having a framework means forcing users into it. It restrains flexibility and creativity, which are the fertile soil for new ideas to grow.
|
| 410 |
|
| 411 |
-
Among the most valuable contributions to `transformers` is of course the addition of new models.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 412 |
|
| 413 |
In that regard, we DO want to be a modular toolbox, being [minimal](#minimal-user-api) enough and well documented enough so any ML/AI developer can use `transformers` without having to think about it. We aim to reduce the cognitive load brought about by model development, not increase it.
|
| 414 |
|
| 415 |
So, how do these design choices, these "tenets" influence development of models and overall usage of transformers?
|
| 416 |
|
| 417 |
-
|
| 418 |
|
| 419 |
### Attention visualisation
|
| 420 |
|
|
@@ -435,7 +435,7 @@ It just works with PyTorch models and is especially useful when aligning outputs
|
|
| 435 |
|
| 436 |
### Cooking faster CUDA warmups
|
| 437 |
|
| 438 |
-
Having a clean _external_ API allows us to work on the true inner workings of transformers. One of the few recent additions was the _CUDA warmup_ via `caching_allocator_warmup` which improved massively the loading footprint by pre-allocating GPU memory to avoid malloc bottlenecks during model loading.
|
| 439 |
|
| 440 |
{{{fragment-warmup_demo}}}
|
| 441 |
|
|
@@ -460,9 +460,12 @@ Continuous batching is in itself very much linked to the great work of vLLM with
|
|
| 460 |
|
| 461 |
## Community reusability
|
| 462 |
|
| 463 |
-
Transformers-serve is transformers-first, for sure, but
|
|
|
|
|
|
|
|
|
|
| 464 |
- having it immediately available to the community
|
| 465 |
-
- having it immediately usable in vLLM, [SGLang](https://huggingface.co/blog/transformers-backend-sglang), and so on without additional code. In April 2025, transformers was added as a backend to run models on vLLM, which optimizes throughput/latency on top of existing transformers architectures [as seen in this great blog post.](https://blog.vllm.ai/2025/04/11/transformers-backend.html)
|
| 466 |
|
| 467 |
This cements the need even more for a [consistent public surface](#consistent-public-surface): we are now a backend, and there's more optimized software than us to handle serving. At the time of writing, more effort is done in that direction. We already have compatible configs for VLMs for vLLM (say that three times fast), [here for GLM4 video support](https://github.com/huggingface/transformers/pull/40696/files), and here for [MoE support](https://github.com/huggingface/transformers/pull/40132) for instance.
|
| 468 |
|
|
|
|
| 173 |
|
| 174 |
๐๐๐๐ ๐น๐ผ๐ผ๐ธ ๐ฎ๐ ๐๐ต๐ฒ ๐ฟ๐ฒ๐๐๐น๐: ๐๐ต๐ฒ ๐ด๐ฟ๐ผ๐๐๐ต ๐ฟ๐ฎ๐๐ฒ ๐ผ๐ณ ๐น๐ถ๐ป๐ฒ๐ ๐ผ๐ณ ๐ฐ๐ผ๐ฑ๐ฒ ๐ฐ๐ผ๐น๐น๐ฎ๐ฝ๐๐ฒ๐ฑ! Counting raw ๐๐๐๐๐๐๐๐_*.๐๐ข (with โCopied fromโฆโ everywhere) we were around 362 new LOC/day; with ๐๐๐๐๐๐๐ in place the effective rate is ~25 LOC/day. About ๐ญ๐ฑร ๐น๐ผ๐๐ฒ๐ฟ! Had we continued with a strict "one model, one file" policy who knows where we'd have ended up.
|
| 175 |
|
| 176 |
+
Less code to hand-maintain means fewer places to break: cyclomatic complexity isnโt LOC, but they strongly correlate.
|
|
|
|
|
|
|
| 177 |
|
| 178 |
{{{fragment-loc-growth}}}
|
| 179 |
|
|
|
|
| 189 |
|
| 190 |
### <a id="attention-classes"></a> External Attention classes
|
| 191 |
|
|
|
|
|
|
|
| 192 |
We moved to an [attention interface](https://huggingface.co/docs/transformers/en/attention_interface) that allowed the following:
|
| 193 |
|
| 194 |
We keep a `Callable` for the naive implementation of the attention, called "eager" computation. This Callable is named `eager_attention_forward`, and can be run as long as the user had `torch` installed, which is a requirement in any case.
|
| 195 |
|
| 196 |
+
In other words, we moved from a class interface to a function interface: in order to use more complex attention implementations, the config is checked, and can use other Callables, including kernel bindings that are much faster, if they are available.
|
| 197 |
+
|
| 198 |
+
This exemplifies the fact that we prefer to have an interface that is [standard, but not abstract](#standardize-dont-abstract).
|
| 199 |
|
| 200 |
```python
|
| 201 |
attention_interface: Callable = eager_attention_forward
|
|
|
|
| 230 |
|
| 231 |
The alternative would be to modify parent classes specific to their
|
| 232 |
|
| 233 |
+
It is written once in the config and passed to `.from_pretrained()`. The plan maps module name patterns to partitioning strategies. Strategies are resolved by the internal `ParallelInterface`, which wires to sharding implementations `ColwiseParallel`, `RowwiseParallel`, packed variants, and so on.
|
| 234 |
|
| 235 |
{{{fragment-tp-plan}}}
|
| 236 |
|
|
|
|
| 322 |
|
| 323 |
We don't have cookbook for common VLM patterns (image token scatter, multiโtower encoders, crossโattn bridges). This is one of the main improvement points where we can work.
|
| 324 |
|
| 325 |
+
For instance, we thought of abstracting away the mixing of `inputs_embeds`, the tensor fed into an llm decoder in 95% of the existing VLMs. It would have looked like something like
|
| 326 |
|
| 327 |
```python
|
| 328 |
class InputsEmbeddingMixerMixin(nn.Module):
|
| 329 |
#
|
| 330 |
```
|
| 331 |
|
| 332 |
+
But this is [abstracting away an important component of the modeling.](#standardize-dont-abstract). Embedding mixin is part of the model, removing it would break it. A user opening [`modeling_qwen2.5_vl`](https://huggingface.co/collections/Qwen/qwen25-vl-6795ffac22b334a837c0f9a5) should not have to go to another file to understand how it works.
|
| 333 |
|
| 334 |
This is the current state of abstractions across a modeling file:
|
| 335 |
|
|
|
|
| 381 |
|
| 382 |
But this is _within_ the modeling file, not in the `PreTrainedModel` base class. It will not move away from it, because it'd break the [self-contained logic](#one-model-one-file) of the model.
|
| 383 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 384 |
### On image processing and processors
|
| 385 |
|
| 386 |
Choosing to be a `torch`-first software meant relieving a tremendous amount of support from `jax ` and `TensorFlow` , and it also meant that we could be more lenient into the amount of torch-dependent utilities that we were able to add. One of these is the _fast processing_ of images. Where they were before assumed to be minimal ndarrays, making stronger assumptions and enforcing `torch` and `torchvision`native inputs allowed up to speed up massively the processing time for each model.
|
|
|
|
| 397 |
|
| 398 |
Having a framework means forcing users into it. It restrains flexibility and creativity, which are the fertile soil for new ideas to grow.
|
| 399 |
|
| 400 |
+
Among the most valuable contributions to `transformers` is of course the addition of new models. Very recently, [OpenAI added GPT-OSS](https://huggingface.co/blog/welcome-openai-gpt-oss), which prompted the addition of many new features to the library in order to support [their model](https://huggingface.co/openai/gpt-oss-120b).
|
| 401 |
+
|
| 402 |
+
A second one is the ability to fine-tune and pipeline these models into many other softwares. Check here on the hub how many finetunes are registered for [gpt-oss 120b](https://huggingface.co/models?other=base_model:finetune:openai/gpt-oss-120b), despite its size!
|
| 403 |
+
|
| 404 |
+
### <a id="encoders-ftw"></a> Models popularity
|
| 405 |
+
|
| 406 |
+
Talking about dependencies, we can take a look at the number of downloads for transformer models popularity. One thing we see is the prominence of encoders: This is because the usage of encoders lies in embeddings, just check out [EmbeddingGemma](https://huggingface.co/blog/embeddinggemma) for a modern recap. Hence, it is vital to keep the encoders part viable, usable, fine-tune-able.
|
| 407 |
+
|
| 408 |
+
{{{fragment-model-visualisation}}}
|
| 409 |
+
|
| 410 |
+
As the codebase grows, with our friend codebase [Sentence Transformers](https://huggingface.co/sentence-transformers), we need to maintain this one as well. Retrieval use-cases, smart dbs, like FAISS-based indexing rely on it, and thus indirectly on transformers.
|
| 411 |
+
|
| 412 |
|
| 413 |
In that regard, we DO want to be a modular toolbox, being [minimal](#minimal-user-api) enough and well documented enough so any ML/AI developer can use `transformers` without having to think about it. We aim to reduce the cognitive load brought about by model development, not increase it.
|
| 414 |
|
| 415 |
So, how do these design choices, these "tenets" influence development of models and overall usage of transformers?
|
| 416 |
|
| 417 |
+
## A surgical toolbox for model development
|
| 418 |
|
| 419 |
### Attention visualisation
|
| 420 |
|
|
|
|
| 435 |
|
| 436 |
### Cooking faster CUDA warmups
|
| 437 |
|
| 438 |
+
Having a clean _external_ API allows us to work on the [true inner workings of transformers](#code-is-product). One of the few recent additions was the _CUDA warmup_ via `caching_allocator_warmup` which improved massively the loading footprint by pre-allocating GPU memory to avoid malloc bottlenecks during model loading, achieving a 7x factor for an 8B model, 6x for a 32B, you can check out [the source](https://github.com/huggingface/transformers/pull/36380)!
|
| 439 |
|
| 440 |
{{{fragment-warmup_demo}}}
|
| 441 |
|
|
|
|
| 460 |
|
| 461 |
## Community reusability
|
| 462 |
|
| 463 |
+
Transformers-serve is transformers-first, for sure, but the library is made first and foremost to be _reused_ at large by the open-source ecosystem.
|
| 464 |
+
|
| 465 |
+
Adding a model to transformers means:
|
| 466 |
+
|
| 467 |
- having it immediately available to the community
|
| 468 |
+
- having it immediately usable in vLLM, [SGLang](https://huggingface.co/blog/transformers-backend-sglang), and so on without additional code. In April 2025, transformers was added as a backend to run models on vLLM, which optimizes throughput/latency on top of existing transformers architectures [as seen in this great vLLM x HF blog post.](https://blog.vllm.ai/2025/04/11/transformers-backend.html)
|
| 469 |
|
| 470 |
This cements the need even more for a [consistent public surface](#consistent-public-surface): we are now a backend, and there's more optimized software than us to handle serving. At the time of writing, more effort is done in that direction. We already have compatible configs for VLMs for vLLM (say that three times fast), [here for GLM4 video support](https://github.com/huggingface/transformers/pull/40696/files), and here for [MoE support](https://github.com/huggingface/transformers/pull/40132) for instance.
|
| 471 |
|
dist/index.html
CHANGED
|
@@ -293,8 +293,7 @@ If it only has a modeling file, we add its LOC count.
|
|
| 293 |
However, if a model has a modular_<em>.py and a corresponding automatically generated modeling_</em>/.py, we only count the LOC under the modular file. The modeling code has no maintenance cost as it is strictly dependent on the modular file.</p>
|
| 294 |
<p>That gives an โeffective LOCโ curve: the ๐บ๐ฎ๐ถ๐ป๐๐ฒ๐ป๐ฎ๐ป๐ฐ๐ฒ ๐๐๐ฟ๐ณ๐ฎ๐ฐ๐ฒ.</p>
|
| 295 |
<p>๐๐๐๐ ๐น๐ผ๐ผ๐ธ ๐ฎ๐ ๐๐ต๐ฒ ๐ฟ๐ฒ๐๐๐น๐: ๐๐ต๐ฒ ๐ด๐ฟ๐ผ๐๐๐ต ๐ฟ๐ฎ๐๐ฒ ๐ผ๐ณ ๐น๐ถ๐ป๐ฒ๐ ๐ผ๐ณ ๐ฐ๐ผ๐ฑ๐ฒ ๐ฐ๐ผ๐น๐น๐ฎ๐ฝ๐๐ฒ๐ฑ! Counting raw ๐๐๐๐๐๐๐๐_*.๐๐ข (with โCopied fromโฆโ everywhere) we were around 362 new LOC/day; with ๐๐๐๐๐๐๐ in place the effective rate is ~25 LOC/day. About ๐ญ๐ฑร ๐น๐ผ๐๐ฒ๐ฟ! Had we continued with a strict โone model, one fileโ policy who knows where weโd have ended up.</p>
|
| 296 |
-
<p>Less code to hand-maintain means fewer places to break.</p>
|
| 297 |
-
<p>Cyclomatic complexity isnโt LOC, but they strongly correlate. As Les Hatton notes, defects scale like ๐ ~ ๐ญ ๐ก๐ฃ ๐ญ. Lower ๐
(lower loc) helps.</p>
|
| 298 |
<p><iframe src=https://molbap-loc-1.hf.space style="width:100%; height:680px; border:0" allow="clipboard-read; clipboard-write; fullscreen" referrerpolicy=no-referrer-when-downgrade></iframe></p>
|
| 299 |
<p>Thereโs a sharp drop near the end, itโs due to us <a href="https://github.com/huggingface/transformers/commit/4df2529d79d75f44e70396df5888a32ffa02d61e#diff-60849db3e9922197854ef1cac92bf4aba08b5d7fd3fe6f3c16a3511e29e0eacc">removing support for Jax and TensorFlow</a> library-wide.</p>
|
| 300 |
<p>Of course, it is not only this effort that allowed to reduce the maintenance load.</p>
|
|
@@ -302,10 +301,10 @@ However, if a model has a modular_<em>.py and a corresponding automatically gene
|
|
| 302 |
<p>The <em>attention computation</em> itself happens at a <em>lower</em> level of abstraction than the model itself.</p>
|
| 303 |
<p>However, we were adding specific torch operations for each backend (sdpa, flash-attention iterations, flex attention) but it wasnโt a <a href="#minimal-user-api">minimal user api</a>.</p>
|
| 304 |
<h3><a id="attention-classes"></a> External Attention classes</h3>
|
| 305 |
-
<p>Externalising the <a href="#external-attention-classes">attention classes</a> has moved out a lot of repeated code that was <a href="#standardize-dont-abstract">standard</a>.</p>
|
| 306 |
<p>We moved to an <a href="https://huggingface.co/docs/transformers/en/attention_interface">attention interface</a> that allowed the following:</p>
|
| 307 |
<p>We keep a <code>Callable</code> for the naive implementation of the attention, called โeagerโ computation. This Callable is named <code>eager_attention_forward</code>, and can be run as long as the user had <code>torch</code> installed, which is a requirement in any case.</p>
|
| 308 |
-
<p>In other words, we moved from a class interface to a function interface: in order to use more complex attention implementations, the config is checked, and use other Callables, including kernel bindings.</p>
|
|
|
|
| 309 |
<pre><code class="language-python">attention_interface: Callable = eager_attention_forward
|
| 310 |
if self.config._attn_implementation != "eager":
|
| 311 |
attention_interface = ALL_ATTENTION_FUNCTIONS[self.config._attn_implementation]
|
|
@@ -324,7 +323,7 @@ MyModelOutputAnnotated = Annotated[MyModelOutput, "shape: (B, C, H, W)"
|
|
| 324 |
We choose to place the level of abstraction higher than the device placement: a matrix multiplication - a <code>nn.Linear</code> layer - should be always expressed in the same way, regardless of how it is placed.</p>
|
| 325 |
<p>Hence, we want to touch <a href="#minimal-user-api">minimally</a> to the modeling code, and only modify it when <em>architectural changes</em> are involved. For instance, for tensor parallelism, we instead now specify a simple <code>tp_plan</code>.</p>
|
| 326 |
<p>The alternative would be to modify parent classes specific to their</p>
|
| 327 |
-
<p>It is written once in the config and passed to <code>.from_pretrained()</code>.
|
| 328 |
<p><pre><code class="language-python"># In the model's config (example: ERNIE 4.5-style decoder blocks)
|
| 329 |
base_model_tp_plan = {
|
| 330 |
"layers.*.self_attn.q_proj": "colwise",
|
|
@@ -403,11 +402,11 @@ As you can see, there is a small DETR island, a little llava pocket, and so on,
|
|
| 403 |
<p>If youโve checked out llava, youโve seen that llava_video is a red node, connected by a red edge to llava: itโs a candidate, something that we can <em>likely</em> remodularize, <a href="#backwards-compatibility">not touching the actual model</a> but being much more readable with <a href="#do-repeat-yourself">DRY*</a>.</p>
|
| 404 |
<h3>VLM improvements, avoiding abstraction</h3>
|
| 405 |
<p>We donโt have cookbook for common VLM patterns (image token scatter, multiโtower encoders, crossโattn bridges). This is one of the main improvement points where we can work.</p>
|
| 406 |
-
<p>For instance,
|
| 407 |
<pre><code class="language-python">class InputsEmbeddingMixerMixin(nn.Module):
|
| 408 |
#
|
| 409 |
</code></pre>
|
| 410 |
-
<p>But this is <a href="#standardize-dont-abstract">abstracting away an important component of the modeling.</a>. Embedding mixin is part of the model, removing it would break it. A user opening <code>modeling_qwen2.5_vl</code> should not have to go to another file.</p>
|
| 411 |
<p>This is the current state of abstractions across a modeling file:</p>
|
| 412 |
<p><img src="static/Bloatedness_visualizer.png" alt="Bloatedness visualizer showing abstraction levels"></p>
|
| 413 |
<p>The following <a href="https://github.com/huggingface/transformers/pull/39777">Pull request to standardize placeholder masking</a> is a good example of what kind of changes are acceptable. In a VLM, we always need to insert embeddings from various encoders at various positions, so we can have a function to do it. For Qwen2 VL, for instance, it will look like this:</p>
|
|
@@ -452,8 +451,18 @@ As you can see, there is a small DETR island, a little llava pocket, and so on,
|
|
| 452 |
return special_image_mask, special_video_mask
|
| 453 |
</code></pre>
|
| 454 |
<p>But this is <em>within</em> the modeling file, not in the <code>PreTrainedModel</code> base class. It will not move away from it, because itโd break the <a href="#one-model-one-file">self-contained logic</a> of the model.</p>
|
| 455 |
-
<h3
|
| 456 |
-
<p>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 457 |
<p><html>
|
| 458 |
<head><meta charset="utf-8" /></head>
|
| 459 |
<body>
|
|
@@ -4340,18 +4349,9 @@ return Plotly;
|
|
| 4340 |
</body>
|
| 4341 |
</html></p>
|
| 4342 |
<p>As the codebase grows, with our friend codebase <a href="https://huggingface.co/sentence-transformers">Sentence Transformers</a>, we need to maintain this one as well. Retrieval use-cases, smart dbs, like FAISS-based indexing rely on it, and thus indirectly on transformers.</p>
|
| 4343 |
-
<h3>On image processing and processors</h3>
|
| 4344 |
-
<p>Choosing to be a <code>torch</code>-first software meant relieving a tremendous amount of support from <code>jax </code> and <code>TensorFlow</code> , and it also meant that we could be more lenient into the amount of torch-dependent utilities that we were able to add. One of these is the <em>fast processing</em> of images. Where they were before assumed to be minimal ndarrays, making stronger assumptions and enforcing <code>torch</code> and <code>torchvision</code>native inputs allowed up to speed up massively the processing time for each model.</p>
|
| 4345 |
-
<p>The gains in performance are immense, up to 20x speed for most models when compiled torchvision ops. Further, it allows to have the whole pipeline solely on GPU.</p>
|
| 4346 |
-
<p><img src="static/fast_image_processors.png" alt="Fast Image Processors Performance"></p>
|
| 4347 |
-
<p class="figure-legend">Thanks <a href="https://huggingface.co/yonigozlan">Yoni Gozlan</a> for the great work!</p>
|
| 4348 |
-
<h2>Reduce barrier to entry/contribution</h2>
|
| 4349 |
-
<p>This is an overall objective: thereโs no <code>transformers</code> without its community.</p>
|
| 4350 |
-
<p>Having a framework means forcing users into it. It restrains flexibility and creativity, which are the fertile soil for new ideas to grow.</p>
|
| 4351 |
-
<p>Among the most valuable contributions to <code>transformers</code> is of course the addition of new models. A second one is the ability to fine-tune and pipeline these models into many other softwares.</p>
|
| 4352 |
<p>In that regard, we DO want to be a modular toolbox, being <a href="#minimal-user-api">minimal</a> enough and well documented enough so any ML/AI developer can use <code>transformers</code> without having to think about it. We aim to reduce the cognitive load brought about by model development, not increase it.</p>
|
| 4353 |
<p>So, how do these design choices, these โtenetsโ influence development of models and overall usage of transformers?</p>
|
| 4354 |
-
<
|
| 4355 |
<h3>Attention visualisation</h3>
|
| 4356 |
<p>All models have the same API internally for attention computation, thanks to <a href="#external-attention-classes">the externalisation of attention classes</a>. it allows us to build cool tools to visualize the inner workings of the attention mechanism.</p>
|
| 4357 |
<p>One particular piece of machinery is the <code>attention mask</code>. Here you see the famous bidirectional attention pattern for the whole prefix (text + image) in PaliGemma and all Gemma2+ models, contrasting with the usual โcausal-onlyโ models.</p>
|
|
@@ -4405,7 +4405,7 @@ return Plotly;
|
|
| 4405 |
<p>It just works with PyTorch models and is especially useful when aligning outputs with a reference implementation, aligned with our <a href="#source-of-truth">core guideline</a>.</p>
|
| 4406 |
<p><img src="static/model_debugger.png" alt="Model debugger interface"></p>
|
| 4407 |
<h3>Cooking faster CUDA warmups</h3>
|
| 4408 |
-
<p>Having a clean <em>external</em> API allows us to work on the true inner workings of transformers
|
| 4409 |
<p><style>.warmup-demo body{background-color:#f5f5f5;margin:0;padding:20px;font-family:Segoe UI,Tahoma,Geneva,Verdana,sans-serif}.warmup-demo .container{background:#fff;border-radius:12px;max-width:1200px;margin:0 auto;padding:30px;box-shadow:0 4px 6px #0000001a}.warmup-demo h1{text-align:center;color:#333;margin-bottom:10px}.warmup-demo .subtitle{text-align:center;color:#666;margin-bottom:30px;font-size:16px}.warmup-demo .demo-container{gap:40px;margin-bottom:30px;display:flex}.warmup-demo .side{background:#fafafa;border:2px solid #ddd;border-radius:8px;flex:1;padding:20px}.warmup-demo .side h2{text-align:center;color:#333;margin-top:0}.warmup-demo .no-warmup h2{color:#d63384}.warmup-demo .with-warmup h2{color:#198754}.warmup-demo .memory-area{background:#fff;border:2px dashed #ccc;border-radius:6px;height:400px;margin:20px 0;padding:10px;position:relative;overflow:hidden}.warmup-demo .layer-box{background:#fff;border:2px solid #666;border-radius:4px;width:80px;height:30px;margin:3px;transition:all .3s;display:inline-block;position:relative}.warmup-demo .layer-box.allocating{background:#e9ecef;border-color:#adb5bd}.warmup-demo .layer-box.allocating:after{content:"malloc";color:#666;font-size:10px;font-weight:700;position:absolute;top:50%;left:50%;transform:translate(-50%,-50%)}.warmup-demo .layer-box.loaded{background:#d1e7dd;border-color:#198754}.warmup-demo .layer-box.loaded:after{content:"data";color:#198754;font-size:10px;font-weight:700;position:absolute;top:50%;left:50%;transform:translate(-50%,-50%)}.warmup-demo .warmup-container{background:#fff;border:3px solid #666;border-radius:6px;width:100%;height:60px;margin-bottom:20px;position:relative;overflow:hidden}.warmup-demo .warmup-container.allocated{background:#e7f1ff;border-color:#0d6efd}.warmup-demo .warmup-container:before{content:"Pre-allocated Memory Pool";color:#666;z-index:1;font-size:14px;font-weight:700;position:absolute;top:50%;left:50%;transform:translate(-50%,-50%)}.warmup-demo .warmup-container.allocated:before{color:#0d6efd}.warmup-demo .warmup-fill{z-index:2;background:linear-gradient(90deg,#198754,#20c997);border-radius:3px;width:0%;height:100%;transition:width .5s;position:relative}.warmup-demo .warmup-fill:after{content:"Layer Data Loading";color:#fff;white-space:nowrap;font-size:12px;font-weight:700;position:absolute;top:50%;left:50%;transform:translate(-50%,-50%)}.warmup-demo .timing{text-align:center;min-height:30px;margin:15px 0;font-size:24px;font-weight:700}.warmup-demo .no-warmup .timing{color:#d63384}.warmup-demo .with-warmup .timing{color:#198754}.warmup-demo .controls{text-align:center;margin:30px 0}.warmup-demo .btn{color:#fff;cursor:pointer;background:#0d6efd;border:none;border-radius:6px;margin:0 10px;padding:12px 24px;font-size:16px;transition:background .3s}.warmup-demo .btn:hover{background:#0b5ed7}.warmup-demo .btn:disabled{cursor:not-allowed;background:#6c757d}.warmup-demo .description{background:#f8f9fa;border-radius:6px;margin-top:15px;padding:15px;font-size:14px;line-height:1.5}.warmup-demo .phase-indicator{color:#666;text-align:center;min-height:20px;margin-top:10px;font-size:14px}.warmup-demo .layer-counter{text-align:center;color:#495057;margin:10px 0;font-size:16px}</style>
|
| 4410 |
|
| 4411 |
<div class=warmup-demo>
|
|
@@ -4469,10 +4469,11 @@ curl -X POST http://localhost:8000/v1/chat/completions \
|
|
| 4469 |
<p>This provides an OpenAI-compatible API with features like <a href="https://github.com/huggingface/transformers/pull/38085">continuous batching</a> (also check <a href="https://github.com/huggingface/transformers/pull/40426">here</a>) for better GPU utilization.</p>
|
| 4470 |
<p>Continuous batching is in itself very much linked to the great work of vLLM with the <code>paged attention kernel</code>, further justifying the facilitation of <a href="#community-kernels">external kernels</a>.</p>
|
| 4471 |
<h2>Community reusability</h2>
|
| 4472 |
-
<p>Transformers-serve is transformers-first, for sure, but
|
|
|
|
| 4473 |
<ul>
|
| 4474 |
<li>having it immediately available to the community</li>
|
| 4475 |
-
<li>having it immediately usable in vLLM, <a href="https://huggingface.co/blog/transformers-backend-sglang">SGLang</a>, and so on without additional code. In April 2025, transformers was added as a backend to run models on vLLM, which optimizes throughput/latency on top of existing transformers architectures <a href="https://blog.vllm.ai/2025/04/11/transformers-backend.html">as seen in this great blog post.</a></li>
|
| 4476 |
</ul>
|
| 4477 |
<p>This cements the need even more for a <a href="#consistent-public-surface">consistent public surface</a>: we are now a backend, and thereโs more optimized software than us to handle serving. At the time of writing, more effort is done in that direction. We already have compatible configs for VLMs for vLLM (say that three times fast), <a href="https://github.com/huggingface/transformers/pull/40696/files">here for GLM4 video support</a>, and here for <a href="https://github.com/huggingface/transformers/pull/40132">MoE support</a> for instance.</p>
|
| 4478 |
<h2>What is coming next</h2>
|
|
|
|
| 293 |
However, if a model has a modular_<em>.py and a corresponding automatically generated modeling_</em>/.py, we only count the LOC under the modular file. The modeling code has no maintenance cost as it is strictly dependent on the modular file.</p>
|
| 294 |
<p>That gives an โeffective LOCโ curve: the ๐บ๐ฎ๐ถ๐ป๐๐ฒ๐ป๐ฎ๐ป๐ฐ๐ฒ ๐๐๐ฟ๐ณ๐ฎ๐ฐ๐ฒ.</p>
|
| 295 |
<p>๐๐๐๐ ๐น๐ผ๐ผ๐ธ ๐ฎ๐ ๐๐ต๐ฒ ๐ฟ๐ฒ๐๐๐น๐: ๐๐ต๐ฒ ๐ด๐ฟ๐ผ๐๐๐ต ๐ฟ๐ฎ๐๐ฒ ๐ผ๐ณ ๐น๐ถ๐ป๐ฒ๐ ๐ผ๐ณ ๐ฐ๐ผ๐ฑ๐ฒ ๐ฐ๐ผ๐น๐น๐ฎ๐ฝ๐๐ฒ๐ฑ! Counting raw ๐๐๐๐๐๐๐๐_*.๐๐ข (with โCopied fromโฆโ everywhere) we were around 362 new LOC/day; with ๐๐๐๐๐๐๐ in place the effective rate is ~25 LOC/day. About ๐ญ๐ฑร ๐น๐ผ๐๐ฒ๐ฟ! Had we continued with a strict โone model, one fileโ policy who knows where weโd have ended up.</p>
|
| 296 |
+
<p>Less code to hand-maintain means fewer places to break: cyclomatic complexity isnโt LOC, but they strongly correlate.</p>
|
|
|
|
| 297 |
<p><iframe src=https://molbap-loc-1.hf.space style="width:100%; height:680px; border:0" allow="clipboard-read; clipboard-write; fullscreen" referrerpolicy=no-referrer-when-downgrade></iframe></p>
|
| 298 |
<p>Thereโs a sharp drop near the end, itโs due to us <a href="https://github.com/huggingface/transformers/commit/4df2529d79d75f44e70396df5888a32ffa02d61e#diff-60849db3e9922197854ef1cac92bf4aba08b5d7fd3fe6f3c16a3511e29e0eacc">removing support for Jax and TensorFlow</a> library-wide.</p>
|
| 299 |
<p>Of course, it is not only this effort that allowed to reduce the maintenance load.</p>
|
|
|
|
| 301 |
<p>The <em>attention computation</em> itself happens at a <em>lower</em> level of abstraction than the model itself.</p>
|
| 302 |
<p>However, we were adding specific torch operations for each backend (sdpa, flash-attention iterations, flex attention) but it wasnโt a <a href="#minimal-user-api">minimal user api</a>.</p>
|
| 303 |
<h3><a id="attention-classes"></a> External Attention classes</h3>
|
|
|
|
| 304 |
<p>We moved to an <a href="https://huggingface.co/docs/transformers/en/attention_interface">attention interface</a> that allowed the following:</p>
|
| 305 |
<p>We keep a <code>Callable</code> for the naive implementation of the attention, called โeagerโ computation. This Callable is named <code>eager_attention_forward</code>, and can be run as long as the user had <code>torch</code> installed, which is a requirement in any case.</p>
|
| 306 |
+
<p>In other words, we moved from a class interface to a function interface: in order to use more complex attention implementations, the config is checked, and can use other Callables, including kernel bindings that are much faster, if they are available.</p>
|
| 307 |
+
<p>This exemplifies the fact that we prefer to have an interface that is <a href="#standardize-dont-abstract">standard, but not abstract</a>.</p>
|
| 308 |
<pre><code class="language-python">attention_interface: Callable = eager_attention_forward
|
| 309 |
if self.config._attn_implementation != "eager":
|
| 310 |
attention_interface = ALL_ATTENTION_FUNCTIONS[self.config._attn_implementation]
|
|
|
|
| 323 |
We choose to place the level of abstraction higher than the device placement: a matrix multiplication - a <code>nn.Linear</code> layer - should be always expressed in the same way, regardless of how it is placed.</p>
|
| 324 |
<p>Hence, we want to touch <a href="#minimal-user-api">minimally</a> to the modeling code, and only modify it when <em>architectural changes</em> are involved. For instance, for tensor parallelism, we instead now specify a simple <code>tp_plan</code>.</p>
|
| 325 |
<p>The alternative would be to modify parent classes specific to their</p>
|
| 326 |
+
<p>It is written once in the config and passed to <code>.from_pretrained()</code>. The plan maps module name patterns to partitioning strategies. Strategies are resolved by the internal <code>ParallelInterface</code>, which wires to sharding implementations <code>ColwiseParallel</code>, <code>RowwiseParallel</code>, packed variants, and so on.</p>
|
| 327 |
<p><pre><code class="language-python"># In the model's config (example: ERNIE 4.5-style decoder blocks)
|
| 328 |
base_model_tp_plan = {
|
| 329 |
"layers.*.self_attn.q_proj": "colwise",
|
|
|
|
| 402 |
<p>If youโve checked out llava, youโve seen that llava_video is a red node, connected by a red edge to llava: itโs a candidate, something that we can <em>likely</em> remodularize, <a href="#backwards-compatibility">not touching the actual model</a> but being much more readable with <a href="#do-repeat-yourself">DRY*</a>.</p>
|
| 403 |
<h3>VLM improvements, avoiding abstraction</h3>
|
| 404 |
<p>We donโt have cookbook for common VLM patterns (image token scatter, multiโtower encoders, crossโattn bridges). This is one of the main improvement points where we can work.</p>
|
| 405 |
+
<p>For instance, we thought of abstracting away the mixing of <code>inputs_embeds</code>, the tensor fed into an llm decoder in 95% of the existing VLMs. It would have looked like something like</p>
|
| 406 |
<pre><code class="language-python">class InputsEmbeddingMixerMixin(nn.Module):
|
| 407 |
#
|
| 408 |
</code></pre>
|
| 409 |
+
<p>But this is <a href="#standardize-dont-abstract">abstracting away an important component of the modeling.</a>. Embedding mixin is part of the model, removing it would break it. A user opening <a href="https://huggingface.co/collections/Qwen/qwen25-vl-6795ffac22b334a837c0f9a5"><code>modeling_qwen2.5_vl</code></a> should not have to go to another file to understand how it works.</p>
|
| 410 |
<p>This is the current state of abstractions across a modeling file:</p>
|
| 411 |
<p><img src="static/Bloatedness_visualizer.png" alt="Bloatedness visualizer showing abstraction levels"></p>
|
| 412 |
<p>The following <a href="https://github.com/huggingface/transformers/pull/39777">Pull request to standardize placeholder masking</a> is a good example of what kind of changes are acceptable. In a VLM, we always need to insert embeddings from various encoders at various positions, so we can have a function to do it. For Qwen2 VL, for instance, it will look like this:</p>
|
|
|
|
| 451 |
return special_image_mask, special_video_mask
|
| 452 |
</code></pre>
|
| 453 |
<p>But this is <em>within</em> the modeling file, not in the <code>PreTrainedModel</code> base class. It will not move away from it, because itโd break the <a href="#one-model-one-file">self-contained logic</a> of the model.</p>
|
| 454 |
+
<h3>On image processing and processors</h3>
|
| 455 |
+
<p>Choosing to be a <code>torch</code>-first software meant relieving a tremendous amount of support from <code>jax </code> and <code>TensorFlow</code> , and it also meant that we could be more lenient into the amount of torch-dependent utilities that we were able to add. One of these is the <em>fast processing</em> of images. Where they were before assumed to be minimal ndarrays, making stronger assumptions and enforcing <code>torch</code> and <code>torchvision</code>native inputs allowed up to speed up massively the processing time for each model.</p>
|
| 456 |
+
<p>The gains in performance are immense, up to 20x speed for most models when compiled torchvision ops. Further, it allows to have the whole pipeline solely on GPU.</p>
|
| 457 |
+
<p><img src="static/fast_image_processors.png" alt="Fast Image Processors Performance"></p>
|
| 458 |
+
<p class="figure-legend">Thanks <a href="https://huggingface.co/yonigozlan">Yoni Gozlan</a> for the great work!</p>
|
| 459 |
+
<h2>Reduce barrier to entry/contribution</h2>
|
| 460 |
+
<p>This is an overall objective: thereโs no <code>transformers</code> without its community.</p>
|
| 461 |
+
<p>Having a framework means forcing users into it. It restrains flexibility and creativity, which are the fertile soil for new ideas to grow.</p>
|
| 462 |
+
<p>Among the most valuable contributions to <code>transformers</code> is of course the addition of new models. Very recently, <a href="https://huggingface.co/blog/welcome-openai-gpt-oss">OpenAI added GPT-OSS</a>, which prompted the addition of many new features to the library in order to support <a href="https://huggingface.co/openai/gpt-oss-120b">their model</a>.</p>
|
| 463 |
+
<p>A second one is the ability to fine-tune and pipeline these models into many other softwares. Check here on the hub how many finetunes are registered for <a href="https://huggingface.co/models?other=base_model:finetune:openai/gpt-oss-120b">gpt-oss 120b</a>, despite its size!</p>
|
| 464 |
+
<h3><a id="encoders-ftw"></a> Models popularity</h3>
|
| 465 |
+
<p>Talking about dependencies, we can take a look at the number of downloads for transformer models popularity. One thing we see is the prominence of encoders: This is because the usage of encoders lies in embeddings, just check out <a href="https://huggingface.co/blog/embeddinggemma">EmbeddingGemma</a> for a modern recap. Hence, it is vital to keep the encoders part viable, usable, fine-tune-able.</p>
|
| 466 |
<p><html>
|
| 467 |
<head><meta charset="utf-8" /></head>
|
| 468 |
<body>
|
|
|
|
| 4349 |
</body>
|
| 4350 |
</html></p>
|
| 4351 |
<p>As the codebase grows, with our friend codebase <a href="https://huggingface.co/sentence-transformers">Sentence Transformers</a>, we need to maintain this one as well. Retrieval use-cases, smart dbs, like FAISS-based indexing rely on it, and thus indirectly on transformers.</p>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 4352 |
<p>In that regard, we DO want to be a modular toolbox, being <a href="#minimal-user-api">minimal</a> enough and well documented enough so any ML/AI developer can use <code>transformers</code> without having to think about it. We aim to reduce the cognitive load brought about by model development, not increase it.</p>
|
| 4353 |
<p>So, how do these design choices, these โtenetsโ influence development of models and overall usage of transformers?</p>
|
| 4354 |
+
<h2>A surgical toolbox for model development</h2>
|
| 4355 |
<h3>Attention visualisation</h3>
|
| 4356 |
<p>All models have the same API internally for attention computation, thanks to <a href="#external-attention-classes">the externalisation of attention classes</a>. it allows us to build cool tools to visualize the inner workings of the attention mechanism.</p>
|
| 4357 |
<p>One particular piece of machinery is the <code>attention mask</code>. Here you see the famous bidirectional attention pattern for the whole prefix (text + image) in PaliGemma and all Gemma2+ models, contrasting with the usual โcausal-onlyโ models.</p>
|
|
|
|
| 4405 |
<p>It just works with PyTorch models and is especially useful when aligning outputs with a reference implementation, aligned with our <a href="#source-of-truth">core guideline</a>.</p>
|
| 4406 |
<p><img src="static/model_debugger.png" alt="Model debugger interface"></p>
|
| 4407 |
<h3>Cooking faster CUDA warmups</h3>
|
| 4408 |
+
<p>Having a clean <em>external</em> API allows us to work on the <a href="#code-is-product">true inner workings of transformers</a>. One of the few recent additions was the <em>CUDA warmup</em> via <code>caching_allocator_warmup</code> which improved massively the loading footprint by pre-allocating GPU memory to avoid malloc bottlenecks during model loading, achieving a 7x factor for an 8B model, 6x for a 32B, you can check out <a href="https://github.com/huggingface/transformers/pull/36380">the source</a>!</p>
|
| 4409 |
<p><style>.warmup-demo body{background-color:#f5f5f5;margin:0;padding:20px;font-family:Segoe UI,Tahoma,Geneva,Verdana,sans-serif}.warmup-demo .container{background:#fff;border-radius:12px;max-width:1200px;margin:0 auto;padding:30px;box-shadow:0 4px 6px #0000001a}.warmup-demo h1{text-align:center;color:#333;margin-bottom:10px}.warmup-demo .subtitle{text-align:center;color:#666;margin-bottom:30px;font-size:16px}.warmup-demo .demo-container{gap:40px;margin-bottom:30px;display:flex}.warmup-demo .side{background:#fafafa;border:2px solid #ddd;border-radius:8px;flex:1;padding:20px}.warmup-demo .side h2{text-align:center;color:#333;margin-top:0}.warmup-demo .no-warmup h2{color:#d63384}.warmup-demo .with-warmup h2{color:#198754}.warmup-demo .memory-area{background:#fff;border:2px dashed #ccc;border-radius:6px;height:400px;margin:20px 0;padding:10px;position:relative;overflow:hidden}.warmup-demo .layer-box{background:#fff;border:2px solid #666;border-radius:4px;width:80px;height:30px;margin:3px;transition:all .3s;display:inline-block;position:relative}.warmup-demo .layer-box.allocating{background:#e9ecef;border-color:#adb5bd}.warmup-demo .layer-box.allocating:after{content:"malloc";color:#666;font-size:10px;font-weight:700;position:absolute;top:50%;left:50%;transform:translate(-50%,-50%)}.warmup-demo .layer-box.loaded{background:#d1e7dd;border-color:#198754}.warmup-demo .layer-box.loaded:after{content:"data";color:#198754;font-size:10px;font-weight:700;position:absolute;top:50%;left:50%;transform:translate(-50%,-50%)}.warmup-demo .warmup-container{background:#fff;border:3px solid #666;border-radius:6px;width:100%;height:60px;margin-bottom:20px;position:relative;overflow:hidden}.warmup-demo .warmup-container.allocated{background:#e7f1ff;border-color:#0d6efd}.warmup-demo .warmup-container:before{content:"Pre-allocated Memory Pool";color:#666;z-index:1;font-size:14px;font-weight:700;position:absolute;top:50%;left:50%;transform:translate(-50%,-50%)}.warmup-demo .warmup-container.allocated:before{color:#0d6efd}.warmup-demo .warmup-fill{z-index:2;background:linear-gradient(90deg,#198754,#20c997);border-radius:3px;width:0%;height:100%;transition:width .5s;position:relative}.warmup-demo .warmup-fill:after{content:"Layer Data Loading";color:#fff;white-space:nowrap;font-size:12px;font-weight:700;position:absolute;top:50%;left:50%;transform:translate(-50%,-50%)}.warmup-demo .timing{text-align:center;min-height:30px;margin:15px 0;font-size:24px;font-weight:700}.warmup-demo .no-warmup .timing{color:#d63384}.warmup-demo .with-warmup .timing{color:#198754}.warmup-demo .controls{text-align:center;margin:30px 0}.warmup-demo .btn{color:#fff;cursor:pointer;background:#0d6efd;border:none;border-radius:6px;margin:0 10px;padding:12px 24px;font-size:16px;transition:background .3s}.warmup-demo .btn:hover{background:#0b5ed7}.warmup-demo .btn:disabled{cursor:not-allowed;background:#6c757d}.warmup-demo .description{background:#f8f9fa;border-radius:6px;margin-top:15px;padding:15px;font-size:14px;line-height:1.5}.warmup-demo .phase-indicator{color:#666;text-align:center;min-height:20px;margin-top:10px;font-size:14px}.warmup-demo .layer-counter{text-align:center;color:#495057;margin:10px 0;font-size:16px}</style>
|
| 4410 |
|
| 4411 |
<div class=warmup-demo>
|
|
|
|
| 4469 |
<p>This provides an OpenAI-compatible API with features like <a href="https://github.com/huggingface/transformers/pull/38085">continuous batching</a> (also check <a href="https://github.com/huggingface/transformers/pull/40426">here</a>) for better GPU utilization.</p>
|
| 4470 |
<p>Continuous batching is in itself very much linked to the great work of vLLM with the <code>paged attention kernel</code>, further justifying the facilitation of <a href="#community-kernels">external kernels</a>.</p>
|
| 4471 |
<h2>Community reusability</h2>
|
| 4472 |
+
<p>Transformers-serve is transformers-first, for sure, but the library is made first and foremost to be <em>reused</em> at large by the open-source ecosystem.</p>
|
| 4473 |
+
<p>Adding a model to transformers means:</p>
|
| 4474 |
<ul>
|
| 4475 |
<li>having it immediately available to the community</li>
|
| 4476 |
+
<li>having it immediately usable in vLLM, <a href="https://huggingface.co/blog/transformers-backend-sglang">SGLang</a>, and so on without additional code. In April 2025, transformers was added as a backend to run models on vLLM, which optimizes throughput/latency on top of existing transformers architectures <a href="https://blog.vllm.ai/2025/04/11/transformers-backend.html">as seen in this great vLLM x HF blog post.</a></li>
|
| 4477 |
</ul>
|
| 4478 |
<p>This cements the need even more for a <a href="#consistent-public-surface">consistent public surface</a>: we are now a backend, and thereโs more optimized software than us to handle serving. At the time of writing, more effort is done in that direction. We already have compatible configs for VLMs for vLLM (say that three times fast), <a href="https://github.com/huggingface/transformers/pull/40696/files">here for GLM4 video support</a>, and here for <a href="https://github.com/huggingface/transformers/pull/40132">MoE support</a> for instance.</p>
|
| 4479 |
<h2>What is coming next</h2>
|