diff --git "a/app/dist/index.html" "b/app/dist/index.html" --- "a/app/dist/index.html" +++ "b/app/dist/index.html" @@ -12,15 +12,16 @@ document.documentElement.setAttribute("data-theme", theme); } catch {} })(); - -

Maintain the unmaintainable:
1M python loc, 400+ models

- - - -
-

-
- - - - - - -
-

-
- - - - - -
- - -
-
- - -
-
-
0%
-
-
-
-
-
- - - - - -
+

Gemma3n graph

As you can see, the GenerationMixin node is already very heavy. It encompasses all of the utilities around .generate, it is second only to nn.Module. That means every decision we make to abstract something else has to be extremely careful.

The following Pull request to standardize placeholder masking is a good example of what kind of changes are acceptable. In a VLM, we always need to insert embeddings from various encoders at various positions, so we can have a function to do it. For Qwen2 VL, for instance, it will look like this:

@@ -1234,7 +799,7 @@ That means every decision we make to abstract something else has to be extremely

The shape of a contribution: add a model (or variant) with a small modular shard; the community and serving stacks pick it up immediately. Popularity trends (encoders/embeddings) guide where we invest. Next: power tools enabled by a consistent API.

Models popularity

Talking about dependencies, we can take a look at the number of downloads for transformer models popularity. One thing we see is the prominence of encoders: This is because the usage of encoders lies in embeddings, just check out EmbeddingGemma for a modern recap. Hence, it is vital to keep the encoders part viable, usable, fine-tune-able.

-
+
@@ -5130,7 +4695,7 @@ return Plotly;

Attention visualisation

All models have the same API internally for attention computation, thanks to the externalisation of attention classes. it allows us to build cool tools to visualize the inner workings of the attention mechanism.

One particular piece of machinery is the attention mask. Here you see the famous bidirectional attention pattern for the whole prefix (text + image) in PaliGemma and all Gemma2+ models, contrasting with the usual “causal-only” models.

-
+
@@ -5183,7 +4748,7 @@ return Plotly;

Forward interception and nested JSON logging align ports to reference implementations, reinforcing “Source of Truth.” Next: CUDA warmup reduces load-time stalls without touching modeling semantics.

Cooking faster CUDA warmups

Having a clean external API allows us to work on the true inner workings of transformers. One of the few recent additions was the CUDA warmup via caching_allocator_warmup which improved massively the loading footprint by pre-allocating GPU memory to avoid malloc bottlenecks during model loading, achieving a 7x factor for an 8B model, 6x for a 32B, you can check out the source!

-