Spaces:

transformers-community
/

Transformers-tenets

Running

App Files Files Community

Molbap HF Staff commited on Aug 20

Commit

cabd939

verified ·

1 Parent(s): 231c051

Update content/article.md

Browse files

Files changed (1) hide show

content/article.md +7 -7

content/article.md CHANGED Viewed

@@ -264,7 +264,7 @@ So what do we see? Llama is a basis for many models, and it shows.
 Radically different architectures such as mamba have spawned their own dependency subgraph.
 [code relatedness](d3_dependency_graph.html)
-![[Pasted image 20250729153809.png]]
  But there is no similar miracle for VLMs across the board.
 As you can see, there is a small DETR island, a little llava pocket, and so on, but it's not comparable to the centrality observed.
@@ -278,7 +278,7 @@ So I looked into Jaccard similarity, which we use to measure set differences. I
 {{TERMINAL}}
-![[Pasted image 20250728175655.png]]
 The yellow areas are places where models are very different to each other. We can see islands here and there corresponding to model families. Llava goes with Llava-onevision, LlavaNext, LlavaNext-video, etc.
 ## VLM improvements, avoiding abstraction
@@ -296,7 +296,7 @@ But this is breaking [Standardize, don't abstract](#standardize-dont-abstract).
 This is the current state of abstractions across a modeling file:
-![[Pasted image 20250728181550.png]]
 The following [Pull request to standardize placeholder masking](https://github.com/huggingface/transformers/pull/39777) is a good example of what kind of changes are acceptable. In a VLM, we always need to insert embeddings from various encoders at various positions, so we can have a function to do it. For Qwen2 VL, for instance, it will look like this:
@@ -350,14 +350,13 @@ So the question abounds naturally: How can we modularize more?
 I took again a similarity measure and looked at the existing graphs. The tool is available on this [ZeroGPU-enabled Space](https://huggingface.co/spaces/Molbap/transformers-modular-refactor). It scans the whole transformers repository, and outputs a graph of candidates across models, using either a Jaccard similarity index (simple) or a SentenceTransformers embedding model. It is understandable that [encoder models still have a lion's share of the game.](#encoders-ftw) See also [Tom Aarsen and Arhur Bresnu's great blog post on the topic of sparse embeddings.](https://huggingface.co/blog/train-sparse-encoder).
-![[Pasted image 20250729174627.png]]
 ## <a id="encoders-ftw"></a> Encoders win !
 Models popularity speaks for itself! This is because the usage of encoders lies in embeddings obviously. So we have to keep the encoders part viable, usable, fine-tune-able.
-![[Pasted image 20250728175753.png]]
 ## On image processing and processors
 Choosing to be a `torch`-first software meant relieving a tremendous amount of support from `jax ` and `TensorFlow` , and it also meant that we could be more lenient into the amount of torch-dependent utilities that we were able to add. One of these is the _fast processing_ of images. Where they were before assumed to be minimal ndarrays, making stronger assumptions and enforcing `torch` and `torchvision`native inputs allowed up to speed up massively the processing time for each model.
@@ -387,7 +386,8 @@ Because it is all PyTorch (and it is even more now that we support only PyTorch)
 It just works with PyTorch models and is especially useful when aligning outputs with a reference implementation, aligned with our core guideline, [source of truth for model definitions](#source-of-truth).
-![[Pasted image 20250813175317.png]]
 ### Transformers-serve
 Having all these models readily available allows to use all of them with transformers-serve, and enable interfacing with them with an Open API-like pattern.

 Radically different architectures such as mamba have spawned their own dependency subgraph.
 [code relatedness](d3_dependency_graph.html)
+![[graoh_modular_related_models.png]]
  But there is no similar miracle for VLMs across the board.
 As you can see, there is a small DETR island, a little llava pocket, and so on, but it's not comparable to the centrality observed.
 {{TERMINAL}}
+![[Jaccard_similarity_plot.png]]
 The yellow areas are places where models are very different to each other. We can see islands here and there corresponding to model families. Llava goes with Llava-onevision, LlavaNext, LlavaNext-video, etc.
 ## VLM improvements, avoiding abstraction
 This is the current state of abstractions across a modeling file:
+![[Bloatedness_visualizer.png]]
 The following [Pull request to standardize placeholder masking](https://github.com/huggingface/transformers/pull/39777) is a good example of what kind of changes are acceptable. In a VLM, we always need to insert embeddings from various encoders at various positions, so we can have a function to do it. For Qwen2 VL, for instance, it will look like this:
 I took again a similarity measure and looked at the existing graphs. The tool is available on this [ZeroGPU-enabled Space](https://huggingface.co/spaces/Molbap/transformers-modular-refactor). It scans the whole transformers repository, and outputs a graph of candidates across models, using either a Jaccard similarity index (simple) or a SentenceTransformers embedding model. It is understandable that [encoder models still have a lion's share of the game.](#encoders-ftw) See also [Tom Aarsen and Arhur Bresnu's great blog post on the topic of sparse embeddings.](https://huggingface.co/blog/train-sparse-encoder).
+![[modular_candidates.png]]
 ## <a id="encoders-ftw"></a> Encoders win !
 Models popularity speaks for itself! This is because the usage of encoders lies in embeddings obviously. So we have to keep the encoders part viable, usable, fine-tune-able.
+![[popular_models_barplot.png]]
 ## On image processing and processors
 Choosing to be a `torch`-first software meant relieving a tremendous amount of support from `jax ` and `TensorFlow` , and it also meant that we could be more lenient into the amount of torch-dependent utilities that we were able to add. One of these is the _fast processing_ of images. Where they were before assumed to be minimal ndarrays, making stronger assumptions and enforcing `torch` and `torchvision`native inputs allowed up to speed up massively the processing time for each model.
 It just works with PyTorch models and is especially useful when aligning outputs with a reference implementation, aligned with our core guideline, [source of truth for model definitions](#source-of-truth).
+![[model_debugger.png]]
 ### Transformers-serve
 Having all these models readily available allows to use all of them with transformers-serve, and enable interfacing with them with an Open API-like pattern.