Molbap HF Staff commited on
Commit
cabd939
·
verified ·
1 Parent(s): 231c051

Update content/article.md

Browse files
Files changed (1) hide show
  1. content/article.md +7 -7
content/article.md CHANGED
@@ -264,7 +264,7 @@ So what do we see? Llama is a basis for many models, and it shows.
264
  Radically different architectures such as mamba have spawned their own dependency subgraph.
265
  [code relatedness](d3_dependency_graph.html)
266
 
267
- ![[Pasted image 20250729153809.png]]
268
 
269
  But there is no similar miracle for VLMs across the board.
270
  As you can see, there is a small DETR island, a little llava pocket, and so on, but it's not comparable to the centrality observed.
@@ -278,7 +278,7 @@ So I looked into Jaccard similarity, which we use to measure set differences. I
278
 
279
  {{TERMINAL}}
280
 
281
- ![[Pasted image 20250728175655.png]]
282
 
283
  The yellow areas are places where models are very different to each other. We can see islands here and there corresponding to model families. Llava goes with Llava-onevision, LlavaNext, LlavaNext-video, etc.
284
  ## VLM improvements, avoiding abstraction
@@ -296,7 +296,7 @@ But this is breaking [Standardize, don't abstract](#standardize-dont-abstract).
296
 
297
  This is the current state of abstractions across a modeling file:
298
 
299
- ![[Pasted image 20250728181550.png]]
300
 
301
  The following [Pull request to standardize placeholder masking](https://github.com/huggingface/transformers/pull/39777) is a good example of what kind of changes are acceptable. In a VLM, we always need to insert embeddings from various encoders at various positions, so we can have a function to do it. For Qwen2 VL, for instance, it will look like this:
302
 
@@ -350,14 +350,13 @@ So the question abounds naturally: How can we modularize more?
350
  I took again a similarity measure and looked at the existing graphs. The tool is available on this [ZeroGPU-enabled Space](https://huggingface.co/spaces/Molbap/transformers-modular-refactor). It scans the whole transformers repository, and outputs a graph of candidates across models, using either a Jaccard similarity index (simple) or a SentenceTransformers embedding model. It is understandable that [encoder models still have a lion's share of the game.](#encoders-ftw) See also [Tom Aarsen and Arhur Bresnu's great blog post on the topic of sparse embeddings.](https://huggingface.co/blog/train-sparse-encoder).
351
 
352
 
353
- ![[Pasted image 20250729174627.png]]
354
 
355
  ## <a id="encoders-ftw"></a> Encoders win !
356
 
357
  Models popularity speaks for itself! This is because the usage of encoders lies in embeddings obviously. So we have to keep the encoders part viable, usable, fine-tune-able.
358
 
359
- ![[Pasted image 20250728175753.png]]
360
-
361
  ## On image processing and processors
362
 
363
  Choosing to be a `torch`-first software meant relieving a tremendous amount of support from `jax ` and `TensorFlow` , and it also meant that we could be more lenient into the amount of torch-dependent utilities that we were able to add. One of these is the _fast processing_ of images. Where they were before assumed to be minimal ndarrays, making stronger assumptions and enforcing `torch` and `torchvision`native inputs allowed up to speed up massively the processing time for each model.
@@ -387,7 +386,8 @@ Because it is all PyTorch (and it is even more now that we support only PyTorch)
387
 
388
  It just works with PyTorch models and is especially useful when aligning outputs with a reference implementation, aligned with our core guideline, [source of truth for model definitions](#source-of-truth).
389
 
390
- ![[Pasted image 20250813175317.png]]
 
391
  ### Transformers-serve
392
 
393
  Having all these models readily available allows to use all of them with transformers-serve, and enable interfacing with them with an Open API-like pattern.
 
264
  Radically different architectures such as mamba have spawned their own dependency subgraph.
265
  [code relatedness](d3_dependency_graph.html)
266
 
267
+ ![[graoh_modular_related_models.png]]
268
 
269
  But there is no similar miracle for VLMs across the board.
270
  As you can see, there is a small DETR island, a little llava pocket, and so on, but it's not comparable to the centrality observed.
 
278
 
279
  {{TERMINAL}}
280
 
281
+ ![[Jaccard_similarity_plot.png]]
282
 
283
  The yellow areas are places where models are very different to each other. We can see islands here and there corresponding to model families. Llava goes with Llava-onevision, LlavaNext, LlavaNext-video, etc.
284
  ## VLM improvements, avoiding abstraction
 
296
 
297
  This is the current state of abstractions across a modeling file:
298
 
299
+ ![[Bloatedness_visualizer.png]]
300
 
301
  The following [Pull request to standardize placeholder masking](https://github.com/huggingface/transformers/pull/39777) is a good example of what kind of changes are acceptable. In a VLM, we always need to insert embeddings from various encoders at various positions, so we can have a function to do it. For Qwen2 VL, for instance, it will look like this:
302
 
 
350
  I took again a similarity measure and looked at the existing graphs. The tool is available on this [ZeroGPU-enabled Space](https://huggingface.co/spaces/Molbap/transformers-modular-refactor). It scans the whole transformers repository, and outputs a graph of candidates across models, using either a Jaccard similarity index (simple) or a SentenceTransformers embedding model. It is understandable that [encoder models still have a lion's share of the game.](#encoders-ftw) See also [Tom Aarsen and Arhur Bresnu's great blog post on the topic of sparse embeddings.](https://huggingface.co/blog/train-sparse-encoder).
351
 
352
 
353
+ ![[modular_candidates.png]]
354
 
355
  ## <a id="encoders-ftw"></a> Encoders win !
356
 
357
  Models popularity speaks for itself! This is because the usage of encoders lies in embeddings obviously. So we have to keep the encoders part viable, usable, fine-tune-able.
358
 
359
+ ![[popular_models_barplot.png]]
 
360
  ## On image processing and processors
361
 
362
  Choosing to be a `torch`-first software meant relieving a tremendous amount of support from `jax ` and `TensorFlow` , and it also meant that we could be more lenient into the amount of torch-dependent utilities that we were able to add. One of these is the _fast processing_ of images. Where they were before assumed to be minimal ndarrays, making stronger assumptions and enforcing `torch` and `torchvision`native inputs allowed up to speed up massively the processing time for each model.
 
386
 
387
  It just works with PyTorch models and is especially useful when aligning outputs with a reference implementation, aligned with our core guideline, [source of truth for model definitions](#source-of-truth).
388
 
389
+ ![[model_debugger.png]]
390
+
391
  ### Transformers-serve
392
 
393
  Having all these models readily available allows to use all of them with transformers-serve, and enable interfacing with them with an Open API-like pattern.