Update content/article.md
Browse files- content/article.md +7 -7
content/article.md
CHANGED
|
@@ -264,7 +264,7 @@ So what do we see? Llama is a basis for many models, and it shows.
|
|
| 264 |
Radically different architectures such as mamba have spawned their own dependency subgraph.
|
| 265 |
[code relatedness](d3_dependency_graph.html)
|
| 266 |
|
| 267 |
-
.
|
|
| 296 |
|
| 297 |
This is the current state of abstractions across a modeling file:
|
| 298 |
|
| 299 |
-
 is a good example of what kind of changes are acceptable. In a VLM, we always need to insert embeddings from various encoders at various positions, so we can have a function to do it. For Qwen2 VL, for instance, it will look like this:
|
| 302 |
|
|
@@ -350,14 +350,13 @@ So the question abounds naturally: How can we modularize more?
|
|
| 350 |
I took again a similarity measure and looked at the existing graphs. The tool is available on this [ZeroGPU-enabled Space](https://huggingface.co/spaces/Molbap/transformers-modular-refactor). It scans the whole transformers repository, and outputs a graph of candidates across models, using either a Jaccard similarity index (simple) or a SentenceTransformers embedding model. It is understandable that [encoder models still have a lion's share of the game.](#encoders-ftw) See also [Tom Aarsen and Arhur Bresnu's great blog post on the topic of sparse embeddings.](https://huggingface.co/blog/train-sparse-encoder).
|
| 351 |
|
| 352 |
|
| 353 |
-
.
|
| 389 |
|
| 390 |
-

|
| 266 |
|
| 267 |
+
![[graoh_modular_related_models.png]]
|
| 268 |
|
| 269 |
But there is no similar miracle for VLMs across the board.
|
| 270 |
As you can see, there is a small DETR island, a little llava pocket, and so on, but it's not comparable to the centrality observed.
|
|
|
|
| 278 |
|
| 279 |
{{TERMINAL}}
|
| 280 |
|
| 281 |
+
![[Jaccard_similarity_plot.png]]
|
| 282 |
|
| 283 |
The yellow areas are places where models are very different to each other. We can see islands here and there corresponding to model families. Llava goes with Llava-onevision, LlavaNext, LlavaNext-video, etc.
|
| 284 |
## VLM improvements, avoiding abstraction
|
|
|
|
| 296 |
|
| 297 |
This is the current state of abstractions across a modeling file:
|
| 298 |
|
| 299 |
+
![[Bloatedness_visualizer.png]]
|
| 300 |
|
| 301 |
The following [Pull request to standardize placeholder masking](https://github.com/huggingface/transformers/pull/39777) is a good example of what kind of changes are acceptable. In a VLM, we always need to insert embeddings from various encoders at various positions, so we can have a function to do it. For Qwen2 VL, for instance, it will look like this:
|
| 302 |
|
|
|
|
| 350 |
I took again a similarity measure and looked at the existing graphs. The tool is available on this [ZeroGPU-enabled Space](https://huggingface.co/spaces/Molbap/transformers-modular-refactor). It scans the whole transformers repository, and outputs a graph of candidates across models, using either a Jaccard similarity index (simple) or a SentenceTransformers embedding model. It is understandable that [encoder models still have a lion's share of the game.](#encoders-ftw) See also [Tom Aarsen and Arhur Bresnu's great blog post on the topic of sparse embeddings.](https://huggingface.co/blog/train-sparse-encoder).
|
| 351 |
|
| 352 |
|
| 353 |
+
![[modular_candidates.png]]
|
| 354 |
|
| 355 |
## <a id="encoders-ftw"></a> Encoders win !
|
| 356 |
|
| 357 |
Models popularity speaks for itself! This is because the usage of encoders lies in embeddings obviously. So we have to keep the encoders part viable, usable, fine-tune-able.
|
| 358 |
|
| 359 |
+
![[popular_models_barplot.png]]
|
|
|
|
| 360 |
## On image processing and processors
|
| 361 |
|
| 362 |
Choosing to be a `torch`-first software meant relieving a tremendous amount of support from `jax ` and `TensorFlow` , and it also meant that we could be more lenient into the amount of torch-dependent utilities that we were able to add. One of these is the _fast processing_ of images. Where they were before assumed to be minimal ndarrays, making stronger assumptions and enforcing `torch` and `torchvision`native inputs allowed up to speed up massively the processing time for each model.
|
|
|
|
| 386 |
|
| 387 |
It just works with PyTorch models and is especially useful when aligning outputs with a reference implementation, aligned with our core guideline, [source of truth for model definitions](#source-of-truth).
|
| 388 |
|
| 389 |
+
![[model_debugger.png]]
|
| 390 |
+
|
| 391 |
### Transformers-serve
|
| 392 |
|
| 393 |
Having all these models readily available allows to use all of them with transformers-serve, and enable interfacing with them with an Open API-like pattern.
|