Victor Sanh PRO

VictorSanh

AI & ML interests

None yet

Articles

Organizations

VictorSanh's activity

posted an update 1 day ago
view post
Post
915
Glad to see Idefics2 making its way into the awesome OpenVLM Leaderboard which ranks VLMs. 🏆
2nd in its category (<10B parameters and open weights)!

While InternLM-XComposer2 uses proprietary data, Idefics2 is built solely using openly available data.

Leaderboard: opencompass/open_vlm_leaderboard
Model: HuggingFaceM4/idefics2-8b
·
posted an update 8 days ago
view post
Post
2377
Can't wait to see multimodal LLama 3!

We released a resource that might come in handy: The Cauldron 🍯

The Cauldron is a massive manually-curated collection of 50 vision-language sets for instruction fine-tuning. 3.6M images, 30.3M query/answer pairs.

It covers a large variety of downstream uses: visual question answering on natural images, OCR, document/charts/figures/tables understanding, textbooks/academic question, reasoning, captioning, spotting differences between 2 images, and screenshot-to-code.

HuggingFaceM4/the_cauldron
  • 1 reply
·
posted an update 12 days ago
view post
Post
2708
New open multimodal model in town: Idefics2!

💪 Strong 8B-parameters model: often on par with open 30B counterparts.
🔓Open license: Apache 2.0.
🚀 Strong improvement over Idefics1: +12 points on VQAv2, +30 points on TextVQA while having 10x fewer parameters.
📚 Better data: boosting OCR capabilities with 6TB of documents to transcribe, and improving QA capabilities on charts/figures/diagrams.
🕵️‍♀️ Transparent training data: inspect and build upon all the data (10s of TB of data) we trained on.
🔲 More natural image processing: Incorporating strategies to treat images in their native resolution and native aspect ratio.
📸 High-resolution images: image resolutions up to 980 x 980 and integrating strategies that allow to trade computational efficiency for performance.
😎 2 checkpoints: Releasing both base checkpoint and instruction fine-tuned checkpoint. Chat version to come.

Ressources: HuggingFaceM4/idefics2-661d1971b7c50831dd3ce0fe
Blogpost: https://huggingface.co/blog/idefics2
posted an update about 1 month ago
view post
Post
When Greg Brockman demo-ed GPT4 by hand-sketching a joke website on a piece of paper and asking the system to convert that into an HTML webpage, it blew my mind.

Can you build your own Screenshot-to-HTML system with much fewer resources?

With this new resource, most likely yes! Current vision-language models can learn this task with the right data (and the right tricks).

We have iterated on WebSight-v0.1 and are releasing its v0.2.
WebSight is an open dataset of synthetically generated webpages with their corresponding rendered screenshot.

A few noticeable improvements:
- 💨From traditional CSS to Tailwind CSS. Tailwind is CSS directly embedded in the HTML attribute class and is much more compact
- 🚛2M pairs of synthetic HTML webpages with their associated rendered screenshot, along with the prompt generated by an LLM to create that webpage
- 🖼️Much more visually appealing pages with the integration of real images

👀Blog: https://huggingface.co/blog/websight
💽Dataset: HuggingFaceM4/WebSight
📜Technical report: Unlocking the conversion of Web Screenshots into HTML Code with the WebSight Dataset (2403.09029)
🎮Want to create your own synthetic data pipelines? A starting point: https://colab.research.google.com/drive/1LdamGKR2oacrDk-kYwz_Wfc1-RBUdzcO?usp=sharing

Built with @HugoLaurencon & @Leyo
posted an update about 2 months ago
view post
Post
Can you beat an AI at Raven puzzles?

HuggingFaceM4/ai_raven

The most powerful vision+language AI systems like Gemini or GPT4V struggle with this problem when used out-of-the-box ( How Far Are We from Intelligent Visual Deductive Reasoning? (2403.04732)).

But when properly trained, a small ~8B model can be very accurate at these IQ tests, solely based on visual inputs!

Raven's Progressive Matrices are visual intelligence tests invented in the 1930s designed to measure abstract reasoning and problem-solving ability. The test consists of a series of matrices or patterns with one part missing. The task for the test-taker is to identify the missing piece from a set of options.

Such puzzles can be procedurally generated at scale. HuggingFaceM4/RAVEN is one example. The complexity of the puzzles is then controlled by the complexity of the generation procedure.

We fine-tuned an early checkpoint of our upcoming vision-and-language model idefics2 on that dataset. The resulting checkpoint yields ~91% accuracy! No chain of thoughts, no pre-processing of the image, no additional inputs or metadata, just the RAVEN problem fed to the model as a standalone image (and a short instruction to the model “Which figure should complete the logical sequence?”), with the training objective being the standard cross-entropy.

Just another evidence that in a lot of cases, for a given well-scoped problem, you will be better off paying to collect & annotate data, and fine-tune a model on that data (i.e. build your own AI) than wastefully trying to solve that problem with a gigantic general-purpose model you call through a paid API!
  • 1 reply
·
posted an update about 2 months ago
view post
Post
An increasing number of engineers and researchers are developing foundational models. Navigating the tools, resources, codebases, and best practices guides is daunting for new contributors.

Introducing the Foundation Model Development Cheatsheet, a succinct guide with 250+ resources & tools for:
📖 sourcing data
🔍 documenting & audits
🌍 environmental impact
🥊 risks & harms eval
🎮 release & monitoring

https://fmcheatsheet.org/

👐 What tools & resources should appear in that cheatsheet? Contributions encouraged!

This is the result of a large collaboration between many organizations promoting open-science, and spearheaded by @Shayne 🔥
  • 2 replies
·
replied to clem's post 3 months ago
view reply

Something I am very excited about with synthetic data is the increased ability to tune the data so that they look like what you want them to look like.

We typically spend a lot of time filtering web-scale data by building heuristics that detect "poor-quality" samples. With control over the data creation process, you can quickly tune the generation process to give some specific properties to the data. Often it's just about telling your model to do X, and not to do Y.