Victor Sanh PRO

VictorSanh

AI & ML interests

None yet

Blog posts

Organizations

Posts 3

view post
Post
When Greg Brockman demo-ed GPT4 by hand-sketching a joke website on a piece of paper and asking the system to convert that into an HTML webpage, it blew my mind.

Can you build your own Screenshot-to-HTML system with much fewer resources?

With this new resource, most likely yes! Current vision-language models can learn this task with the right data (and the right tricks).

We have iterated on WebSight-v0.1 and are releasing its v0.2.
WebSight is an open dataset of synthetically generated webpages with their corresponding rendered screenshot.

A few noticeable improvements:
- 💨From traditional CSS to Tailwind CSS. Tailwind is CSS directly embedded in the HTML attribute class and is much more compact
- 🚛2M pairs of synthetic HTML webpages with their associated rendered screenshot, along with the prompt generated by an LLM to create that webpage
- 🖼️Much more visually appealing pages with the integration of real images

👀Blog: https://huggingface.co/blog/websight
💽Dataset: HuggingFaceM4/WebSight
📜Technical report: Unlocking the conversion of Web Screenshots into HTML Code with the WebSight Dataset (2403.09029)
🎮Want to create your own synthetic data pipelines? A starting point: https://colab.research.google.com/drive/1LdamGKR2oacrDk-kYwz_Wfc1-RBUdzcO?usp=sharing

Built with @HugoLaurencon & @Leyo
view post
Post
Can you beat an AI at Raven puzzles?

HuggingFaceM4/ai_raven

The most powerful vision+language AI systems like Gemini or GPT4V struggle with this problem when used out-of-the-box ( How Far Are We from Intelligent Visual Deductive Reasoning? (2403.04732)).

But when properly trained, a small ~8B model can be very accurate at these IQ tests, solely based on visual inputs!

Raven's Progressive Matrices are visual intelligence tests invented in the 1930s designed to measure abstract reasoning and problem-solving ability. The test consists of a series of matrices or patterns with one part missing. The task for the test-taker is to identify the missing piece from a set of options.

Such puzzles can be procedurally generated at scale. HuggingFaceM4/RAVEN is one example. The complexity of the puzzles is then controlled by the complexity of the generation procedure.

We fine-tuned an early checkpoint of our upcoming vision-and-language model idefics2 on that dataset. The resulting checkpoint yields ~91% accuracy! No chain of thoughts, no pre-processing of the image, no additional inputs or metadata, just the RAVEN problem fed to the model as a standalone image (and a short instruction to the model “Which figure should complete the logical sequence?”), with the training objective being the standard cross-entropy.

Just another evidence that in a lot of cases, for a given well-scoped problem, you will be better off paying to collect & annotate data, and fine-tune a model on that data (i.e. build your own AI) than wastefully trying to solve that problem with a gigantic general-purpose model you call through a paid API!