💪 Strong 8B-parameters model: often on par with open 30B counterparts. 🔓Open license: Apache 2.0. 🚀 Strong improvement over Idefics1: +12 points on VQAv2, +30 points on TextVQA while having 10x fewer parameters. 📚 Better data: boosting OCR capabilities with 6TB of documents to transcribe, and improving QA capabilities on charts/figures/diagrams. 🕵️♀️ Transparent training data: inspect and build upon all the data (10s of TB of data) we trained on. 🔲 More natural image processing: Incorporating strategies to treat images in their native resolution and native aspect ratio. 📸 High-resolution images: image resolutions up to 980 x 980 and integrating strategies that allow to trade computational efficiency for performance. 😎 2 checkpoints: Releasing both base checkpoint and instruction fine-tuned checkpoint. Chat version to come.
When Greg Brockman demo-ed GPT4 by hand-sketching a joke website on a piece of paper and asking the system to convert that into an HTML webpage, it blew my mind.
Can you build your own Screenshot-to-HTML system with much fewer resources?
With this new resource, most likely yes! Current vision-language models can learn this task with the right data (and the right tricks).
We have iterated on WebSight-v0.1 and are releasing its v0.2. WebSight is an open dataset of synthetically generated webpages with their corresponding rendered screenshot.
A few noticeable improvements: - 💨From traditional CSS to Tailwind CSS. Tailwind is CSS directly embedded in the HTML attribute class and is much more compact - 🚛2M pairs of synthetic HTML webpages with their associated rendered screenshot, along with the prompt generated by an LLM to create that webpage - 🖼️Much more visually appealing pages with the integration of real images