Papers
arxiv:2403.09029

Unlocking the conversion of Web Screenshots into HTML Code with the WebSight Dataset

Published on Mar 14
· Featured in Daily Papers on Mar 15

Abstract

Using vision-language models (VLMs) in web development presents a promising strategy to increase efficiency and unblock no-code solutions: by providing a screenshot or a sketch of a UI, a VLM could generate the code to reproduce it, for instance in a language like HTML. Despite the advancements in VLMs for various tasks, the specific challenge of converting a screenshot into a corresponding HTML has been minimally explored. We posit that this is mainly due to the absence of a suitable, high-quality dataset. This work introduces WebSight, a synthetic dataset consisting of 2 million pairs of HTML codes and their corresponding screenshots. We fine-tune a foundational VLM on our dataset and show proficiency in converting webpage screenshots to functional HTML code. To accelerate the research in this area, we open-source WebSight.

Community

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Congrats on the great work! Our arXiv paper https://arxiv.org/abs/2305.14637 is one of the earliest works addressing the same problem one year ago. Looking forward to more work on the topic!

Paper author

Thanks @zhoutianyi for the reference, we indeed missed your paper
We’ll put it in the related work section if we edit this technical report after the next iteration!

Sign up or log in to comment

Models citing this paper 1

Datasets citing this paper 2

Spaces citing this paper 8

Collections including this paper 9