@clem on Hugging Face: "Is synthetic data the future of AI? 🔥🔥🔥 @HugoLaurencon @Leyo &…"

Join the conversation

Join the community of Machine Learners and AI enthusiasts.

clem

posted an update Jan 16

Post

Is synthetic data the future of AI? 🔥🔥🔥

@HugoLaurencon @Leyo & @VictorSanh are introducing HuggingFaceM4/WebSight , a multimodal dataset featuring 823,000 pairs of synthetically generated HTML/CSS codes along with screenshots of the corresponding rendered websites to train GPT4-V-like models 🌐💻

While crafting their upcoming foundation vision language model, they faced the challenge of converting website screenshots into usable HTML/CSS codes. Most VLMs suck at this and there was no public dataset available for this specific task, so they decided to create their own.

They prompted existing LLMs to generate 823k HTML/CSS codes of very simple websites. Through supervised fine-tuning of a vision language model on WebSight, they were able to generate the code to reproduce a website component, given a screenshot.

You can explore the dataset here: HuggingFaceM4/WebSight

What do you think?

happymanda

Jan 16

Yes, absolutely! We use synthetic data to create high end OS datasets like OpenOrca. We are beginning to use synthetic data driven results in reinforcement learning too.

clem

Jan 16

🔥🔥🔥

davanstrien

Jan 16

I think it will be a big focus of 2024. I also believe there is a lot of scope for creative approaches to generating synthetic data that don't always rely on a big GPU budget (though this will also be important!). As an example of this, I created haiku_dpo, a synthetic dataset for making LLMs better at writing haiku. The development happened locally on a laptop with <1 hour of collab GPU time used at the end to generate a larger dataset. I think this topic will be an area where many community members can contribute more through their creativity rather than their GPU budget.

s3nh

Jan 16

VLMs are not my pair of shoes but maybe itll motivate me to start ngl. !

neovalle

Jan 16

Definitely. We created the H4rmony dataset by making GPT-4 formulate prompts, using ecolinguistic principles, to unveil the ecological stance od LLMs. Then we asked it to generate completions by playing roles (ecological aware / unaware / ambivalent). The prompt/completions were ultimately verified by ecolinguists. We called this approach RLRHV, Reinforcement Learning by Role-playing and Human Verification. We RL'd various models with the H4rmony dataset, and the results are very encouraging.

phanes

Jan 16

Synthetic data will certainly play a big role in training future models. I’m interested to know more about the process of cleaning up synthetic data and what methods exist for identifying high quality clusters for specific output behaviors within larger messy datasets.

VictorSanh

Jan 16

Something I am very excited about with synthetic data is the increased ability to tune the data so that they look like what you want them to look like.

We typically spend a lot of time filtering web-scale data by building heuristics that detect "poor-quality" samples. With control over the data creation process, you can quickly tune the generation process to give some specific properties to the data. Often it's just about telling your model to do X, and not to do Y.

ajibawa-2023

Jan 16

Synthetic data is a safe & best way to train small, medium and large models. I have trained several models using synthetic data. My models viz. Scarlett, Carl, Frank, Jordan, Code are all trained on Synthetic dataset.
Thanks for sharing this new dataset. It will be very useful to all the model developers.

julien-c

Jan 16

hat/tip @soldni 🎉

berdaniera

Jan 17

I'm interested in best practices for making diverse synthetic data. Anyone have suggestions or references?

davanstrien

Jan 17

I would recommend following @dvilasuero and other folks at Argilla, who are doing some very cool work in this area, particularly via their distilabel library.

I'm also working on a modest intro to synthetic data generation here 🤗

In this post