Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
clemΒ 
posted an update Jan 16
Post
Is synthetic data the future of AI? πŸ”₯πŸ”₯πŸ”₯

@HugoLaurencon @Leyo & @VictorSanh are introducing HuggingFaceM4/WebSight , a multimodal dataset featuring 823,000 pairs of synthetically generated HTML/CSS codes along with screenshots of the corresponding rendered websites to train GPT4-V-like models πŸŒπŸ’»

While crafting their upcoming foundation vision language model, they faced the challenge of converting website screenshots into usable HTML/CSS codes. Most VLMs suck at this and there was no public dataset available for this specific task, so they decided to create their own.

They prompted existing LLMs to generate 823k HTML/CSS codes of very simple websites. Through supervised fine-tuning of a vision language model on WebSight, they were able to generate the code to reproduce a website component, given a screenshot.

You can explore the dataset here: HuggingFaceM4/WebSight

What do you think?

Yes, absolutely! We use synthetic data to create high end OS datasets like OpenOrca. We are beginning to use synthetic data driven results in reinforcement learning too.

Β·

πŸ”₯πŸ”₯πŸ”₯

I think it will be a big focus of 2024. I also believe there is a lot of scope for creative approaches to generating synthetic data that don't always rely on a big GPU budget (though this will also be important!). As an example of this, I created haiku_dpo, a synthetic dataset for making LLMs better at writing haiku. The development happened locally on a laptop with <1 hour of collab GPU time used at the end to generate a larger dataset. I think this topic will be an area where many community members can contribute more through their creativity rather than their GPU budget.

VLMs are not my pair of shoes but maybe itll motivate me to start ngl. !

Definitely. We created the H4rmony dataset by making GPT-4 formulate prompts, using ecolinguistic principles, to unveil the ecological stance od LLMs. Then we asked it to generate completions by playing roles (ecological aware / unaware / ambivalent). The prompt/completions were ultimately verified by ecolinguists. We called this approach RLRHV, Reinforcement Learning by Role-playing and Human Verification. We RL'd various models with the H4rmony dataset, and the results are very encouraging.

Synthetic data will certainly play a big role in training future models. I’m interested to know more about the process of cleaning up synthetic data and what methods exist for identifying high quality clusters for specific output behaviors within larger messy datasets.

Something I am very excited about with synthetic data is the increased ability to tune the data so that they look like what you want them to look like.

We typically spend a lot of time filtering web-scale data by building heuristics that detect "poor-quality" samples. With control over the data creation process, you can quickly tune the generation process to give some specific properties to the data. Often it's just about telling your model to do X, and not to do Y.

Synthetic data is a safe & best way to train small, medium and large models. I have trained several models using synthetic data. My models viz. Scarlett, Carl, Frank, Jordan, Code are all trained on Synthetic dataset.
Thanks for sharing this new dataset. It will be very useful to all the model developers.

image.png

hat/tip @soldni πŸŽ‰

I'm interested in best practices for making diverse synthetic data. Anyone have suggestions or references?

Β·

I would recommend following @dvilasuero and other folks at Argilla, who are doing some very cool work in this area, particularly via their distilabel library.

I'm also working on a modest intro to synthetic data generation here πŸ€—