Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
davanstrien 
posted an update 12 days ago
Post
2274
Several methods/models have recently been shared to generate synthetic data from minimal or no initial seeds, essentially creating data directly from raw text.

IMO, these approaches that rely on smaller models for synthetic data generation are quite valuable for scaling up synthetic data and democratizing access to creating domain-specific synthetic datasets.

I've compiled a collection of Gradio demos showcasing some of these methods here: davanstrien/synthetic-data-generation-demos-667573f248b97360ff3668a5

This is great..thanks for posting up these Gradio's in one collection.

@davanstrien Thanks for the wonderful demos! Just wanted to highlight that we recently released Bonito, an open-source model that converts user's raw text into instruction tuning dataset. It would be awesome if you could add our model to the collection! Happy to help :)

Model: https://huggingface.co/BatsResearch/bonito-v1
Paper: https://arxiv.org/abs/2402.18334
GitHub: https://github.com/BatsResearch/bonito

·

@nihalnayak would love to add Bonito (I really like how many tasks it supports!). Do you already have a Spaces demo for it?